read_html() Thread Safety · Issue #16928 · pandas-dev/pandas (original) (raw)

Code Sample

#!/usr/bin/python3 import pandas import threading

def fetch_file(): url = "https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html" pandas.read_html(url)

thread1 = threading.Thread(target = fetch_file) thread2 = threading.Thread(target = fetch_file)

thread1.start() thread2.start()

Output

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "./pandas_bug.py", line 7, in fetch_file
    pandas.read_html(url)
  File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 904, in read_html
    keep_default_na=keep_default_na)
  File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 731, in _parse
    parser = _parser_dispatch(flav)
  File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 691, in _parser_dispatch
    raise ImportError("lxml not found, please install it")
ImportError: lxml not found, please install it

Problem description

read_html() doesn't appear to be multi-threading safe. This specific issue seems to be caused by setting _IMPORTS in html.py to True too early resulting in the second thread entering _parser_dispatch and throwing an exception while the first thread hasn't finished the check.

I have written a potential fix and will open a PR shortly.

Expected Output

No exception should be thrown since lxml is installed and the program works fine without multi-threading.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.11.3-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None