read_html() Thread Safety · Issue #16928 · pandas-dev/pandas (original) (raw)
Code Sample
#!/usr/bin/python3 import pandas import threading
def fetch_file(): url = "https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html" pandas.read_html(url)
thread1 = threading.Thread(target = fetch_file) thread2 = threading.Thread(target = fetch_file)
thread1.start() thread2.start()
Output
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "./pandas_bug.py", line 7, in fetch_file
pandas.read_html(url)
File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 904, in read_html
keep_default_na=keep_default_na)
File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 731, in _parse
parser = _parser_dispatch(flav)
File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 691, in _parser_dispatch
raise ImportError("lxml not found, please install it")
ImportError: lxml not found, please install it
Problem description
read_html() doesn't appear to be multi-threading safe. This specific issue seems to be caused by setting _IMPORTS
in html.py to True too early resulting in the second thread entering _parser_dispatch
and throwing an exception while the first thread hasn't finished the check.
I have written a potential fix and will open a PR shortly.
Expected Output
No exception should be thrown since lxml is installed and the program works fine without multi-threading.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.11.3-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None