ENH: support storage_options
in pandas.read_html
· Issue #49944 · pandas-dev/pandas (original) (raw)
Feature Type
- Adding new functionality to pandas
Problem Description
I wish pandas.read_html supported the storage_options
keyword argument, just like pandas.read_json and many other functions.
The storage_options
can be helpful when reading data from websites requiring request headers, such as user-agent.
Feature Description
With the suggestion, pd.read_html
will accept storage_options
:
import pandas as pd
url = "http://apps.fibi.co.il/Matach/matach.aspx" user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36" df = pd.read_html(url, storage_options={"User-Agent": user_agent})[0]
Currently, without the storage_options
argument, the request fails with:
df = pd.read_html(url)[0]
ValueError: No tables found
Alternative Solutions
A workaround is to use urllib
and send pandas the HTTP response:
import urllib.request
import pandas as pd
url = "http://apps.fibi.co.il/Matach/matach.aspx" user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36" request = urllib.request.Request(url, headers={"User-Agent": user_agent}) with urllib.request.urlopen(request) as response: df = pd.read_html(response)[0]
This approach is valid but more verbose.
It will be valuable to extend the already widespread support of storage_options
in pandas to pd.read_html
also.
Additional Context
No response