read_html providing title from a attribute as well as the text - in effect duplicating output · Issue #20027 · pandas-dev/pandas (original) (raw)
Code Sample
url = """https://en.wikipedia.org/wiki/List_of_winners_of_the_Boston_Marathon"""
tables = pd.read_html(url, header=0)
print(tables[0].head())
Problem description
The above code ''should' just extract the displayed text in the HTML table; what's in the dataframe should be what's displayed on screen. This isn't what happens. If the HTML contains a hyperlink with a title attribute, this is picked up and added to the dataframe, duplicating the data.
Expected Output
Year Athlete \
0 1897 John J. McDermott
1 1898 Ronald J. MacDonald
2 1899 Lawrence Brignolia
3 1900 John "Jack" Caffery
4 1901 John "Jack" Caffery
Country/State Time Notes
0 United States (NY) 2:55:10 NaN
1 Canada Canada 2:42:00 NaN
2 United States (MA) 2:54:38 NaN
3 Canada 2:39:44 NaN
4 Canada 2:29:23 2nd victory
Output
Here's the actual output, the duplication is in the Athlete and Country/State columns.
Year Athlete
0 1897 McDermott, John J.John J. McDermott
1 1898 MacDonald, Ronald J.Ronald J. MacDonald
2 1899 Brignolia, LawrenceLawrence Brignolia
3 1900 Caffery, JohnJohn "Jack" Caffery
4 1901 Caffery, JohnJohn "Jack" Caffery
Country/State Time Notes
0 United States United States (NY) 2:55:10 NaN
1 Canada Canada 2:42:00 NaN
2 United States United States (MA) 2:54:38 NaN
3 Canada Canada 2:39:44 NaN
4 Canada Canada 2:29:23 2nd victory