read_html providing title from a attribute as well as the text - in effect duplicating output · Issue #20027 · pandas-dev/pandas (original) (raw)

Code Sample

url = """https://en.wikipedia.org/wiki/List_of_winners_of_the_Boston_Marathon"""
tables = pd.read_html(url, header=0)
print(tables[0].head())

Problem description

The above code ''should' just extract the displayed text in the HTML table; what's in the dataframe should be what's displayed on screen. This isn't what happens. If the HTML contains a hyperlink with a title attribute, this is picked up and added to the dataframe, duplicating the data.

Expected Output

   Year                   Athlete  \
0  1897          John J. McDermott   
1  1898         Ronald J. MacDonald   
2  1899          Lawrence Brignolia   
3  1900         John "Jack" Caffery   
4  1901         John "Jack" Caffery   

                      Country/State     Time        Notes  
0                United States (NY)  2:55:10          NaN  
1                     Canada Canada  2:42:00          NaN  
2                United States (MA)  2:54:38          NaN  
3                            Canada  2:39:44          NaN  
4                            Canada  2:29:23  2nd victory 

Output

Here's the actual output, the duplication is in the Athlete and Country/State columns.

   Year                                  Athlete  
0  1897      McDermott, John J.John J. McDermott   
1  1898  MacDonald, Ronald J.Ronald J. MacDonald   
2  1899    Brignolia, LawrenceLawrence Brignolia   
3  1900         Caffery, JohnJohn "Jack" Caffery   
4  1901         Caffery, JohnJohn "Jack" Caffery   

                      Country/State     Time        Notes  
0  United States United States (NY)  2:55:10          NaN  
1                     Canada Canada  2:42:00          NaN  
2  United States United States (MA)  2:54:38          NaN  
3                     Canada Canada  2:39:44          NaN  
4                     Canada Canada  2:29:23  2nd victory