pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation69 Commits5 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes #4697 (refactor issue) (REF/ENH)
closes #4700 (header inconsistency issue) (API)
closes #5029 (comma issue, added this data set, ordering issue) (BUG)
closes #5048 (header type conversion issue) (BUG)
closes #5066 (index_col issue) (BUG)

figure out skiprows, header, and index_col interaction (a somewhat longstanding MultiIndex sorting issue, I just took the long way to get there :))
~~spam url not working anymore~~ (US gov "shutdown" is responsible for this, it correctly skips)
~~table ordering doc blurb/HTML gotchas~~ (was an actual "bug", now fixed in this PR)
~~add tests for rows with a different length~~ (this is already done by the existing tests)

@cpcloud Does this PR actually close #4679 ? That one is specific to Excel.

@jtratner Nope it doesn't, I think I must've mixed up 79 and 97. Thanks

this diff is very difficult to read ... sigh

I think you are trying to bump your stats with html files! lol

@jreback yep

~~last issue is for me to make sure that codec.open with my chosen errors param is correct~~ not using it anymore ... sticking with our old pal open. would be nice to handle encoding decoding ... but that's for the future

@jreback nah ... i'm trying to avoid slowing down the test suite with a bunch of @network tests

jtratner

raise Exception("invalid names passed _stack_arrays")
nitems, nstacked = len(items), len(stacked)
if nitems != nstacked:
raise BadDataError('number of names in ref_items must equal the'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave a note here that says "Caller must catch this error"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or I don't know, maybe not, just something like, if you think this could happen then you should catch this error and try to say something more meaningful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

I did a first pass. I'd probably like to go over it again and see what I see, but I'm sure it's good.

As an aside, can you help me understand how ordering works for the outputted tables? My assumption is that the ordering is deterministic. Does it follow the order in the HTML data that's passed in? (i.e., if you found all the line numbers of <table> elements, the order of outputted tables would be the same as the order of line numbers [unless a table isn't parseable])

@cpcloud Maybe also add that note about table ordering to the html gotchas?

look ok

on tupleize_cols your explanation is odd - no other functions have it as true (by default)

Probably a cycle in the HTML parse tree. Does copy-pasting just the table work?

Surprised that site works...

i think maybe a timeout parameter might be useful

I interrupted and this is the trace:

/home/alex/git/pandas/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands) 838 'data (you passed a negative value)') 839 return _parse(flavor, io, match, header, index_col, skiprows, infer_types, --> 840 parse_dates, tupleize_cols, thousands, attrs)

/home/alex/git/pandas/pandas/io/html.py in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs) 700 701 try: --> 702 tables = p.parse_tables() 703 except Exception as caught: 704 retained = caught

/home/alex/git/pandas/pandas/io/html.py in parse_tables(self) 172 173 def parse_tables(self): --> 174 tables = self._parse_tables(self._build_doc(), self.match, self.attrs) 175 return (self._build_table(table) for table in tables) 176

/home/alex/git/pandas/pandas/io/html.py in _parse_tables(self, doc, match, attrs) 396 def _parse_tables(self, doc, match, attrs): 397 element_name = self._strainer.name --> 398 tables = doc.find_all(element_name, attrs=attrs) 399 400 if not tables:

/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in find_all(self, name, attrs, recursive, text, limit, **kwargs) 1165 if not recursive: 1166 generator = self.children -> 1167 return self._find_all(name, attrs, text, limit, generator, **kwargs) 1168 findAll = find_all # BS3 1169 findChildren = find_all # BS2

/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in _find_all(self, name, attrs, text, limit, generator, **kwargs) 483 # Optimization to find all tags with a given name. 484 elif isinstance(name, basestring): --> 485 return [element for element in generator 486 if isinstance(element, Tag) and element.name == name] 487 else:

/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in descendants(self) 1182 current = self.contents[0] 1183 while current is not stopNode: -> 1184 yield current 1185 current = current.next_element 1186

Yep ... @jseabold had a similar issue ... let me find it

essentially the borked html is in a cycle, in this case a node goes to its child then when the current node (the child) goes to the next element, it's actually the previous node (its parent) and on and on ...

Am I doing something wrong here:

pd.read_html("/home/alex/table.html",infer_types=False,header=[0])

---------------------------------------------------------------------------
TypeError                                 
----> 1 pd.read_html("/home/alex/table.html",infer_types=False,header=[0])

/home/alex/git/pandas/pandas/io/html.pyc in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands)
    838                          'data (you passed a negative value)')
    839     return _parse(flavor, io, match, header, index_col, skiprows, infer_types,
--> 840                   parse_dates, tupleize_cols, thousands, attrs)

/home/alex/git/pandas/pandas/io/html.pyc in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs)
    710     return [_data_to_frame(table, header, index_col, skiprows, infer_types,
    711                            parse_dates, tupleize_cols, thousands)
--> 712             for table in tables]
    713 
    714 

/home/alex/git/pandas/pandas/io/html.pyc in _data_to_frame(data, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands)
    600                     skiprows=_get_skiprows(skiprows),
    601                     parse_dates=parse_dates, tupleize_cols=tupleize_cols,
--> 602                     thousands=thousands)
    603     df = tp.read()
    604 

/home/alex/git/pandas/pandas/io/parsers.pyc in TextParser(*args, **kwds)
   1173     """
   1174     kwds['engine'] = 'python'
-> 1175     return TextFileReader(*args, **kwds)
   1176 
   1177 

/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    485             self.options['has_index_names'] = kwds['has_index_names']
    486 
--> 487         self._make_engine(self.engine)
    488 
    489     def _get_options_with_defaults(self, engine):

/home/alex/git/pandas/pandas/io/parsers.pyc in _make_engine(self, engine)
    601             elif engine == 'python-fwf':
    602                 klass = FixedWidthFieldParser
--> 603             self._engine = klass(self.f, **self.options)
    604 
    605     def _failover_to_python(self):

/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, **kwds)
   1296         if len(self.columns) > 1:
   1297             self.columns, self.index_names, self.col_names, _ = self._extract_multi_indexer_columns(
-> 1298                 self.columns, self.index_names, self.col_names)
   1299         else:
   1300             self.columns = self.columns[0]

/home/alex/git/pandas/pandas/io/parsers.pyc in _extract_multi_indexer_columns(self, header, index_names, col_names, passed_names)
    736         # if we find 'Unnamed' all of a single level, then our header was too long
    737         for n in range(len(columns[0])):
--> 738             if all([ 'Unnamed' in c[n] for c in columns ]):
    739                 raise _parser.CParserError("Passed header=[%s] are too many rows for this "
    740                                            "multi_index of columns" % ','.join([ str(x) for x in self.header ]))

TypeError: argument of type 'float' is not iterable

This works fine:

pd.read_html("/home/alex/table.html",infer_types=False,header=0)

this is an issue with TextParser

if you need to pass only a single header just use the second version

it doesn't really make a whole lot of sense to pass a singleton list if you just want the first row anyway

only pass a list if you need more than 1 row as a MultiIndex header

i'll open an issue about TextParser

That might have been a poor example, I see this issue even with a proper list:

In [6]: pd.read_html("/home/alex/table.html",infer_types=False,header=[0,1])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-1b482817d4da> in <module>()
----> 1 pd.read_html("/home/alex/table.html",infer_types=False,header=[0,1])

/home/alex/git/pandas/pandas/io/html.pyc in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands)
    838                          'data (you passed a negative value)')
    839     return _parse(flavor, io, match, header, index_col, skiprows, infer_types,
--> 840                   parse_dates, tupleize_cols, thousands, attrs)

/home/alex/git/pandas/pandas/io/html.pyc in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs)
    710     return [_data_to_frame(table, header, index_col, skiprows, infer_types,
    711                            parse_dates, tupleize_cols, thousands)
--> 712             for table in tables]
    713 
    714 

/home/alex/git/pandas/pandas/io/html.pyc in _data_to_frame(data, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands)
    600                     skiprows=_get_skiprows(skiprows),
    601                     parse_dates=parse_dates, tupleize_cols=tupleize_cols,
--> 602                     thousands=thousands)
    603     df = tp.read()
    604 

/home/alex/git/pandas/pandas/io/parsers.pyc in TextParser(*args, **kwds)
   1173     """
   1174     kwds['engine'] = 'python'
-> 1175     return TextFileReader(*args, **kwds)
   1176 
   1177 

/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    485             self.options['has_index_names'] = kwds['has_index_names']
    486 
--> 487         self._make_engine(self.engine)
    488 
    489     def _get_options_with_defaults(self, engine):

/home/alex/git/pandas/pandas/io/parsers.pyc in _make_engine(self, engine)
    601             elif engine == 'python-fwf':
    602                 klass = FixedWidthFieldParser
--> 603             self._engine = klass(self.f, **self.options)
    604 
    605     def _failover_to_python(self):

/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, **kwds)
   1296         if len(self.columns) > 1:
   1297             self.columns, self.index_names, self.col_names, _ = self._extract_multi_indexer_columns(
-> 1298                 self.columns, self.index_names, self.col_names)
   1299         else:
   1300             self.columns = self.columns[0]

/home/alex/git/pandas/pandas/io/parsers.pyc in _extract_multi_indexer_columns(self, header, index_names, col_names, passed_names)
    736         # if we find 'Unnamed' all of a single level, then our header was too long
    737         for n in range(len(columns[0])):
--> 738             if all([ 'Unnamed' in c[n] for c in columns ]):
    739                 raise _parser.CParserError("Passed header=[%s] are too many rows for this "
    740                                            "multi_index of columns" % ','.join([ str(x) for x in self.header ]))

TypeError: argument of type 'float' is not iterable

What does your table look like?

GitHub actually supports tables:

| | | Three months endedApril 30 | | Six months endedApril 30 | | | | | | | | | | | | -------------------------------------------------- | ------------------------------ | - | ---------------------------- | - | -------- | ------ | -------- | - | ------ | - | - | ------ | - | | | | 2013 | | 2012 | | 2013 | | 2012 | | | | | | | | | | In millions | | | | | | | | | | | | | | Net revenue: | | | | | | | | | | | | | | | Notebooks | | $ | 3,718 | | $ | 4,900 | | $ | 7,846 | | $ | 9,842 | | | Desktops | | | 3,103 | | | 3,827 | | | 6,424 | | | 7,033 | | | Workstations | | | 521 | | | 537 | | | 1,056 | | | 1,072 | | | Other | | | 242 | | | 206 | | | 462 | | | 415 | | | | | | | | | | | | | | | | | | | Personal Systems | | | 7,584 | | | 9,470 | | | 15,788 | | | 18,362 | | | | | | | | | | | | | | | | | | | Supplies | | | 4,122 | | | 4,060 | | | 8,015 | | | 8,139 | | | Commercial Hardware | | | 1,398 | | | 1,479 | | | 2,752 | | | 2,968 | | | Consumer Hardware | | | 561 | | | 593 | | | 1,240 | | | 1,283 | | | | | | | | | | | | | | | | | | | Printing | | | 6,081 | | | 6,132 | | | 12,007 | | | 12,390 | | | | | | | | | | | | | | | | | | | Printing and Personal Systems Group | | | 13,665 | | | 15,602 | | | 27,795 | | | 30,752 | | | | | | | | | | | | | | | | | | | Industry Standard Servers | | | 2,806 | | | 3,186 | | | 5,800 | | | 6,258 | | | Technology Services | | | 2,272 | | | 2,335 | | | 4,515 | | | 4,599 | | | Storage | | | 857 | | | 990 | | | 1,690 | | | 1,945 | | | Networking | | | 618 | | | 614 | | | 1,226 | | | 1,200 | | | Business Critical Systems | | | 266 | | | 421 | | | 572 | | | 826 | | | | | | | | | | | | | | | | | | | Enterprise Group | | | 6,819 | | | 7,546 | | | 13,803 | | | 14,828 | | | | | | | | | | | | | | | | | | | Infrastructure Technology Outsourcing | | | 3,721 | | | 3,954 | | | 7,457 | | | 7,934 | | | Application and Business Services | | | 2,278 | | | 2,535 | | | 4,461 | | | 4,926 | | | | | | | | | | | | | | | | | | | Enterprise Services | | | 5,999 | | | 6,489 | | | 11,918 | | | 12,860 | | | | | | | | | | | | | | | | | | | Software | | | 941 | | | 970 | | | 1,867 | | | 1,916 | | | HP Financial Services | | | 881 | | | 968 | | | 1,838 | | | 1,918 | | | Corporate Investments | | | 10 | | | 7 | | | 14 | | | 37 | | | | | | | | | | | | | | | | | | | Total segments | | | 28,315 | | | 31,582 | | | 57,235 | | | 62,311 | | | | | | | | | | | | | | | | | | | Eliminations of intersegment net revenue and other | | | (733 | ) | | (889 | ) | | (1,294 | ) | | (1,582 | ) | | | | | | | | | | | | | | | | | | Total HP consolidated net revenue | | $ | 27,582 | | $ | 30,693 | | $ | 55,941 | | $ | 60,729 | | | | | | | | | | | | | | | | | |

im betting that if you try to get the part of the table after the header it will work ... otherwise i'll have to take a look later

Aside from the leading $ and trailing ) (which are like that in the HTML), GitHub does a great job of rendering that table.

So this seems to work better:

ret = pd.read_html("/home/alex/table.html",infer_types=False,skiprows=3)[0]

but I am still left with a lot of nan's:

                                                   0    1    2      3    4   \
0                                        Net revenue:  nan  nan    nan  nan   
1                                           Notebooks  nan    $   3718  nan   
2                                            Desktops  nan  nan   3103  nan   
3                                        Workstations  nan  nan    521  nan   
4                                               Other  nan  nan    242  nan   
5                                                 nan  nan  nan    nan  nan   
6                                    Personal Systems  nan  nan   7584  nan   
7                                                 nan  nan  nan    nan  nan   
8                                            Supplies  nan  nan   4122  nan   
9                                 Commercial Hardware  nan  nan   1398  nan   
10                                  Consumer Hardware  nan  nan    561  nan   
11                                                nan  nan  nan    nan  nan   
12                                           Printing  nan  nan   6081  nan   
13                                                nan  nan  nan    nan  nan   
14                Printing and Personal Systems Group  nan  nan  13665  nan   
15                                                nan  nan  nan    nan  nan   
16                          Industry Standard Servers  nan  nan   2806  nan   
17                                Technology Services  nan  nan   2272  nan   
18                                            Storage  nan  nan    857  nan   
19                                         Networking  nan  nan    618  nan   
20                          Business Critical Systems  nan  nan    266  nan   
21                                                nan  nan  nan    nan  nan   
22                                   Enterprise Group  nan  nan   6819  nan   
23                                                nan  nan  nan    nan  nan   
24              Infrastructure Technology Outsourcing  nan  nan   3721  nan   
25                  Application and Business Services  nan  nan   2278  nan   
26                                                nan  nan  nan    nan  nan   
27                                Enterprise Services  nan  nan   5999  nan   
28                                                nan  nan  nan    nan  nan   
29                                           Software  nan  nan    941  nan   
30                              HP Financial Services  nan  nan    881  nan   
31                              Corporate Investments  nan  nan     10  nan   
32                                                nan  nan  nan    nan  nan   
33                                     Total segments  nan  nan  28315  nan   
34                                                nan  nan  nan    nan  nan   
35  Eliminations of intersegment net revenue and o...  nan  nan   (733    )   
36                                                nan  nan  nan    nan  nan   
37                  Total HP consolidated net revenue  nan    $  27582  nan   
38                                                nan  nan  nan    nan  nan

where those nans actually appear to be strings:

In [22]: ret[5][0]
Out[22]: u'nan'

yep as i suspected.

the string nans are because you passed infer_types=False which converts everything to a string (kind of kludgy i know, it's for back compat). infer_types will have no effect starting in 0.14

I think that parse_dates should have a little more documentation (I did look at docs on read_csv as well).

It is not clear to me what the default value of parse_dates is.

Also, on read_html, parse_dates is listed as "bool" whereas on read_csv as "boolean, list of ints or names, list of lists, or dict".

I am not sure what setting parse_dates=False does.

It seems that some values are being incorrectly parsed as dates (See columns 9-12):

In [38]: pd.read_html("/home/alex/table.html",skiprows=3,parse_dates=False)[0]
Out[38]: 
                                                   0   1    2      3    4    5      6    7    8                   9    10   11                  12   13
0                                        Net revenue: NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
1                                           Notebooks NaN    $   3718  NaN    $   4900  NaN    $                 NaT  NaN    $                 NaT  NaN
2                                            Desktops NaN  NaN   3103  NaN  NaN   3827  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
3                                        Workstations NaN  NaN    521  NaN  NaN    537  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
4                                               Other NaN  NaN    242  NaN  NaN    206  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
5                                                 NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
6                                    Personal Systems NaN  NaN   7584  NaN  NaN   9470  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
7                                                 NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
8                                            Supplies NaN  NaN   4122  NaN  NaN   4060  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
9                                 Commercial Hardware NaN  NaN   1398  NaN  NaN   1479  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
10                                  Consumer Hardware NaN  NaN    561  NaN  NaN    593  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
11                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
12                                           Printing NaN  NaN   6081  NaN  NaN   6132  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
13                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
14                Printing and Personal Systems Group NaN  NaN  13665  NaN  NaN  15602  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
15                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
16                          Industry Standard Servers NaN  NaN   2806  NaN  NaN   3186  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
17                                Technology Services NaN  NaN   2272  NaN  NaN   2335  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
18                                            Storage NaN  NaN    857  NaN  NaN    990  NaN  NaN 1690-01-01 00:00:00  NaN  NaN 1945-01-01 00:00:00  NaN
19                                         Networking NaN  NaN    618  NaN  NaN    614  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
20                          Business Critical Systems NaN  NaN    266  NaN  NaN    421  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
21                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
22                                   Enterprise Group NaN  NaN   6819  NaN  NaN   7546  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
23                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
24              Infrastructure Technology Outsourcing NaN  NaN   3721  NaN  NaN   3954  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
25                  Application and Business Services NaN  NaN   2278  NaN  NaN   2535  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
26                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
27                                Enterprise Services NaN  NaN   5999  NaN  NaN   6489  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
28                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
29                                           Software NaN  NaN    941  NaN  NaN    970  NaN  NaN 1867-01-01 00:00:00  NaN  NaN 1916-01-01 00:00:00  NaN
30                              HP Financial Services NaN  NaN    881  NaN  NaN    968  NaN  NaN 1838-01-01 00:00:00  NaN  NaN 1918-01-01 00:00:00  NaN
31                              Corporate Investments NaN  NaN     10  NaN  NaN      7  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
32                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
33                                     Total segments NaN  NaN  28315  NaN  NaN  31582  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
34                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
35  Eliminations of intersegment net revenue and o... NaN  NaN   (733    )  NaN   (889    )  NaN                 NaT    )  NaN                 NaT    )
36                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
37                  Total HP consolidated net revenue NaN    $  27582  NaN    $  30693  NaN    $                 NaT  NaN    $                 NaT  NaN
38                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN

This was referenced

Apr 3, 2014

What are we supposed to do if we want to use read_html and not have Pandas infer types then?

jreback added the IO HTML

read_html, to_html, Styler.apply, Styler.applymap

label

Aug 23, 2015

jreback added a commit to jreback/pandas that referenced this pull request

Aug 24, 2015

Labels

IO HTML

read_html, to_html, Styler.apply, Styler.applymap