Merge branch 'cpcloud_read_html' · pandas-dev/pandas@6518c79 (original) (raw)

Skip to content

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sign up

Appearance settings

Commit 6518c79

author

y-p

committed

Merge branch 'cpcloud_read_html'

* cpcloud_read_html: DOC: update RELEASE.rst ENH: add ability to read html tables directly into DataFrames

File tree

12 files changed

lines changed

12 files changed

lines changed

Lines changed: 3 additions & 1 deletion

Original file line number Diff line number Diff line change
@@ -30,7 +30,8 @@ pandas 0.11.1
30 30
31 31 **New features**
32 32
33 - -
33 + - pd.read_html() can now parse HTML string, files or urls and return dataframes
34 + courtesy of @cpcloud. (GH3477_)
34 35
35 36 **Improvements to existing features**
36 37
@@ -88,6 +89,7 @@ pandas 0.11.1
88 89 .. _GH3437: https://github.com/pydata/pandas/issues/3437
89 90 .. _GH3455: https://github.com/pydata/pandas/issues/3455
90 91 .. _GH3457: https://github.com/pydata/pandas/issues/3457
92 +.. _GH3477: https://github.com/pydata/pandas/issues/3457
91 93 .. _GH3461: https://github.com/pydata/pandas/issues/3461
92 94 .. _GH3468: https://github.com/pydata/pandas/issues/3468
93 95 .. _GH3448: https://github.com/pydata/pandas/issues/3448

Lines changed: 2 additions & 0 deletions

Original file line number Diff line number Diff line change
@@ -75,6 +75,8 @@ if ( ! $VENV_FILE_AVAILABLE ); then
75 75 pip install $PIP_ARGS xlrd>=0.9.0
76 76 pip install $PIP_ARGS 'http://downloads.sourceforge.net/project/pytseries/scikits.timeseries/0.91.3/scikits.timeseries-0.91.3.tar.gz?r='
77 77 pip install $PIP_ARGS patsy
78 + pip install $PIP_ARGS lxml
79 + pip install $PIP_ARGS beautifulsoup4
78 80
79 81 # fool statsmodels into thinking pandas was already installed
80 82 # so it won't refuse to install itself. We want it in the zipped venv

Lines changed: 7 additions & 0 deletions

Original file line number Diff line number Diff line change
@@ -50,6 +50,13 @@ File IO
50 50 read_csv
51 51 ExcelFile.parse
52 52
53 +.. currentmodule:: pandas.io.html
54 +
55 +.. autosummary::
56 +:toctree: generated/
57 +
58 + read_html
59 +
53 60 HDFStore: PyTables (HDF5)
54 61 ~~~~~~~~~~~~~~~~~~~~~~~~~
55 62 .. currentmodule:: pandas.io.pytables

Lines changed: 6 additions & 0 deletions

Original file line number Diff line number Diff line change
@@ -99,6 +99,12 @@ Optional Dependencies
99 99 * `openpyxl http://packages.python.org/openpyxl/\`__, `xlrd/xlwt http://www.python-excel.org/\`__
100 100 * openpyxl version 1.6.1 or higher
101 101 * Needed for Excel I/O
102 + * `lxml http://lxml.de\`__, or `Beautiful Soup 4 http://www.crummy.com/software/BeautifulSoup\`__: for reading HTML tables
103 + * The differences between lxml and Beautiful Soup 4 are mostly speed (lxml
104 + is faster), however sometimes Beautiful Soup returns what you might
105 + intuitively expect. Both backends are implemented, so try them both to
106 + see which one you like. They should return very similar results.
107 + * Note that lxml requires Cython to build successfully
102 108
103 109 .. note::
104 110

Lines changed: 3 additions & 0 deletions

Original file line number Diff line number Diff line change
@@ -12,9 +12,12 @@ API changes
12 12
13 13 Enhancements
14 14 ~~~~~~~~~~~~
15 + - pd.read_html() can now parse HTML string, files or urls and return dataframes
16 + courtesy of @cpcloud. (GH3477_)
15 17
16 18 See the `full release notes
17 19 https://github.com/pydata/pandas/blob/master/RELEASE.rst`__ or issue tracker
18 20 on GitHub for a complete list.
19 21
20 22 .. _GH2437: https://github.com/pydata/pandas/issues/2437
23 +.. _GH3477: https://github.com/pydata/pandas/issues/3477

Lines changed: 1 addition & 0 deletions

Original file line number Diff line number Diff line change
@@ -33,6 +33,7 @@
33 33 read_fwf, to_clipboard, ExcelFile,
34 34 ExcelWriter)
35 35 from pandas.io.pytables import HDFStore, Term, get_store, read_hdf
36 +from pandas.io.html import read_html
36 37 from pandas.util.testing import debug
37 38
38 39 from pandas.tools.describe import value_range