Merge branch 'cpcloud_read_html' · pandas-dev/pandas@6518c79 (original) (raw)

Skip to content

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sign up

Appearance settings

Commit 6518c79

author

y-p

committed

Merge branch 'cpcloud_read_html'

* cpcloud_read_html: DOC: update RELEASE.rst ENH: add ability to read html tables directly into DataFrames

File tree

12 files changed

lines changed

12 files changed

lines changed

Original file line number Diff line number Diff line change
@@ -30,7 +30,8 @@ pandas 0.11.1
30 30
31 31 **New features**
32 32
33 - -
33 + - pd.read_html() can now parse HTML string, files or urls and return dataframes
34 + courtesy of @cpcloud. (GH3477_)
34 35
35 36 **Improvements to existing features**
36 37
@@ -88,6 +89,7 @@ pandas 0.11.1
88 89 .. _GH3437: https://github.com/pydata/pandas/issues/3437
89 90 .. _GH3455: https://github.com/pydata/pandas/issues/3455
90 91 .. _GH3457: https://github.com/pydata/pandas/issues/3457
92 +.. _GH3477: https://github.com/pydata/pandas/issues/3457
91 93 .. _GH3461: https://github.com/pydata/pandas/issues/3461
92 94 .. _GH3468: https://github.com/pydata/pandas/issues/3468
93 95 .. _GH3448: https://github.com/pydata/pandas/issues/3448
Original file line number Diff line number Diff line change
@@ -75,6 +75,8 @@ if ( ! $VENV_FILE_AVAILABLE ); then
75 75 pip install $PIP_ARGS xlrd>=0.9.0
76 76 pip install $PIP_ARGS 'http://downloads.sourceforge.net/project/pytseries/scikits.timeseries/0.91.3/scikits.timeseries-0.91.3.tar.gz?r='
77 77 pip install $PIP_ARGS patsy
78 + pip install $PIP_ARGS lxml
79 + pip install $PIP_ARGS beautifulsoup4
78 80
79 81 # fool statsmodels into thinking pandas was already installed
80 82 # so it won't refuse to install itself. We want it in the zipped venv
Original file line number Diff line number Diff line change
@@ -50,6 +50,13 @@ File IO
50 50 read_csv
51 51 ExcelFile.parse
52 52
53 +.. currentmodule:: pandas.io.html
54 +
55 +.. autosummary::
56 +:toctree: generated/
57 +
58 + read_html
59 +
53 60 HDFStore: PyTables (HDF5)
54 61 ~~~~~~~~~~~~~~~~~~~~~~~~~
55 62 .. currentmodule:: pandas.io.pytables
Original file line number Diff line number Diff line change
@@ -99,6 +99,12 @@ Optional Dependencies
99 99 * `openpyxl http://packages.python.org/openpyxl/\`__, `xlrd/xlwt http://www.python-excel.org/\`__
100 100 * openpyxl version 1.6.1 or higher
101 101 * Needed for Excel I/O
102 + * `lxml http://lxml.de\`__, or `Beautiful Soup 4 http://www.crummy.com/software/BeautifulSoup\`__: for reading HTML tables
103 + * The differences between lxml and Beautiful Soup 4 are mostly speed (lxml
104 + is faster), however sometimes Beautiful Soup returns what you might
105 + intuitively expect. Both backends are implemented, so try them both to
106 + see which one you like. They should return very similar results.
107 + * Note that lxml requires Cython to build successfully
102 108
103 109 .. note::
104 110
Original file line number Diff line number Diff line change
@@ -12,9 +12,12 @@ API changes
12 12
13 13 Enhancements
14 14 ~~~~~~~~~~~~
15 + - pd.read_html() can now parse HTML string, files or urls and return dataframes
16 + courtesy of @cpcloud. (GH3477_)
15 17
16 18 See the `full release notes
17 19 https://github.com/pydata/pandas/blob/master/RELEASE.rst`__ or issue tracker
18 20 on GitHub for a complete list.
19 21
20 22 .. _GH2437: https://github.com/pydata/pandas/issues/2437
23 +.. _GH3477: https://github.com/pydata/pandas/issues/3477
Original file line number Diff line number Diff line change
@@ -33,6 +33,7 @@
33 33 read_fwf, to_clipboard, ExcelFile,
34 34 ExcelWriter)
35 35 from pandas.io.pytables import HDFStore, Term, get_store, read_hdf
36 +from pandas.io.html import read_html
36 37 from pandas.util.testing import debug
37 38
38 39 from pandas.tools.describe import value_range