Merge branch 'cpcloud_read_html' · pandas-dev/pandas@6518c79 (original) (raw)
Navigation Menu
- Explore
- Pricing
Provide feedback
Saved searches
Use saved searches to filter your results more quickly
Appearance settings
Commit 6518c79
y-p
committed
Merge branch 'cpcloud_read_html'
* cpcloud_read_html: DOC: update RELEASE.rst ENH: add ability to read html tables directly into DataFrames
File tree
12 files changed
lines changed
12 files changed
lines changed
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -30,7 +30,8 @@ pandas 0.11.1 | ||
30 | 30 | |
31 | 31 | **New features** |
32 | 32 | |
33 | - - | |
33 | + - pd.read_html() can now parse HTML string, files or urls and return dataframes | |
34 | + courtesy of @cpcloud. (GH3477_) | |
34 | 35 | |
35 | 36 | **Improvements to existing features** |
36 | 37 | |
@@ -88,6 +89,7 @@ pandas 0.11.1 | ||
88 | 89 | .. _GH3437: https://github.com/pydata/pandas/issues/3437 |
89 | 90 | .. _GH3455: https://github.com/pydata/pandas/issues/3455 |
90 | 91 | .. _GH3457: https://github.com/pydata/pandas/issues/3457 |
92 | +.. _GH3477: https://github.com/pydata/pandas/issues/3457 | |
91 | 93 | .. _GH3461: https://github.com/pydata/pandas/issues/3461 |
92 | 94 | .. _GH3468: https://github.com/pydata/pandas/issues/3468 |
93 | 95 | .. _GH3448: https://github.com/pydata/pandas/issues/3448 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -75,6 +75,8 @@ if ( ! $VENV_FILE_AVAILABLE ); then | ||
75 | 75 | pip install $PIP_ARGS xlrd>=0.9.0 |
76 | 76 | pip install $PIP_ARGS 'http://downloads.sourceforge.net/project/pytseries/scikits.timeseries/0.91.3/scikits.timeseries-0.91.3.tar.gz?r=' |
77 | 77 | pip install $PIP_ARGS patsy |
78 | + pip install $PIP_ARGS lxml | |
79 | + pip install $PIP_ARGS beautifulsoup4 | |
78 | 80 | |
79 | 81 | # fool statsmodels into thinking pandas was already installed |
80 | 82 | # so it won't refuse to install itself. We want it in the zipped venv |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -50,6 +50,13 @@ File IO | ||
50 | 50 | read_csv |
51 | 51 | ExcelFile.parse |
52 | 52 | |
53 | +.. currentmodule:: pandas.io.html | |
54 | + | |
55 | +.. autosummary:: | |
56 | +:toctree: generated/ | |
57 | + | |
58 | + read_html | |
59 | + | |
53 | 60 | HDFStore: PyTables (HDF5) |
54 | 61 | ~~~~~~~~~~~~~~~~~~~~~~~~~ |
55 | 62 | .. currentmodule:: pandas.io.pytables |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -99,6 +99,12 @@ Optional Dependencies | ||
99 | 99 | * `openpyxl http://packages.python.org/openpyxl/\`__, `xlrd/xlwt http://www.python-excel.org/\`__ |
100 | 100 | * openpyxl version 1.6.1 or higher |
101 | 101 | * Needed for Excel I/O |
102 | + * `lxml http://lxml.de\`__, or `Beautiful Soup 4 http://www.crummy.com/software/BeautifulSoup\`__: for reading HTML tables | |
103 | + * The differences between lxml and Beautiful Soup 4 are mostly speed (lxml | |
104 | + is faster), however sometimes Beautiful Soup returns what you might | |
105 | + intuitively expect. Both backends are implemented, so try them both to | |
106 | + see which one you like. They should return very similar results. | |
107 | + * Note that lxml requires Cython to build successfully | |
102 | 108 | |
103 | 109 | .. note:: |
104 | 110 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -12,9 +12,12 @@ API changes | ||
12 | 12 | |
13 | 13 | Enhancements |
14 | 14 | ~~~~~~~~~~~~ |
15 | + - pd.read_html() can now parse HTML string, files or urls and return dataframes | |
16 | + courtesy of @cpcloud. (GH3477_) | |
15 | 17 | |
16 | 18 | See the `full release notes |
17 | 19 | https://github.com/pydata/pandas/blob/master/RELEASE.rst`__ or issue tracker |
18 | 20 | on GitHub for a complete list. |
19 | 21 | |
20 | 22 | .. _GH2437: https://github.com/pydata/pandas/issues/2437 |
23 | +.. _GH3477: https://github.com/pydata/pandas/issues/3477 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -33,6 +33,7 @@ | ||
33 | 33 | read_fwf, to_clipboard, ExcelFile, |
34 | 34 | ExcelWriter) |
35 | 35 | from pandas.io.pytables import HDFStore, Term, get_store, read_hdf |
36 | +from pandas.io.html import read_html | |
36 | 37 | from pandas.util.testing import debug |
37 | 38 | |
38 | 39 | from pandas.tools.describe import value_range |