readability-lxml (original) (raw)
Project description
python-readability
Given an HTML document, extract and clean up the main body text and title.
This is a Python port of a Ruby port of arc90's Readability project.
Installation
It's easy using pip, just run:
$ pip install readability-lxml
As an alternative, you may also use conda to install, just run:
$ conda install -c conda-forge readability-lxml
Usage
import requests from readability import Document
response = requests.get('http://example.com') doc = Document(response.content) doc.title() 'Example Domain'
doc.summary() """
\n\nExample Domain
\n
This domain is established to be used for illustrative examples in documents. You may use this\n domain in examples without prior coordination or asking for permission.
\n \n \n\n"""Change Log
- 0.8.4 Better CJK support, thanks @cdhigh
- 0.8.3.1 Support for python 3.8 - 3.13
- 0.8.3 We can now save all images via keep_all_images=True (default is to save 1 main image), thanks @botlabsDev
- 0.8.2 Added article author(s) (thanks @mattblaha)
- 0.8.1 Fixed processing of non-ascii HTMLs via regexps.
- 0.8 Replaced XHTML output with HTML5 output in summary() call.
- 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
- 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
- 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
- 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
- 0.4 Added Videos loading and allowed more images per paragraph
- 0.3 Added Document.encoding, positive_keywords and negative_keywords
Licensing
This code is under the Apache License 2.0 license.
Thanks to
- Latest readability.js
- Ruby port by starrhorne and iterationlabs
- Python port by gfxmonk
- Decruft effort to move to lxml
- "BR to P" fix from readability.js which improves quality for smaller texts
- Github users contributions.
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file readability_lxml-0.8.4.1.tar.gz.
File metadata
- Download URL: readability_lxml-0.8.4.1.tar.gz
- Upload date: May 3, 2025
- Size: 22.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.10
File hashes
Hashes for readability_lxml-0.8.4.1.tar.gz | Algorithm | Hash digest | | | ----------- | ---------------------------------------------------------------- | | | SHA256 | 9d2924f5942dd7f37fb4da353263b22a3e877ccf922d0e45e348e4177b035a53 | | | MD5 | 14af137865e8220ac2af2fcabf5ea931 | | | BLAKE2b-256 | 553edc87d97532ddad58af786ec89c7036182e352574c1cba37bf2bf783d2b15 | |
See more details on using hashes here.
File details
Details for the file readability_lxml-0.8.4.1-py3-none-any.whl.
File metadata
- Download URL: readability_lxml-0.8.4.1-py3-none-any.whl
- Upload date: May 3, 2025
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.10
File hashes
Hashes for readability_lxml-0.8.4.1-py3-none-any.whl | Algorithm | Hash digest | | | ----------- | ---------------------------------------------------------------- | | | SHA256 | 874c0cea22c3bf2b78c7f8df831bfaad3c0a89b7301d45a188db581652b4b465 | | | MD5 | 993c47451250d45104f41a4886e1ed77 | | | BLAKE2b-256 | c7752cc58965097e351415af420be81c4665cf80da52a17ef43c01ffbe2caf91 | |