Read tables in a PDF into DataFrame — tabula-py documentation (original) (raw)
tabula-py is a simple Python wrapper of tabula-java, which can read table of PDF. You can read tables from PDF and convert them into pandas’ DataFrame. tabula-py also converts a PDF file into CSV/TSV/JSON file.
We highly recommend looking at the example notebook and trying it on Google Colab.
For high-level API reference, see High level interfaces.
Contents
- Getting Started
- FAQ
- tabula-py does not work
- I can’t run from tabula import read_pdf
- I got an empty DataFrame. How can I resolve it?
- The result is different from tabula-java. Or, stream option seems not to work appropriately
- Can I use option xxx?
- How can I ignore useless area?
- I faced ParserError: Error tokenizing data. C error. How can I extract multiple tables?
- I want to prevent tabula-py from stealing focus on every call on my mac
- I got ? character with results on Windows. How can I avoid it?
- I can’t extract file/directory names with space on Windows
- I want to use a different tabula .jar file
- I want to extract multiple tables from a document
- Table cell contents sometimes overflow into the next row.
- I got a warning/error message from PDFBox including org.apache.pdfbox.pdmodel.. Is it the cause of the empty dataframe?
- java_options is ignored once read_pdf or similar funcion is called.
- I can’t figure out accurate extraction with tabula-py. Are there any similar Python libraries?
- Contributing to tabula-py
API Reference