GitHub - tabulapdf/tabula-java: Extract tables from PDF files (original) (raw)
tabula-java is a library for extracting tables from PDF files — it is the table extraction engine that powers Tabula (repo). You can use tabula-java as a command-line tool to programmatically extract tables from PDFs.
© 2014-2020 Manuel Aristarán. Available under MIT License. See LICENSE.
Download
Download a version of the tabula-java's jar, with all dependencies included, that works on Mac, Windows and Linux from our releases page.
Commandline Usage Examples
tabula-java provides a command line application:
$ java -jar target/tabula-1.0.5-jar-with-dependencies.jar --help
usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-f <FORMAT>]
[-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r] [-s
<PASSWORD>] [-t] [-u] [-v]
Tabula helps you extract tables from PDFs
-a,--area <AREA> -a/--area = Portion of the page to analyze.
Example: --area 269.875,12.75,790.5,561.
Accepts top,left,bottom,right i.e. y1,x1,y2,x2
where all values are in points relative to the
top left corner. If all values are between
0-100 (inclusive) and preceded by '%', input
will be taken as % of actual height or width
of the page. Example: --area %0,0,100,50. To
specify multiple areas, -a option should be
repeated. Default is entire page
-b,--batch <DIRECTORY> Convert all .pdfs in the provided directory.
-c,--columns <COLUMNS> X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3. If all values are
between 0-100 (inclusive) and preceded by '%',
input will be taken as % of actual width of
the page. Example: --columns %25,50,80.6
-f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV
-g,--guess Guess the portion of the page to analyze per
page.
-h,--help Print this help text.
-i,--silent Suppress all stderr output.
-l,--lattice Force PDF to be extracted using lattice-mode
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
spreadsheet)
-n,--no-spreadsheet [Deprecated in favor of -t/--stream] Force PDF
not to be extracted using spreadsheet-style
extraction (if there are no ruling lines
separating each cell)
-o,--outfile <OUTFILE> Write output to <file> instead of STDOUT.
Default: -
-p,--pages <PAGES> Comma separated list of ranges, or all.
Examples: --pages 1-3,5-7, --pages 3 or
--pages all. Default is --pages 1
-r,--spreadsheet [Deprecated in favor of -l/--lattice] Force
PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
spreadsheet)
-s,--password <PASSWORD> Password to decrypt document. Default is empty
-t,--stream Force PDF to be extracted using stream-mode
extraction (if there are no ruling lines
separating each cell)
-u,--use-line-returns Use embedded line returns in cells. (Only in
spreadsheet mode.)
-v,--version Print version and exit.
It also includes a debugging tool, run java -cp ./target/tabula-1.0.5-jar-with-dependencies.jar technology.tabula.debug.Debug -h for the available options.
You can also integrate tabula-java with any JVM language. For Java examples, see the tests folder.
JVM start-up time is a lot of the cost of the tabula command, so if you're trying to extract many tables from PDFs, you have a few options for speeding it up:
- the -b option, which allows you to convert all pdfs in a given directory
- the drip utility
- the Ruby, Python, R, and Node.js bindings
- writing your own program in any JVM language (Java, JRuby, Scala) that imports tabula-java.
- waiting for us to implement an API/server-style system (it's on the roadmap)
API Usage Examples
A simple Java code example which extracts all rows and cells from all tables of all pages of a PDF document:
InputStream in = this.getClass().getResourceAsStream("my.pdf"); try (PDDocument document = PDDocument.load(in)) { SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm(); PageIterator pi = new ObjectExtractor(document).extract(); while (pi.hasNext()) { // iterate over the pages of the document Page page = pi.next(); List
