tabula-extractor (old version)

Deprecation Note: This is the old version of the Tabula extraction engine. New projects wishing to integrate Tabula should use tabula-java (the new Java version of this extraction engine) unless you prefer to use JRuby. Users looking for the command-line version of Tabula should also use tabula-java.

Extract tables from PDF files. tabula-extractor is the table extraction engine that used to power Tabula.

If you're beginning a new project, consider using tabula-java, a pure-Java version of the extraction engine behind Tabula. If you want Ruby bindings and are okay using JRuby (or have already begin a project), you may continue to use this project. This project's JRuby backend has been replaced with the Java backend; all that remains here is a thin wrapper for Ruby compatibility. This wrapper maintains API backwards-compatibility with the old, pure-JRuby implementation that we all know and love.

Installation

tabula-extractor only works with JRuby 1.7 or newer. Install JRuby and run

jruby -S gem install tabula-extractor

Usage

Tabula helps you extract tables from PDFs Usage: tabula [options] <pdf_file> where [options] are: Tabula helps you extract tables from PDFs --pages, -p <s>: Comma separated list of ranges. Examples: --pages 1-3,5-7 or --pages 3. Default is --pages 1 (default: 1) --area, -a <s>: Portion of the page to analyze (top,left,bottom,right). Example: --area 269.875,12.75,790.5,561. Default is entire page --columns, -c <s>: X coordinates of column boundaries. Example --columns 10.1,20.2,30.3 --password, -s <s>: Password to decrypt document. Default is empty (default: ) --guess, -g: Guess the portion of the page to analyze per page. --debug, -d: Print detected table areas instead of processing. --format, -f <s>: Output format (CSV,TSV,HTML,JSON) (default: CSV) --outfile, -o <s>: Write output to <file> instead of STDOUT (default: -) --spreadsheet, -r: Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet) --no-spreadsheet, -n: Force PDF not to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet) --silent, -i: Suppress all stderr output. --use-line-returns, -u: Use embedded line returns in cells. --version, -v: Print version and exit --help, -h: Show this message

Scripting examples

tabula-extractor is a RubyGem that you can use to programmatically extract tabular data, using the Tabula engine, in your scripts or applications. We don't have docs yet, but the tests are a good source of information.

Here's a very basic example:

require 'tabula' pdf_file_path = "whatever.pdf" outfilename = "whatever.csv" out = open(outfilename, 'w') extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, :all ) extractor.extract.each do |pdf_page| pdf_page.spreadsheets.each do |spreadsheet| out << spreadsheet.to_csv out << "\n\n" end end out.close

Name		Name	Last commit message	Last commit date
Latest commit History 702 Commits
bin		bin
lib		lib
target		target
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
AUTHORS.md		AUTHORS.md
Gemfile		Gemfile
LICENSE.md		LICENSE.md
NOTICE.txt		NOTICE.txt
README.md		README.md
Rakefile		Rakefile
tabula-extractor.gemspec		tabula-extractor.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tabula-extractor (old version)

Installation

Usage

Scripting examples

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tabula-extractor (old version)

Installation

Usage

Scripting examples

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages