Skip to content

qurator-spk/page2tsv

Repository files navigation

TSV - Processing Tools

Create .tsv files that can be viewed and edited with neat.

Installation:

Required python version is 3.11. Consider use of pyenv if that python version is not available on your system.

Activate virtual environment (virtualenv):

source venv/bin/activate 

or (pyenv):

pyenv activate my-python-3.11-virtualenv 

Update pip:

pip install -U pip 

Install tsvtools:

pip install git+https://github.com/qurator-spk/page2tsv.git 

PAGE-XML to TSV Transformation:

Create a TSV file from OCR in PAGE-XML format (with word segmentation):

page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1 

In order to create a TSV file for multiple PAGE XML files just perform successive calls of the tool using the same TSV file:

page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1 page2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2 page2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3 page2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4 page2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5 ... ... ... 

For instance, for the file example.xml:

page2tsv example.xml example.tsv --image-url=http://content.staatsbibliothek-berlin.de/zefys/SNP27646518-18800101-0-3-0-0/left,top,width,height/full/0/default.jpg 

Processing of already existing TSV files:

Create a URL-annotated TSV file from an existing TSV file:

annotate-tsv enp_DE.tsv enp_DE-annotated.tsv 

Command-line interface:

page2tsv --help Usage: page2tsv [OPTIONS] PAGE_XML_FILE TSV_OUT_FILE Converts a page-XML file into a TSV file that can be edited with neat. Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as parameters and performs NER and EL and the document if these are provided. PAGE_XML_FILE: The source page-XML file. TSV_OUT_FILE: Resulting TSV file. Options: --purpose [NERD|OCR] Purpose of output tsv file. NERD: NER/NED application/ground-truth creation. OCR: OCR application/ground-truth creation. default: NERD. --image-url TEXT An image retrieval link that enables neat to show the scan images corresponding to the text tokens. Example: https://content.staatsbibliothek-berlin. de/zefys/SNP26824620-18371109-0-1-0-0/left,top,wi dth,height/full/0/default.jpg --ner-rest-endpoint TEXT REST endpoint of sbb_ner service. See https://github.com/qurator-spk/sbb_ner for details. Only applicable in case of NERD. --ned-rest-endpoint TEXT REST endpoint of sbb_ned service. See https://github.com/qurator-spk/sbb_ned for details. Only applicable in case of NERD. --noproxy disable proxy. default: enabled. --scale-factor FLOAT default: 1.0 --ned-threshold FLOAT --min-confidence FLOAT --max-confidence FLOAT --ned-priority INTEGER --normalization-file PATH --help Show this message and exit. 
tsv2tsv --help Usage: tsv2tsv [OPTIONS] TSV_IN_FILE Options: --tsv-out-file PATH Write modified TSV to this file. --ner-rest-endpoint TEXT REST endpoint of sbb_ner service. See https://github.com/qurator-spk/sbb_ner for details. --noproxy disable proxy. default: enabled. --num-tokens Print number of tokens in input/output file. --sentence-count Print sentence count in input/output file. --max-sentence-len Print maximum sentence len for input/output file. --keep-tokenization Keep the word tokenization exactly as it is. --sentence-split-only Do only sentence splitting. --show-urls Print contained visualization URLs. --just-zero Process only files that have max sentence length zero,i.e., that do not have sentence splitting. --sanitize-sentence-numbers Sanitize sentence numbering. --show-columns Show TSV columns. --drop-column TEXT Drop column --help Show this message and exit. 
alto2tsv --help Usage: alto2tsv [OPTIONS] ALTO_XML_FILE TSV_OUT_FILE Converts a ALTO-XML file into a TSV file that can be edited with neat. Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as parameters and performs NER and EL and the document if these are provided. ALTO_XML_FILE: The source ALTO-XML file. TSV_OUT_FILE: Resulting TSV file. Options: --purpose [NERD|OCR] Purpose of output tsv file. NERD: NER/NED application/ground-truth creation. OCR: OCR application/ground-truth creation. default: NERD. --image-url TEXT An image retrieval link that enables neat to show the scan images corresponding to the text tokens. Example: https://content.staatsbibliothek-berlin.d e/zefys/SNP26824620-18371109-0-1-0-0/left,top,widt h,height/full/0/default.jpg --ner-rest-endpoint TEXT REST endpoint of sbb_ner service. See https://github.com/qurator-spk/sbb_ner for details. Only applicable in case of NERD. --ned-rest-endpoint TEXT REST endpoint of sbb_ned service. See https://github.com/qurator-spk/sbb_ned for details. Only applicable in case of NERD. --noproxy disable proxy. default: enabled. --scale-factor FLOAT default: 1.0 --ned-threshold FLOAT --ned-priority INTEGER --help Show this message and exit. 

About

PAGE-XML to TSV

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5