Scraping large pdf tables which span across multiple pages

Question

I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout as advised here. The problem is that the resultant text file is not easy to work with, as the table layout differs across pages, so the columns are not aligned. Also note missing values in lines beginning with "Solsonès":

 TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012 COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT N Alt Camp VY Nulles 7,5 5,5 10,9 12,3 16,7 21,6 22,3 24,4 20,1 15,9 Alt Camp DQ Vila-rodona 7,9 5,6 11,0 12,0 16,6 21,6 22,0 24,3 19,9 15,8 Alt Empordà U1 Cabanes 8,2 6,5 11,7 12,6 17,5 22,0 23,1 24,4 20,4 16,6 Alt Empordà W1 Castelló d'Empúries 8,1 6,4 11,6 12,9 17,0 21,1 22,0 23,4 20,1 16,4 [...] TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012 COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT Baix Empordà DF la Bisbal d'Empordà 6,6 5,3 10,9 12,6 17,2 21,9 22,9 24,6 20,3 16 Baix Empordà UB la Tallada d'Empordà 6,1 5,2 10,7 12,3 16,6 21,3 22,2 23,8 19,7 15 Baix Empordà UC Monells 6,1 4,6 9,9 11,4 16,5 21,7 23,0 24,5 19,6 15 [...] TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012 COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT [...] Solsonès CA Clariana de Cardener 4,6 3,3 10,3 10,2 16,7 22,3 d.i. Solsonès Z8 el Port del Comte (2.316 m) -0,9 -6,3 -0,2 -2,0 5,3 10,5 10,9 13,8 7,8 4,2 Solsonès VO Lladurs 3,0 2,6 9,5 9,0 15,3 21,4 21,6 24,3 17,5 13,0 Solsonès VP Pinós 3,0 1,6 8,9 9,2 15,4 21,1 21,3 23,8 17,6 13,3 Solsonès XT Solsona d.i. 24,3 18,0 13,5 Tarragonès VQ Constantí 7,9 6,0 11,2 13,1 17,1 21,9 22,6 24,6 20,6 16,6 Tarragonès XE Tarragona - Complex Educatiu 10,2 7,8 12,3 14,6 18,3 23,0 24,2 26,2 23,0 * 18,4 Tarragonès DK Torredembarra 9,7 7,7 12,3 14,3 17,9 22,8 24,3 26,2 22,7 18,5 Terra Alta WD Batea 6,3 5,0 11,2 12,1 18,3 23,0 23,3 25,5 20,2 15,9 Terra Alta XP Gandesa 6,6 5,2 11,2 12,2 18,1 22,9 23,4 25,6 20,4 16,0

complete file for download - UTF8

So, this output is not very easy to parse. What other approach is available?

It seems that every tool I use is only capable to extract information about layout of the table cells, but it doesn't extract the information of belonging to particular column. This is very much apparent if the cells are empty - the empty cells are not in the output, you only get non-empty "cells" with their layout. Does the PDF itself contain this tabular information? If not, it doesn't make sense to search for tool that will extract it.

Paid solutions are not out of question, as it might in the end be cheaper than invest several working days of my time...

What I have tried:

copy paste - makes problems with missing values (pg 5)
save as text from Acrobat (even worse result than copy-paste)
open in Excel as external data source - will not recognize the table
https://www.pdftoexcelonline.com/ - results in error
http://www.pdftoexcel.org/ as well as their trial of Able2Extract - they messed up some columns. They recognized the columns correctly in the preview but in the excel output they were messed up
http://www.pdftoword.com/ - just takes my email and never sends anything
using python on scraperwiki http://schoolofdata.org/2013/06/18/get-started-with-scraping-extracting-simple-tables-from-pdf-documents/ seems very complicated especially for non-python users and https://scraperwiki.com/ is not free
I have encountered several python libraries like pdftables but they are not easy to use for non-python developer like me (I was not even able to run these things). Is there any easier way to accomplish the task?
I am trying to use tm library in R as recommended here, but I have encountered some problems

EDIT: the Cloud SDK recommended by Ian. I registered but I absolutely don't know where to go from here - how to upload pages, recognize them etc:

enter image description here

How do you want the data from, say, the last page of the PDF to appear? On that page, it seems like there are some "columns" that have two values. — A5C1D2H2I1M1N2O1R2T1
– A5C1D2H2I1M1N2O1R2T1, Commented Aug 20, 2013 at 11:05
Most pages have two columns per month, but that does not seem to be a very big problem. The big question it would seem is how the data at the headers of the pages needs to be treated. The question seems woefully underspecified at the moment. — IRTFM
– IRTFM, Commented Aug 20, 2013 at 19:57
What are you talking about when stating " note missing values in lines beginning with *Solsonès'"?? -- Clearly these values are missing already in the original PDF file. — Kurt Pfeifle
– Kurt Pfeifle, Commented May 1, 2016 at 17:48

hmatt1 · Accepted Answer · 2013-08-19 23:25:05Z

Ok I took a shot at this and I think it will help, although I'm not sure what you want your final output to look like. I'm happy to work more on this so let me know if there are parts you need help with.

I started by downloading a PDF to Text application from CNET.

After installing, I checked these settings:

PDF to text conversion

The important part here is we're using the physical layout option.

This gave us output that looks like this:

Taules de Dades de la Xarxa d’Estacions Meteorològiques Automàtiques 2 Anuari de dades meteorològiques 2012 / Servei Meteorològic de Catalunya 2 TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012 COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT NOV DES ANY Alt Camp VY Nulles 7,5 5,5 10,9 12,3 16,7 21,6 22,3 24,4 20,1 15,9 11,0 8,5 14,8 Alt Camp DQ Vila-rodona 7,9 5,6 11,0 12,0 16,6 21,6 22,0 24,3 19,9 15,8 11,0 8,6 14,7 Alt Empordà U1 Cabanes 8,2 6,5 11,7 12,6 17,5 22,0 23,1 24,4 20,4 16,6 11,8 8,3 15,3 Alt Empordà W1 Castelló d'Empúries 8,1 6,4 11,6 12,9 17,0 21,1 22,0 23,4 20,1 16,4 12,1 8,5 15,0 Alt Empordà VZ Espolla 9,0 6,7 12,4 12,7 17,8 22,0 23,3 24,8 20,9 16,7 12,0 8,9 15,6 [......] 3 Anuari de dades meteorològiques 2012 / Servei Meteorològic de Catalunya 2 TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012 COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT NOV DES ANY Baix Empordà DF la Bisbal d'Empordà 6,6 5,3 10,9 12,6 17,2 21,9 22,9 24,6 20,3 16,6 11,9 7,6 14,9 Baix Empordà UB la Tallada d'Empordà 6,1 5,2 10,7 12,3 16,6 21,3 22,2 23,8 19,7 15,8 11,7 7,6 14,4 Baix Empordà UC Monells 6,1 4,6 9,9 11,4 16,5 21,7 23,0 24,5 19,6 15,7 11,7 7,2 14,3 Baix Empordà UD Serra de Daró 6,3 5,3 10,6 12,3 16,8 21,6 22,7 24,3 20,3 16,6 12,2 7,7 14,8 [......] 4 Anuari de dades meteorològiques 2012 / Servei Meteorològic de Catalunya 2 TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012 COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT NOV DES ANY Maresme UQ Dosrius - PN Montnegre Corredor 7,2 4,6 10,8 10,7 15,8 20,4 20,8 23,4 18,6 15,1 10,7 7,8 13,9 Maresme WT Malgrat de Mar 7,4 5,4 11,0 13,0 16,7 21,5 22,8 24,6 20,9 17,2 12,9 8,8 15,2 Maresme DD Vilassar de Mar 10,1 7,5 12,6 13,9 17,9 22,4 23,7 25,7 22,1 18,4 13,8 10,8 16,6 Montsià US Alcanar 10,0 7,6 11,8 14,2 17,9 22,7 24,0 25,8 22,0 18,2 13,7 10,7 16,6 Montsià UU Amposta 9,6 7,5 12,1 14,3 18,3 22,8 23,5 25,3 21,6 18,0 13,1 10,8 16,4 [......]

You can see the columns line up much better, but we also have headers and page numbers. Also the COMARCA and i NOM EMA columns were variying length. We want to normalize this to fixed width columns.

I wrote a Perl program to do normalize it, and it also combines tables with the same title, and only prints the headers at the top. It creates an output folder with all the files with the title as the file name.

Here's the code:

#!/bin/perl use strict; use warnings; use open qw(:std :utf8); use utf8; my $comarca; my $nom; my $print_headers; my $title = ""; my $fh; while(<>) { if ( !/Xarxa d’Estacions/ and !/Meteorològiques Automàtiques/ and !/Servei/ and !/^\s*\d+\s*$/ and !/^\s*$/ ) { chomp($_); if ( /^\s*2/ ) { #title s/^\s*2\s*//; if ( $title ne $_ ) { $title = $_; $print_headers = 1; } } elsif ( /COMARCA/ ) { #column headers my ($first_col, $second_col, @the_rest) = split(/(CODI +i NOM EMA *)/, $_); $comarca = length $first_col; $nom = length $second_col; if ( $print_headers ) { my $str = sprintf "%-50s %-50s %s\n", $first_col, $second_col, join("", @the_rest); write_string($str); $print_headers = 0; } } else { #data my ($one, $two, $three) = unpack("A${comarca}A${nom}A*", $_); my $str = sprintf "%-50s %-50s $three\n", $one, $two; write_string($str); } } } sub write_string { my $string = shift; my $file_name = $title; $file_name =~ s/[\/\\]//g; open ($fh, '>>', ".\/output_folder\/${file_name}.txt") or die "Couldn't open: $!"; print $fh $string; close ($fh); }

There are still a few imperfections in the output (you'll see these when you run this), but I wanted to get some feedback on what output would work best for you. There is definitely more we can do to improve the code! The output directory tree looks like this:

Matt@MattPC ~/perl/pdftotext $ find . . ./convert.pl ./EMAtaules2012.txt ./output.txt ./output_folder ./output_folder/AMPLITUD TÈRMICA MITJANA MENSUAL ( ºC ) - 2012?.txt ./output_folder/AMPLITUD TÈRMICA MÀXIMA MENSUAL ( ºC ) - 2012?.txt ./output_folder/DIRECCIÓ DOMINANT DEL VENT - 2012?.txt ./output_folder/GRUIX MÀXIM MENSUAL DE NEU AL TERRA ( cm ) - 2012?.txt ./output_folder/HUMITAT RELATIVA MITJANA MENSUAL ( % ) - 2012?.txt ./output_folder/MITJANA MENSUAL DE LA HUMITAT RELATIVA MÀXIMA DIÀRIA ( % ) - 2012?.txt ./output_folder/MITJANA MENSUAL DE LA HUMITAT RELATIVA MÍNIMA DIÀRIA ( % ) - 2012?.txt [......]

Where a file might look like this:

COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT NOV DES ANY Alt Camp VY Nulles 7,5 5,5 10,9 12,3 16,7 21,6 22,3 24,4 20,1 15,9 11,0 8,5 14,8 Alt Camp DQ Vila-rodona 7,9 5,6 11,0 12,0 16,6 21,6 22,0 24,3 19,9 15,8 11,0 8,6 14,7 Alt Empordà U1 Cabanes 8,2 6,5 11,7 12,6 17,5 22,0 23,1 24,4 20,4 16,6 11,8 8,3 15,3 Alt Empordà W1 Castelló d'Empúries 8,1 6,4 11,6 12,9 17,0 21,1 22,0 23,4 20,1 16,4 12,1 8,5 15,0 Alt Empordà VZ Espolla 9,0 6,7 12,4 12,7 17,8 22,0 23,3 24,8 20,9 16,7 12,0 8,9 15,6 Alt Empordà D6 Portbou 9,6 5,5 12,7 12,5 17,4 21,5 22,9 24,4 19,8 17,0 12,3 10,1 15,5 [......]

Headers are only at the top and all the columns line up. This one is TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012.

I've been thinking of uploading more of the output to a file hosting site, but I don't know which would be a good one, suggestions?

Hope this helps you Tomas!

EDIT: Example of missing entries from AMPLITUD TÈRMICA MÀXIMA MENSUAL ( ºC ) - 2012:

Solsonès VP Pinós 1 3,1 26 16,9 13 16,7 15 16,6 17 19,2 11 19,6 24 20,4 17 19,1 01 17,5 16 16,5 06 13,1 08 13,9 24 20,4 17/07 Solsonès XT Solsona 22,2 25 22,2 09 20,1 16 18,6 06 15,3 07 18,2 23 22,2 09/08 Tarragonès VQ Constantí 1 6,4 19 21,9 23 19,7 11 12,9 07 17,4 23 17,2 21 15,1 18 14,2 18 18,0 15 15,1 02 14,9 07 16,0 10 21,9 23/02

Update

Updated scripts for processing the input file:

#!/bin/perl use strict; use warnings; use open qw(:std :utf8); use utf8; use charnames ':full'; my @column_lengths; my $print_headers; my $title = ""; my $fh; while(<>) { if ( !/Xarxa d’Estacions/ and !/Meteorològiques Automàtiques/ and !/Servei/ and !/^\s*\d+\s*$/ and !/^\s*$/ ) { s/[\r\n]+//g; s/ +\d+$//; if ( /^\s*2/ ) { #title s/^\s*2\s*//; if ( $title ne $_ ) { $title = $_; $print_headers = 1; } } elsif ( /COMARCA/ ) { #column headers my $comarca = (split(/(COMARCA *)/, $_))[1]; my $codi = (split(/(CODI *)/, $_))[1]; my $inomema = (split(/(i NOM EMA *) /, $_))[1]; my $the_rest = (split(/(i NOM EMA *) /, $_))[2]; my @rest = split(/( \w+ *)/, $the_rest); undef @column_lengths; push @column_lengths, length $comarca; push @column_lengths, length $codi; push @column_lengths, length $inomema; for (@rest) { if ( $_ ) { push @column_lengths, length $_; } } $column_lengths[-1] = "*"; if ( $print_headers ) { $print_headers = 0; write_string(join(";", unpack( "A" . join("A", @column_lengths), $_)) . "\n"); } } else { #data write_string(join(";", unpack( "A" . join("A", @column_lengths), $_)) . "\n"); } } } sub write_string { my $string = shift; my $file_name = $title; $file_name =~ s/[º]//g; $file_name =~ s/[^\w ]//g; $file_name =~ s/ +/ /g; $file_name =~ s/È/E/g; $file_name =~ s/À/A/g; $file_name =~ s/Ó/O/g; $file_name =~ s/Í/I/g; $file_name =~ s/Ç/C/g; open ($fh, '>>', ".\/output_folder\/${file_name}.txt") or die "Couldn't open: $!"; print $fh $string; close ($fh); }

This one combines lines with the d.i. on the next line.

#!/bin/perl -i use strict; use warnings; my $last = <>; while(<>) { my @current_array = split(";", $_); if ( /^;+[ \t]+.d\.i\./ ) { my @last_array = split(";", $last); my @combined_array; #print "matches\n"; for my $element (@current_array) { if ( $element =~ /d\.i\./ ) { push @combined_array, $element; shift @last_array; } else { push @combined_array, $last_array[0]; shift @last_array; } } undef @current_array; @current_array = @combined_array; } $last = join ";", @current_array; print $last; }

The output is in csv format with semicolon delimiters.

Matt, looks good! Is your script able to handle the missing values in comarca Solsones (see my question)?
The script does handle the missing values, except sometimes the .di lines appear on the next line, and we might want to add some code to truncate after the last column since sometimes we have an extra page number. I'll work on uploading the files, it looks like I have to rename some of them to compress them in a zip (because of special characters).

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2013-08-23 01:10:45Z

Here is an R solution, but it is not without its flaws.

Part 1: Setup steps

# Read the lines of your file into R x <- readLines("EMAtaules2012.txt") # Make sure it shows up as UTF-8 to get proper accents and so on Encoding(x) <- "UTF-8" # Identify the lines where the data starts Start <- grep("COMARCA", x) # Grab the names of each table ListNames <- gsub("\\s+", " ", x[Start-2]) # Figure out the number of rows of data per page Runs <- rle(diff(cumsum(x != ""))) Nrows <- Runs$lengths[Runs$lengths > 4]+1 # Make our life easier by making this column name # a single string x <- gsub("i NOM EMA", "i_NOM_EMA", x) # Since these are fixed width files, we need to figure # out the widths of each column. This is the sum of # the number of characters in the header row plus # the number of spaces between each column name Spaces <- gregexpr(x[Start], pattern="\\s+") Spaces <- lapply(Spaces, function(x) c(attr(x, "match.length"), 0)) Chars <- lapply(strsplit(x[Start], "\\s+"), nchar) Widths <- lapply(seq_along(Spaces), function(x) rowSums(cbind(Spaces[[x]], Chars[[x]])))

Part 2: Using `read.fwf` to get the data in

# Now, you can use `read.fwf` to read your data files in temp <- lapply(seq_along(Start), function(fwf) { A <- read.fwf(textConnection(x), widths = c(Widths[[fwf]]), header = FALSE, skip = Start[fwf]+1, n = Nrows[fwf]-2, blank.lines.skip = TRUE, strip.white = TRUE, stringsAsFactors = FALSE) # Add in the column names names(A) <- scan(what = "character", file = textConnection(x[Start[fwf]]), quiet = TRUE) A }) # Assign the table names names(temp) <- ListNames # Some more cleanup. The original tables span multiple pages # in the PDF, but we can `rbind` them together in R Tables <- unique(ListNames) final <- lapply(seq_along(Tables), function(final) { A <- do.call(rbind, temp[names(temp) %in% Tables[final]]) rownames(A) <- NULL A }) # Add the names back in names(final) <- Tables

Part 3: Did it work?

# View the first few rows and columns of the first three tables lapply(final[1:3], function(y) head(y[1:5], 3)) # $` TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012` # COMARCA CODI i_NOM_EMA GEN FEB # 1 Alt Camp DQ Vila-rodona 7,9 5,6 # 2 Alt Empordà U1 Cabanes 8,2 6,5 # 3 Alt Empordà W1 Castelló d'Empúries 8,1 6,4 # # $` TEMPERATURA MÀXIMA MITJANA MENSUAL ( ºC ) - 2012` # COMARCA CODI i_NOM_EMA GEN FEB # 1 Alt Camp DQ Vila-rodona 13,1 11,7 # 2 Alt Empordà U1 Cabanes 15,1 12,4 # 3 Alt Empordà W1 Castelló d'Empúries 14,4 11,7 # # $` TEMPERATURA MÍNIMA MITJANA MENSUAL ( ºC ) - 2012` # COMARCA CODI i_NOM_EMA GEN FEB # 1 Alt Camp DQ Vila-rodona 3,8 0,5 # 2 Alt Empordà U1 Cabanes 2,4 0,9 # 3 Alt Empordà W1 Castelló d'Empúries 2,1 0,5 # Some tables, like those on page 76 (for the table "DIRECCIÓ DOMINANT DEL VENT"), had more columns than others. # Did our script take care of that? names(final$` DIRECCIÓ DOMINANT DEL VENT`) # [1] "COMARCA" "CODI" "i_NOM_EMA" "vent" "GEN" "FEB" # [7] "MAR" "ABR" "MAI" "JUN" "JUL" "AGO" # [13] "SET" "OCT" "NOV" "DES" "ANY"

It sort of worked. But, your input file is not perfect, and that means that there will still be a lot of cleaning up to to. For instance, some columns in the PDF seem to have multiple values. Not sure how you would be able to do any analysis on those.

Hopefully, the comments in the above code help get you started on figuring out how to go about scraping the data in a better way.

Update: Extracting just the data

Continuing after "Part 1" above, here's a solution that relies on (gasp) Excel. The basic idea is that Excel actually does a pretty decent job of detecting where the column breaks are if you import text as Fixed Width.

So, we use R to break up the text into separate pages, one file per page, only the data (not the column names or the row names, which are mostly the same across all datasets).

With that, here's the last R step:

# Output just the data temp <- lapply(seq_along(Widths), function(y) { DEL <- sum(Widths[[y]][1:3])-2 A <- substring(x[(Start[y]+1):(sum(Start[y], Nrows[y]))], DEL) writeLines(A, paste("temp_", y, ".txt", collapse = "")) A })

Let's open file "temp_9.txt", which is one that has the missing columns:

enter image description here

^^ Make sure "Fixed Width" is selected -- It should be by default since the file has no delimiters.

enter image description here

^^ Excel shows you a preview of where it is going to make the columns.

enter image description here

^^ I've highlighted the "problem rows" for you to see how it worked out.

We seem to be approaching this with a similar strategy (and you seem further along that I have gotten.) You might want to see if your approach comes closer to complete success with removal of the "Ã" characters. I get what looks like 100% "registration" after that simple step.
@DWin, I forgot to mention that: I just used Encoding(x) <- "UTF-8" to show the characters properly.
Thanks Ananda, but how did this method work for the missing values in comarca Solsones?

Ian Hopkinson · Accepted Answer · 2013-08-07 08:17:37Z

In the past I have used pdftohtml which can be used to generate xml, described here. The columns are generally fairly well separated so you could use the positioning to extract columns.

I wrote a large part of pdftables, apologies for the opaqueness! It works OK for some pages of the document you show, for example page 2 gives me the output at the bottom this reply. For other pages it falls over, on page 33, for example. The problem here is that there are two numbers under one column heading and they get stuck together by pdftables. The "COMARCA, CODI i, NOM EMA" columns don't get separated in either case. You can submit issues for pdftables on GitHub, I'm not working on it actively at the moment. It is available by pip install.

If you wanted to go the commercial route then Abbyy FineReader is very good, they produce a cloud SDK which will give you 30 or so pages free. They have example code in multiple languages but their support isn't great.

 14 columns, 39 rows 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ----------------------------------------------------------------------------------------------------- 0 | COMARCACODI i NOM EMA| GEN| FEB| MAR| ABR| MAI| JUN| JUL| AGO| SET| OCT| NOV| DES| ANY| 1 | VYNullesAlt Camp| 7,5| 5,5|10,9|12,3|16,7|21,6|22,3|24,4|20,1|15,9|11,0| 8,5|14,8| 2 | DQVila-rodonaAlt Camp| 7,9| 5,6|11,0|12,0|16,6|21,6|22,0|24,3|19,9|15,8|11,0| 8,6|14,7| 3 | Alt EmpordÃ U1Cabanes| 8,2| 6,5|11,7|12,6|17,5|22,0|23,1|24,4|20,4|16,6|11,8| 8,3|15,3| 4 | Alt EmpordÃ W1CastellÃ³ d'EmpÃºries| 8,1| 6,4|11,6|12,9|17,0|21,1|22,0|23,4|20,1|16,4|12,1| 8,5|15,0| 5 | Alt EmpordÃ VZEspolla| 9,0| 6,7|12,4|12,7|17,8|22,0|23,3|24,8|20,9|16,7|12,0| 8,9|15,6| 6 | D6PortbouAlt EmpordÃ | 9,6| 5,5|12,7|12,5|17,4|21,5|22,9|24,4|19,8|17,0|12,3|10,1|15,5| 7 | D4RosesAlt EmpordÃ | 9,3| 7,2|13,0|13,6|18,2|22,6|23,9|25,7|21,3|17,5|13,2| 9,9|16,3| 8 | Alt EmpordÃ U2Sant Pere Pescador| 7,8| 6,3|11,5|12,9|16,8|21,2|22,2|23,6|20,2|16,5|12,3| 8,5|15,0| 9 | Alt EmpordÃ W2Torroella de FluviÃ | 7,4| 6,0|11,2|12,6|16,4|21,2|22,3|23,7|19,9|16,1|11,7| 8,0|14,7| 10 | Alt EmpordÃ W3VentallÃ³| 7,3| 6,2|11,4|12,8|16,9|21,8|22,8|24,3|20,4|16,5|12,0| 8,1|15,1| 11 | Alt PenedÃ¨sWPCanaletes| 7,0| 5,2|11,3|11,9|16,7|21,5|22,0|24,2|19,7|15,6|10,7| 8,1|14,5| 12 | Alt PenedÃ¨sDIFont-rubÃ| 8,1| 6,2|12,0|11,9|16,9|21,8|22,0|24,4|20,0|15,9|11,4| 8,9|15,0| 13 | Alt PenedÃ¨sW4la Granada| 7,0| 5,5|11,2|12,6|17,2|21,9|22,4|24,3|20,0|16,0|11,1| 8,3|14,8| 14 | Alt PenedÃ¨sU3Sant MartÃ Sarroca| 6,4| 5,1|10,9|12,4|17,0|21,8|22,3|24,3|19,9|15,7|10,8| 8,0|14,6| 15 | Alt PenedÃ¨sWYSant SadurnÃ d'Anoia| 6,4| 5,1|11,0|12,8|17,6|22,6|23,2|25,0|20,5|16,2|10,9| 7,8|15,0| 16 | CDla Seu d'UrgellAlt Urgell| 3,6| 2,5| 8,5| 8,4|14,6|20,3|21,0|23,4|16,9|12,2| 7,0| 3,2|11,8| 17 | W5OlianaAlt Urgell| 2,0| 2,7| 9,8|10,2|16,8|23,0|22,9|25,6|19,1|13,9| 8,6| 3,1|13,2| 18 | Alt UrgellCJOrganyÃ | 2,6| 3,5| 9,8| 9,9|16,1|22,0|22,6|25,3|18,8|13,5| 8,2| 2,9|13,0| 19 | Alta RibagorÃ§aZ2BoÃ (2.535 m)|-2,4|-7,5|-1,3|-3,4| 3,8| 8,6| 9,4|12,0| 6,3| 2,7|-1,1|-3,2| 2,0| 20 | Alta RibagorÃ§aCTel Pont de Suert| 0,5| 1,6| 6,9| 7,9|14,1|18,0|19,1|20,4|15,7|10,7| 6,1| 1,3|10,2| 21 | CEels Hostalets de PierolaAnoia| 7,3| 5,5|11,7|12,1|17,4|22,4|22,9|25,2|20,3|16,2|11,1| 8,3|15,1| 22 | XBla LlacunaAnoia| 5,4| 3,3| 9,3|10,3|15,6|20,8|20,9|23,3|18,0|14,1| 9,1| 6,9|13,1| 23 | AnoiaXAla Panadella| 3,6| 1,7| 9,2| 8,7|14,9|20,5|20,4|23,2|17,2|13,3| 7,9| 5,1|12,2| 24 | H1Ã’denaAnoia| 5,1| 3,3| 9,4|11,5|16,3|21,7|22,5|24,6|19,4|15,2| 9,3| 6,0|13,7| 25 | WWArtÃ©sBages| 3,5| 2,8| 9,2|11,2|16,6|22,4|23,2|25,1|19,3|15,0| 9,1| 4,3|13,5| 26 | U4Castellnou de BagesBages| 4,8| 3,8|10,5|10,9|16,3|22,0|22,5|25,0|19,3|15,0| 9,6| 5,9|13,9| 27 | R1el Pont de VilomaraBages| 3,8| 3,1| 9,9|12,3|17,4|22,9|23,5|25,4|20,0|15,7| 9,7| 5,0|14,1| 28 | BagesWNMontserrat - Sant Dimes| 6,2| 3,3| 9,7| 8,6|14,8|19,5|19,5|22,4|16,9|13,5| 9,0| 7,1|12,6| 29 | CLSant Salvador de GuardiolaBages| 3,3| 2,8| 9,1|11,5|16,4|22,0|22,4|24,6|19,2|14,9| 9,1| 4,8|13,4| 30 | U5Prades - los HortalsBaix Camp| 2,8| 0,0| 6,4| 7,4|13,0|18,4|18,0|21,3|15,0|11,3| 6,5| 4,1|10,4| 31 | W6RiudomsBaix Camp| 9,7| 7,1|12,0|13,4|17,6|22,4|23,1|25,2|21,2|17,1|12,3|10,1|16,0| 32 | U6Vinyols i els ArcsBaix Camp|10,2| 7,6|12,0|13,8|17,6|22,5|24,0|25,9|22,3|18,2|13,2|11,1|16,6| 33 | Baix EbreU7Aldover|10,0| 8,5|13,2|14,8|19,7|24,6|25,2|27,1|22,7|18,3|12,9|11,1|17,4| 34 | DBel PerellÃ³Baix Ebre| 8,7| 7,0|12,0|13,3|17,9|22,6|23,3|25,3|21,4|17,2|11,9|10,3|15,9| 35 | U9l'AldeaBaix Ebre| 9,9| 8,1|12,5|14,3|18,5|23,3|24,1|26,0|22,1|17,9|13,1|10,7|16,8| 36 | UAl'Ametlla de MarBaix Ebre| 9,6| 7,8|12,3|13,8|18,0|22,9|23,9|25,8|22,0|17,6|12,5|10,6|16,4| 37 | Baix EbreX5PN dels Ports| 3,4|-0,2| 6,5| 6,8|13,4|18,7|17,8|21,2|15,2|11,3| 6,1| 4,9|10,5| 38 | Baix EmpordÃ DOCastell d'Aro| 6,7| 5,1|10,6|12,0|16,2|20,9|21,8|23,8|20,1|16,3|12,2| 8,1|14,5| -----------------------------------------------------------------------------------------------------

The unicode problems are down to my dev environment (Spyder).

Dear Ian, thank you! and sorry for late reply. But 1) how would your tool work on the page 5, when there are missing values in the table (for comarca Solsonès)? 2) is your tool able to join the table which is split accross multiple pages into one, coercing the corresponding columns? These problems are actually the essential thing for me, I can survive the "COMARCA, CODI i, NOM EMA" columns being merged.
I tried the Cloud SDK you recommend, I registered but I absolutely don't know where to go from here - how to upload pages, recognize them etc. Please look at my updated question.
Sorry for delay - I thought I was getting notifications, and I'm not! The missing values are no problem, they will simply appear as empty cells in pdftables. Joining tables between pages would be a matter of programming (that's to say as currently configured it would extract a table for each page and how to join them would be a matter of further programming). It looks like there is the same number of columns on each page, and they just differ by their absolute location on the page - in which case joining up should be straightforward.
The cloud sdk is for programmatic use only, there are code samples here: ocrsdk.com/documentation/code-samples However, I'm not clear whether you are looking for a programmatic solution here

canary_in_the_data_mine · Accepted Answer · 2013-08-20 14:04:15Z

4

If you're wary of diving too deeply into Python or other code-based solutions, a completely different approach for a quick and dirty solution for a small number of pdfs is to outsource the task to MechanicalTurk.

Having multiple users per column allows you to double-check the submitted answers, and you can also publish the resulting .csv table and pay a large amount (say, $5) for every error that a worker can find. Often ends up being way cheaper than your or others' time programming a solution.

answered Aug 20, 2013 at 14:04

canary_in_the_data_mine

2,3932 gold badges27 silver badges34 bronze badges

4 Comments

Tomas Over a year ago

Mechanical Turk??? Are they for real? Sounds quite offensive and humiliating (almost racist) for those people employed in that! +1 though, seems like one of possible solutions.

canary_in_the_data_mine Over a year ago

Ah, the name is based on 'artificial artificial intelligence'.

IRTFM Over a year ago

In the old days, pre- and intra-WWII, a "computer" was (typically) a woman with a slide rule or other mechanical mathematical device and pencil. One of Richard Feynman's jobs at Los Alamos was managing the "computer" staff. One can also ask why this didn't result in earlier attention to parallelism as a strategy.

LorenzoDonati4Ukraine-OnStrike Over a year ago

@Tomas I hope they didn't mean anything racist. I guess they are simply making an indirect reference to this historical thing.

Community · Accepted Answer · 2017-05-23 12:17:42Z

Although the layout differs across pages when using pdftotext, note that the column headings on individual pages (COMARCA, CODI, etc) seem to line up with the data on that page.

Also, there are many different types of data in your pdf - wind direction, wind strength, humidity, precipitation, etc. So not only does the layout differ across pages for the same data, but the layout differs because there are different data sets as well.

And just for completeness - the missing data for "Solsonès" (as one example) exists in the original PDF. It seems like pdftotext did a reasonable job - the missing data is whitespace, just like in the original PDF.

As a result, it may make sense to stay with pdftotext and treat the pages (which are separated by form feeds) as columnar data and parse using struct as documented here:

How to efficiently parse fixed width files?

One way to make this work would be to detect the form feed, look for the next line starting with "COMARCA", and use the spacing in that line to set up the columns for struct.

Yes it is there because many tools I tried are in Python, and the scraping community seems to be Python-oriented, but I prefer non-python solution if possible. Or, if there is a python tool which doesn't require knowledge of python - e.g. some command-line tool which does the parsing for me, then it is great. I can do the rest in PERL, PHP, bash, etc...
To follow a similar approach in Perl, does unpack help you, as descibed here? stackoverflow.com/questions/4911044/parse-fixed-width-files
unpack seems useful here (parsing fixed width columns), but I think you would also need a different template for each page since the column widths are different. The plus side is that it would handle missing entries. Tutorial here: perldoc.perl.org/perlpacktut.html

IRTFM · Accepted Answer · 2013-08-20 21:40:58Z

Efforts to construct an Index for this (presumably the variation in formats relates to the different sub-reports. These all seem to be for Catalunya:

heads <- grep(" .+2012", txt) notheads <- grep(" .+Anuari de", txt) headtxt <- unique(trim(txt[1:length(txt) %in% heads & !1:length(txt) %in% notheads])) [1] "TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012" [2] "TEMPERATURA MÀXIMA MITJANA MENSUAL ( ºC ) - 2012" [3] "TEMPERATURA MÍNIMA MITJANA MENSUAL ( ºC ) - 2012" [4] "TEMPERATURA MÀXIMA ABSOLUTA MENSUAL ( ºC ) - 2012" [5] "TEMPERATURA MÍNIMA ABSOLUTA MENSUAL ( ºC ) - 2012" [6] "AMPLITUD TÈRMICA MITJANA MENSUAL ( ºC ) - 2012" [7] "AMPLITUD TÈRMICA MÀXIMA MENSUAL ( ºC ) - 2012" [8] "NOMBRE DE DIES DE GLAÇADA ( TN ≤ 0 ºC ) - 2012" [9] "PRECIPITACIÓ MENSUAL ( mm ) - 2012" [10] "PRECIPITACIÓ MENSUAL MÀXIMA EN 24 HORES ( mm ) - 2012" [11] "PRECIPITACIÓ MENSUAL MÀXIMA EN 1 HORA ( mm ) - 2012" [12] "PRECIPITACIÓ MENSUAL MÀXIMA EN 30 MINUTS ( mm ) - 2012" [13] "PRECIPITACIÓ MENSUAL MÀXIMA EN UN 1 MINUT ( mm ) - 2012" [14] "NOMBRE DE DIES DE PRECIPITACIÓ (PPT ≥ 0,1 mm) - 2012" [15] "NOMBRE DE DIES DE PRECIPITACIÓ (PPT > 0,2 mm) - 2012" [16] "VELOCITAT MITJANA DEL VENT MENSUAL ( m/s ) - 2012" [17] "DIRECCIÓ DOMINANT DEL VENT - 2012" [18] "MITJANA MENSUAL DE LA RATXA MÀXIMA DIÀRIA DEL VENT ( m/s ) - 2012" [19] "RATXA MÀXIMA ABSOLUTA DEL VENT MENSUAL ( m/s ) - 2012" [20] "HUMITAT RELATIVA MITJANA MENSUAL ( % ) - 2012" [21] "MITJANA MENSUAL DE LA HUMITAT RELATIVA MÀXIMA DIÀRIA ( % ) - 2012" [22] "MITJANA MENSUAL DE LA HUMITAT RELATIVA MÍNIMA DIÀRIA ( % ) - 2012" [23] "MITJANA MENSUAL DE LA IRRADIACIÓ SOLAR GLOBAL DIÀRIA ( MJ/m2 ) - 2012" [24] "PRESSIÓ ATMOSFÈRICA MITJANA MENSUAL, A NIVELL DE L'EMA ( hPa ) - 2012" [25] "PRESSIÓ ATMOSFÈRICA MÀXIMA ABSOLUTA MENSUAL ( hPa ) - 2012" [26] "PRESSIÓ ATMOSFÈRICA MÍNIMA ABSOLUTA MENSUAL ( hPa ) - 2012" [27] "GRUIX MÀXIM MENSUAL DE NEU AL TERRA ( cm ) - 2012"

The parens and dashes interfere with grepping. So trying to get into a form where those values can be use to identify page header locations by grep(val, txt) succeeds by removing the "\\(.+$" matches with a single exception (which I decided to fix "by hand":

 headtxt[14:15] #[14] "NOMBRE DE DIES DE PRECIPITACIÓ (PPT ≥ 0,1 mm) - 2012" #[15] "NOMBRE DE DIES DE PRECIPITACIÓ (PPT > 0,2 mm) - 2012" headtxt <- gsub("\\(.+$", "", headtxt) pagedivs <- lapply(headtxt, grep, txt) # Seemed reasonable that the first 5 (of 10) should be the first section pagedivs[[14]] <- pagedivs[[14]][1:5] pagedivs[[15]] <- pagedivs[[15]][6:10]

So looking for a marker to end pages it looks like 4 empty lines is reliable

> length(notheads) [1] 113 > rl.lens <- rle( nchar(txt) ) > table(rl.lens$lengths[rl.lens$values==0]) # 1 4 #226 113

Removed all the "Ã" because they were creating non-fixed width columns:

txt <- gsub("Ã", "", txt) write(txt, "txt_noAs.txt)

Interestingly, my text editor now shows "à"'s where the "Ã"'s used to appear. At this point one can loop over the pages within page type starting at pagedivs+4 to the location of 4 empty rows and use read.fwf from the 'utils' package. What remains to support this is a layout definition, which you say you already have a handle on, but which could be also inferred using pkg:gsubfn's strapply or a regex solution.

Looking for an approach to develop a regex solution:

> numfields <- gregexpr("[-[:digit:].]+ ", txt) > table( sapply( numfields, length)) 1 2 3 5 6 7 8 11 12 13 14 15 1201 193 8 1 13 15 2 4 1162 869 308 32 16 17 19 20 21 23 24 25 26 27 28 30 1 3 1 1 1 7 10 688 481 168 13 1

So clearly the pages fall into two classes: those where the number of numeric columns is 12-14 and those where they number 23-28. I would have expected this to be a bit different, but I guess the "ANY" columns threw off my expectations.

Thanks DWin, but how did this method work for the missing values in comarca Solsones?
I don't really think I have described "a method yet". I've looked at the Solsones-lines since they have the highest degree of missingness, but in the context of any one page, the column alignment is still preserved. I've also just noticed that an easier page break test is to search for grep("^\\\f", txt). I assumed we could page through with @AnandaMahto's approach, after it had been encapsulated into a function that processed page by page.

Kurt Pfeifle · Accepted Answer · 2016-05-01 18:53:07Z

It's very clear that the original Excel spreadsheet was composed of different sheets which used different column widths.

So the PDF tables also use different column widths. If you look at the PDF, you can see the following groups of page ranges, which have identical column widths each. Each group also describes different things, as can be seen from the change headlines for each group's starting page (I can identify these differences even without being able to understand Spanish):

pages 2-6 (5 pages)
pages 7-11 (5 pages)
pages 12-16 (5 pages)
pages 17-21 (5 pages)
pages 22-26 (5 pages)
pages 27-31 (5 pages)
pages 32-36 (5 pages)
pages 37-41 (5 pages)
pages 42-46 (5 pages)
pages 47-51 (5 pages)
pages 52-56 (5 pages)
pages 57+58 (2 pages)
pages 59-62 (4 pages)
pages 63-67 (5 pages)
pages 68-72 (5 pages)
pages 73-76 (4 pages)
pages 77-80 (4 pages)
pages 81-84 (4 pages)
pages 85-88 (5 pages)
pages 89-93 (5 pages)
pages 84-98 (5 pages)
pages 99-103 (5 pages)
pages 104-107 (4 pages)
pages 108+109 (2 pages)
pages 110+111 (2 pages)
pages 112+113 (2 pages)
finally, page 114 (1 page only)

So, you could to let pdftotext extract the table data by these page groups. If the results will be not be perfectly aligned columns within each page range, you would have to extract the tables page-by-page. These should be easy enough to import into Excel as "fixed-width" table data.

To show you an example (created with Poppler's version of pdftotext):

pdftotext \ -layout \ -enc UTF-8 \ -f 22 -l 26 \ -nopgbrk \ -x 20 -y 82 \ -W 810 -H 450 \ EMAtaules2012.pdf \ -

-f 22 -l 26:
This tells the tool to extract page 22 as the first in the range, and page 26 as the last one.
-nopgbrk:
Tells the tool to not insert page breaks.
-x 20 -y 82:
Sets the upper left corner (in pixels) of the area where to extract the table data from. Note, I used such values here which also exclude the column headers, not just the page headers and the table names.
-W 810 -H 450: Sets the width and height (in pixels) of the area to use for table data extraction.

_{Note, if you use XPDF's version of pdftotext (as available at www.foolabs.com/xpdf/download.html) the command line options for -x, -y, -W and -H are not supported. But if you use -table instead of -layout with the XPDF-pdftotext, then the result should be similar (you will however still have to remove the page and column headers manually).}

Above command gives you this output (I show only the output for the first two pages with the width jump at exactly page border, 2 lines after the Baix Ebre entries):

 Alt Camp VY Nulles -1,4 19 -4,9 12 1,1 07 4,0 07 4,8 01 11,2 13 12,0 02 12,7 31 8,3 27 0,7 29 0,1 30 -1,7 01 -4,9 12/02 Alt Camp DQ Vila-rodona -0,5 30 -4,5 03 1,3 07 3,4 17 5,5 02 13,0 14 12,8 02 14,6 31 8,9 27 2,6 28 0,2 30 0,6 12 -4,5 03/02 Alt Empordà U1 Cabanes -3,0 15 -6,0 09 -0,3 02 2,9 25 3,6 01 12,2 11 10,5 24 12,6 27 6,6 27 2,8 30 2,0 30 -4,3 12 -6,0 09/02 Alt Empordà W1 Castelló d'Empúries -2,7 15 -6,2 09 0,3 02 3,2 07 6,0 01 12,1 16 11,1 24 13,3 27 7,5 27 0,7 30 2,2 23 -3,7 12 -6,2 09/02 Alt Empordà VZ Espolla -1,8 15 -6,8 09 1,5 19 2,9 07 5,7 01 12,2 12 10,3 24 13,7 07 7,6 20 2,5 30 2,5 07 -4,8 12 -6,8 09/02 Alt Empordà D6 Portbou 1,7 29 -4,5 04 4,8 06 3,3 16 9,4 01 12,6 11 13,3 01 15,3 06 12,4 26 4,7 28 4,0 30 1,4 12 -4,5 04/02 Alt Empordà D4 Roses -1,6 15 -4,2 09 2,9 16 4,6 07 7,0 01 13,5 12 13,5 24 15,7 27 8,7 27 2,1 30 3,5 23 -2,5 12 -4,2 09/02 Alt Empordà U2 Sant Pere Pescador -3,5 15 -6,1 09 -0,2 02 2,6 07 5,8 01 10,3 12 9,6 24 12,7 27 8,0 27 -0,2 30 1,9 23 -3,5 12 -6,1 09/02 Alt Empordà W2 Torroella de Fluvià -4,0 15 -6,7 09 -1,3 02 1,6 07 3,4 02 9,5 12 9,5 24 12,6 27 6,4 27 -0,6 30 0,9 30 -4,2 12 -6,7 09/02 Alt Empordà W3 Ventalló -5,0 15 -6,8 09 -0,7 02 1,9 07 4,3 01 10,2 12 10,6 24 12,5 27 6,9 27 -0,7 30 -0,8 30 -5,2 12 -6,8 09/02 Alt Penedès WP Canaletes -1,0 14 -5,3 12 1,6 07 3,1 17 5,7 03 11,2 13 12,1 02 13,7 31 9,0 27 1,8 29 -0,8 30 -0,6 02 -5,3 12/02 Alt Penedès DI Font-rubí -1,1 29 -4,9 12 2,0 08 4,4 17 6,9 01 11,6 09 11,8 02 15,1 31 10,0 26 0,3 29 -0,3 30 -0,3 02 -4,9 12/02 Alt Penedès W4 la Granada -0,9 31 -5,4 13 1,0 07 3,7 17 5,9 01 11,1 13 12,1 02 13,5 31 9,0 26 1,7 29 -0,9 30 -0,3 02 -5,4 13/02 Alt Penedès U3 Sant Martí Sarroca -4,1 14 -7,2 13 -0,3 08 3,0 07 4,6 03 11,2 12 11,4 02 13,2 31 8,2 26 -0,6 29 -1,1 30 -4,3 02 -7,2 13/02 Alt Penedès WY Sant Sadurní d'Anoia -2,7 31 -5,7 13 -0,3 08 2,4 07 4,7 01 10,7 12 12,0 02 13,8 31 8,0 27 1,6 30 -2,2 30 -2,8 02 -5,7 13/02 Alt Urgell CD la Seu d'Urgell -6,9 15 -10,7 12 -4,6 06 -1,5 17 2,1 01 6,3 12 7,5 02 7,2 31 3,1 27 -3,0 29 -4,0 30 -8,4 12 -10,7 12/02 Alt Urgell W5 Oliana -6,6 31 -12,0 12 -4,3 08 -1,1 14 1,4 01 7,8 12 9,6 02 11,2 26 7,4 26 -3,1 29 -4,5 30 -6,8 10 -12,0 12/02 Alt Urgell CJ Organyà -8,2 14 -8,8 05 -2,4 19 -0,9 20 1,1 01 6,6 12 9,9 02 10,4 31 5,6 27 -2,2 30 -1,7 30 -7,8 12 -8,8 05/02 Alta Ribagorça Z2 Boí (2.535 m) -14,3 29 -23,0 03 -13,6 06 -11,5 16 -7,2 01 -1,8 12 0,7 01 -2,0 31 -3,5 26 -14,2 28 -12,9 29 -11,5 06 -23,0 03/02 Alta Ribagorça CT el Pont de Suert -10,3 15 -11,8 21 -6,4 07 -3,4 17 -0,1 01 3,5 12 5,4 15 5,2 31 1,5 27 -4,9 29 -6,7 30 -9,6 12 -11,8 21/02 Anoia CE els Hostalets de Pierola -2,0 14 -5,1 13 1,3 07 3,4 17 5,8 01 12,4 12 12,2 02 13,1 31 10,0 27 1,2 29 -0,2 30 -1,9 02 -5,1 13/02 Anoia XB la Llacuna -6,2 14 -8,2 12 -2,8 07 1,1 17 2,4 03 6,4 13 9,8 24 10,2 31 5,0 27 -1,5 29 -3,2 30 -3,9 01 -8,2 12/02 Anoia XA la Panadella -3,9 30 -10,1 03 -2,2 06 -1,4 17 4,2 01 8,3 12 8,5 02 9,5 31 7,5 27 -1,2 28 -2,0 30 -4,4 02 -10,1 03/02 Anoia H1 Òdena -5,6 14 -8,7 13 -4,2 07 0,3 17 2,3 01 7,9 13 10,4 02 12,2 31 5,0 27 -0,7 30 -3,3 30 -4,8 02 -8,7 13/02 Bages WW Artés -5,9 14 -10,3 11 -4,9 06 -2,1 17 2,2 01 9,0 12 10,4 24 10,6 31 5,0 27 -2,6 29 -5,0 30 -5,6 02 -10,3 11/02 Bages U4 Castellnou de Bages -5,5 14 -7,5 03 -1,7 06 1,3 17 3,8 01 9,6 12 11,3 02 11,6 31 6,7 27 -0,3 29 -2,9 30 -3,8 02 -7,5 03/02 Bages R1 el Pont de Vilomara -5,3 14 -9,6 13 -3,0 07 -0,6 17 2,9 01 9,6 13 11,3 02 12,3 31 6,0 27 -1,2 29 -3,4 30 -5,0 02 -9,6 13/02 Bages WN Montserrat - Sant Dimes -0,3 29 -7,4 12 0,4 19 1,8 17 5,3 21 9,5 12 9,5 02 11,5 31 8,6 26 2,4 29 -0,1 30 -1,0 06 -7,4 12/02 Bages CL Sant Salvador de Guardiola -6,3 30 -10,1 13 -4,2 07 0,3 17 1,6 01 7,8 13 9,9 24 9,9 31 4,7 27 -1,5 30 -5,0 30 -6,4 02 -10,1 13/02 Baix Camp U5 Prades - los Hortals -6,6 30 -12,9 12 -5,8 09 -2,7 17 0,7 01 6,8 09 4,9 02 7,8 31 3,8 02 -3,1 29 -5,0 30 -6,6 01 -12,9 12/02 Baix Camp W6 Riudoms 0,0 13 -3,2 03 2,7 01 4,9 07 6,3 01 13,9 13 14,8 02 16,1 31 10,7 26 4,1 28 3,7 30 1,6 10 -3,2 03/02 Baix Camp U6 Vinyols i els Arcs -1,1 15 -2,1 03 1,9 15 4,7 07 6,9 01 15,6 02 15,1 01 17,3 31 11,7 26 6,4 28 4,6 30 2,4 10 -2,1 03/02 Baix Ebre U7 Aldover 0,4 31 -2,0 03 3,7 01 4,0 07 6,6 01 13,4 09 14,8 02 17,1 31 12,2 27 4,5 30 3,7 30 1,0 10 -2,0 03/02 Baix Ebre DB el Perelló -0,2 15 -2,8 03 3,2 07 6,0 17 7,4 01 15,5 09 15,3 02 16,9 31 12,0 29 5,0 30 3,5 30 1,7 01 -2,8 03/02 Baix Ebre U9 l'Aldea -1,3 13 -1,2 04 3,5 01 5,2 07 7,1 01 14,3 09 15,5 01 18,2 31 11,4 27 6,0 30 5,6 30 0,6 10 -1,3 13/01 Baix Ebre UA l'Ametlla de Mar 1,1 15 -2,2 03 4,5 23 5,0 07 6,6 01 14,9 09 15,2 01 17,1 31 11,7 27 4,8 30 4,1 30 2,4 12 -2,2 03/02 Baix Ebre X5 PN dels Ports -4,5 30 -11,3 04 -4,0 07 -2,8 17 0,2 01 5,8 09 7,4 01 8,0 31 4,8 27 -2,6 29 -4,6 30 -5,8 01 -11,3 04/02 Baix Empordà DO Castell d'Aro -1,7 15 -7,4 05 -0,4 06 2,2 17 4,9 01 11,2 12 12,1 24 13,6 31 9,1 27 -0,7 29 -1,5 30 -3,0 12 -7,4 05/02 Baix Empordà DF la Bisbal d'Empordà -3,2 15 -6,8 12 -2,4 06 0,5 17 4,6 01 11,1 12 10,3 24 11,6 31 7,7 27 -1,0 29 -2,2 30 -4,2 12 -6,8 12/02 Baix Empordà UB la Tallada d'Empordà -4,1 15 -7,1 12 -2,0 06 1,8 17 4,8 01 11,9 12 10,8 24 12,4 31 7,2 27 -0,5 30 -2,2 30 -5,1 12 -7,1 12/02 Baix Empordà UC Monells -3,7 15 -8,0 13 -3,2 06 -1,2 17 2,7 01 10,5 13 10,5 24 8,8 31 6,2 27 -2,1 29 -2,5 30 -4,8 12 -8,0 13/02 Baix Empordà UD Serra de Daró -3,2 15 -6,8 12 -1,7 06 0,9 17 4,6 01 11,7 12 10,1 24 11,5 31 7,3 27 0,5 30 -1,7 30 -3,8 12 -6,8 12/02 Baix Empordà UE Torroella de Montgrí -1,8 15 -5,6 12 -1,1 02 2,5 07 5,5 01 12,6 12 11,8 24 14,3 27 8,4 27 1,0 30 -0,5 30 -3,4 12 -5,6 12/02 Baix Llobregat UF Begues - PN del Garraf 0,1 29 -5,8 04 2,5 06 3,1 17 6,4 21 11,8 12 12,3 01 14,2 31 10,1 26 1,8 28 0,1 30 -0,4 02 -5,8 04/02 Baix Llobregat XL el Prat de Llobregat 0,6 30 -4,6 05 2,1 06 5,5 07 8,5 01 12,4 12 14,8 02 16,8 31 9,9 26 3,3 29 1,9 30 0,9 09 -4,6 05/02 Baix Llobregat D3 Vallirana 0,6 29 -3,1 03 4,1 07 5,4 17 6,7 01 12,9 12 13,9 02 15,9 31 11,3 27 4,7 29 1,9 30 0,4 01 -3,1 03/02 Baix Llobregat UG Viladecans 1,2 30 -4,1 05 3,8 08 6,2 11 8,4 01 15,0 16 15,4 02 17,5 31 12,1 26 4,2 29 2,2 30 1,1 02 -4,1 05/02 Baix Penedès WZ Cunit -1,9 30 -4,7 13 3,1 10 2,2 17 7,1 01 13,0 12 13,5 02 14,4 31 11,3 26 1,8 29 1,4 30 -1,6 02 -4,7 13/02 Baix Penedès UH el Montmell -0,7 29 -4,7 03 1,9 07 3,9 17 5,4 01 11,4 12 10,0 01 13,8 31 9,8 27 1,5 29 0,4 30 0,4 02 -4,7 03/02 Baix Penedès D9 el Vendrell -1,4 30 -4,2 12 1,2 10 5,3 07 6,4 02 12,9 12 13,2 02 17,7 08 10,7 26 4,3 29 1,1 30 0,1 11 -4,2 12/02 Baix Penedès WO la Bisbal del Penedès -5,4 14 -5,9 13 -1,3 10 4,5 02 3,8 01 11,6 15 12,9 24 14,6 08 7,0 27 0,9 30 -2,1 30 -2,9 01 -5,9 13/02 Barcelonès WU Badalona - Museu 2,2 14 -0,8 04 4,9 07 6,7 17 9,9 01 16,7 12 15,9 02 17,2 31 14,2 27 5,6 29 2,9 30 2,4 02 -0,8 04/02 Barcelonès X4 Barcelona - el Raval 5,5 30 0,6 04 7,9 09 9,1 17 11,6 01 17,6 12 16,6 01 19,4 30 16,3 29 7,6 29 5,6 30 4,5 02 0,6 04/02 Barcelonès D5 Barcelona - Observatori Fabra 1,0 30 -4,7 03 4,5 07 4,5 17 7,7 21 12,7 12 13,4 02 15,2 31 12,4 27 3,2 28 1,9 30 0,5 02 -4,7 03/02 Barcelonès X8 Barcelona - Zona Universitària 1,9 14 -1,8 04 4,8 06 6,1 17 7,6 01 14,5 12 14,6 01 16,8 31 13,3 27 5,4 29 2,3 30 2,1 02 -1,8 04/02 Barcelonès X2 Barcelona - Zoo 3,1 13 -2,3 05 5,1 10 8,5 07 10,1 01 15,9 12 16,6 02 18,0 31 14,8 02 6,8 29 4,3 30 2,2 02 -2,3 05/02 Berguedà UI Gisclareny -5,1 16 -12,5 04 -4,1 05 -2,7 17 -0,6 01 5,7 13 7,4 02 6,1 31 3,2 26 -2,8 29 -5,1 30 -5,6 12 -12,5 04/02 Berguedà WV Guardiola de Berguedà -7,4 14 -11,7 12 -5,8 07 -2,9 14 0,6 02 5,7 12 6,3 02 6,3 31 0,9 27 -4,4 30 -5,7 30 -8,4 01 -11,7 12/02 Berguedà CR la Quar -3,5 29 -11,5 12 -1,8 07 -2,3 17 1,2 01 5,7 12 10,0 15 8,9 31 5,0 27 -1,9 29 -2,7 30 -4,7 01 -11,5 12/02 Berguedà WM Santuari de Queralt -2,4 29 -9,1 04 -0,8 06 -0,2 11 2,9 01 6,2 12 9,2 02 9,7 31 7,2 26 -1,0 28 -1,3 30 -2,8 12 -9,1 04/02 Cerdanya Z9 Cadí Nord (2.143 m) - Prat d'Aguiló -11,5 30 -19,6 03 -10,4 06 -9,0 17 -4,5 01 1,8 12 2,9 01 0,9 31 -1,0 26 -10,5 28 -11,4 30 -9,2 02 -19,6 03/02 Cerdanya DP Das -12,9 14 -16,6 12 -9,7 10 -5,5 14 -2,2 14 0,6 12 2,3 02 3,6 27 -2,8 27 -6,9 30 -8,3 30 -13,5 12 -16,6 12/02 Cerdanya Z3 Malniu (2.230 m) -12,2 29 -20,6 03 -10,7 06 -9,6 16 -5,4 01 0,4 12 2,9 01 -0,2 31 -0,4 27 -12,1 28 -11,3 30 -9,1 02 -20,6 03/02 Conca de B. W8 Blancafort -3,1 19 -8,2 11 -2,8 07 1,9 17 2,9 01 10,7 13 11,8 02 12,5 31 6,2 27 -0,3 30 -1,2 30 -3,1 11 -8,2 11/02 Conca de B. CW l'Espluga de Francolí -2,0 16 -5,9 04 -0,9 07 2,5 17 2,8 01 11,5 04 10,4 02 13,2 31 6,5 27 -0,3 30 -1,0 30 -3,2 12 -5,9 04/02 Conca de B. UJ Santa Coloma de Queralt -3,4 14 -8,9 03 -1,1 07 -0,4 17 3,4 01 8,3 13 9,2 02 10,7 31 6,7 27 -0,3 28 -1,6 30 -3,4 02 -8,9 03/02 Garraf UK Sant Pere de Ribes - PN del Garraf -0,3 29 -3,8 04 2,8 06 4,2 17 7,1 01 12,9 12 12,4 02 13,2 31 12,0 27 2,6 29 0,3 30 0,2 02 -3,8 04/02 Garrigues UL Castelldans -4,9 26 -7,0 06 -1,9 10 1,7 07 3,2 01 11,5 15 12,8 03 13,6 31 5,8 27 -0,5 30 -1,5 30 -5,1 12 -7,0 06/02 Garrigues UM la Granadella -3,4 11 -7,6 03 -2,5 10 0,6 17 2,7 01 10,9 13 10,8 02 11,5 31 6,2 02 1,1 29 -0,9 30 -3,4 12 -7,6 03/02 Garrotxa W9 la Vall d'en Bas -6,3 14 -10,9 13 -5,8 07 -2,2 17 1,7 01 8,8 12 6,7 24 8,5 31 4,3 27 -4,3 29 -5,0 30 -6,6 09 -10,9 13/02 Garrotxa DC Olot -4,9 15 -9,9 12 -3,6 07 -1,8 17 2,6 01 9,0 12 9,9 24 9,6 31 5,5 27 -3,3 29 -3,9 30 -5,9 12 -9,9 12/02 Gironès UN Cassà de la Selva -4,2 15 -10,7 05 -3,0 06 0,5 17 1,9 01 8,8 12 11,0 24 10,5 31 6,7 27 -3,2 29 -4,4 30 -5,3 12 -10,7 05/02 Gironès UO Fornells de la Selva -5,8 15 -10,4 13 -4,9 07 -1,5 17 2,2 01 9,3 12 9,2 24 10,3 31 6,1 27 -3,5 29 -4,3 30 -6,3 12 -10,4 13/02 Gironès XJ Girona -5,1 15 -9,6 13 -4,0 07 -1,6 17 3,1 01 10,2 12 9,7 24 10,4 31 5,7 27 -3,1 29 -3,8 30 -5,7 12 -9,6 13/02 Gironès WF Vilablareix -5,2 15 -9,9 13 -4,3 07 -1,7 17 3,0 02 9,0 12 9,7 24 11,7 31 5,7 27 -2,8 29 -2,8 30 -4,6 12 -9,9 13/02 Maresme UP Cabrils 1,6 30 -2,6 11 3,2 07 6,7 17 8,5 01 13,9 12 15,1 02 15,9 31 13,3 26 3,7 28 3,0 30 2,6 12 -2,6 11/02

If you know how to properly operate a text editor, it is very easy and fast to fix this text output, so it will smoothly get imported by Excel...

Collectives™ on Stack Overflow

Scraping large pdf tables which span across multiple pages

7 Answers 7

Update

2 Comments

Part 1: Setup steps

Part 2: Using `read.fwf` to get the data in

Part 3: Did it work?

Update: Extracting just the data

4 Comments

4 Comments

4 Comments

3 Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Update

2 Comments

Part 1: Setup steps

Part 2: Using read.fwf to get the data in

Part 3: Did it work?

Update: Extracting just the data

4 Comments

4 Comments

4 Comments

3 Comments

2 Comments

Comments

Linked

Related

Part 2: Using `read.fwf` to get the data in