3

My Objective: I would like to use GDAL to convert a GeoPDF. I want the vector layers as shp files and the raster layers as tif files. I want to do this in a programmatic way.

Edit: In reality, I want to do this with many geospatial PDFs. I'm prototyping the workflow using Python, but it will probably end up being C++. (End Edit)

The Problem: Naturally, the command to convert a vector layer differs from a raster layer. And I don't know (again in a programmatic way) which layers are vector and which are raster.

What I've Tried: First, here is my sample data https://www.terragotech.com/images/pdf/webmap_urbansample.pdf.

gdalinfo webmap_urbansample.pdf -mdd LAYERS 

gives the layer names:

... Metadata (LAYERS): LAYER_00_NAME=Layers LAYER_01_NAME=Layers.BPS_-_Water_Sources LAYER_02_NAME=Layers.BPS_-_Facilities LAYER_03_NAME=Layers.BPS_-_Buildings LAYER_04_NAME=Layers.Sewerage_Man_Holes LAYER_05_NAME=Layers.Sewerage_Pump_Stations LAYER_06_NAME=Layers.Water_Points LAYER_07_NAME=Layers.Roads LAYER_08_NAME=Layers.Sewerage_Jump-Ups LAYER_09_NAME=Layers.Sewerage_Lines LAYER_10_NAME=Layers.Water_Lines LAYER_11_NAME=Layers.Cadastral_Boundaries LAYER_12_NAME=Layers.Raster_Images ... 

I know to look at the data which are vector and which are raster, but I don't know how to parse this information to know whether to use ogr2ogr or gdal_translate to do the conversion.

Then I thought I could use ogrinfo and just diff all the layers to deduce which ones are raster, but ogrinfo gives me:

... 1: Cadastral Boundaries (Polygon) 2: Water Lines (Line String) 3: Sewerage Lines (Line String) 4: Sewerage Jump-Ups (Line String) 5: Roads 6: Water Points (Point) 7: Sewerage Pump Stations (Point) 8: Sewerage Man Holes (Point) 9: BPS - Buildings (Polygon) 10: BPS - Facilities (Polygon) 11: BPS - Water Sources (Point) 

So there's not a one-to-one correspondence with the way these are output.

Does anyone know how to have gdal print the GeoPDF layers and indicate which are raster vs. vector?

2
  • Is there just one raster layer and the rest is vector? I think ogrinfo only lists those that it supports. Commented Jul 30, 2015 at 14:08
  • @bugmenot123 In this data set, yes. There is only one raster layer. If you're getting at the fact that ogrinfo won't list raster layers, then yes, that's what I saw too. I thought I might be able to leverage that fact to tell which layers were raster (because they won't be listed by ogrinfo). By "not a one-to-one comparison", I meant that the layer text is different, e.g., spaces vs underscores and the Layers.* syntax. Commented Jul 30, 2015 at 14:21

2 Answers 2

1

This is not really the answer, but something I've been using as a workaround.

The script compares the text of the layers between gdalinfo and ogrinfo to infer which ones are raster. This approach isn't definitive though, so I imagine it could be wrong from time to time. Even in this example, LAYER_00_NAME=Layers isn't really a raster layer.

def GetRasterVectorLayers(filename): from osgeo import gdal from osgeo import ogr from difflib import SequenceMatcher # get vector layers with ogr data_ogr = ogr.Open(filename) if data_ogr: vector_layers = [ data_ogr.GetLayer(i).GetName() for i in range(data_ogr.GetLayerCount()) ] else: vector_layers = [] # get all layers with gdal data_gdal = gdal.Open( filename, gdal.GA_ReadOnly ) layers = data_gdal.GetMetadata_List("LAYERS") # peel off label, e.g., LAYER_00_NAME=Layers layers = [ layer.split('=')[-1] for layer in layers ] # match the text to deduce which layers are vector or raster matched_layers = [] for vector_layer in vector_layers: layer_matches = [] for layer in layers: layer_matches.append( [SequenceMatcher(None, vector_layer, layer).ratio(), layer] ) layer_matches.sort() best_match = layer_matches[-1][1] # -1 gets the highest score, 1 gets the gdalinfo layer name matched_layers.append( [vector_layer,best_match] ) layers_vector = [ match[1] for match in matched_layers ] layers_raster = [ layer for layer in layers if layer not in layers_vector ] return [layers_raster, layers_vector] layers_raster, layers_vector = GetRasterVectorLayers('webmap_urbansample.pdf') layers_raster # ['Layers', 'Layers.Raster_Images'] layers_vector # ['Layers.Cadastral_Boundaries', 'Layers.Water_Lines', 'Layers.Sewerage_Lines', 'Layers.Sewerage_Jump-Ups', 'Layers.Roads', 'Layers.Water_Points', 'Layers.Sewerage_Pump_Stations', 'Layers.Sewerage_Man_Holes', 'Layers.BPS_-_Buildings', 'Layers.BPS_-_Facilities', 'Layers.BPS_-_Water_Sources'] 
2
  • There's nothing wrong with answering your own question, providing it actually provides an answer (which you do). The fact that you are already working with the Python API is important though, when I first read your question I expected you wanted something simpler (like a shell script). I'd add that to your question, as it opens more options. Commented Jul 31, 2015 at 17:19
  • Thanks - you're right, I should have included that information. I edited the question. Commented Jul 31, 2015 at 17:41
0

I'm afraid it is up to the PDF creator how to name and number the raster and vector layers. New USGS Topo sheets are combined GeoPDF, and a sample file (NM_Santa_Fe_20131108_TM_geo.pdf) has this output from gdalinfo:

LAYER_00_NAME=Map_Collar LAYER_01_NAME=Map_Collar.Map_Elements LAYER_02_NAME=Map_Frame LAYER_03_NAME=Map_Frame.Projection_and_Grids LAYER_04_NAME=Map_Frame.Geographic_Names LAYER_05_NAME=Map_Frame.Structures LAYER_06_NAME=Map_Frame.Transportation LAYER_07_NAME=Map_Frame.Transportation.Road_Names_and_Shields LAYER_08_NAME=Map_Frame.Transportation.Road_Features LAYER_09_NAME=Map_Frame.Transportation.Trails LAYER_10_NAME=Map_Frame.Transportation.Railroads LAYER_11_NAME=Map_Frame.Transportation.Airports LAYER_12_NAME=Map_Frame.PLSS LAYER_13_NAME=Map_Frame.Hydrography LAYER_14_NAME=Map_Frame.Terrain LAYER_15_NAME=Map_Frame.Terrain.Contours LAYER_16_NAME=Map_Frame.Terrain.Shaded_Relief LAYER_17_NAME=Map_Frame.Woodland LAYER_18_NAME=Map_Frame.Boundaries LAYER_19_NAME=Map_Frame.Boundaries.Jurisdictional_Boundaries LAYER_20_NAME=Map_Frame.Boundaries.Jurisdictional_Boundaries.International LAYER_21_NAME=Map_Frame.Boundaries.Jurisdictional_Boundaries.State_or_Territory LAYER_22_NAME=Map_Frame.Boundaries.Jurisdictional_Boundaries.County_or_Equivalent LAYER_23_NAME=Map_Frame.Boundaries.Federal_Administered_Lands LAYER_24_NAME=Map_Frame.Boundaries.Federal_Administered_Lands.National_Park_Service LAYER_25_NAME=Map_Frame.Boundaries.Federal_Administered_Lands.Department_of_Defense LAYER_26_NAME=Map_Frame.Boundaries.Federal_Administered_Lands.Forest_Service LAYER_27_NAME=Images LAYER_28_NAME=Images.Orthoimage 

The layers are ordered hierarchical, with sublayers that have one or several points in the name. Layer 00, 02, 06, 14, 18, 19, 23 and 27 are meta layers, a combination of the following sublayers.

Ogrinfo reports:

1: Map_Collar 2: Map_Collar_Map_Elements 3: Map_Frame_Projection_and_Grids 4: Map_Frame_Geographic_Names (Multi Line String) 5: Map_Frame_Structures 6: Map_Frame_Transportation_Road_Names_and_Shields 7: Map_Frame_Transportation_Road_Features 8: Map_Frame_PLSS 9: Map_Frame_Hydrography 10: Map_Frame_Terrain_Contours 11: Map_Frame_Woodland 12: Map_Frame_Boundaries_Federal_Administered_Lands_Forest_Service 

These are really only vector layers, but the substructure naming is lost, and different from your example the geometry type is only added in one layer. Both lists have no blanks in names (except the one with geometry). Meta layers and empty layers are omitted in the vector layer list. The numbering of the layers is not consistent (but not inverted as in your example).

If you want to extract all vector layers, you can use

ogr2ogr -f sqlite out.sqlite in.pdf 

If you try that with the GDAL raster commands, it will try to rasterize your vector data, which might take some time. So you have to include the raster layer name explicitely for every layer:

gdalwarp -co "TILED=YES" -co "TFW=YES" rumney_farmforest_geopdf.pdf rumtif01.tif -overwrite --config GDAL_PDF_LAYERS "Graphic_Outline_(display_only)" 
2
  • So I know how to extract the vector and raster layers. My question was how to know which layers are vector and which are raster so that I can extract each individual layer in an automated way. You point out that the substructure naming is lost, which I also point out in my question. Also, the difference in strings between the two calls (underscores vs spaces). I presume that gdalinfo reports the name of some object whereas ogrinfo gives an attribute value. Is there no way to make this uniform, e.g., through a certain api call? I almost take your answer to mean there is no good answer. Commented Jul 31, 2015 at 17:26
  • The underscore vs blank does not apply to my example. I guess the GeoPDF driver still needs some developer attention, but there seem to be more important bugs to fix. The automatisation works for vector export into the spaitaöie database, while the raster layers need some handwork. Commented Jul 31, 2015 at 19:11

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.