1

I would like to output the StanfordNLP results in protobuf (since its size is much smaller) and read the results back in python. How should I do that?

I followed the instruction here to output the results serialized with ProtobufAnnotationSerializer, like this:

java -cp "stanford-corenlp-full-2015-12-09/*" \ edu.stanford.nlp.pipeline.StanfordCoreNLP \ -annotators tokenize,ssplit \ -file input.txt \ -outputFormat serialized \ -outputSerializer \ edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer 

Then use protoc to compile the CoreNLP.proto, which comes with the source code of StanfordNLP, into python modules like this:

protoc --python_out=. CoreNLP.proto 

Then in python I read the files back like this:

import CoreNLP_pb2 doc = CoreNLP_pb2.Document() doc.ParseFromString(open('input.txt.ser.gz', 'rb').read()) 

The parsing fails with the following error message

--------------------------------------------------------------------------- DecodeError Traceback (most recent call last) <ipython-input-213-d8eaeb9c2048> in <module>() 1 doc = CoreNLP_pb2.Document() ----> 2 doc.ParseFromString(open('imed/s5_tokenized/conv-00000.ser.gz', 'rb').read()) /usr/local/lib/python2.7/dist-packages/google/protobuf/message.pyc in ParseFromString(self, serialized) 183 """ 184 self.Clear() --> 185 self.MergeFromString(serialized) 186 187 def SerializeToString(self): /usr/local/lib/python2.7/dist-packages/google/protobuf/internal/python_message.pyc in MergeFromString(self, serialized) 1092 # The only reason _InternalParse would return early is if it 1093 # encountered an end-group tag. -> 1094 raise message_mod.DecodeError('Unexpected end-group tag.') 1095 except (IndexError, TypeError): 1096 # Now ord(buf[p:p+1]) == ord('') gets TypeError. DecodeError: Unexpected end-group tag. 

UPDATE:

I asked the author of the serializer Gabor Angeli and got the answer. The protobuf objects were written to the files with writeDelimitedTo in this line. Changing it to writeTo would make the output files readable in Python.

5
  • What version of protoc are you running? protoc --version Commented Sep 11, 2016 at 6:03
  • @sberry: it outputs "libprotoc 3.0.0". Commented Sep 11, 2016 at 6:08
  • That may also be an issue (not sure which version was used to generate the .java files) but see my answer first because that is a problem too. Commented Sep 11, 2016 at 6:09
  • @sberry: The proto file does not include a proto version specification, and the compiler compiled it as proto2, which is correct because the proto file has the "optional" keyword. Commented Sep 11, 2016 at 6:10
  • @sberry: I got the answer from asking Gabor and I put it in the post. Thank you still for helping me :) Commented Sep 11, 2016 at 18:59

2 Answers 2

2

This question seems to have come up again, so I figured I'd write up a proper answer. The root of the issue is that the proto is written using Java's writeDelimitedTo method, which Google has not implemented for Python. A workaround would be to use the following method to read the proto file (assuming the file is not gziped -- you can replace f.read() with the appropriate code to unzip the file as appropriate):

from google.protobuf.internal.decoder import _DecodeVarint import CoreNLP_pb2 def readCoreNLPProtoFile(protoFile): protos = [] with open(protoFile, 'rb') as f: # -- Read the file -- data = f.read() # -- Parse the file -- # In Java. there's a parseDelimitedFrom() method that makes this easier pos = 0 while (pos < len(data)): # (read the proto) (size, pos) = _DecodeVarint(data, pos) proto = CoreNLP_pb2.Document() proto.ParseFromString(data[pos:(pos+size)]) pos += size # (add the proto to the list; or, `yield proto`) protos.append(proto) return protos 

The file CoreNLP_pb2 is compiled from the CoreNLP.proto file in the repo with the command:

protoc --python_out /path/to/output/ /path/to/CoreNLP.proto 

Note that as of writing this (version 3.7.0) the format is proto2, not proto3.

Sign up to request clarification or add additional context in comments.

Comments

0

There is a simple solution in Golang, assume the raw data is "data" and parsed to "msg":

import ( "google.golang.org/protobuf/proto" "google.golang.org/protobuf/reflect/protoreflect" "google.golang.org/protobuf/encoding/protowire" ) func CoreNLPUnmarshal(data []byte, msg protoreflect.ProtoMessage) error { bs, n := protowire.ConsumeBytes(data) if n < 0 { return protowire.ParseError(n) } return proto.Unmarshal(bs, msg) } 

1 Comment

The question requires an answer in Python

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.