I would like to output the StanfordNLP results in protobuf (since its size is much smaller) and read the results back in python. How should I do that?
I followed the instruction here to output the results serialized with ProtobufAnnotationSerializer, like this:
java -cp "stanford-corenlp-full-2015-12-09/*" \ edu.stanford.nlp.pipeline.StanfordCoreNLP \ -annotators tokenize,ssplit \ -file input.txt \ -outputFormat serialized \ -outputSerializer \ edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer Then use protoc to compile the CoreNLP.proto, which comes with the source code of StanfordNLP, into python modules like this:
protoc --python_out=. CoreNLP.proto Then in python I read the files back like this:
import CoreNLP_pb2 doc = CoreNLP_pb2.Document() doc.ParseFromString(open('input.txt.ser.gz', 'rb').read()) The parsing fails with the following error message
--------------------------------------------------------------------------- DecodeError Traceback (most recent call last) <ipython-input-213-d8eaeb9c2048> in <module>() 1 doc = CoreNLP_pb2.Document() ----> 2 doc.ParseFromString(open('imed/s5_tokenized/conv-00000.ser.gz', 'rb').read()) /usr/local/lib/python2.7/dist-packages/google/protobuf/message.pyc in ParseFromString(self, serialized) 183 """ 184 self.Clear() --> 185 self.MergeFromString(serialized) 186 187 def SerializeToString(self): /usr/local/lib/python2.7/dist-packages/google/protobuf/internal/python_message.pyc in MergeFromString(self, serialized) 1092 # The only reason _InternalParse would return early is if it 1093 # encountered an end-group tag. -> 1094 raise message_mod.DecodeError('Unexpected end-group tag.') 1095 except (IndexError, TypeError): 1096 # Now ord(buf[p:p+1]) == ord('') gets TypeError. DecodeError: Unexpected end-group tag. UPDATE:
I asked the author of the serializer Gabor Angeli and got the answer. The protobuf objects were written to the files with writeDelimitedTo in this line. Changing it to writeTo would make the output files readable in Python.
protoc --version