1

I am trying to parse a gzipped csv file (where the fields are separated by | characters), to test if reading the file directly in Python will be faster than zcat file.gz | python in parsing the contents.

I have the following code:

#!/usr/bin/python3 import gzip if __name__ == "__main__": total=0 count=0 f=gzip.open('SmallData.DAT.gz', 'r') for line in f.readlines(): split_line = line.split('|') total += int(split_line[52]) count += 1 print(count, " :: ", total) 

But I get the following error:

$ ./PyZip.py Traceback (most recent call last): File "./PyZip.py", line 11, in <module> split_line = line.split('|') TypeError: a bytes-like object is required, not 'str' 

How can I modify this to read the line and split it properly?

I'm interested mainly in just the 52nd field as delimited by |. The lines in my input file are like the following:

field1|field2|field3|...field52|field53

Is there a faster way than what I have in summing all the values in the 52nd field?

Thanks!

1

2 Answers 2

2

You should decode the line first before splitting, since unzipped files are read as bytes:

split_line = line.decode('utf-8').split('|') 

The code you have for summing all the values in the 52nd field is fine. There's no way to make it faster because all the lines simply have to be read and split in order to identify the 52th field of every line.

Sign up to request clarification or add additional context in comments.

Comments

1

Just try decoding the bytes object to a string. i.e,

line.decode('utf-8')

Updated script :

#!/usr/bin/python3 import gzip if __name__ == "__main__": total=0 count=0 f=gzip.open('SmallData.DAT.gz', 'r') for line in f.readlines(): split_line = line.decode("utf-8").split('|') total += int(split_line[52]) count += 1 print(count, " :: ", total) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.