numpy loadtxt, unicode, and python 2 or 3

Question

I have a (windows) text file reported by linux as being a:

ISO-8859 text, with very long lines, with CRLF line terminators

I want to read this into numpy, except the first line which contains labels (with special characters, usually only the greek mu).

Python 2.7.6, Numpy 1.8.0, this works perfectly:

data = np.loadtxt('input_file.txt', skiprows=1)

Python 3.4.0, Numpy 1.8.0, gives an error:

>>> np.loadtxt('input_file.txt', skiprows=1) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.4/site-packages/numpy/lib/npyio.py", line 796, in loadtxt next(fh) File "/usr/lib/python3.4/codecs.py", line 313, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 4158: invalid start byte

To me this is "buggy" behaviour for the following reasons:

I want to skip the first line so it should be ignored, regardless of its encoding
If I delete the first line from the file, loadtxt works fine in both versions of python
Shouldn't numpy.loadtxt behave the same in python2 and python3?

Questions:

How to get around this problem (using python3 of course)?
Should I file a bug report or is this expected behaviour?

do you get the same using genfromtxt() ?

Saullo G. P. Castro
– Saullo G. P. Castro

2014-04-08 12:47:46 +00:00
Commented Apr 8, 2014 at 12:47 — Saullo G. P. Castro
– Saullo G. P. Castro, Commented Apr 8, 2014 at 12:47
No! genfromtxt() works. Thanks!

Louic
– Louic

2014-04-08 12:58:20 +00:00
Commented Apr 8, 2014 at 12:58 — Louic
– Louic, Commented Apr 8, 2014 at 12:58
I created this issue in NumPy's GitHub

Saullo G. P. Castro
– Saullo G. P. Castro

2014-04-08 13:16:31 +00:00
Commented Apr 8, 2014 at 13:16 — Saullo G. P. Castro
– Saullo G. P. Castro, Commented Apr 8, 2014 at 13:16

Saullo G. P. Castro · Accepted Answer · 2014-04-08 13:05:22Z

3

It seems like a bug in loadtxt(), try to use genfromtxt() instead.

answered Apr 8, 2014 at 13:05

Saullo G. P. Castro

59.4k28 gold badges191 silver badges244 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

user297171 · Accepted Answer · 2014-04-08 12:34:16Z

Yes, it seems to be a bug in Numpy - it tries to do some parsing even in skipped rows and fails. Better report it. By the time, the documentation says that loadtxt supports fileobject or string generator as its first argument. Try this

f = open ('load_file.txt') f.readline() data = np.loadtxt(f)

P.S. Error 'utf-8' codec can't decode byte 0xb5 in position 4158 don't seem to happen in the beginning of the file. Are you sure your file does not contain some weird symbol that is invisible or looks like space but actually is not?

I am not 100% sure what exactly is in the file although it looks like tab separated numbers (some of them in scientific notation), and the linux file command seems to agree. Also, it looks "normal" in vim using :set list, but I did not look at each individual character in these 200KB files... If the "position" in the error message refers to the byte, this is the first line: lines are very long.
You can try regenerating the file: read it with Python, parse, convert to numbers then write them back. It will solve all formatting issues. Or just use regular expression to search for anything that is neither space nor digit.

Collectives™ on Stack Overflow

numpy loadtxt, unicode, and python 2 or 3

2 Answers 2

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Related