3

I have a (windows) text file reported by linux as being a:

ISO-8859 text, with very long lines, with CRLF line terminators 

I want to read this into numpy, except the first line which contains labels (with special characters, usually only the greek mu).

Python 2.7.6, Numpy 1.8.0, this works perfectly:

data = np.loadtxt('input_file.txt', skiprows=1) 

Python 3.4.0, Numpy 1.8.0, gives an error:

>>> np.loadtxt('input_file.txt', skiprows=1) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.4/site-packages/numpy/lib/npyio.py", line 796, in loadtxt next(fh) File "/usr/lib/python3.4/codecs.py", line 313, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 4158: invalid start byte 

To me this is "buggy" behaviour for the following reasons:

  1. I want to skip the first line so it should be ignored, regardless of its encoding
  2. If I delete the first line from the file, loadtxt works fine in both versions of python
  3. Shouldn't numpy.loadtxt behave the same in python2 and python3?

Questions:

  • How to get around this problem (using python3 of course)?
  • Should I file a bug report or is this expected behaviour?
3
  • 1
    do you get the same using genfromtxt() ? Commented Apr 8, 2014 at 12:47
  • 1
    No! genfromtxt() works. Thanks! Commented Apr 8, 2014 at 12:58
  • 1
    I created this issue in NumPy's GitHub Commented Apr 8, 2014 at 13:16

2 Answers 2

3

It seems like a bug in loadtxt(), try to use genfromtxt() instead.

Sign up to request clarification or add additional context in comments.

Comments

2

Yes, it seems to be a bug in Numpy - it tries to do some parsing even in skipped rows and fails. Better report it. By the time, the documentation says that loadtxt supports fileobject or string generator as its first argument. Try this

f = open ('load_file.txt') f.readline() data = np.loadtxt(f) 

P.S. Error 'utf-8' codec can't decode byte 0xb5 in position 4158 don't seem to happen in the beginning of the file. Are you sure your file does not contain some weird symbol that is invisible or looks like space but actually is not?

2 Comments

I am not 100% sure what exactly is in the file although it looks like tab separated numbers (some of them in scientific notation), and the linux file command seems to agree. Also, it looks "normal" in vim using :set list, but I did not look at each individual character in these 200KB files... If the "position" in the error message refers to the byte, this is the first line: lines are very long.
You can try regenerating the file: read it with Python, parse, convert to numbers then write them back. It will solve all formatting issues. Or just use regular expression to search for anything that is neither space nor digit.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.