numpy UnicodeDecodeError am I using the right approach with genfromtxt

Question

I am stuck. I want to read a simple csv file into a Numpy array and seem to have dug myself into a hole. I am new to Numpy and I am SURE I have messed this up somehow as usually I can read CSV files easily in Python 3.4. I don't want to use Pandas so I thought I would use Numpy to increase my skillset but I really am not getting this at all. If someone could tell me if I am on the right track using genfromtxt OR is there an easier way and give me a nudge in the right direction I would be grateful. I want to read in the CSV file manipulate the datetime column to 8/4/2014 then put it in a numpy array together with the remaining columns. Here is what I have so far and the error which I am having trouble coding around. I can get the date part way there but don't see how to add the date.strftime("%Y-%m-%d") to the datefunc. Also I don't see how to format the string for SYM to get round the error. Any help would be appreciated.

the data

 2015-08-04 02:14:05.249392, AA, 0.0193103612, 0.0193515212, 0.0249713335, 30.6542480634, 30.7195875454, 39.640763021, 0.2131498442, 29.0406746589, 13524.5347810182, 89, 57, 99 2015-08-04 02:14:05.325113, AAPL, 0.0170506271, 0.0137941891, 0.0105915637, 27.0670313481, 21.8975963326, 16.8135861893, -19.0986405157, -23.2172064279, 21.5647072302, 33, 26, 75 2015-08-04 02:14:05.415193, AIG, 0.0080808151, 0.0073296055, 0.0076213535, 12.8278962785, 11.635388035, 12.0985236788, -9.2962105215, 3.980405659, -142.8175077335, 71, 42, 33 2015-08-04 02:14:05.486185, AMZN, 0.0235649449, 0.0305828226, 0.0092703502, 37.4081902773, 48.5487257749, 14.7162247572, 29.7810062852, -69.6877219282, -334.0005615016, 2, 92, 10

the "code" sorry still learning

import numpy as np from datetime import datetime from datetime import date,time datefunc = lambda x: datetime.strptime(x.decode("utf-8"), '%Y-%m-%d %H:%M:%S.%f') a = np.genfromtxt('/home/dave/Desktop/development/hvanal2016.csv',delimiter = ',', converters = {0:datefunc},dtype='object,str,float,float,float,float,float,float,float,float,float,float,float,float', names = ["date","sym","20sd","10sd","5sd","hv20","hv10","hv5","2010hv","105hv","abshv","2010rank","105rank","absrank"]) print(a["date"]) print(a["sym"]) print(a["20sd"]) print(a["hv20"]) print(a["absrank"])

the error

Python 3.4.3+ (default, Oct 14 2015, 16:03:50) [GCC 5.2.1 20151010] on linux Type "copyright", "credits" or "license()" for more information. >>> ============================================================================== RESTART: /home/dave/3 9 15 my slope.py =============================================================================== [datetime.datetime(2015, 8, 4, 2, 14, 5, 249392) datetime.datetime(2015, 8, 4, 2, 14, 5, 325113) datetime.datetime(2015, 8, 4, 2, 14, 5, 415193) ..., datetime.datetime(2016, 3, 18, 1, 0, 25, 925754) datetime.datetime(2016, 3, 18, 1, 0, 26, 26400) datetime.datetime(2016, 3, 18, 1, 0, 26, 114828)] Traceback (most recent call last): File "/home/dave/3 9 15 my slope.py", line 19, in <module> print(a["sym"]) File "/usr/lib/python3/dist-packages/numpy/core/numeric.py", line 1615, in array_str return array2string(a, max_line_width, precision, suppress_small, ' ', "", str) File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 454, in array2string separator, prefix, formatter=formatter) File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 328, in _array2string _summaryEdgeItems, summary_insert)[:-1] File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 490, in _formatArray word = format_function(a[i]) + separator UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)

Community · Accepted Answer · 2017-05-23 12:31:30Z

So part of your text is

b'2015-08-04 02:14:05.249392 AA 0.0193103612 ...'

(I'm using b because Py3 genfromtxt opens the file a bytestrings).

But you specify a , delimiter. I don't see any commas.

Let's just try a basic load, not fancy business.

In [97]: txt=b"""2015-08-04 02:14:05.249392 AA 0.0193103612 0.0193515212 0.0249713335 30.6542480634 30.7195875454 39.640763021 0.2131498442 29.0406746589 13524.5347810182 89 57 99 2015-08-04 02:14:05.325113 AAPL 0.0170506271 0.0137941891 0.0105915637 27.0670313481 21.8975963326 16.8135861893 -19.0986405157 -23.2172064279 21.5647072302 33 26 75 """ In [98]: txt=txt.splitlines() In [99]: data=np.genfromtxt(txt,dtype=None) In [100]: data Out[100]: array([ (b'2015-08-04', b'02:14:05.249392', b'AA', 0.0193103612, 0.0193515212, 0.0249713335, 30.6542480634, 30.7195875454, 39.640763021, 0.2131498442, 29.0406746589, 13524.5347810182, 89, 57, 99), (b'2015-08-04', b'02:14:05.325113', b'AAPL', 0.0170506271, 0.0137941891, 0.0105915637, 27.0670313481, 21.8975963326, 16.8135861893, -19.0986405157, -23.2172064279, 21.5647072302, 33, 26, 75)], dtype=[('f0', 'S10'), ('f1', 'S15'), ('f2', 'S4'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<f8'), ('f8', '<f8'), ('f9', '<f8'), ('f10', '<f8'), ('f11', '<f8'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4')])

The datetime information is in 2 fields:

In [101]: data[['f0','f1']] Out[101]: array([(b'2015-08-04', b'02:14:05.249392'), (b'2015-08-04', b'02:14:05.325113')], dtype=[('f0', 'S10'), ('f1', 'S15')])

Your datefunction does work with a byte substring

In [102]: datefunc(b'2015-08-04 02:14:05.249392') Out[102]: datetime.datetime(2015, 8, 4, 2, 14, 5, 249392)

But it requires 2 fields (as defined by the ' ' delimiter). So we need to figure out a way of parsing these 2 substrings as one, rather than split into two fields.

Maybe I'll try changing the sample txt to really use , delimiter (but not between date and time) and set what works.

With the , delimited text I get:

In [117]: data=np.genfromtxt(txt,delimiter=',',dtype=None,usecols=[0,1,2,3]) In [118]: data.dtype Out[118]: dtype([('f0', 'S26'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')]) In [119]: data['f0'] Out[119]: array([b'2015-08-04 02:14:05.249392', b'2015-08-04 02:14:05.325113', b'2015-08-04 02:14:05.415193', b'2015-08-04 02:14:05.486185'], dtype='|S26') In [120]: [datefunc(d) for d in data['f0']] Out[120]: [datetime.datetime(2015, 8, 4, 2, 14, 5, 249392), datetime.datetime(2015, 8, 4, 2, 14, 5, 325113), datetime.datetime(2015, 8, 4, 2, 14, 5, 415193), datetime.datetime(2015, 8, 4, 2, 14, 5, 486185)]

I used usecols because the full text has 14 fields in the 1st line, and 13 in the others.

If I specify the dtype (instead of the easy None), I can replace the strings in the 1st field with these datetime objects:

In [122]: data=np.genfromtxt(txt,delimiter=',',dtype='O,S5,f,f',usecols=[0,1,2,3]) In [123]: data Out[123]: array([ (b'2015-08-04 02:14:05.249392', b' AA', 0.01931036077439785, 0.019351521506905556), (b'2015-08-04 02:14:05.325113', b' AAPL', 0.01705062761902809, 0.01379418931901455),....], dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f4'), ('f3', '<f4')]) In [124]: data['f0'] Out[124]: array([b'2015-08-04 02:14:05.249392', b'2015-08-04 02:14:05.325113', b'2015-08-04 02:14:05.415193', b'2015-08-04 02:14:05.486185'], dtype=object) .... In [126]: data['f0']=[datefunc(d) for d in data['f0']] In [127]: data Out[127]: array([ (datetime.datetime(2015, 8, 4, 2, 14, 5, 249392), b' AA', 0.01931036077439785, 0.019351521506905556), (datetime.datetime(2015, 8, 4, 2, 14, 5, 325113), b' AAPL', 0.01705062761902809, 0.01379418931901455),...], dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f4'), ('f3', '<f4')])

and with the converter, your call works (more or less)

In [133]: data=np.genfromtxt(txt,dtype='object,S5,float,float', converters = {0:datefunc},delimiter=',',usecols=[0,1,2,3]) In [134]: data Out[134]: array([ (datetime.datetime(2015, 8, 4, 2, 14, 5, 249392), b' AA', 0.0193103612, 0.0193515212), (datetime.datetime(2015, 8, 4, 2, 14, 5, 325113), b' AAPL', 0.0170506271, 0.0137941891),...], dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])

the numpy datetime64 works with this string. These types can be used a numpy numbers.

In [154]: datefunc(b'2015-08-04 02:14:05.249392') Out[154]: datetime.datetime(2015, 8, 4, 2, 14, 5, 249392) In [155]: np.datetime64(b'2015-08-04 02:14:05.249392') Out[155]: numpy.datetime64('2015-08-04T02:14:05.249392-0700')

From this Importing csv into Numpy datetime64 I got this to work:

In [175]: data=np.genfromtxt(txt,dtype='M8[us],S5,float,float', delimiter=',',usecols=[0,1,2,3]) In [176]: data Out[176]: array([ (datetime.datetime(2015, 8, 4, 9, 14, 5, 249392), b' AA', 0.0193103612, 0.0193515212), (datetime.datetime(2015, 8, 4, 9, 14, 5, 325113), b' AAPL', 0.0170506271, 0.0137941891),...], dtype=[('f0', '<M8[us]'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])

See for datetime units: http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units

Hey @hpaulj sorry there ARE "," in data I didn't spot that when I loaded the data into the message. I also reduced the data to 5 lines from 10 to make it more readable.
I tried the new text. There is a field miscount, but I worked around that.
thanks so much for going the extra mile, I assume from all your good work there is no better way to do this in numpy. You have given me an excellent start at trying to improve my approach. thank you

Collectives™ on Stack Overflow

numpy UnicodeDecodeError am I using the right approach with genfromtxt

1 Answer 1

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Linked

Related