Trying to load an hdf5 table with dataframe.to_hdf before I die of old age

Question

This sounds like it should be REALLY easy to answer with Google but I'm finding it impossible to answer the majority of my nontrivial pandas/pytables questions this way. All I'm trying to do is to load about 3 billion records from about 6000 different CSV files into a single table in a single HDF5 file. It's a simple table, 26 fields, mixture of strings, floats and ints. I'm loading the CSVs with df = pandas.read_csv() and appending them to my hdf5 file with df.to_hdf(). I really don't want to use df.to_hdf(data_columns = True) because it looks like that will take about 20 days versus about 4 days for df.to_hdf(data_columns = False). But apparently when you use df.to_hdf(data_columns = False) you end up with some pile of junk that you can't even recover the table structure from (or so it appears to my uneducated eye). Only the columns that were identified in the min_itemsize list (the 4 string columns) are identifiable in the hdf5 table, the rest are being dumped by data type into values_block_0 through values_block_4:

table = h5file.get_node('/tbl_main/table') print(table.colnames) ['index', 'values_block_0', 'values_block_1', 'values_block_2', 'values_block_3', 'values_block_4', 'str_col1', 'str_col2', 'str_col3', 'str_col4']

And any query like df = pd.DataFrame.from_records(table.read_where(condition)) fails with error "Exception: Data must be 1-dimensional"

So my questions are: (1) Do I really have to use data_columns = True which takes 5x as long? I was expecting to do a fast load and then index just a few columns after loading the table. (2) What exactly is this pile of garbage I get using data_columns = False? Is it good for anything if I need my table back with query-able columns? Is it good for anything at all?

Hey dts! This question is going to need quite a bit of editing before anyone can answer it. One of the cardinal rules is to ask 1 focused question at a time, so you'll probably get flagged for that. Also, showing a minimal reproducible example of the correct and garbled result, on some smallish dataframes, is going to help any readers figure out how to help. — Mike
– Mike, Commented May 31, 2019 at 0:47
I thought it was a generic question about data_columns = True versus False ... something that should be well-covered in the pytables documentation, but so far as I can tell is not. — dts
– dts, Commented May 31, 2019 at 13:28
Not sure why pytbales dos should cover this. df.to_hdf(data_columns = True) versus False is a pandas function. If all your CSV files have the same columns of data, and all you want is HDF5 file of same, maybe you should skip pandas and use pytables or h5py instead. There are SO answers on how to do both. — kcw78
– kcw78, Commented May 31, 2019 at 14:26
Right, I stand corrected, should have said pandas documentation. I spent quite a bit of time looking at pytables to skip the dataframe read_csv step, but couldn't get the read_csv speed together with the richness of what read_csv can do while reading. But if pytables is way faster I would settle for just sucking the CSVs up with no fancy processing. So, nobody knows anything about df.to_hdf(data_columns = True versus False)? One of life's great mysteries? — dts
– dts, Commented May 31, 2019 at 15:17
I was going to point you to my answer to a previous question about this. Then I discovered the "question was voluntarily removed by its author." I found my old code and modified to address your situation w/ multiple CSVs. See answer below. — kcw78
– kcw78, Commented May 31, 2019 at 18:50

kcw78 · Accepted Answer · 2019-06-02 19:46:25Z

This is how you can create an HDF5 file from CSV data using pytables. You could also use a similar process to create the HDF5 file with h5py.

Use a loop to read the CSV files with np.genfromtxt into a np array.
After reading the first CSV file, write the data with .create_table() method, referencing the np array created in Step 1.
For additional CSV files, write the data with .append() method, referencing the np array created in Step 1

End of loop
Updated on 6/2/2019 to read a date field (mm/dd/YYY) and convert to datetime object. Note changes to genfromtxt() arguments! Data used is added below the updated code.

import numpy as np import tables as tb from datetime import datetime csv_list = ['SO_56387241_1.csv', 'SO_56387241_2.csv' ] my_dtype= np.dtype([ ('a',int),('b','S20'),('c',float),('d',float),('e','S20') ]) with tb.open_file('SO_56387241.h5', mode='w') as h5f: for PATH_csv in csv_list: csv_data = np.genfromtxt(PATH_csv, names=True, dtype=my_dtype, delimiter=',', encoding=None) # modify date in fifth field 'e' for row in csv_data : datetime_object = datetime.strptime(row['my_date'].decode('UTF-8'), '%m/%d/%Y' ) row['my_date'] = datetime_object if h5f.__contains__('/CSV_Data') : dset = h5f.root.CSV_Data dset.append(csv_data) else: dset = h5f.create_table('/','CSV_Data', obj=csv_data) dset.flush() h5f.close()

Data for testing:

SO_56387241_1.csv: my_int,my_str,my_float,my_exp,my_date 0,zero,0.0,0.00E+00,01/01/1980 1,one,1.0,1.00E+00,02/01/1981 2,two,2.0,2.00E+00,03/01/1982 3,three,3.0,3.00E+00,04/01/1983 4,four,4.0,4.00E+00,05/01/1984 5,five,5.0,5.00E+00,06/01/1985 6,six,6.0,6.00E+00,07/01/1986 7,seven,7.0,7.00E+00,08/01/1987 8,eight,8.0,8.00E+00,09/01/1988 9,nine,9.0,9.00E+00,10/01/1989 SO_56387241_2.csv: my_int,my_str,my_float,my_exp,my_date 10,ten,10.0,1.00E+01,01/01/1990 11,eleven,11.0,1.10E+01,02/01/1991 12,twelve,12.0,1.20E+01,03/01/1992 13,thirteen,13.0,1.30E+01,04/01/1993 14,fourteen,14.0,1.40E+01,04/01/1994 15,fifteen,15.0,1.50E+01,06/01/1995 16,sixteen,16.0,1.60E+01,07/01/1996 17,seventeen,17.0,1.70E+01,08/01/1997 18,eighteen,18.0,1.80E+01,09/01/1998 19,nineteen,19.0,1.90E+01,10/01/1999

This approach looks promising ... like maybe I can do everything I was doing with read_csv using genfromtxt instead, and it might have the speed I was expecting. But, being ignorant, I can't figure out how to read the string fields. And, I'm not allowed to enter enough characters here to show what I'm doing. Guess I will try to edit the original post above ...
Can't figure out how to add my code in edit ... what do I do, start a new question on the new topic "genfromtxt"?
Cancel that, the fourth time I tried the syntax like '|S10' it finally worked. Now I'm trying to figure out the string-to-int converters and date conversion ...
What do you mean by "can't figure out how to read the string fields"? When you use dtype=None , genfromtxt should auto-magically create the dtype based on the data it finds. Then create_table will use the np array dtype to define the table field datatypes. Also, remember that you are dealing with unicode objects. I don't know what happens if you have different length string fields on each row. Also, if you know the CSV file format, you can define the dtype and reference in genfromtxt.
I have to define the dtypes as a list, otherwise the "guesses" on the earlier CSVs will be the wrong sizes for the later CSVs. But even if I didn't need to do this, dtype = None gives garbage for the string fields. At the moment I'm giving up on all my converters and trying to live with none of those, except the dates.

Collectives™ on Stack Overflow

Trying to load an hdf5 table with dataframe.to_hdf before I die of old age

1 Answer 1

16 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

16 Comments

Linked

Related