How to load batches of CSV files using tf.data and map

Question

I have been searching for an answer as to how I should go about this for quite some time and can't seem to find anything that works.

I am following a tutorial on using the tf.data API found here. My scenario is very similar to the one in this tutorial (i.e. I have 3 directories containing all the training/validation/test files), however, they are not images, they're spectrograms saved as CSVs.

I have found a couple solutions for reading lines of a CSV where each line is a training instance (e.g., How to *actually* read CSV data in TensorFlow?). But my issue with this implementation is the required record_defaults parameter as the CSVs are 500x200.

Here is what I was thinking:

import tensorflow as tf import pandas as pd def load_data(path, label): # This obviously doesn't work because path and label # are Tensors, but this is what I had in mind... data = pd.read_csv(path, index_col=0).values() return data, label X_train = tf.constant(training_files) # training_files is a list of the file names Y_train = tf.constant(training_labels # training_labels is a list of labels for each file train_data = tf.data.Dataset.from_tensor_slices((X_train, Y_train)) # Here is where I thought I would do the mapping of 'load_data' over each batch train_data = train_data.batch(64).map(load_data) iterator = tf.data.Iterator.from_structure(train_data.output_types, \ train_data.output_shapes) next_batch = iterator.get_next() train_op = iterator.make_initializer(train_data)

I have only used Tensorflows feed_dict in the past, but I need a different approach now that my data has gotten to the size that it can no longer fit in memory.

Any thoughts? Thanks.

I don't understand the problem you describe with record_defaults. Can you elaborate? — mikkola
– mikkola, Commented Apr 3, 2018 at 17:11
@mikkola sure. As far as I can tell, reading the CSV line by line would require that I create a list of length 200 every time I wanted to read a file (record_defaults = [[0],[0],...,[0]]) and then do something like cols = tf.decode_csv(csv_row, record_defaults=record_defaults) and data = tf.stack(cols). Which seemed like a lot of overhead for every file. — markdjthomas
– markdjthomas, Commented Apr 3, 2018 at 17:18
Ah, I see. Could still be worth a try? You only need to create one constant tensor to do that, and it can be shared between calls, right? Another option I have had success with was to read the whole file contents using tf.read_file, then split it appropriately (see tf.string_split) or directly interpret as CSV using tf.decode_csv. — mikkola
– mikkola, Commented Apr 3, 2018 at 18:02
I list of 200 tensors does not sound bad at all and you can reuse the same tf.constant(0) tensor. I would definitely give it a try. — iga
– iga, Commented Apr 5, 2018 at 1:05

yasin_alm · Accepted Answer · 2019-12-04 00:52:03Z

I use Tensorflow (2.0) tf.data to read my csv dataset. I have several folders for each class. Each folder contains thousands of csv files of data points. Below is the code I use for the data input pipeline. Hope this helps.

import tensorflow as tf def tf_parse_filename(filename): def parse_filename(filename_batch): data = [] labels = [] for filename in filename_batch: # Read data filename_str = filename.numpy().decode() # Read .csv file data_point= np.loadtxt(filename_str, delimiter=',') # Create label current_label = get_label(filename) label = np.zeros(n_classes, dtype=np.float32) label[current_label] = 1.0 data.append(data_point) labels.append(label) return np.stack(data), np.stack(labels) x, y = tf.py_function(parse_filename, [filename], [tf.float32, tf.float32]) return x, y train_ds = tf.data.Dataset.from_tensor_slices(TRAIN_FILES) train_ds = train_ds.batch(BATCH_SIZE, drop_remainder=True) train_ds = train_ds.map(tf_parse_filename, num_parallel_calls=AUTOTUNE) train_ds = train_ds.prefetch(buffer_size=AUTOTUNE) #Train on epochs for i in range(num_epochs): # Train on batches for x_train, y_train in train_ds: train_step(x_train, y_train) print('Training done!')

"TRAIN_FILES" is a matrix (e.g. pandas dataframe) where the first column is the label of a data point and the second column is the path to the csv file containing the data point.

edn · Accepted Answer · 2018-11-26 16:59:09Z

I suggest looking at this thread. It provides a complete example for how to use dataset API to read data from multiple csv files.

Tensorflow Python reading 2 files

Addendum:

Not sure how relevant the problem is as of today.. After seeing the comment that @markdjthomas mentions that the problem is slightly different here and he needs to read several rows instead of one at a time. The following example can come handy as well. Sharing here, just in case anyone else needs it too...

import tensorflow as tf import numpy as np from tensorflow.contrib.data.python.ops import sliding sequence = np.array([ [[1]], [[2]], [[3]], [[4]], [[5]], [[6]], [[7]], [[8]], [[9]] ]) labels = [1,0,1,0,1,0,1,0,1] # create TensorFlow Dataset object data = tf.data.Dataset.from_tensor_slices((sequence, labels)) # sliding window batch window_size = 3 window_shift = 1 data = data.apply(sliding.sliding_window_batch(window_size=window_size, window_shift=window_shift)) data = data.shuffle(1000, reshuffle_each_iteration=False) data = data.batch(3) #iter = dataset.make_initializable_iterator() iter = tf.data.Iterator.from_structure(data.output_types, data.output_shapes) el = iter.get_next() # create initialization ops init_op = iter.make_initializer(data) NR_EPOCHS = 3 with tf.Session() as sess: for e in range (NR_EPOCHS): print("\nepoch: ", e, "\n") sess.run(init_op) print("1 ", sess.run(el)) print("2 ", sess.run(el)) print("3 ", sess.run(el))

And the output...

epoch: 0 1 (array([[[[5]], [[6]], [[7]]], [[[4]], [[5]], [[6]]], [[[1]], [[2]], [[3]]]]), array([[1, 0, 1], [0, 1, 0], [1, 0, 1]], dtype=int32)) 2 (array([[[[3]], [[4]], [[5]]], [[[2]], [[3]], [[4]]], [[[7]], [[8]], [[9]]]]), array([[1, 0, 1], [0, 1, 0], [1, 0, 1]], dtype=int32)) 3 (array([[[[6]], [[7]], [[8]]]]), array([[0, 1, 0]], dtype=int32)) epoch: 1 1 (array([[[[1]], [[2]], [[3]]], [[[6]], [[7]], [[8]]], [[[2]], [[3]], [[4]]]]), array([[1, 0, 1], [0, 1, 0], [0, 1, 0]], dtype=int32)) 2 (array([[[[5]], [[6]], [[7]]], [[[3]], [[4]], [[5]]], [[[7]], [[8]], [[9]]]]), array([[1, 0, 1], [1, 0, 1], [1, 0, 1]], dtype=int32)) 3 (array([[[[4]], [[5]], [[6]]]]), array([[0, 1, 0]], dtype=int32)) epoch: 2 1 (array([[[[1]], [[2]], [[3]]], [[[5]], [[6]], [[7]]], [[[2]], [[3]], [[4]]]]), array([[1, 0, 1], [1, 0, 1], [0, 1, 0]], dtype=int32)) 2 (array([[[[4]], [[5]], [[6]]], [[[3]], [[4]], [[5]]], [[[7]], [[8]], [[9]]]]), array([[0, 1, 0], [1, 0, 1], [1, 0, 1]], dtype=int32)) 3 (array([[[[6]], [[7]], [[8]]]]), array([[0, 1, 0]], dtype=int32))

Thanks for your input but this isn't really what I am looking for as each line is not a unique training example. The entire file is a single training example (ie: a 2d array).

Collectives™ on Stack Overflow

How to load batches of CSV files using tf.data and map

2 Answers 2

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Linked

Related