I have a big dataset (300.000 examples x 33.000 features), which of course does not fit the memory. The data are saved in HDF5 format. The values are mostly zeros (sparse data). They look like this:
Attr1 52 52 52 52 52 52 52 52 ... Attr2 umb umb umb umb umb umb umb umb ... CellID TGC-1 TGG-1 CAG-1 TTC-1 GTG-1 GTA-1 CAA-1 CAC-1 ... Acc Gene ... 243485 RP11-.3 0 0 0 0 0 0 0 0 ... 237613 FAM138A 0 0 0 0 0 0 0 0 ... 186092 OR4F5 0 0 0 0 0 0 0 0 ... 238009 RP11-.7 0 0 0 0 0 0 0 0 ... 239945 RP11-.8 0 0 0 0 0 0 0 0 ... 279457 FO538.2 0 0 0 0 0 0 0 0 ... 228463 AP006.2 0 0 0 0 0 0 0 0 ... ... ... ... ... ... ... ... ... ... ... I have done the following that works, to load the whole dataset in TensorFlow (loompy is just a package using hdf5 on the background):
import tensorflow as tf import numpy as np import loompy as lp batch_size = 1000 with loompy.connect(filename, 'r') as ds: ds_shape = (batch_size, ds.shape[0]) ds_dtype = ds[0:1, 0:1].dtype labels = np.asarray([ds.ca.CellID, ds.ca.Attr1]).T labels_shape = (batch_size, 1) data_placeholder = tf.placeholder(ds_dtype, ds_shape) labels_placeholder = tf.placeholder(labels[:,1].dtype, labels_shape) dataset = tf.data.Dataset.from_tensor_slices((data_placeholder, labels_placeholder)) dataset = dataset.prefetch(batch_size) iterator = dataset.make_initializable_iterator() next_element = iterator.get_next() with tf.Session() as sess: with loompy.connect(filename, 'r') as ds: for i in range(0, ds.shape[1], batch_size): batch = ds[0 : ds_shape[1], i : i + batch_size].T batch_labels = np.asarray([ds.ca.CellID[i : i + batch_size], ds.ca.Attr1[i : i + batch_size]]).T[:,1] sess.run(iterator.initializer, feed_dict = {data_placeholder: batch, labels_placeholder: batch_labels.reshape(batch_size, 1)}) for _ in range(batch_size): print(sess.run(next_element)) Output:
(array([0, 0, 0, ..., 0, 0, 0], dtype=int32), array([b'52'], dtype=object))
(array([0, 0, 0, ..., 0, 0, 0], dtype=int32), array([b'52'], dtype=object))
...
This way however, I am not able to split my data in train, test and evaluation sets. Also, I can only shuffle them inside each batch, which is not effective since most times the data on a batch belong to the same class.
How do I manipulate this kind of data to be able to load them as train, test, evaluation sets, and perform shuffling etc. (preferably by utilizing my TitanX GPU as much as possible)?