tf.keras.utils.PyDataset

Base class for defining a parallel dataset using Python code.

Every PyDataset must implement the __getitem__() and the __len__() methods. If you want to modify your dataset between epochs, you may additionally implement on_epoch_end(). The __getitem__() method should return a complete batch (not a single sample), and the __len__ method should return the number of batches in the dataset (rather than the number of samples).

workers Number of workers to use in multithreading or multiprocessing.
use_multiprocessing Whether to use Python multiprocessing for parallelism. Setting this to True means that your dataset will be replicated in multiple forked processes. This is necessary to gain compute-level (rather than I/O level) benefits from parallelism. However it can only be set to True if your dataset can be safely pickled.
max_queue_size Maximum number of batches to keep in the queue when iterating over the dataset in a multithreaded or multipricessed setting. Reduce this value to reduce the CPU memory consumption of your dataset. Defaults to 10.

Notes:

  • PyDataset is a safer way to do multiprocessing. This structure guarantees that the model will only train once on each sample per epoch, which is not the case with Python generators.
  • The arguments workers, use_multiprocessing, and max_queue_size exist to configure how fit() uses parallelism to iterate over the dataset. They are not being used by the PyDataset class directly. When you are manually iterating over a PyDataset, no parallelism is applied.

Example:

from skimage.io import imread from skimage.transform import resize import numpy as np import math # Here, `x_set` is list of path to the images # and `y_set` are the associated classes. class CIFAR10PyDataset(keras.utils.PyDataset): def __init__(self, x_set, y_set, batch_size, **kwargs): super().__init__(**kwargs) self.x, self.y = x_set, y_set self.batch_size = batch_size def __len__(self): # Return number of batches. return math.ceil(len(self.x) / self.batch_size) def __getitem__(self, idx): # Return x, y for batch idx. low = idx * self.batch_size # Cap upper bound at array length; the last batch may be smaller # if the total number of items is not a multiple of batch size. high = min(low + self.batch_size, len(self.x)) batch_x = self.x[low:high] batch_y = self.y[low:high] return np.array([ resize(imread(file_name), (200, 200)) for file_name in batch_x]), np.array(batch_y) 

max_queue_size

num_batches Number of batches in the PyDataset.
use_multiprocessing

workers

Methods

on_epoch_end

View source

Method called at the end of every epoch.

__getitem__

View source

Gets batch at position index.

Args
index position of the batch in the PyDataset.

Returns
A batch