TensorFlow, Dataset API and flat_map operation

Question

I'm having difficulties working with tf.contrib.data.Dataset API and wondered if some of you could help. I wanted to transform the entire skip-gram pre-processing of word2vec into this paradigm to play with the API a little bit, it involves the following operations:

Sequence of tokens are loaded dynamically (to avoid loading all dataset in memory at a time), say we then start with a Stream (to be understood as Scala's way, all data is not in memory but loaded when access is needed) of sequence of tokens: seq_tokens.
From any of these seq_tokens we extract skip-grams with a python function that returns a list of tuples (token, context).
Select for features the column of tokens and for label the column of contexts.

In pseudo-code to make it clearer it would look like above. We should be taking advantage of the framework parallelism system not to load by ourselves the data, so I would do something like first load in memory only the indices of sequences, then load sequences (inside a map, hence if not all lines are processed synchronously, data is loaded asynchronously and there's no OOM to fear), and apply a function on those sequences of tokens that would create a varying number of skip-grams that needs to be flattened. In this end, I would formally end up with data being of shape (#lines=number of skip-grams generated, #columns=2).

data = range(1:N) .map(i => load(i): Seq[String]) // load: Int -> Seq[String] loads dynamically a sequence of tokens (sequences have varying length) .flat_map(s => skip_gram(s)) // skip_gram: Seq[String] -> Seq[(String, String)] with output length features = data[0] // features lables = data[1] // labels

I've tried naively to do so with Dataset's API but I'm stuck, I can do something like:

iterator = ( tf.contrib.data.Dataset.range(N) .map(lambda i: tf.py_func(load_data, [i], [tf.int32, tf.int32])) // (1) .flat_map(?) // (2) .make_one_shot_iterator() )

(1) TensorFlow's not happy here because sequences loaded have differents lengths...

(2) Haven't managed yet to do the skip-gram part... I actually just want to call a python function that computes a sequence (of variable size) of skip-grams and flatten it so that if the return type is a matrix, then each line should be understood as a new line of the output Dataset.

Thanks a lot if anyone has any idea, and don't hesitate if I forgot to mention useful precisions...

I'm new to Dataset API as well. Maybe the "Batching tensors with padding" section on the official guide "Using the Dataset API for TensorFlow Input Pipelines" can help you? tensorflow.org/versions/r1.3/programmers_guide/datasets — Maosi Chen
– Maosi Chen, Commented Aug 29, 2017 at 0:09

ehrencrona · Accepted Answer · 2018-05-20 11:54:36Z

I'm just implementing the same thing; here's how I solved it:

dataset = tf.data.TextLineDataset(filename) if mode == ModeKeys.TRAIN: dataset = dataset.shuffle(buffer_size=batch_size * 100) dataset = dataset.flat_map(lambda line: string_to_skip_gram(line)) dataset = dataset.batch(batch_size)

In my dataset, I treat every line as standalone, so I'm not worrying about contexts that span multiple lines.

I therefore flat map each line through a function string_to_skip_gram that returns a Dataset of a length that depends on the number of tokens in the line.

string_to_skip_gram turns the line into a series of tokens, represented by IDs (using the method tokenize_str) using tf.py_func:

def string_to_skip_gram(line): def handle_line(line): token_ids = tokenize_str(line) (features, labels) = skip_gram(token_ids) return np.array([features, labels], dtype=np.int64) res = tf.py_func(handle_line, [line], tf.int64) features = res[0] labels = res[1] return tf.data.Dataset.from_tensor_slices((features, labels))

Finally, skip_gram returns a list of all possible context words and target words:

def skip_gram(token_ids): skip_window = 1 features = [] labels = [] context_range = [i for i in range(-skip_window, skip_window + 1) if i != 0] for word_index in range(skip_window, len(token_ids) - skip_window): for context_word_offset in context_range: features.append(token_ids[word_index]) labels.append(token_ids[word_index + context_word_offset]) return features, labels

Note that I'm not sampling the context words here; just using all of them.

Works like a charm and is way more elegant than what I had done before, thanks for sharing.

Collectives™ on Stack Overflow

TensorFlow, Dataset API and flat_map operation

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related