2

I'm having difficulties working with tf.contrib.data.Dataset API and wondered if some of you could help. I wanted to transform the entire skip-gram pre-processing of word2vec into this paradigm to play with the API a little bit, it involves the following operations:

  1. Sequence of tokens are loaded dynamically (to avoid loading all dataset in memory at a time), say we then start with a Stream (to be understood as Scala's way, all data is not in memory but loaded when access is needed) of sequence of tokens: seq_tokens.
  2. From any of these seq_tokens we extract skip-grams with a python function that returns a list of tuples (token, context).
  3. Select for features the column of tokens and for label the column of contexts.

In pseudo-code to make it clearer it would look like above. We should be taking advantage of the framework parallelism system not to load by ourselves the data, so I would do something like first load in memory only the indices of sequences, then load sequences (inside a map, hence if not all lines are processed synchronously, data is loaded asynchronously and there's no OOM to fear), and apply a function on those sequences of tokens that would create a varying number of skip-grams that needs to be flattened. In this end, I would formally end up with data being of shape (#lines=number of skip-grams generated, #columns=2).

data = range(1:N) .map(i => load(i): Seq[String]) // load: Int -> Seq[String] loads dynamically a sequence of tokens (sequences have varying length) .flat_map(s => skip_gram(s)) // skip_gram: Seq[String] -> Seq[(String, String)] with output length features = data[0] // features lables = data[1] // labels 

I've tried naively to do so with Dataset's API but I'm stuck, I can do something like:

iterator = ( tf.contrib.data.Dataset.range(N) .map(lambda i: tf.py_func(load_data, [i], [tf.int32, tf.int32])) // (1) .flat_map(?) // (2) .make_one_shot_iterator() ) 

(1) TensorFlow's not happy here because sequences loaded have differents lengths...

(2) Haven't managed yet to do the skip-gram part... I actually just want to call a python function that computes a sequence (of variable size) of skip-grams and flatten it so that if the return type is a matrix, then each line should be understood as a new line of the output Dataset.

Thanks a lot if anyone has any idea, and don't hesitate if I forgot to mention useful precisions...

1

1 Answer 1

2

I'm just implementing the same thing; here's how I solved it:

dataset = tf.data.TextLineDataset(filename) if mode == ModeKeys.TRAIN: dataset = dataset.shuffle(buffer_size=batch_size * 100) dataset = dataset.flat_map(lambda line: string_to_skip_gram(line)) dataset = dataset.batch(batch_size) 

In my dataset, I treat every line as standalone, so I'm not worrying about contexts that span multiple lines.

I therefore flat map each line through a function string_to_skip_gram that returns a Dataset of a length that depends on the number of tokens in the line.

string_to_skip_gram turns the line into a series of tokens, represented by IDs (using the method tokenize_str) using tf.py_func:

def string_to_skip_gram(line): def handle_line(line): token_ids = tokenize_str(line) (features, labels) = skip_gram(token_ids) return np.array([features, labels], dtype=np.int64) res = tf.py_func(handle_line, [line], tf.int64) features = res[0] labels = res[1] return tf.data.Dataset.from_tensor_slices((features, labels)) 

Finally, skip_gram returns a list of all possible context words and target words:

def skip_gram(token_ids): skip_window = 1 features = [] labels = [] context_range = [i for i in range(-skip_window, skip_window + 1) if i != 0] for word_index in range(skip_window, len(token_ids) - skip_window): for context_word_offset in context_range: features.append(token_ids[word_index]) labels.append(token_ids[word_index + context_word_offset]) return features, labels 

Note that I'm not sampling the context words here; just using all of them.

Sign up to request clarification or add additional context in comments.

1 Comment

Works like a charm and is way more elegant than what I had done before, thanks for sharing.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.