How to use Tensorflow embeddings in scikit learn models?

Question

I am to trying to use text data for linear regression model as input and converting my text data to vectors using Universal sentence encoder from tensorflow hub as pretrained model for this but this gives me tf.tensors and now I am not able to split the data into training and testing for scikit learn linear regression model as my target feature is continuous.

This gives me embeddings (i.e vectors of shape (1,512) for each text in my pandas dataframe text column)

import tensorflow_hub as hub model_url = 'https://tfhub.dev/google/universal-sentence-encoder-large/5' model = hub.load(model_url) embeddings = model(train['excerpt'])

This is how data look :

 id excerpt target 0 c12129c31 When the young people returned to the ballroom... -0.340259 1 85aa80a4c All through dinner time, Mrs. Fayre was somewh... -0.315372 2 b69ac6792 As Roger had predicted, the snow departed as q... -0.580118 3 dd1000b26 And outside before the palace a great garden w... -1.054013 4 37c1b32fb Once upon a time there were Three Bears who li... 0.247197

This is how embeddings look:

tf.Tensor: shape=(2834, 512), dtype=float32, numpy= array([[-0.06747025, 0.02054032, -0.01223458, ..., 0.03468879, -0.04216784, 0.01212691], [-0.01053216, 0.01346854, 0.01992477, ..., 0.03078162, -0.0226634 , 0.04429556], [-0.10778417, 0.01735378, 0.00803178, ..., 0.00345916, 0.00552441, -0.02448413], ..., [ 0.0364146 , 0.02996029, -0.06757646, ..., -0.00335971, -0.01381749, -0.08319554], [ 0.0042374 , 0.02291174, -0.04473154, ..., -0.02009053, -0.00428826, -0.06476445], [-0.0141812 , 0.03879716, 0.03304171, ..., 0.06709221, -0.05016331, 0.00868828]], dtype=float32)

Now I want to use this embeddings as input in Linear Regression model or any Regression model using scikit learn. But not able to split the data using train_test_split(), giving me error TypeError: Only integers, slices (:), ellipsis (...), tf.newaxis (None) and scalar tf.int32/tf.int64 tensors are valid indices, got array([1434, 2653, 2620, ..., 749, 2114, 2389])

This is how I am splitting the data:

X_train,X_test,y_train,y_test = train_test_split(embeddings,train['target'],test_size =0.2, shuffle =True)

It seems like you are giving the wrong data type can you also add the data display that you are splitting? — Hakan Akgün
– Hakan Akgün, Commented Jul 18, 2021 at 15:21
Instead of giving your output values as train["target"], have you tried to give it as train["target"].values? It seems like the problem is in that part because since you are giving it as train["target"] you are providing pandas Series which consists Series index besides values. — Hakan Akgün
– Hakan Akgün, Commented Jul 18, 2021 at 15:42
same issue even after using train['target'].values, looks like it is the issue of embeddings format — martian_rover
– martian_rover, Commented Jul 18, 2021 at 16:11
Have you tried this X_train,X_test,y_train,y_test = train_test_split(embeddings.numpy(),train['target'].to_numpy(),test_size =0.2, shuffle =True)? — Loukik
– Loukik, Commented Jul 18, 2021 at 17:10

Abhishek Prajapat · Accepted Answer · 2021-07-18 18:22:14Z

In the train_test_split you are passing a tensor. Instead, you should pass the NumPy array like this-

X_train,X_test,y_train,y_test = train_test_split(embeddings.numpy(), train['target'],test_size =0.2, shuffle =True)

Collectives™ on Stack Overflow

How to use Tensorflow embeddings in scikit learn models?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related