0

I am to trying to use text data for linear regression model as input and converting my text data to vectors using Universal sentence encoder from tensorflow hub as pretrained model for this but this gives me tf.tensors and now I am not able to split the data into training and testing for scikit learn linear regression model as my target feature is continuous.

This gives me embeddings (i.e vectors of shape (1,512) for each text in my pandas dataframe text column)

import tensorflow_hub as hub model_url = 'https://tfhub.dev/google/universal-sentence-encoder-large/5' model = hub.load(model_url) embeddings = model(train['excerpt']) 

This is how data look :

 id excerpt target 0 c12129c31 When the young people returned to the ballroom... -0.340259 1 85aa80a4c All through dinner time, Mrs. Fayre was somewh... -0.315372 2 b69ac6792 As Roger had predicted, the snow departed as q... -0.580118 3 dd1000b26 And outside before the palace a great garden w... -1.054013 4 37c1b32fb Once upon a time there were Three Bears who li... 0.247197 

This is how embeddings look:

tf.Tensor: shape=(2834, 512), dtype=float32, numpy= array([[-0.06747025, 0.02054032, -0.01223458, ..., 0.03468879, -0.04216784, 0.01212691], [-0.01053216, 0.01346854, 0.01992477, ..., 0.03078162, -0.0226634 , 0.04429556], [-0.10778417, 0.01735378, 0.00803178, ..., 0.00345916, 0.00552441, -0.02448413], ..., [ 0.0364146 , 0.02996029, -0.06757646, ..., -0.00335971, -0.01381749, -0.08319554], [ 0.0042374 , 0.02291174, -0.04473154, ..., -0.02009053, -0.00428826, -0.06476445], [-0.0141812 , 0.03879716, 0.03304171, ..., 0.06709221, -0.05016331, 0.00868828]], dtype=float32) 

Now I want to use this embeddings as input in Linear Regression model or any Regression model using scikit learn. But not able to split the data using train_test_split(), giving me error TypeError: Only integers, slices (:), ellipsis (...), tf.newaxis (None) and scalar tf.int32/tf.int64 tensors are valid indices, got array([1434, 2653, 2620, ..., 749, 2114, 2389])

This is how I am splitting the data:

X_train,X_test,y_train,y_test = train_test_split(embeddings,train['target'],test_size =0.2, shuffle =True) 
4
  • It seems like you are giving the wrong data type can you also add the data display that you are splitting? Commented Jul 18, 2021 at 15:21
  • Instead of giving your output values as train["target"], have you tried to give it as train["target"].values? It seems like the problem is in that part because since you are giving it as train["target"] you are providing pandas Series which consists Series index besides values. Commented Jul 18, 2021 at 15:42
  • same issue even after using train['target'].values, looks like it is the issue of embeddings format Commented Jul 18, 2021 at 16:11
  • 1
    Have you tried this X_train,X_test,y_train,y_test = train_test_split(embeddings.numpy(),train['target'].to_numpy(),test_size =0.2, shuffle =True)? Commented Jul 18, 2021 at 17:10

1 Answer 1

1

In the train_test_split you are passing a tensor. Instead, you should pass the NumPy array like this-

X_train,X_test,y_train,y_test = train_test_split(embeddings.numpy(), train['target'],test_size =0.2, shuffle =True) 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.