I am to trying to use text data for linear regression model as input and converting my text data to vectors using Universal sentence encoder from tensorflow hub as pretrained model for this but this gives me tf.tensors and now I am not able to split the data into training and testing for scikit learn linear regression model as my target feature is continuous.
This gives me embeddings (i.e vectors of shape (1,512) for each text in my pandas dataframe text column)
import tensorflow_hub as hub model_url = 'https://tfhub.dev/google/universal-sentence-encoder-large/5' model = hub.load(model_url) embeddings = model(train['excerpt']) This is how data look :
id excerpt target 0 c12129c31 When the young people returned to the ballroom... -0.340259 1 85aa80a4c All through dinner time, Mrs. Fayre was somewh... -0.315372 2 b69ac6792 As Roger had predicted, the snow departed as q... -0.580118 3 dd1000b26 And outside before the palace a great garden w... -1.054013 4 37c1b32fb Once upon a time there were Three Bears who li... 0.247197 This is how embeddings look:
tf.Tensor: shape=(2834, 512), dtype=float32, numpy= array([[-0.06747025, 0.02054032, -0.01223458, ..., 0.03468879, -0.04216784, 0.01212691], [-0.01053216, 0.01346854, 0.01992477, ..., 0.03078162, -0.0226634 , 0.04429556], [-0.10778417, 0.01735378, 0.00803178, ..., 0.00345916, 0.00552441, -0.02448413], ..., [ 0.0364146 , 0.02996029, -0.06757646, ..., -0.00335971, -0.01381749, -0.08319554], [ 0.0042374 , 0.02291174, -0.04473154, ..., -0.02009053, -0.00428826, -0.06476445], [-0.0141812 , 0.03879716, 0.03304171, ..., 0.06709221, -0.05016331, 0.00868828]], dtype=float32) Now I want to use this embeddings as input in Linear Regression model or any Regression model using scikit learn. But not able to split the data using train_test_split(), giving me error TypeError: Only integers, slices (:), ellipsis (...), tf.newaxis (None) and scalar tf.int32/tf.int64 tensors are valid indices, got array([1434, 2653, 2620, ..., 749, 2114, 2389])
This is how I am splitting the data:
X_train,X_test,y_train,y_test = train_test_split(embeddings,train['target'],test_size =0.2, shuffle =True)
X_train,X_test,y_train,y_test = train_test_split(embeddings.numpy(),train['target'].to_numpy(),test_size =0.2, shuffle =True)?