2

I have a dataset that has a combination of both text and numeric value. I want to cluster my data and below is an example of the dataset. I am using pandas and scikit-learn.

all_text,amount Check Sample row 1,-1154 Check Sample row 2,-1154 

The above rows have one value as text and the other numeric. I took the column that is text and transformed it using TF-IDF.

vect = TfidfVectorizer(ngram_range=(1,1),stop_words='english',max_features=1000) td_matrix = vect.fit_transform(data['all_text']) data['all_text'] = list(td_matrix) # Calculating the distance measure derived from cosine similarity dbscan = DBSCAN(eps=0.5, min_samples= 10) dbscan.fit(data) 

When I try to create the new dataframe with the td_matrix and fit the data it throws the following error.

array = array.astype(np.float64) ValueError: setting an array element with a sequence. 

How should I combine the tf-idf matrix with the numeric column so that I can run my clustering algorithm ?

4
  • you might need to make sure your tf-idf matrix is a pandas dataframe, then you can easily concat the columns with the numeric dataframe Commented Sep 26, 2018 at 8:19
  • If you just need to flag that a certain amount is present on a given row you might as well include that in the string as text. In ML terms, one-hot encode the column "amount" Commented Sep 26, 2018 at 12:52
  • Use the new ColumnTransformer in scikit-learn. It will transform and concat columns in dataframe of various data types. Commented Oct 31, 2018 at 1:05
  • @AnshulTripathi- I know this post is quite old now, but at the moment I am in the same situation as you were there as mentioned in the above question. Do you remember how you have solved the column with array so that you can fit the dataframe to DBSCAN? Commented Nov 5, 2020 at 19:34

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.