Clustering data that has both text and numeric value

I have a dataset that has a combination of both text and numeric value. I want to cluster my data and below is an example of the dataset. I am using pandas and scikit-learn.

all_text,amount Check Sample row 1,-1154 Check Sample row 2,-1154

The above rows have one value as text and the other numeric. I took the column that is text and transformed it using TF-IDF.

vect = TfidfVectorizer(ngram_range=(1,1),stop_words='english',max_features=1000) td_matrix = vect.fit_transform(data['all_text']) data['all_text'] = list(td_matrix) # Calculating the distance measure derived from cosine similarity dbscan = DBSCAN(eps=0.5, min_samples= 10) dbscan.fit(data)

When I try to create the new dataframe with the td_matrix and fit the data it throws the following error.

array = array.astype(np.float64) ValueError: setting an array element with a sequence.

How should I combine the tf-idf matrix with the numeric column so that I can run my clustering algorithm ?

edited Sep 26, 2018 at 5:16

petezurich

10.3k10 gold badges48 silver badges63 bronze badges

asked Sep 25, 2018 at 23:18

Anshul Tripathi

5995 gold badges12 silver badges31 bronze badges

you might need to make sure your tf-idf matrix is a pandas dataframe, then you can easily concat the columns with the numeric dataframe

Po Stevanus Andrianta
– Po Stevanus Andrianta

2018-09-26 08:19:13 +00:00
Commented Sep 26, 2018 at 8:19
If you just need to flag that a certain amount is present on a given row you might as well include that in the string as text. In ML terms, one-hot encode the column "amount"

Fabio Picchi
– Fabio Picchi

2018-09-26 12:52:02 +00:00
Commented Sep 26, 2018 at 12:52
Use the new ColumnTransformer in scikit-learn. It will transform and concat columns in dataframe of various data types.

PeterB
– PeterB

2018-10-31 01:05:38 +00:00
Commented Oct 31, 2018 at 1:05
@AnshulTripathi- I know this post is quite old now, but at the moment I am in the same situation as you were there as mentioned in the above question. Do you remember how you have solved the column with array so that you can fit the dataframe to DBSCAN?

Jazz
– Jazz

2020-11-05 19:34:55 +00:00
Commented Nov 5, 2020 at 19:34

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Clustering data that has both text and numeric value

0

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.