I have a dataset that has a combination of both text and numeric value. I want to cluster my data and below is an example of the dataset. I am using pandas and scikit-learn.
all_text,amount Check Sample row 1,-1154 Check Sample row 2,-1154 The above rows have one value as text and the other numeric. I took the column that is text and transformed it using TF-IDF.
vect = TfidfVectorizer(ngram_range=(1,1),stop_words='english',max_features=1000) td_matrix = vect.fit_transform(data['all_text']) data['all_text'] = list(td_matrix) # Calculating the distance measure derived from cosine similarity dbscan = DBSCAN(eps=0.5, min_samples= 10) dbscan.fit(data) When I try to create the new dataframe with the td_matrix and fit the data it throws the following error.
array = array.astype(np.float64) ValueError: setting an array element with a sequence. How should I combine the tf-idf matrix with the numeric column so that I can run my clustering algorithm ?