Skip to main content
edited body; edited title
Source Link
petezurich
  • 10.3k
  • 10
  • 48
  • 63

Clustering a data that has both text and numeric value

I have a dataset that has a combination of both text and numeric value. I want to cluster my data and Belowbelow is an example of the dataset. I am using pandas and scikit-learn.

all_text,amount Check Sample row 1,-1154 Check Sample row 2,-1154 

The above rows have one value as text and the other numeric. I took the column that is text and transformed it using TF-IDF.

vect = TfidfVectorizer(ngram_range=(1,1),stop_words='english',max_features=1000) td_matrix = vect.fit_transform(data['all_text']) data['all_text'] = list(td_matrix) # Calculating the distance measure derived from cosine similarity dbscan = DBSCAN(eps=0.5, min_samples= 10) dbscan.fit(data) 

When I try to create the new dataframe with the td_matrix and fit the data it throws the following error.

array = array.astype(np.float64) ValueError: setting an array element with a sequence. 

How should I combine the tf-idf matrix with the numeric column so that I can run my clustering algorithm ?

Clustering a data that has both text and numeric value

I have a dataset that has a combination of both text and numeric value. I want to cluster my data and Below is an example of the dataset. I am using pandas and scikit-learn.

all_text,amount Check Sample row 1,-1154 Check Sample row 2,-1154 

The above rows have one value as text and the other numeric. I took the column that is text and transformed it using TF-IDF.

vect = TfidfVectorizer(ngram_range=(1,1),stop_words='english',max_features=1000) td_matrix = vect.fit_transform(data['all_text']) data['all_text'] = list(td_matrix) # Calculating the distance measure derived from cosine similarity dbscan = DBSCAN(eps=0.5, min_samples= 10) dbscan.fit(data) 

When I try to create the new dataframe with the td_matrix and fit the data it throws the following error.

array = array.astype(np.float64) ValueError: setting an array element with a sequence. 

How should I combine the tf-idf matrix with the numeric column so that I can run my clustering algorithm ?

Clustering data that has both text and numeric value

I have a dataset that has a combination of both text and numeric value. I want to cluster my data and below is an example of the dataset. I am using pandas and scikit-learn.

all_text,amount Check Sample row 1,-1154 Check Sample row 2,-1154 

The above rows have one value as text and the other numeric. I took the column that is text and transformed it using TF-IDF.

vect = TfidfVectorizer(ngram_range=(1,1),stop_words='english',max_features=1000) td_matrix = vect.fit_transform(data['all_text']) data['all_text'] = list(td_matrix) # Calculating the distance measure derived from cosine similarity dbscan = DBSCAN(eps=0.5, min_samples= 10) dbscan.fit(data) 

When I try to create the new dataframe with the td_matrix and fit the data it throws the following error.

array = array.astype(np.float64) ValueError: setting an array element with a sequence. 

How should I combine the tf-idf matrix with the numeric column so that I can run my clustering algorithm ?

Source Link
Anshul Tripathi
  • 599
  • 5
  • 12
  • 31

Clustering a data that has both text and numeric value

I have a dataset that has a combination of both text and numeric value. I want to cluster my data and Below is an example of the dataset. I am using pandas and scikit-learn.

all_text,amount Check Sample row 1,-1154 Check Sample row 2,-1154 

The above rows have one value as text and the other numeric. I took the column that is text and transformed it using TF-IDF.

vect = TfidfVectorizer(ngram_range=(1,1),stop_words='english',max_features=1000) td_matrix = vect.fit_transform(data['all_text']) data['all_text'] = list(td_matrix) # Calculating the distance measure derived from cosine similarity dbscan = DBSCAN(eps=0.5, min_samples= 10) dbscan.fit(data) 

When I try to create the new dataframe with the td_matrix and fit the data it throws the following error.

array = array.astype(np.float64) ValueError: setting an array element with a sequence. 

How should I combine the tf-idf matrix with the numeric column so that I can run my clustering algorithm ?