Cluster the data and give me similar data

Question

Assume that you want to see the car price based on name of the car, and production year.

Name, production year, price Volvo, 2021, 50000 Volvo, 2012, 16000 Toyota, 2022, 30000

Then if the input data is:

Volvo, 2018

It returns the most similar data maybe as below:

Volvo, 2021, 50000 Volvo, 2012, 16000

So, I can manually choose the right price.

$\begingroup$ So what is your question? $\endgroup$

user2974951
– user2974951

2022-10-12 10:17:36 +00:00
Commented Oct 12, 2022 at 10:17 — user2974951
– user2974951, Commented Oct 12, 2022 at 10:17
$\begingroup$ df.query("'Name' == 'Volvo'")? $\endgroup$

DarknessPlusPlus
– DarknessPlusPlus

2022-10-12 10:40:26 +00:00
Commented Oct 12, 2022 at 10:40 — DarknessPlusPlus
– DarknessPlusPlus, Commented Oct 12, 2022 at 10:40

DarknessPlusPlus · Accepted Answer · 2022-10-12 12:21:09Z

If you want to be choosing something manually, you can just query it using df.query. You can play around with the production_year variable, but in your post it seems rather irrelevant.

data = pd.DataFrame({"name" : ['Volvo', 'Volvo', 'Toyota'], "production_year" : [2021, 2012, 2022], 'price' : [50000, 16000, 30000]}) q = data.query("name == 'Volvo'")

Returns :

Volvo, 2021, 50000 Volvo, 2012, 16000

I am unaware whether your data is in text documents or in a table, but it seems like the latter. In that case, you can jump straight to creating a sort of like a recommender system - see https://www.kaggle.com/code/alexanderstefanov/goodreads-search-engine/notebook

In the former case, I would use sentence transformers for semantic search.

from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1') query_embedding = model.encode('Volvo made in 2012 with a price of 10000') passage_embedding = model.encode(['Volvo made in 2022 with a price of 500000', 'Volvo made in 2010 with a price of 12000', 'Volvo made in 2013 with a price of 13337', 'London is known for manufacturing Volvos']) print("Similarity:", util.dot_score(query_embedding, passage_embedding))

Returns the following probs: 0.9150, 0.9557, 0.9385, 0.6478

Thanks for the reply and my apologies for ambiguity. The similar results should come out of the clustering algo. This example is very easy and manually possible, but once you have 40+ features you cannot query and you do not probably know how to measure the similarity. — user141467
– user141467, Commented Oct 12, 2022 at 11:21
@user141467 I am not sure that clustering is what you need. You could approach it that way, but it seems like you're in the field of neural search / sentence similarity. — DarknessPlusPlus
– DarknessPlusPlus, Commented Oct 12, 2022 at 12:16

Stack Exchange Network

Cluster the data and give me similar data

1 Answer 1

Hot Network Questions

Cluster the data and give me similar data

1 Answer 1

Related

Hot Network Questions