0
$\begingroup$

I am thinking about it not the first time, namely if I have a variable that I want to convert later to the variable dummy (cities in this case), should I delete lines that occur less often than N times?

For example, the value of new york has occurred 400+ times but there are cities that only appeared once or twice.

What should I do with values ​​that have appeared only once or twice?

print(df[cities].value_counts()) 

Output:

city1 424 city2 107 city3 35 city4 33 city5 28 city6 24 city7 15 city8 7 city9 4 city10 3 city11 2 city12 1 city13 1 city14 1 city15 1 city16 1 city17 1 
$\endgroup$

1 Answer 1

1
$\begingroup$

There's no general rule that can apply to all cases, and there's a lot of context missing in your post to say anything conclusive.

Having said that, I think that a good approach is to treat the cities with a lot of occurrences each on its own, and then group all others under a 'other' category.

Going further, you could have multiple 'other' groups, grouped by various criteria, for example, geographical criteria, or anything that makes sense in your context.

Hope this helps.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.