Frequency of occurrence - dummy variables

Question

I am thinking about it not the first time, namely if I have a variable that I want to convert later to the variable dummy (cities in this case), should I delete lines that occur less often than N times?

For example, the value of new york has occurred 400+ times but there are cities that only appeared once or twice.

What should I do with values that have appeared only once or twice?

print(df[cities].value_counts())

Output:

city1 424 city2 107 city3 35 city4 33 city5 28 city6 24 city7 15 city8 7 city9 4 city10 3 city11 2 city12 1 city13 1 city14 1 city15 1 city16 1 city17 1

89f3a1c · Accepted Answer · 2019-10-29 23:50:40Z

There's no general rule that can apply to all cases, and there's a lot of context missing in your post to say anything conclusive.

Having said that, I think that a good approach is to treat the cities with a lot of occurrences each on its own, and then group all others under a 'other' category.

Going further, you could have multiple 'other' groups, grouped by various criteria, for example, geographical criteria, or anything that makes sense in your context.

Hope this helps.

Stack Exchange Network

Frequency of occurrence - dummy variables

1 Answer 1

Hot Network Questions

Frequency of occurrence - dummy variables

1 Answer 1

Related

Hot Network Questions