4

I want optimize this filter function. It is searching in two list: one is of category and one is of tags. That's why it takes a long time to run this function.

def get_percentage(l1, l2, sim_score): diff = intersection(l1, l2) size = len(l1) if size != 0: perc = (diff/size) if perc >= sim_score: return True else: return False def intersection(lst1, lst2): return len(list(set(lst1) & set(lst2))) def filter_entities(country, city, category, entities, entityId): valid_entities = [] tags = get_tags(entities, entityId) for index, i in entities.iterrows(): if i["country"] == country and i["city"] == city: for j in i.categories: if j == category: if(get_percentage(i["tags"], tags, 0.80)): valid_entities.append(i.entity_id) return valid_entities 
1

2 Answers 2

1

You have a couple of unnecessary for loops and if checks in there that you can remove, and you should definitely take advantage of df.loc for selecting elements from your dataframe (assuming entities is a Pandas dataframe):

def get_percentage(l1, l2, sim_score): if len(l1) == 0: return False # shortcut this default case else: diff = intersection(l1, l2) perc = (diff / len(l1)) return perc >= sim_score # rather than handling each case separately def intersection(lst1, lst2): return len(set(lst1).intersection(lst2)) # almost twice as fast this way on my machine def filter_entities(country, city, category, entities, entityId): valid_entities = [] tags = get_tags(entities, entityId) # Just grab the desired elements directly, no loops entity = entities.loc[(entities.country == county) & (entities.city == city)] if category in entity.categories and get_percentage(entity.tags, tags, 0.8): valid_entities.append(entity.entity_id) return valid_entities 

It's difficult to say for sure that this will help because we can't really run the code you provided, but this should remove some inefficiencies and take advantage of some of the optimizations available in Pandas.

Depending on your data structure (i.e. if you have multiple matches in entity above), you may need to do something like this for the last three lines above:

for ent in entity: if category in ent.categories and get_percentage(ent.tags, tags, 0.8): valid_entities.append(ent.entity_id) return valid_entities 
Sign up to request clarification or add additional context in comments.

3 Comments

actually categories is a list in a dataframe column in which I have to match if one of the value matches then i will include that entity
@HammadKhan roger, I updated it according to what I think I understand your data structure to be...
Thanks between its performing better now :)
1

A first step would be to look at Engineero's answer which fixes the unnecessary if and for loops. Next I would suggest if you are using large amounts of input data which should be the case if it taking a noticeably large amount of time. You may want to use a numpy array to store data instead of lists as it is much better for large amounts of data as seen here. Numpy even beats out Pandas DataFrames as seen here. After a certain point you should ask yourself if efficiency is more important than convenience of using Pandas, and if so for large amounts of data Numpy will be quicker.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.