1

I have a Pandas DataFrame in the following format

df = pd.DataFrame([[1, 2, 4, 5, 7, 8, 1], [1, 3, 1, 3, 4, 6, 1], [1, 4, 1, 2, 6, 5, 0], [1, 5, 1, 3, 3, 6, 0], [2, 6, 3, 5, 1, 3, 1], [2, 7, 3, 2, 6, 8, 1], [2, 1, 3, 1, 0, 4, 1]], columns=['person_id', 'object_id', 'col_1','col_2','col_3','col_4','label']) 

In a more visual way, this is how the DataFrame looks. It has a person_id and an object_id column. Then some columns such as col_x and finally the label.

 person_id object_id col_1 col_2 col_3 col_4 label 0 1 2 4 5 7 8 1 1 1 3 1 3 4 6 1 2 1 4 1 2 6 5 0 3 1 5 1 3 3 6 0 4 2 6 3 5 1 3 1 5 2 7 3 2 6 8 1 6 2 1 3 1 0 4 1 

I want to use a function from a library that needs the input in a specific format. In specific, I want to group by person_id, object_id and label and then create a list of lists with the col_x and a regular list with the label. Based on the example above, it will be

bags = [ [[4, 5, 7, 8],[1, 3, 4, 6]], [[1, 2, 6, 5],[1, 3, 3, 6]], [[3, 5, 1, 3],[3, 2, 6, 8],[3, 1, 0, 4]] ] labels = [1,0,1] 

What I do now is iterating in the pandas and create the two new lists dynamically. However, I know it's not wise and I am looking for a more pythonic and better approach in performance.

My ugly solution

bags = [] labels = [] uniquePeople = df['person_id'].unique() predictors = ['col_1','col_2','col_3','col_4'] for unp in uniquePeople: person = df[ (df['person_id'] == unp) && (df['label'] == 1) ][predictors].values label = 1 if len(person) > 0: bags.append(person) labels.append(label) person = df[ (df['person_id'] == unp) && (df['label'] == 0) ][predictors].values label = 0 if len(person) > 0: bags.append(paper) labels.append(label) 

P.S. I did a heavy lifting in the code on the fly to make it proper for stackoverflow. In case you find something wrong there, don't bother. The aim is to find a better one, not to fix the ugly one :P

2 Answers 2

2

Use DataFrame.groupby with lambda function by both columns for Series:

predictors = ['col_1','col_2','col_3','col_4'] s = (df.groupby(['person_id','label'], sort=False)[predictors] .apply(lambda x: x.values.tolist())) print (s) person_id label 1 1 [[4, 5, 7, 8], [1, 3, 4, 6]] 0 [[1, 2, 6, 5], [1, 3, 3, 6]] 2 1 [[3, 5, 1, 3], [3, 2, 6, 8], [3, 1, 0, 4]] dtype: object 

And then convert Series to lists:

bags = s.tolist() print (bags) [[[4, 5, 7, 8], [1, 3, 4, 6]], [[1, 2, 6, 5], [1, 3, 3, 6]], [[3, 5, 1, 3], [3, 2, 6, 8], [3, 1, 0, 4]]] 

And second level of MultiIndex by Index.get_level_values too:

labels = s.index.get_level_values(1).tolist() print (labels) [1, 0, 1] 
Sign up to request clarification or add additional context in comments.

1 Comment

Insanely improve in performance! Awesome! Always socked with what Pandas can achieve. Thank you
0

Not sure if this is what you are looking for

import pandas as pd df = df = pd.DataFrame([[1, 2, 4, 5, 7, 8, 1], [1, 3, 1, 3, 4, 6, 1], [1, 4, 1, 2, 6, 5, 0], [1, 5, 1, 3, 3, 6, 0], [2, 6, 3, 5, 1, 3, 1], [2, 7, 3, 2, 6, 8, 1], [2, 1, 3, 1, 0, 4, 1]], columns=['person_id', 'object_id', 'col_1','col_2','col_3','col_4','label']) # example dataframe df['cols'] = df[['col_1', 'col_2', 'col_3', 'col_4']].apply(lambda x: list(x), axis=1) # create a new column with col_x as element of a list tmp = df.groupby(['person_id', 'label'])[['cols']].agg(list) # group by and create list of lists bags = tmp['cols'].tolist() # unpack labels = tmp.index.droplevel(0) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.