I have a Pandas DataFrame in the following format
df = pd.DataFrame([[1, 2, 4, 5, 7, 8, 1], [1, 3, 1, 3, 4, 6, 1], [1, 4, 1, 2, 6, 5, 0], [1, 5, 1, 3, 3, 6, 0], [2, 6, 3, 5, 1, 3, 1], [2, 7, 3, 2, 6, 8, 1], [2, 1, 3, 1, 0, 4, 1]], columns=['person_id', 'object_id', 'col_1','col_2','col_3','col_4','label']) In a more visual way, this is how the DataFrame looks. It has a person_id and an object_id column. Then some columns such as col_x and finally the label.
person_id object_id col_1 col_2 col_3 col_4 label 0 1 2 4 5 7 8 1 1 1 3 1 3 4 6 1 2 1 4 1 2 6 5 0 3 1 5 1 3 3 6 0 4 2 6 3 5 1 3 1 5 2 7 3 2 6 8 1 6 2 1 3 1 0 4 1 I want to use a function from a library that needs the input in a specific format. In specific, I want to group by person_id, object_id and label and then create a list of lists with the col_x and a regular list with the label. Based on the example above, it will be
bags = [ [[4, 5, 7, 8],[1, 3, 4, 6]], [[1, 2, 6, 5],[1, 3, 3, 6]], [[3, 5, 1, 3],[3, 2, 6, 8],[3, 1, 0, 4]] ] labels = [1,0,1] What I do now is iterating in the pandas and create the two new lists dynamically. However, I know it's not wise and I am looking for a more pythonic and better approach in performance.
My ugly solution
bags = [] labels = [] uniquePeople = df['person_id'].unique() predictors = ['col_1','col_2','col_3','col_4'] for unp in uniquePeople: person = df[ (df['person_id'] == unp) && (df['label'] == 1) ][predictors].values label = 1 if len(person) > 0: bags.append(person) labels.append(label) person = df[ (df['person_id'] == unp) && (df['label'] == 0) ][predictors].values label = 0 if len(person) > 0: bags.append(paper) labels.append(label) P.S. I did a heavy lifting in the code on the fly to make it proper for stackoverflow. In case you find something wrong there, don't bother. The aim is to find a better one, not to fix the ugly one :P