2

I need to do a fuzzy groupby where a single record can be in one or more groups.

I have a DataFrame like this:

test = pd.DataFrame({'score1' : pandas.Series(['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series(['b', 'a', 'k', 'n', 'c'])}) 

Output:

 score1 score2 0 a b 1 b a 2 c k 3 d n 4 e c 

I wish to have groups like this: enter image description here

The group keys should be the union of the unique values between score1 and score2. Record 0 should be in groups a and b because it contains both score values. Similarly record 1 should be in groups b and a; record 2 should be in groups c and k and so on.

I've tried doing a groupby on two columns like this:

In [192]: score_groups = pd.groupby(['score1', 'score2']) 

However I get the group keys as tuples - (1, 2), (2, 1), (3, 8), etc, instead of unique group keys where records can be in multiple groups. The output is shown below:

In [192]: score_groups.groups Out[192]: {('a', 'b'): [0], ('b', 'a'): [1], ('c', 'k'): [2], ('d', 'n'): [3], ('e', 'c'): [4]} 

Also, I need the indexes preserved because I'm using them for another operation later. Please help!

3
  • I think your data-frame probably needs to be transposed. If your columns were: 'index,score_name,score' this would be trivial. Look at pandas.melt to transform your dataframe. Commented May 31, 2016 at 21:58
  • So.... you just need to persevere the original index? Commented May 31, 2016 at 23:39
  • Yes, and I also need access to the actual groups in the GroupBy object so that I can perform some operations on the contents of the groups. I've posted my answer. Commented May 31, 2016 at 23:48

3 Answers 3

1

Combine the two columns in a single column using e.g. pd.concat():

s = pd.concat([test['score1'], test['score2'].rename(columns={'score2': 'score1'})]).reset_index() s.columns = ['val', 'grp'] val grp 0 0 a 1 1 b 2 2 c 3 3 d 4 4 e 5 0 b 6 1 a 7 2 k 8 3 n 9 4 c 

And then .groupby() on 'grp' and collect 'val' in a list:

s = s.groupby('grp').apply(lambda x: x.val.tolist()) a [0, 1] b [1, 0] c [2, 4] d [3] e [4] k [2] n [3] 

or, if you prefer dict:

s.to_dict() {'e': [4], 'd': [3], 'n': [3], 'k': [2], 'a': [0, 1], 'c': [2, 4], 'b': [1, 0]} 

Or, to the same effect in a single step, skipping renaming the columns:

test.unstack().reset_index(-1).groupby(0).apply(lambda x: x.level_1.tolist()) a [0, 1] b [1, 0] c [2, 4] d [3] e [4] k [2] n [3] 
Sign up to request clarification or add additional context in comments.

4 Comments

The concatenation operation adds new index values. I need the indexes preserved because I'm using them for something else. I've actually created a separate column that replicates the original index. Is it possible to do a groupby() that stores this column 's values instead of the actual index?
I just updated my answer, looks like the original data changed or I copied it wrong before? I think the index values are captured properly?
You're right. I changed the question to more closely represent my data. Your answer does give me the correct indices now, however because of the apply() function, access to the original GroupBy object and its contents is lost. I can always use the dictionary values to get the rows from the original dataFrame, but it would be nice to have the GroupBy object itself.
You can access the groupby object before using apply; the original index is preserved in the val column.
0

Reorganizing your data for ease of manipulation (having multiple value columns for the same data will always cause you headaches).

import pandas as pd test = pd.DataFrame({'score1' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series([2, 1, 8, 9, 3], index=['a', 'b', 'c', 'd', 'e'])}) test['name'] = test.index result = pd.melt(test, id_vars=['name'], value_vars=['score1', 'score2']) name variable value 0 a score1 1 1 b score1 2 2 c score1 3 3 d score1 4 4 e score1 5 5 a score2 2 6 b score2 1 7 c score2 8 8 d score2 9 9 e score2 3 

Now you have only one column for your value and it's easy to groupby score or select by your name column:

 hey = result.groupby('value') hey.groups #below are the indices that you care about {1: [0, 6], 2: [1, 5], 3: [2, 9], 4: [3], 5: [4], 8: [7], 9: [8]} 

2 Comments

This could work nicely, but it doesn't preserve the indexes. When I do a groupby on result, the indexes in the groups are now of result, not of the original dataframe. Also, result.groupby('value') contains groups of result's rows, not the original dataset's rows. I've updated my question.
@lostsoul29 - yes, they're indices on the new dataframe, you just have to use them to index in to that result and it will give you what you want. If you have to preserve the original index you should save it as a column not an index.
0

Using Stefan's help, I solved it like this.

In (283): frame1 = test[['score1']] frame2 = test[['score2']] frame2.rename(columns={'score2': 'score1'}, inplace=True) test = pandas.concat([frame1, frame2]) test Out[283]: score1 0 a 1 b 2 c 3 d 4 e 0 b 1 a 2 k 3 n 4 c 

Notice the duplicate indices. The indexes have been preserved, which is what I wanted. Now, lets get to business - the group by operation.

In (283): groups = test.groupby('score1') groups.get_group('a') # Get group with key a Out[283]: score1 0 a 1 a In (283): groups.get_group('b') # Get group with key b Out[283]: score1 1 b 0 b In (283): groups.get_group('c') # Get group with key c Out[283]: score1 2 c 4 c In (283): groups.get_group('k') # Get group with key k Out[283]: score1 2 k 

I'm baffled by how pandas retrieves rows with the correct index even though they are duplicated. As I understand, the group by operation uses an inverted index data structure to store the references (indexes) to rows. Any insights would be greatly appreciated. Anyone who answers this will have their answer accepted :)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.