1

I have the following Pandas DataFrame:

 start_timestamp_milli end_timestamp_milli name rating 1 1555414708025 1555414723279 Valence 2 2 1555414708025 1555414723279 Arousal 6 3 1555414708025 1555414723279 Dominance 2 4 1555414708025 1555414723279 Sadness 1 5 1555414813304 1555414831795 Valence 3 6 1555414813304 1555414831795 Arousal 5 7 1555414813304 1555414831795 Dominance 2 8 1555414813304 1555414831795 Sadness 1 9 1555414921819 1555414931382 Valence 1 10 1555414921819 1555414931382 Arousal 7 11 1555414921819 1555414931382 Dominance 2 12 1555414921819 1555414931382 Sadness 1 13 1555414921819 1555414931382 Anger 1 

In the above example there are three groups which can be grouped by start_timestamp_milli and end_timestamp_milli. The first group is from index 1 to 4, the second group from index 5 - 8 and the third group from index 9 to 13.

For each such group, if in the name column "Anger" and Happiness" is not present, I would like to insert it with a rating of 0. If it is present, nothing should happen.

The final result should look like this. The added lines are line 5, 6, 11, 12 and 18.

 start_timestamp_milli end_timestamp_milli name rating 1 1555414708025 1555414723279 Valence 2 2 1555414708025 1555414723279 Arousal 6 3 1555414708025 1555414723279 Dominance 2 4 1555414708025 1555414723279 Sadness 1 5 1555414708025 1555414723279 Happiness 0 6 1555414708025 1555414723279 Anger 0 7 1555414813304 1555414831795 Valence 3 8 1555414813304 1555414831795 Arousal 5 9 1555414813304 1555414831795 Dominance 2 10 1555414813304 1555414831795 Sadness 1 11 1555414813304 1555414831795 Happiness 0 12 1555414813304 1555414831795 Anger 0 13 1555414921819 1555414931382 Valence 1 14 1555414921819 1555414931382 Arousal 7 15 1555414921819 1555414931382 Dominance 2 16 1555414921819 1555414931382 Sadness 1 17 1555414921819 1555414931382 Happiness 0 18 1555414921819 1555414931382 Anger 1 

How can this be done?

2 Answers 2

3

Option 1

This very explicitly loops through each group and appends dummy dataframe and drops duplicates.

d = dict(name=['Anger', 'Happiness'], rating=0) cols = ['start_timestamp_milli', 'end_timestamp_milli'] def f(d0, k): d1 = pd.DataFrame({**dict(zip(cols, k)), **d}) return d0.append(d1, ignore_index=True).drop_duplicates('name') pd.concat([f(d, k) for k, d in df.groupby(cols)], ignore_index=True) start_timestamp_milli end_timestamp_milli name rating 0 1555414708025 1555414723279 Valence 2 1 1555414708025 1555414723279 Arousal 6 2 1555414708025 1555414723279 Dominance 2 3 1555414708025 1555414723279 Sadness 1 4 1555414708025 1555414723279 Anger 0 5 1555414708025 1555414723279 Happiness 0 6 1555414813304 1555414831795 Valence 3 7 1555414813304 1555414831795 Arousal 5 8 1555414813304 1555414831795 Dominance 2 9 1555414813304 1555414831795 Sadness 1 10 1555414813304 1555414831795 Anger 0 11 1555414813304 1555414831795 Happiness 0 12 1555414921819 1555414931382 Valence 1 13 1555414921819 1555414931382 Arousal 7 14 1555414921819 1555414931382 Dominance 2 15 1555414921819 1555414931382 Sadness 1 16 1555414921819 1555414931382 Anger 1 17 1555414921819 1555414931382 Happiness 0 

Option 2

This builds a new index and uses reindex

cats = ['Anger', 'Happiness'] cols = ['start_timestamp_milli', 'end_timestamp_milli'] d = df.set_index([*cols, 'name']) i = pd.MultiIndex.from_tuples( [(s, e, n) for s, e in {*zip(*map(df.get, cols))} for n in cats], names=d.index.names ) | d.index df.set_index([*cols, 'name']).reindex(i, fill_value=0).reset_index() start_timestamp_milli end_timestamp_milli name rating 0 1555414708025 1555414723279 Anger 0 1 1555414708025 1555414723279 Arousal 6 2 1555414708025 1555414723279 Dominance 2 3 1555414708025 1555414723279 Happiness 0 4 1555414708025 1555414723279 Sadness 1 5 1555414708025 1555414723279 Valence 2 6 1555414813304 1555414831795 Anger 0 7 1555414813304 1555414831795 Arousal 5 8 1555414813304 1555414831795 Dominance 2 9 1555414813304 1555414831795 Happiness 0 10 1555414813304 1555414831795 Sadness 1 11 1555414813304 1555414831795 Valence 3 12 1555414921819 1555414931382 Anger 1 13 1555414921819 1555414931382 Arousal 7 14 1555414921819 1555414931382 Dominance 2 15 1555414921819 1555414931382 Happiness 0 16 1555414921819 1555414931382 Sadness 1 17 1555414921819 1555414931382 Valence 1 
Sign up to request clarification or add additional context in comments.

2 Comments

The difference between mine and @WeNYoBen is that their answer will add zero for every unique value in name for each grouping of start and end. This may be what the OP wanted but did not explicitly say. I intentionally went through extra trouble to ensure that I only added 'Anger' and 'Happiness' if they didn't exist for a group. OP and any future readers, please choose according to your needs.
Your approach worked well. I would like to add another row to each grouping of start and end. Namely a row with name "Neutral". The rating column of the "Neutral" row should be 1 if Anger and Sadness have a rating of 0 and 0 otherwise. Is this possible?
3

I am using unstack + stack + reindex

s=set(df.name.unique().tolist()+['Anger','Happiness']) df.set_index(df.columns[:-1].tolist()).rating.\ unstack(fill_value=0).\ reindex(columns=s,fill_value=0).\ stack().reset_index() 

3 Comments

This almost works, add a .drop_duplicates().reset_index() to get the desired output
@Vink thank you , i think using set is slightly better than drop_duplicate
agreed, safer & more precise

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.