Pandas groupby multiple columns and retain all other columns

Question

I have a df which is the concat of two identically structured df's, the first is Orders and the second is Cancels. There are more than 20,000 rows in Orders and a small number of Cancels that have a corresponding OrderNo & ItemCode. I have made the canceled quantities negative, so that while grouping the df by both OrderNo & ItemCode I can sum the quantity fields with agg, thus giving me the actual quantity shipped which compensates for canceled orders.

Below is my dataframe:

 OrderNo OrderDate LineNo ClientNo ItemCode QtyOrdered QtyShipped 0 528758 1/3/2017 1 1358538 111931 70 70 1 528791 1/3/2017 10 1254798 110441 300 300 2 528791 1/3/2017 1 1254798 1029071 10 10 3 528791 1/3/2017 2 1254798 1033341 10 10 4 528791 1/3/2017 8 1254798 1040726 15 15 ... ... ... ... ... ... ... ... 28344 537667 2/6/2017 12 43823870 10137992 0 -2 28345 537771 2/7/2017 5 1276705 1041106 0 -4 28346 539524 2/13/2017 6 1254798 1038323 0 -10 28347 542362 2/23/2017 11 1254612 1041108 0 -2 28348 542835 2/23/2017 13 1255235 10137993 0 -5 28349 rows × 7 columns

After running:

ActualOrders = PreActualOrders.groupby(['OrderNo','ItemCode']).agg({'QtyOrdered': 'sum', 'QtyShipped': 'sum'}).reset_index()

I get my desired result but i lose all other columns in the DF.

Result sample below:

 OrderNo ItemCode QtyOrdered QtyShipped 28255 543734 1038324 1 1 28256 543734 10137992 1 1 28257 543734 10137993 1 1 28258 543735 1041106 1 1 28259 543735 1041108 1 1 28260 543735 10135359 1 1

What do I need to add inorder to keep all columns in the original df?

All values in those other columns match as they are corresponding cancels or the order.

Thank you,

MTH

MTH · Accepted Answer · 2020-05-11 14:47:24Z

I was able to get the desired result by including the other columns in the agg funtion with 'first' while the 'QtyOrdered' & 'QtyShipped' are subject to 'sum'.

ActualOrders = PreActualOrders.groupby(['OrderNo','ItemCode']).agg({'OrderDate': 'first', 'LineNo': 'first', 'ClientNo': 'first', 'QtyOrdered': 'sum', 'QtyShipped': 'sum' }).reset_index()

Yeilds my desired reult of:

 OrderNo ItemCode OrderDate LineNo ClientNo QtyOrdered QtyShipped 28255 543734 1038324 2/27/2017 3 1254787 1 1 28256 543734 10137992 2/27/2017 1 1254787 1 1 28257 543734 10137993 2/27/2017 2 1254787 1 1 28258 543735 1041106 2/27/2017 4 1816460 1 1 28259 543735 1041108 2/27/2017 3 1816460 1 1 28260 543735 10135359 2/27/2017 2 1816460 1 1 28261 543735 10137993 2/27/2017 1 1816460 1 1

The output example doesn't show any difference between Qty ordered and shipped because the number of matching cancels is very small. The rows which have a corresponding cancel are correctly adjusted.

Elephant90 · Accepted Answer · 2020-04-22 17:17:37Z

If I understood you correctly, you could maybe try another approach without groupby. Something similar to this:

orders = [["123", "1", 10], ["1234", "2", 100], ["12345", "3", 15]] cancels = [["123", "1", 10]] df_orders = pd.DataFrame(orders, columns=["OrderNo", "ItemCode", "Amount"]) df_cancels = pd.DataFrame(cancels, columns=["OrderNo", "ItemCode", "Amount"]) merged = df_orders.merge(df_cancels, how="left", on=["OrderNo", "ItemCode"], suffixes=["_orders", "_cancels"]) merged["Amount_cancels"] = merged["Amount_cancels"].fillna(0) print("Before substract cancels") print(merged) merged["Amount_orders"] = merged["Amount_orders"] - merged["Amount_cancels"] print("After substract cancels") print(merged)

Collectives™ on Stack Overflow

Pandas groupby multiple columns and retain all other columns

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related