Select rows of dataframe whose column values amount to a given sum

Question

I need to find out how many of the first N rows of a dataframe make up (just over) 50% of the sum of values for that column.

Here's an example:

import pandas as pd import numpy as np df = pd.DataFrame(np.random.rand(10, 1), columns=list("A")) 0 0.681991 1 0.304026 2 0.552589 3 0.716845 4 0.559483 5 0.761653 6 0.551218 7 0.267064 8 0.290547 9 0.182846

therefore

sum_of_A = df["A"].sum()

4.868260213425804

and with this example I need to find, starting from row 0, how many rows I need to get a sum of at least 2.43413 (approximating 50% of sum_of_A).

Of course I could iterate through the rows and sum and break when I get over 50%, but is there a more concise/Pythonic/efficient way of doing this?

There is "cumsum" for a cumulative sum and (if column has no negative values) "searchsorted" to find the point where the sum is greater than a given value. — Michael Butscher
– Michael Butscher, Commented Jan 17, 2023 at 16:05

BrokenBenchmark · Accepted Answer · 2023-01-17 16:25:26Z

2

I would use .cumsum(), which we can use to get all the rows where the cumulative sum is at least half of the total sum:

df[df["A"].cumsum() < df["A"].sum() / 2]

edited Jan 17, 2023 at 16:25

answered Jan 17, 2023 at 16:03

BrokenBenchmark

19.4k7 gold badges26 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Robert Alexander Over a year ago

Very interesting idea but it seem to select the rows which go OVER the 50% value. Using the example above your code would select rows 5-9

BrokenBenchmark Over a year ago

Yes, did you want the rows under 50%? If so, change the >= to <=.

Robert Alexander Over a year ago

Yes thanks. The correct comparison for me is using '<'.

BrokenBenchmark Over a year ago

Got it. If this answer helped you, please consider accepting it for the benefit of future readers. Have a great day!

Robert Alexander Over a year ago

If anyone's curious about the real use case, I have a dataframe with usernames in a column and the number of times they've commented in the other and have it sorted descending by this second column. With this I am selecting the first N users which have contributed to around 50% of the comments total :)

Collectives™ on Stack Overflow

Select rows of dataframe whose column values amount to a given sum

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related