Filtering a list to take its highest values

Question

I have a program which works with a for loop, but it's too slow, and I need to speed it up.

I have a reverse-sorted list of probabilities whose sum is 1. There are over 5 million items.

I want to take the highest probabilities, i.e. the first n items whose collective sum is 0.9999.

This was my code:

for b in sorted_list: new_list.append(b) if sum(new_list) > 0.9999: break

Can anyone suggest a quicker method?

Thank you

Edit: I found that this question was asked before - stackexchange link

however, the suggestions all make use of loops so I don't think they will be any quicker. Someone at the end suggested a list comprehension. So I am going to google that and see what that means! Thank you

This is a classic mistake, sum(new_list) every step makes your algorithm very inefficient. Instead, just keep score adding the appended element to the previous total. — Paulo Scardine
– Paulo Scardine, Commented Oct 9, 2019 at 16:20
can you try np.array(sorted_list)[np.cumsum(sorted_list)< 0.9999] — Benoit de Menthière
– Benoit de Menthière, Commented Oct 9, 2019 at 16:23
I like the idea that I made a classic mistake. Going to replace my code with the suggestion below and check the outcome. — Naz
– Naz, Commented Oct 9, 2019 at 16:38

Paulo Scardine · Accepted Answer · 2019-10-09 16:22:45Z

6

Keep a running sum instead of recomputing it every step for the whole list. I.e.

running_sum = 0 for b in sorted_list: new_list.append(b) running_sum += b if running_sum > 0.9999: break

edited Oct 9, 2019 at 16:22

Paulo Scardine

78.2k12 gold badges134 silver badges153 bronze badges

answered Oct 9, 2019 at 16:21

Andrew

25k14 gold badges67 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Naz Over a year ago

Thanks very much for the quick replies. I am going to try this solution. I haven't yet had time to digest why you all think this is quicker, but I will certainly try it, and after I have done it, I'll have probably figured out why my process was inefficient. I will probably come back in an an hour or so and uptick this as solution, but I want to try it first Thanks to all for your comments.

Jacob Turpin Over a year ago

The version provided in your question requires that all elements in new_list are accessed to perform addition to compute the sum on each iteration. This computes in O(n!) time over the course of all iterations. However, the incremental addition only needs to perform a single operation with each iteration of the loop, meaning it performs in O(n) time. This presents a significant speedup as the size of n gets large.

user2390182 Over a year ago

@JacobTurpin The complexity is quadratic, not exponential as you suggest. Simple nested iteration issue. For every outer iteration, the collected inner items are iterated. That amounts to an O(n^2) complexity.

Naz Over a year ago

Thanks Jacob. I understand now why the amendment is so much faster.

user2390182 · Accepted Answer · 2019-10-09 16:43:19Z

sum(iterable) has to visit all elements to calculate the sum. That is unnecessary as you can reuse the sum from the previous iteration. The built-in tool to accumulate such a sum is, well, itertools.accumulate. Moreover, you don't have to append repeatedly. Instead, you can take a single slice at the end:

 from itertools import accumulate for i, s in enumerate(accumulate(sorted_list)): if s > 0.9999: break new_list = sorted_list[:i+1]

Collectives™ on Stack Overflow

Filtering a list to take its highest values

2 Answers 2

4 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Related