22

I have looked this answer which explains how to compute the value of a specific percentile, and this answer which explains how to compute the percentiles that correspond to each element.

  • Using the first solution, I can compute the value and scan the original array to find the index.

  • Using the second solution, I can scan the entire output array for the percentile I'm looking for.

However, both require an additional scan if I want to know the index (in the original array) that corresponds to a particular percentile (or the index containing the element closest to that index).

Is there is more direct or built-in way to get the index which corresponds to a percentile?

Note: My array is not sorted and I want the index in the original, unsorted array.

0

6 Answers 6

12

It is a little convoluted, but you can get what you are after with np.argpartition. Lets take an easy array and shuffle it:

>>> a = np.arange(10) >>> np.random.shuffle(a) >>> a array([5, 6, 4, 9, 2, 1, 3, 0, 7, 8]) 

If you want to find e.g. the index of quantile 0.25, this would correspond to the item in position idx of the sorted array:

>>> idx = 0.25 * (len(a) - 1) >>> idx 2.25 

You need to figure out how to round that to an int, say you go with nearest integer:

>>> idx = int(idx + 0.5) >>> idx 2 

If you now call np.argpartition, this is what you get:

>>> np.argpartition(a, idx) array([7, 5, 4, 3, 2, 1, 6, 0, 8, 9], dtype=int64) >>> np.argpartition(a, idx)[idx] 4 >>> a[np.argpartition(a, idx)[idx]] 2 

It is easy to check that these last two expressions are, respectively, the index and the value of the .25 quantile.

Sign up to request clarification or add additional context in comments.

3 Comments

+1; FWIW, your answer would be more obviously correct if a wasn't a shuffle of argpartion(a, idx).
Does this work if the values in the list repeats? y = [0, 0, 0, 2, 2, 4, 5, 5, 9] and int(0.75 * (len(y) -1 ) + 0.5) == 6 and y[np.argpartition(y, 6)[6]] outputs 5 and y[5] -> 4 =(
@alvas This does work for lists with repeated values. You have one level of indirection too many in your example: np.argpartition(y, 6)[6] should be 6, and y[6] -> 5.
6

If numpy is to be used, one can use the built-in percentile function, but the way you do this depends on the version you have (very old <v1.9.0, old < 1.22 or new >=1.22)

From v1.22.0 of numpy you can write

np.percentile(x,p,method="method") 

with method chosen from:

  • ‘inverted_cdf’

  • ‘averaged_inverted_cdf’

  • ‘closest_observation’

  • ‘interpolated_inverted_cdf’

  • ‘hazen’

  • ‘weibull’

  • ‘linear’ (default)

  • ‘median_unbiased’

  • ‘normal_unbiased’

For older versions before v1.22

NOTE: The original answer below is deprecated from numpy v1.22.0 onwards - the argument interpolation is now deprecated and is renamed method - the lower, higher and nearest methods are retained for backwards compatibility but are now in method linear. New methods have now been added, see the man page for details.

From version 1.9.0 of numpy, percentile has the option "interpolation" that allows you to pick out the lower/higher/nearest percentile value. The following will work on unsorted arrays and finds the nearest percentile index:

import numpy as np p=70 # my desired percentile, here 70% x=np.random.uniform(10,size=(1000))-5.0 # dummy vector # index of array entry nearest to percentile value pcen=np.percentile(x,p,interpolation='nearest') i_near=abs(x-pcen).argmin() 

Most people will normally want the nearest percentile value as stated above. But just for completeness, you can also easily specify to get the entry below or above the stated percentile value:

# Use this to get index of array entry greater than percentile value: pcen=np.percentile(x,p,interpolation='higher') # Use this to get index of array entry smaller than percentile value: pcen=np.percentile(x,p,interpolation='lower') 

For EXTREMELY OLD versions of numpy < v1.9.0, the interpolation option is not available, and thus the equivalent is this:

# Calculate 70th percentile: pcen=np.percentile(x,p) i_high=np.asarray([i-pcen if i-pcen>=0 else x.max()-pcen for i in x]).argmin() i_low=np.asarray([i-pcen if i-pcen<=0 else x.min()-pcen for i in x]).argmax() i_near=abs(x-pcen).argmin() 

In summary:

i_high points to the array entry which is the next value equal to, or greater than, the requested percentile.

i_low points to the array entry which is the next value equal to, or smaller than, the requested percentile.

i_near points to the array entry that is closest to the percentile, and can be larger or smaller.

My results are:

pcen 

2.3436832738049946

x[i_high] 

2.3523077864975441

x[i_low] 

2.339987054079617

x[i_near] 

2.339987054079617

i_high,i_low,i_near 

(876, 368, 368)

i.e. location 876 is the closest value exceeding pcen, but location 368 is even closer, but slightly smaller than the percentile value.

6 Comments

Regarding the solution i_near=abs(x-np.percentile(x,p,interpolation='nearest')).argmin() it is much faster to do y=np.percentile(x,p,interpolation='nearest') i_near=abs(x-y).argmin() and even a little bit faster to do y=np.percentile(x,p,interpolation='nearest') i_near=np.where(x==A).argmin()
thanks you are right, I will update to include this
I don't think this will work if the desired percentile value is duplicated. It might be important to know the actual position of the percentile.
Duplicates doesn't matter, the post specifically requests "the index containing the element closest to that index" - in a list of floats, it would actually be extremely unlikely that an entry = exactly the requested percentile. Remember that percentile value calculations rely on a assumed distribution fit to "samples" provided. Calculating percentiles directly from the distribution requires very large samples. This solution uses the generally adopted method to calculate percentile values (python, CDO etc), and my solution then simply finds the list entry nearest to that solution.
So, what would be the position of the 75% percentile in this unusual data set: [0.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0]? Your code gives i_near=1. I think duplicates do matter. You can argue that they are unlikely but this is debatable. I have duplicates in my application.
If you want to use the ranked definition of percentile, then the value of the 75th percentile is 0.5 and the position of the 75% percentile in this unusual data set is 1, 2, 3, 4, 5, 6, and 7. So 1 is the index of the first occurrence of the 75th percentile and is the correct answer. If instead you choose a percentile method from the many built in numpy that assumes a fit (answers will vary depending on the method due to the small size of the sample), you may find some fitted methods return a 75th percentile value closer to 1 than 0.5 in which case the method returns 8, still correct.
3

Using numpy,

arr = [12, 19, 11, 28, 10] p = 0.75 np.argsort(arr)[int((len(arr) - 1) * p)] 

This returns 11, as desired.

Comments

3

You can use numpy's np.percentile as such:

import numpy as np percentile = 75 mylist = [random.random() for i in range(100)] # random list percidx = mylist.index(np.percentile(mylist, percentile, interpolation='nearest')) 

Comments

1

Assuming the array is sorted... Unless I'm misunderstanding you, you can compute the index of a percentile by taking the length of the array -1, multiplying it by the quantile, and rounding to the nearest integer.

round( (len(array) - 1) * (percentile / 100.) ) 

should give you the nearest index to that percentile

5 Comments

My array is not sorted and I want the index in the original array. I updated the question to clarify.
Would sorting the array, finding the element at the index nearest the quantile * (length - 1) and then finding its index in the original array solve the problem?
Finding the index in the original array by a linear search would amount to doing one of the two solutions I already listed in the question. :)
Okay, you could zip the original elements with their index enumerate, sort by the second element, and then take the quantile * last element. If the original array is unsorted, it's not clear to me that you can avoid doing at least O(n*log(n)) work
I've tested this a bit and, rather than round( (len(array) - 1) * (percentile / 100.) ) wouldn't the correct formula be: round( len(array) * (percentile / 100.) ) - 1 ? Basically removing 1 from the index at the end instead of from the length.
1

You can select the values in a df in a designated quantile with df.quantile().

df_metric_95th_percentile = df.metric[df >= df['metric'].quantile(q=0.95)] 

1 Comment

What is df? Please include import statements in your answers!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.