2

I have two dataframes of different length. dfSamples (63012375 rows) and dfFixations (200000 rows).

dfSamples = pd.DataFrame({'tSample':[4, 6, 8, 10, 12, 14]}) dfFixations = pd.DataFrame({'tStart':[4,12],'tEnd':[8,14]}) 

I would like to check each value in dfSamples if it is within any two ranges given in dfFixations and then assign a label to this value. I have found this: Check if value in a dataframe is between two values in another dataframe, but the loop solution is terribly slow and I cannot make any other solution work.

Working (but very slow) example:

labels = np.empty_like(dfSamples['tSample']).astype(np.chararray) for i, fixation in dfFix.iterrows(): log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd']) labels[log_range] = 'fixation' labels[labels != 'fixation'] = 'no_fixation' dfSamples['labels'] = labels 

Following this example: Performance of Pandas apply vs np.vectorize to create new column from existing columns I have tried to vectorize this but with no success.

def check_range(samples, tstart, tend): log_range = (samples > tstart) & (samples < tend) return log_range fixations = list(map(check_range, dfSamples['tSample'], dfFix['tStart'], dfFix['tEnd'])) 

Would appreciate any help!

2 Answers 2

3

Use IntervalIndex.from_arrays with IntervalIndex.get_indexer , if not match is returned -1, so checked and set ouput in numpy.where:

i = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both") pos = i.get_indexer(dfSamples['tSample']) dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation") print (dfSamples) tSample labels 0 4 fixation 1 6 fixation 2 8 fixation 3 10 no_fixation 4 12 fixation 5 14 fixation 

Performance: In ideal nice sorted not overlap data, in real should be performance different, the best test it.

dfSamples = pd.DataFrame({'tSample':range(10000)}) dfFixations = pd.DataFrame({'tStart':range(0, 10000, 5),'tEnd':range(2, 10000, 5)}) In [165]: %%timeit ...: labels = np.empty_like(dfSamples['tSample']).astype(np.chararray) ...: for i, fixation in dfFixations.iterrows(): ...: log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd']) ...: labels[log_range] = 'fixation' ...: labels[labels != 'fixation'] = 'no_fixation' ...: dfSamples['labels'] = labels ...: ...: 1.25 s ± 52.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [168]: %%timeit ...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both") ...: dfSamples["labels1"] = np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation") ...: 315 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [170]: %%timeit ...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both") ...: contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0) ...: dfSamples["labels1"] = np.where(contained, "fixation", "no_fixation") ...: 82.4 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 

In [166]: %%timeit ...: s = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both") ...: pos = s.get_indexer(dfSamples['tSample']) ...: dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation") ...: 27.8 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
Sign up to request clarification or add additional context in comments.

4 Comments

Great answer! For my data set, this is 10x faster than my solution. Thank you!
Hi, I am getting the error message "InvalidIndexError: cannot handle overlapping indices; use IntervalIndex.get_indexer_non_unique" even though my intervals are not overlapping. If I use [ii.overlaps(x) for x in ii] only the diagonal is True
@heldm - hmmm, maybe there is at least one overlapped value. Another solution also not working?
Right again. thanks!
1

setup

dfSamples = pd.DataFrame({'tSample':[4, 6, 8, 10, 12, 14]}) dfFixations = pd.DataFrame({'tStart':[4,12],'tEnd':[8,14]}) 

solution

Create an interval index from your start and end points

ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both") 

ii.contains is a method which checks if a point is contained by each interval in the interval index, eg

dfSamples["tSample"].apply(ii.contains) 

gives

0 [True, False] 1 [True, False] 2 [True, False] 3 [False, False] 4 [False, True] 5 [False, True] Name: tSample, dtype: object 

We're going to take this result, apply any function to each element (a list) to get a pandas.Series of booleans, which we can then use with numpy.where

dfSamples["labels"] = np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation") 

the result

 tSample labels 0 4 fixation 1 6 fixation 2 8 no_fixation 3 10 no_fixation 4 12 fixation 5 14 no_fixation 

edit: faster version

Using piso v0.6.0

import piso import numpy as np ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both") contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0) dfSamples["labels"] = np.where(contained, "fixation", "no_fixation") 

This will run in a similar time to @jezrael's solution, however it can handle cases where intervals overlaps eg

dfFixations = pd.DataFrame({'tStart':[4,5,12],'tEnd':[8,9,14]}) 

note: I am the creator of piso. Please feel free to reach out with feedback or questions if you have any.

3 Comments

Thanks for your answer! Unfortunately, this seems to be even slower than interrows for me. Took about 5x longer than my solution above
I may have a faster solution for you but I need to know how your intervals are "closed". In your first solution your intervals are closed on both ends, and in your second they're closed on neither.
Would be great! Whatever is fastest will be best in this case. Preferably, only 10 would get the label 'no_fixation'.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.