Extending from the loop vectorization domain you looked into, try keywords pairwise distance calculation, e.g.: How to efficiently calculate euclidean distance matrix for several timeseries
The method below is fully vectorized (no for loops at all). It passes by Numpy for the 2D intensive part, then returns a pandas dataframe as requested. Calculating distances and filtering rows by <= buffer, with your input data sizes of 1500 * 15000 rows, is accomplished in a fraction of a second.
Mock-up data
Let
dfA represent qinsy_file_2 with nA = 15000 points of 2D coordinates ('xA', 'yA') dfB represent segy_vlookup with nB = 1500 points of 2D coordinates ('xB', 'yB')
import numpy as np import pandas as pd buffer = 0.01 # arbitrary value dfA = pd.DataFrame({'xA' : np.random.default_rng(seed=0).random(nA), 'yA' : np.random.default_rng(seed=1).random(nA)}) dfB = pd.DataFrame({'xB' : np.random.default_rng(seed=3).random(nB), 'yB' : np.random.default_rng(seed=4).random(nB)})
With these given seeds if you wanted to reproduce the process:
xA yA 0 0.636962 0.511822 1 0.269787 0.950464 2 0.040974 0.144160 3 0.016528 0.948649 4 0.813270 0.311831 ... ... ... 14995 0.457170 0.507611 14996 0.283829 0.828005 14997 0.679110 0.910104 14998 0.273703 0.294932 14999 0.097813 0.366295 [15000 rows x 2 columns] xB yB 0 0.085649 0.943056 1 0.236811 0.511328 2 0.801274 0.976244 3 0.582162 0.080836 4 0.094129 0.607356 ... ... ... 1495 0.684086 0.821719 1496 0.383240 0.957020 1497 0.341389 0.064735 1498 0.032523 0.234434 1499 0.500383 0.931106 [1500 rows x 2 columns]
One-liner summary
df_out = dfA[np.sum((np.sqrt((np.atleast_2d(dfA['xA'].values) - np.atleast_2d(dfB['xB'].values).T) **2 + (np.atleast_2d(dfA['yA'].values) - np.atleast_2d(dfB['yB'].values).T) **2))<=buffer, axis=0)>0]
Detailed answer
Each point from dfA will be tested for distance < buffer regarding each point from dfB, to determine if the corresponding row of dfA should be selected into df_out
Distance matrix
Every column in the output below stands for a row in dfA:
D = np.sqrt((np.atleast_2d(dfA['xA'].values) - np.atleast_2d(dfB['xB'].values).T) **2 + (np.atleast_2d(dfA['yA'].values) - np.atleast_2d(dfB['yB'].values).T) **2) D [[0.69993476 0.18428648 0.80014469 ... 0.59437473 0.67485498 0.57688898] [0.40015149 0.44037255 0.41613029 ... 0.59552573 0.21951799 0.20088488] [0.49263227 0.53211262 1.12712974 ... 0.13891991 0.86169468 0.93107222] ... [0.535957 0.88861864 0.3107378 ... 0.91033173 0.2399422 0.38764484] [0.66504838 0.75431549 0.0906694 ... 0.93520197 0.24865128 0.14713943] [0.44096846 0.23140718 0.91123077 ... 0.17995671 0.6753528 0.69359482]]
Now filter by threshold (distance < buffer)
At least one value per column in D is enough to select that column for df_out, i.e. that row in dfA.
b = np.sum(D<=buffer, axis=0)>0 # boolean selector [ True False False ... False True True] # In this example, at least the first, # last and former last rows will be selected, and probably many more in between.
Finally operate row-wise selection on dfA:
df_out = dfA[b] xA yA 0 0.636962 0.511822 4 0.813270 0.311831 9 0.935072 0.027559 10 0.815854 0.753513 11 0.002739 0.538143 ... ... ... 14988 0.039833 0.239034 14994 0.243440 0.428287 14996 0.283829 0.828005 14998 0.273703 0.294932 14999 0.097813 0.366295 [5620 rows x 2 columns]
In this mock-up example, 5620 rows out of 150000 made it to the selection.
Naturally any additional columns that your actual dfA might have will also be transcribed into df_out.
To go further, but this answer already reprsents a dramatic improvement over nested loops, reduce number of distances actually calculated?
- Here no playing with symmetry to reduce calculations as in actual pairwise distance algorithms (that typically tries a dataset against itself). One has to calculate all products.
- I wonder, though, would it be advantageous to go by chunks, calculating the next chunk for that point only if condition was not already satisfied.
qinsy_file_2.iterrows-> not sure what you have read, but you should find a bit everywhere that you should not useiterrows…out_df=pd.concat([out_df,row_qinsy])is something you should never ever do. It iteratively create bigger and bigger dataframe resulting in a quadratic time (not to mention creating a dataframe has a huge overhead). You should accumulate data in one/multiple lists and then create a dataframe from that. I expect this to be a significant bottleneck.