I am a little new in the Pandas and Python framework. I would like to understand if my code is more or less fine or it too related to my old program framework way of thinking.
I have this is the input file "test.csv"
name,A,B,C,D,E,F x,1,2,3,0,5,6 y,5,5,6,0,4,2 z,2,3,3,0,1,1 2012-01-01,106.20,48.80,41.60,1015.04,211.13,643.55 2012-02-01,8.40,-9999.,4.80,15.36,0.37,0.02 2012-03-01,5.20,7.00,12.20,42.70,2.60,0.33 2012-04-01,45.60,29.80,48.20,718.18,-9999.,373.28 2012-05-01,-9999.,21.20,18.30,193.98,17.75,10.34 2012-06-01,122.40,95.30,103.00,4907.95,2527.59,37253.17 2012-07-01,-9999.,98.50,83.70,4122.23,1725.15,21355.74 2012-08-01,-9999.,113.00,94.80,5356.20,2538.84,40836.42 2012-09-01,-9999.,97.80,96.90,4738.41,2295.76,32667.42 2012-10-01,50.20,52.60,47.90,1259.77,301.71,1141.42 2012-11-01,76.40,-9999.,118.00,5858.70,3456.63,60814.94 2012-12-01,73.80,41.90,31.10,651.55,101.32,198.23 Basically, I have six stations [A,B,C,D,E,F], with a position in x, in y and z. Each station has a value that changes each month.
I have to apply a function to each value selecting the two closest values according to the x and y rows.
This is the code:
""" name of the file to be put of the same folder of the python script """ fname = './test.csv' """ read the station positions """ dfrGEO = pd.read_csv(f_name, parse_dates = True, index_col = 0, nrows = 3, infer_datetime_format = True, cache_dates=True).replace(-9999.0, np.nan) """ read the station values changing in time """ dfrDATA = pd.read_csv(f_name, parse_dates = True, index_col = 0, skiprows = [1,2,3], infer_datetime_format = True, cache_dates=True).replace(-9999.0, np.nan) """ this is the function to apply to each element of the dfrDATA """ def specific_function(p_ref,p_near,d_near): res = p_ref - np.sum(p_near/d_near) return res """ function that pass the two closest values to the function """ def func_each_row(xrow,dd,col_name): # icout = 0 p_ref = xrow[col_name] if np.isnan(p_ref): res = 0. else: values = np.vstack((xrow, dd)) values = values[:, values[1, :].argsort()] values = values[:, ~np.isnan(values).any(axis=0)] p_near = values[0,1:3] d_near = values[1,1:3] res = specific_function(p_ref,p_near,d_near) return res """ compute the distances between all stations """ xp = dfrGEO.loc['x'].values yp = dfrGEO.loc['y'].values r_xy = np.vstack((xp, yp)).T dd = distance.cdist(r_xy, r_xy, 'euclidean') """ function that pass column by column """ def func_each_column(x,dd,dfr): col = x.name ix = list(dfrDATA.columns).index(col) res = dfrDATA.apply(func_each_row, args=(dd[ix,:],col),axis=1) return res modDfObj = dfrDATA.apply(func_each_column, args=(dd, dfrDATA)) I have different questions:
- Is there any more efficient method/structure to do that?
- Is the code written properly?
- Can I parallelize it?
Thank in advance
f_name- which is undefined - for two different files even though you haven't shown it changing values. This doesn't make any sense. Please show all of your code. \$\endgroup\$