Apply a funtion to each element of a dataframe according to other elements values in the corresponding row in pandas and python

Question

I am a little new in the Pandas and Python framework. I would like to understand if my code is more or less fine or it too related to my old program framework way of thinking.

I have this is the input file "test.csv"

name,A,B,C,D,E,F x,1,2,3,0,5,6 y,5,5,6,0,4,2 z,2,3,3,0,1,1 2012-01-01,106.20,48.80,41.60,1015.04,211.13,643.55 2012-02-01,8.40,-9999.,4.80,15.36,0.37,0.02 2012-03-01,5.20,7.00,12.20,42.70,2.60,0.33 2012-04-01,45.60,29.80,48.20,718.18,-9999.,373.28 2012-05-01,-9999.,21.20,18.30,193.98,17.75,10.34 2012-06-01,122.40,95.30,103.00,4907.95,2527.59,37253.17 2012-07-01,-9999.,98.50,83.70,4122.23,1725.15,21355.74 2012-08-01,-9999.,113.00,94.80,5356.20,2538.84,40836.42 2012-09-01,-9999.,97.80,96.90,4738.41,2295.76,32667.42 2012-10-01,50.20,52.60,47.90,1259.77,301.71,1141.42 2012-11-01,76.40,-9999.,118.00,5858.70,3456.63,60814.94 2012-12-01,73.80,41.90,31.10,651.55,101.32,198.23

Basically, I have six stations [A,B,C,D,E,F], with a position in x, in y and z. Each station has a value that changes each month.

I have to apply a function to each value selecting the two closest values according to the x and y rows.

This is the code:

""" name of the file to be put of the same folder of the python script """ fname = './test.csv' """ read the station positions """ dfrGEO = pd.read_csv(f_name, parse_dates = True, index_col = 0, nrows = 3, infer_datetime_format = True, cache_dates=True).replace(-9999.0, np.nan) """ read the station values changing in time """ dfrDATA = pd.read_csv(f_name, parse_dates = True, index_col = 0, skiprows = [1,2,3], infer_datetime_format = True, cache_dates=True).replace(-9999.0, np.nan) """ this is the function to apply to each element of the dfrDATA """ def specific_function(p_ref,p_near,d_near): res = p_ref - np.sum(p_near/d_near) return res """ function that pass the two closest values to the function """ def func_each_row(xrow,dd,col_name): # icout = 0 p_ref = xrow[col_name] if np.isnan(p_ref): res = 0. else: values = np.vstack((xrow, dd)) values = values[:, values[1, :].argsort()] values = values[:, ~np.isnan(values).any(axis=0)] p_near = values[0,1:3] d_near = values[1,1:3] res = specific_function(p_ref,p_near,d_near) return res """ compute the distances between all stations """ xp = dfrGEO.loc['x'].values yp = dfrGEO.loc['y'].values r_xy = np.vstack((xp, yp)).T dd = distance.cdist(r_xy, r_xy, 'euclidean') """ function that pass column by column """ def func_each_column(x,dd,dfr): col = x.name ix = list(dfrDATA.columns).index(col) res = dfrDATA.apply(func_each_row, args=(dd[ix,:],col),axis=1) return res modDfObj = dfrDATA.apply(func_each_column, args=(dd, dfrDATA))

I have different questions:

Is there any more efficient method/structure to do that?
Is the code written properly?
Can I parallelize it?

Thank in advance

You use the same f_name - which is undefined - for two different files even though you haven't shown it changing values. This doesn't make any sense. Please show all of your code. — Reinderien
– Reinderien, Commented Mar 2, 2022 at 15:55

Lodinn · Accepted Answer · 2022-03-03 12:26:12Z

It seems that you have some old-school file with headers there. Overall, the approach of reading it in two passes is fine - the rest, however, is fairly hard to read. Variable names convey little meaning to an uninitiated reader (which someone supporting this piece of software 10 years from now very well might be!), and comments only explain the technical details about the functions, not what these functions actually do. It is as if you would put a comment # adds a and b above the line c = a + b without explaining what are you trying to achieve in the first place.

With that said, it seems to me that what you are actually trying to do is comparing the output from a given station to an interpolated value from its three closest neighbors (weighted average). I would also imagine you will eventually use this code to work with the actual geocoordinates, in which case scipy's euclidean distance may be a poor fit and you would want to at least implement the haversine formula.

Finally - and perhaps the most importantly - your way of passing the list of three closest stations to the columnwise function is extremely obscure. Why not just create a dictionary mapping column names to the three closest stations and respective distances to them, instead relying on low-level abstractions and slicing arrays? It is all too easy to get lost in what e.g. values[1,1:3] is supposed to mean: packing heterogeneous data like that in a tuple is acceptable, but a numpy array... Not so much. The legacy of the file format you are using starts to show in the parsing code as well - isolate it to the very first lines of parsing the file and do not use the same approach when dealing with high-level abstractions. Pet peeve: skiprows=3 is cleaner than skiprows=[1,2,3].

Finally, I would suggest checking out meteostat - it has a whole bunch of routines similar to what you are dealing with here and it is organized fairly neatly.

Thanks. skiprows=3 is equal to skiprows=[0,1,2]. Am I wrong?; How can I create a dictionary mapping column names to the three closest stations without NaN? — diedro
– diedro, Commented Mar 3, 2022 at 14:34
Yes, sorry, my bad. header=0, skiprows=range(1,3). If you want to select three closest stations on a row-per-row basis, ignoring those giving NaNs, that is slightly different from what you described and there probably is no good way of doing this indeed. But it also raises more serious concerns about the overall logic: what happens if 5 out of 6 stations have NaNs in a given row and you attempt to select 3 of them not having NaNs? Arguably worse yet - suppose you have hundreds of stations and for whatever reason, you end up interpolating data from hundreds of kilometers away. — Lodinn
– Lodinn, Commented Mar 3, 2022 at 15:58
Do take a look at meteostat: the logic implemented there is pretty reasonable (select stations within the 30km circle by default and interpolate data from those stations only). I mean... you do you, having to loop through columns and rows both is the sole solution to the problem you are posing, but the main reason people are bothered when their code doesn't look clean is that it is usually an indication of a deeper problem, resulting in a poor maintainability of the system as a whole. I would not recommend selecting three closest stations on a per-row basis even in an one-off research project. — Lodinn
– Lodinn, Commented Mar 3, 2022 at 16:04
You totally get the point. Are you familiar with the kriging and with the concept of cross validation? — diedro
– diedro, Commented Mar 3, 2022 at 17:21
The idea could be add a check in order to skip the point and moment when there is no close stations — diedro
– diedro, Commented Mar 3, 2022 at 17:26

Stack Exchange Network

Apply a funtion to each element of a dataframe according to other elements values in the corresponding row in pandas and python

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Apply a funtion to each element of a dataframe according to other elements values in the corresponding row in pandas and python

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions