@pandas_udf - Linear regressions

Question

Can someone help me?

I need two groups to perform linear regression

example:

pdf = pd.DataFrame({'group_id':[1,1,1,2,2,2,3,3,3,3], 'sex':['M','M','F','F','M','F','M','F','F','M'], 'x':[0,1,2,0,1,5,2,3,4,5], 'y':[2,1,0,0,0.5,2.5,3,4,5,6]}) df = sqlContext.createDataFrame(pdf) result_schema =StructType([ StructField('group_id',DoubleType()), StructField('sex',StringType()), StructField('x',DoubleType()) ]) @pandas_udf(result_schema, PandasUDFType.GROUPED_MAP) def ols(df): group_id = df['group_id'].iloc[0] sex = df['sex'].iloc[0] y = df['y'].astype(int) X = df['x'].astype(int) X = sm.add_constant(X) model = sm.OLS(y, X).fit() return pd.DataFrame([[group_id] + [sex] + [model.params[1]]], columns=['group_id'] + ['sex'] + ['x']) beta = df.groupby('group_id', 'sex').apply(ols) beta.show()

return error:

PythonException: An exception was thrown from a UDF: 'IndexError: index out of bounds',

There is only 1 entry for some groups. How are you going to do a linear regression with 1 point? — mck
– mck, Commented Jan 29, 2021 at 19:33

Paul Brennan · Accepted Answer · 2021-01-29 19:36:07Z

Looking at the data

df = pd.DataFrame({'group_id':[1,1,1,2,2,2,3,3,3,3], 'sex':['M','M','F','F','M','F','M','F','F','M'], 'x':[0,1,2,0,1,5,2,3,4,5], 'y':[2,1,0,0,0.5,2.5,3,4,5,6]})

and simplistically looking at this group by

for name, sdf in df.groupby(['group_id', 'sex']): print(name) print(sdf)

we get

(1, 'F') group_id sex x y 2 1 F 2 0.0 (1, 'M') group_id sex x y 0 1 M 0 2.0 1 1 M 1 1.0 (2, 'F') group_id sex x y 3 2 F 0 0.0 5 2 F 5 2.5 (2, 'M') group_id sex x y 4 2 M 1 0.5 (3, 'F') group_id sex x y 7 3 F 3 4.0 8 3 F 4 5.0 (3, 'M') group_id sex x y 6 3 M 2 3.0 9 3 M 5 6.0

Now two points will get a linear (perfect) fit, and you need three to get some sort of deviation from a potentially perfect fit. Some of these groups only have one data point that means you cannot fit the data...

mck · Accepted Answer · 2021-01-29 19:47:05Z

There are some groups with only one point, which makes it impossible to do a linear regression. To catch that in the function, you can add a check of number of rows in the dataframe, and return null in case there is only one row.

@pandas_udf(result_schema, PandasUDFType.GROUPED_MAP) def ols(df): group_id = df['group_id'].iloc[0] sex = df['sex'].iloc[0] if len(df) == 1: return pd.DataFrame([[group_id] + [sex] + [None]], columns=['group_id'] + ['sex'] + ['x']) else: y = df['y'].astype(int) X = df['x'].astype(int) X = sm.add_constant(X) model = sm.OLS(y, X).fit() return pd.DataFrame([[group_id] + [sex] + [model.params[1]]], columns=['group_id'] + ['sex'] + ['x']) df.groupby('group_id', 'sex').apply(ols).show() +--------+---+-------------------+ |group_id|sex| x| +--------+---+-------------------+ | 2.0| M| null| | 3.0| F| 1.0| | 1.0| M|-1.0000000000000002| | 1.0| F| null| | 2.0| F|0.39999999999999986| | 3.0| M| 0.9999999999999998| +--------+---+-------------------+

your answer is very good, but, in real case, i use a large dataset grouped by 2 arguments (like the example, group_id and sex). When i filter just a few "arguments" in sex and group_id, OK, however when i put all data set, i have the error: 'IndexError: index out of bounds'. Even i grouping by one complete "argument", the error occurs. Can you help me again?
Sorry but I don't understand what you were trying. Would you mind opening another question and provide the necessary details?
Thanks @mck, unfortunately I can't put the data, but i fixed the problem

Collectives™ on Stack Overflow

@pandas_udf - Linear regressions

2 Answers 2

Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Related