1

Can someone help me?

I need two groups to perform linear regression

example:

pdf = pd.DataFrame({'group_id':[1,1,1,2,2,2,3,3,3,3], 'sex':['M','M','F','F','M','F','M','F','F','M'], 'x':[0,1,2,0,1,5,2,3,4,5], 'y':[2,1,0,0,0.5,2.5,3,4,5,6]}) df = sqlContext.createDataFrame(pdf) result_schema =StructType([ StructField('group_id',DoubleType()), StructField('sex',StringType()), StructField('x',DoubleType()) ]) @pandas_udf(result_schema, PandasUDFType.GROUPED_MAP) def ols(df): group_id = df['group_id'].iloc[0] sex = df['sex'].iloc[0] y = df['y'].astype(int) X = df['x'].astype(int) X = sm.add_constant(X) model = sm.OLS(y, X).fit() return pd.DataFrame([[group_id] + [sex] + [model.params[1]]], columns=['group_id'] + ['sex'] + ['x']) beta = df.groupby('group_id', 'sex').apply(ols) beta.show() 

return error:

PythonException: An exception was thrown from a UDF: 'IndexError: index out of bounds',

1
  • There is only 1 entry for some groups. How are you going to do a linear regression with 1 point? Commented Jan 29, 2021 at 19:33

2 Answers 2

2

Looking at the data

df = pd.DataFrame({'group_id':[1,1,1,2,2,2,3,3,3,3], 'sex':['M','M','F','F','M','F','M','F','F','M'], 'x':[0,1,2,0,1,5,2,3,4,5], 'y':[2,1,0,0,0.5,2.5,3,4,5,6]}) 

and simplistically looking at this group by

for name, sdf in df.groupby(['group_id', 'sex']): print(name) print(sdf) 

we get

(1, 'F') group_id sex x y 2 1 F 2 0.0 (1, 'M') group_id sex x y 0 1 M 0 2.0 1 1 M 1 1.0 (2, 'F') group_id sex x y 3 2 F 0 0.0 5 2 F 5 2.5 (2, 'M') group_id sex x y 4 2 M 1 0.5 (3, 'F') group_id sex x y 7 3 F 3 4.0 8 3 F 4 5.0 (3, 'M') group_id sex x y 6 3 M 2 3.0 9 3 M 5 6.0 

Now two points will get a linear (perfect) fit, and you need three to get some sort of deviation from a potentially perfect fit. Some of these groups only have one data point that means you cannot fit the data...

Sign up to request clarification or add additional context in comments.

Comments

0

There are some groups with only one point, which makes it impossible to do a linear regression. To catch that in the function, you can add a check of number of rows in the dataframe, and return null in case there is only one row.

@pandas_udf(result_schema, PandasUDFType.GROUPED_MAP) def ols(df): group_id = df['group_id'].iloc[0] sex = df['sex'].iloc[0] if len(df) == 1: return pd.DataFrame([[group_id] + [sex] + [None]], columns=['group_id'] + ['sex'] + ['x']) else: y = df['y'].astype(int) X = df['x'].astype(int) X = sm.add_constant(X) model = sm.OLS(y, X).fit() return pd.DataFrame([[group_id] + [sex] + [model.params[1]]], columns=['group_id'] + ['sex'] + ['x']) df.groupby('group_id', 'sex').apply(ols).show() +--------+---+-------------------+ |group_id|sex| x| +--------+---+-------------------+ | 2.0| M| null| | 3.0| F| 1.0| | 1.0| M|-1.0000000000000002| | 1.0| F| null| | 2.0| F|0.39999999999999986| | 3.0| M| 0.9999999999999998| +--------+---+-------------------+ 

3 Comments

your answer is very good, but, in real case, i use a large dataset grouped by 2 arguments (like the example, group_id and sex). When i filter just a few "arguments" in sex and group_id, OK, however when i put all data set, i have the error: 'IndexError: index out of bounds'. Even i grouping by one complete "argument", the error occurs. Can you help me again?
Sorry but I don't understand what you were trying. Would you mind opening another question and provide the necessary details?
Thanks @mck, unfortunately I can't put the data, but i fixed the problem

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.