1

Is there a fast and efficient way to unpivot a dataframe? I have used the follwoing methods and although both work on a sample data when on full set it runs for hours and never completes.

Method 1:

def to_long(df, by): # Filter dtypes and split into column names and type description cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by)) # Spark SQL supports only homogeneous columns assert len(set(dtypes)) == 1, "All columns have to be of the same type" # Create and explode an array of (column_name, column_value) structs kvs = explode(array([ struct(lit(c).alias("question_id"), col(c).alias("response_value")) for c in cols ])).alias("kvs") return df.select(by + [kvs]).select(by + ["kvs.question_id", "kvs.response_value"]) 

Method 2:

def rowExpander(row): rowDict = row.asDict() valA = rowDict.pop('user_id') for k in rowDict: yield Row(**{'user_id': valA , 'question_id' : k, 'response_value' : row[k]}) user_response_df = spark.createDataFrame(response_df.rdd.flatMap(rowExpander)) 

2 Answers 2

1

Maybe you can try select each column as a new dataframe and union all of them
Like this

# Get all columns except 'user_id' cols = [col for col in df.columns if col != 'user_id'] # Select user_id and another column as a new dataframe. # Use column_name as the value of the new column `question_id` # Use column_value as the value of the new column `response_value` # Then union all of these new dataframes df = reduce(lambda df1, df2: df1.union(df2), [df.select('user_id', F.lit(c).alias('question_id'), F.col(c).alias('response_value')) for c in cols]) 
Sign up to request clarification or add additional context in comments.

Comments

0

df.selectExpr('col1', 'stack(2, "col2", col2, "col3", col3) as (cols, values)')

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.