3

How can we explode multiple array column in Spark? I have a dataframe with 5 stringified array columns and I want to explode on all 5 columns. Showing example with 3 columns for the sake of simplicity.

If I have the following input row:

col1 col2 col3 ["b_val1","b_val2"] ["at_val1","at_val2","at_val3"] ["male","female"] 

I want to explode on all 3 array columns, so the output should look like:

b_val1 at_val1 male b_val1 at_val1 female b_val2 at_val1 male b_val2 at_val1 female b_val1 at_val2 male b_val1 at_val2 female b_val2 at_val2 male b_val2 at_val2 female b_val1 at_val3 male b_val1 at_val3 female b_val2 at_val3 male b_val2 at_val3 female 

I tried the following:

SELECT timestamp, explode(from_json(brandList, 'array<string>')) AS brand, explode(from_json(articleTypeList, 'array<string>')) AS articleTypeList, explode(from_json(gender, 'array<string>')) AS gender, explode(from_json(masterCategoryList, 'array<string>')) AS masterCategoryList, explode(from_json(subCategoryList, 'array<string>')) AS subCategoryList, isLandingPage, ... from table 

but this is not allowed nd I get the following error - Exception in thread "main" org.apache.spark.sql.AnalysisException: Only one generator allowed per select clause but found 5: explode(jsontostructs(brandList)), explode(jsontostructs(articleTypeList)), explode(jsontostructs(gender)), explode(jsontostructs(masterCategoryList)), explode(jsontostructs(subCategoryList));

1 Answer 1

7

use withColumn to get the required output.

Let's create one sample dataframe with 3 columns of arraytype and perform explode operation:

import spark.implicits._ import org.apache.spark.sql._ import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ val rdd=spark.sparkContext.makeRDD(List(Row(Array(1,2,3),Array("a","b","c"),Array("1a","1b","1c")))) val schema=new StructType().add("arraycolumn1",ArrayType(IntegerType)).add("arraycolumn2",ArrayType(StringType)).add("arraycolumn3",ArrayType(StringType)) var df=spark.createDataFrame(rdd,schema) df.show(5,false) +------------+------------+------------+ |arraycolumn1|arraycolumn2|arraycolumn3| +------------+------------+------------+ |[1, 2, 3] |[a, b, c] |[1a, 1b, 1c]| +------------+------------+------------+ val explodedDF=df.withColumn("column1",explode('arraycolumn1)).withColumn("column2",explode('arraycolumn2)).withColumn("column3",explode('arraycolumn3)) explodedDF.select('column1,'column2,'column3).show(5,false) +-------+-------+-------+ |column1|column2|column3| +-------+-------+-------+ |1 |a |1a | |1 |a |1b | |1 |a |1c | |1 |b |1a | |1 |b |1b | +-------+-------+-------+ only showing top 5 rows 

Let's do the above steps with less lines of code

 var exploded=df.columns.foldLeft(df)((df,column)=>df.withColumn(column,explode(col(column)))) exploded.select(df.columns.map(col(_)):_*).show(false) 

Using spark-sql

df.createOrReplaceTempView("arrayTable") spark.sql(""" select column1,column2,column3 from arraytable LATERAL VIEW explode(arraycolumn1) as column1 LATERAL VIEW explode(arraycolumn2) as column2 LATERAL VIEW explode(arraycolumn3) as column3""").show 
Sign up to request clarification or add additional context in comments.

3 Comments

Any way to do this in SQL? Don't have the flexibility to write scala code
We can do that using LATERAL VIEW, Edited the post, please have a look.
How to do this in Pyspark ?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.