Explode multiple columns SparkSQL

Question

How can we explode multiple array column in Spark? I have a dataframe with 5 stringified array columns and I want to explode on all 5 columns. Showing example with 3 columns for the sake of simplicity.

If I have the following input row:

col1 col2 col3 ["b_val1","b_val2"] ["at_val1","at_val2","at_val3"] ["male","female"]

I want to explode on all 3 array columns, so the output should look like:

b_val1 at_val1 male b_val1 at_val1 female b_val2 at_val1 male b_val2 at_val1 female b_val1 at_val2 male b_val1 at_val2 female b_val2 at_val2 male b_val2 at_val2 female b_val1 at_val3 male b_val1 at_val3 female b_val2 at_val3 male b_val2 at_val3 female

I tried the following:

SELECT timestamp, explode(from_json(brandList, 'array<string>')) AS brand, explode(from_json(articleTypeList, 'array<string>')) AS articleTypeList, explode(from_json(gender, 'array<string>')) AS gender, explode(from_json(masterCategoryList, 'array<string>')) AS masterCategoryList, explode(from_json(subCategoryList, 'array<string>')) AS subCategoryList, isLandingPage, ... from table

but this is not allowed nd I get the following error - Exception in thread "main" org.apache.spark.sql.AnalysisException: Only one generator allowed per select clause but found 5: explode(jsontostructs(brandList)), explode(jsontostructs(articleTypeList)), explode(jsontostructs(gender)), explode(jsontostructs(masterCategoryList)), explode(jsontostructs(subCategoryList));

Daniel Bonetti · Accepted Answer · 2022-08-23 17:45:47Z

use withColumn to get the required output.

Let's create one sample dataframe with 3 columns of arraytype and perform explode operation:

import spark.implicits._ import org.apache.spark.sql._ import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ val rdd=spark.sparkContext.makeRDD(List(Row(Array(1,2,3),Array("a","b","c"),Array("1a","1b","1c")))) val schema=new StructType().add("arraycolumn1",ArrayType(IntegerType)).add("arraycolumn2",ArrayType(StringType)).add("arraycolumn3",ArrayType(StringType)) var df=spark.createDataFrame(rdd,schema) df.show(5,false) +------------+------------+------------+ |arraycolumn1|arraycolumn2|arraycolumn3| +------------+------------+------------+ |[1, 2, 3] |[a, b, c] |[1a, 1b, 1c]| +------------+------------+------------+ val explodedDF=df.withColumn("column1",explode('arraycolumn1)).withColumn("column2",explode('arraycolumn2)).withColumn("column3",explode('arraycolumn3)) explodedDF.select('column1,'column2,'column3).show(5,false) +-------+-------+-------+ |column1|column2|column3| +-------+-------+-------+ |1 |a |1a | |1 |a |1b | |1 |a |1c | |1 |b |1a | |1 |b |1b | +-------+-------+-------+ only showing top 5 rows

Let's do the above steps with less lines of code

 var exploded=df.columns.foldLeft(df)((df,column)=>df.withColumn(column,explode(col(column)))) exploded.select(df.columns.map(col(_)):_*).show(false)

Using spark-sql

df.createOrReplaceTempView("arrayTable") spark.sql(""" select column1,column2,column3 from arraytable LATERAL VIEW explode(arraycolumn1) as column1 LATERAL VIEW explode(arraycolumn2) as column2 LATERAL VIEW explode(arraycolumn3) as column3""").show

Any way to do this in SQL? Don't have the flexibility to write scala code
We can do that using LATERAL VIEW, Edited the post, please have a look.

Collectives™ on Stack Overflow

Explode multiple columns SparkSQL

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related