Split string column based on delimiter and create columns for each value in Pyspark

Question

I have 1000s of files with data in the below format:

a|b|c|clm4=1|clm5=3 a|b|c|clm4=9|clm6=60|clm7=23

And I want to read it and convert it to a dataframe as below:

clm1|clm2|clm3|clm4|clm5|clm6|clm7 a|b|c|1|3|null|null a|b|c|9|null|60|23

I have tried the below method:

files = [f for f in glob.glob(pathToFile + "/**/*.txt.gz", recursive=True)] df = spark.read.load(files, format='csv', sep = '|', header=None)

But it is giving me below result:

clm1, clm2, clm3, clm4, clm5 a, b, c, 1, 3 a, b, c, 9, null

To use this method I have to write getItem() for each column which is not possible because there are 100s of columns and most of them are unknown — user0204
– user0204, Commented Jan 25, 2020 at 12:06

blackbishop · Accepted Answer · 2020-01-25 14:41:16Z

For Spark 2.4+, you can read the files as a single column then split it by |. You'll get an array column that you could transform using higher-order functions:

df.show(truncate=False) +----------------------------+ |clm | +----------------------------+ |a|b|c|clm4=1|clm5=3 | |a|b|c|clm4=9|clm6=60|clm7=23| +----------------------------+

We use transform function to convert the array of string that we get from splitting the clm column into an array of structs. Each struct contains column name if present (check if a string contains =) or name it clm + (i+1) where i is its position.

transform_expr = """ transform(split(clm, '[|]'), (x, i) -> struct( IF(x like '%=%', substring_index(x, '=', 1), concat('clm', i+1)), substring_index(x, '=', -1) ) ) """

Now use map_from_entries to convert the array to map. And finally, explode the map and pivot to get your columns

df.select("clm", explode(map_from_entries(expr(transform_expr))).alias("col_name", "col_value") ) \ .groupby("clm").pivot('col_name').agg(first('col_value')) \ .drop("clm") \ .show(truncate=False)

Gives:

+----+----+----+----+----+----+----+ |clm1|clm2|clm3|clm4|clm5|clm6|clm7| +----+----+----+----+----+----+----+ |a |b |c |9 |null|60 |23 | |a |b |c |1 |3 |null|null| +----+----+----+----+----+----+----+

Thanks for the help. Is there any way I can put a condition in above code to select only those columns that are present in an existing list of column names?

Collectives™ on Stack Overflow

Split string column based on delimiter and create columns for each value in Pyspark

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related