Stack Overflow while processing several columns with a UDF

Question

I have a DataFrame with many columns of str type, and I want to apply a function to all those columns, without renaming their names or adding more columns, I tried using a for-in loop executing withColumn (see example bellow), but normally when I run the code, it shows a Stack Overflow (it rarely works), this DataFrame is not big at all, it has just ~15000 records.

# df is a DataFrame def lowerCase(string): return string.strip().lower() lowerCaseUDF = udf(lowerCase, StringType()) for (columnName, kind) in df.dtypes: if(kind == "string"): df = df.withColumn(columnName, lowerCaseUDF(df[columnName])) df.select("Tipo_unidad").distinct().show()

The complete error is very long, therefore I decided to paste only some lines. But you can find the full trace here Complete Trace

Py4JJavaError: An error occurred while calling o516.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 38, worker2.mcbo.mood.com.ve): java.lang.StackOverflowError at java.io.ObjectInputStream$BlockDataInputStream.readByte(ObjectInputStream.java:2774)

I am thinking that this problem is produced because this code launches many jobs (one for each column of type string), could you show me another alternative or what I am doing wrong?

I think the loop is keeping the dataframe in memory each time you are computing on it and the GC doesn't have time to clean it thus no memory => SO — eliasah
– eliasah, Commented Jan 28, 2016 at 16:23
@eliasah It's very probable, but I don't have any other user friendly alternative (the other one will be to do this manually column by column) — Alberto Bonsanto
– Alberto Bonsanto, Commented Jan 28, 2016 at 16:25
Could you try to use a single select instead? This SO smells like some kind of issue with growing lineage. Also I wouldn't use UDF here. It is kind of wasteful and can be handled directly on internal representation. — zero323
– zero323, Commented Jan 28, 2016 at 16:30

zero323 · Accepted Answer · 2016-01-28 16:54:53Z

Try something like this:

from pyspark.sql.functions import col, lower, trim exprs = [ lower(trim(col(c))).alias(c) if t == "string" else col(c) for (c, t) in df.dtypes ] df.select(*exprs)

This approach has two main advantages over you current solution:

it requires only as single projection (no growing lineage which most likely responsible for SO) instead of projection per string column.
it operates directly only an internal representation without passing data to Python (BatchPythonProcessing).

Worked perfectly, but how would I do if I have to apply a really complex function, in every string column
Well, pretty much the same way :) If you cannot use expression (in 1.6 Spark it shouldn't be a problem - there is enough to choose so you can create arbitrary complex transformation) just replace lower ∘ trim with an UDF.

Collectives™ on Stack Overflow

Stack Overflow while processing several columns with a UDF

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related