9

If you add any kind of logging into a UDF function in PySpark, it won't appear anywhere. Is it some kind of method to make this happen?

So far I tried standard python logging, py4j and also print.

We're running PySpark 2.3.2 with YARN cluster manager on AWS EMR clusters.

For example. Here's a function I want to use:

def parse_data(attr): try: # execute something except Exception as e: logger.error(e) return None 

I convert it to UDF:

import pyspark.sql.functions as F parse_data_udf = F.udf(parse_data, StringType()) 

And I will use it on a dataframe:

from pyspark.sql import types as pst dataframe = dataframe.withColumn("new_column", parse_data_udf("column").cast(pst.StringType()) 

The logs from the function will NOT appear anywhere.

6
  • Possible duplicate of stackoverflow.com/questions/25407550/… Commented Oct 15, 2019 at 14:38
  • Look here: stackoverflow.com/questions/40806225/… Commented Oct 15, 2019 at 18:43
  • 2
    Both of these are about general logging, my question is about, logs inside of a UDF. Commented Oct 21, 2019 at 11:47
  • 1
    @Mariusz - Sorry, we tried that. It didn't work. Commented Oct 21, 2019 at 20:13
  • Hi did you come up with any solution, im stuck at same point Commented Apr 11, 2021 at 20:58

1 Answer 1

0

When using yarn you can use the following YARN CLI command to check the container logs.

This is where the stdout/stderr (and thus what you log inside the udf) is probably located.

yarn logs -applicationId <Application ID> -containerId <Container ID> 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.