How to send out logs from a UDF function in Pyspark

Question

If you add any kind of logging into a UDF function in PySpark, it won't appear anywhere. Is it some kind of method to make this happen?

So far I tried standard python logging, py4j and also print.

We're running PySpark 2.3.2 with YARN cluster manager on AWS EMR clusters.

For example. Here's a function I want to use:

def parse_data(attr): try: # execute something except Exception as e: logger.error(e) return None

I convert it to UDF:

import pyspark.sql.functions as F parse_data_udf = F.udf(parse_data, StringType())

And I will use it on a dataframe:

from pyspark.sql import types as pst dataframe = dataframe.withColumn("new_column", parse_data_udf("column").cast(pst.StringType())

The logs from the function will NOT appear anywhere.

Possible duplicate of stackoverflow.com/questions/25407550/… — Aasim Khan
– Aasim Khan, Commented Oct 15, 2019 at 14:38
Both of these are about general logging, my question is about, logs inside of a UDF. — Géza Hodgyai
– Géza Hodgyai, Commented Oct 21, 2019 at 11:47
Hi did you come up with any solution, im stuck at same point — Nrithya M
– Nrithya M, Commented Apr 11, 2021 at 20:58

Thijs · Accepted Answer · 2021-10-22 20:28:54Z

When using yarn you can use the following YARN CLI command to check the container logs.

This is where the stdout/stderr (and thus what you log inside the udf) is probably located.

yarn logs -applicationId <Application ID> -containerId <Container ID>

1 Answer 1