I am trying to setup a decent logging configuration in PySpark. I have a YAML configuration file which setups different loghandlers. Those handlers consist of the console, a file, and a SQLite DB using the format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s".
# SETUP LOGGING with open(cfile, 'rt') as f: config = yaml.safe_load(f.read()) logging.config.dictConfig(config) lg = logging.getLogger("mylog." + self.__name__()) So each time I call the lg.xxxx('message') everything gets handled quite nicely.
Now I found quite some posts on how to get the log4j from PySpark using log_handler = sc._jvm.org.apache.log4j. But now I'm lost on how to add this handler to my existing setup and catch all the messages that happen on the PySpark console and save them to the file and SQLite DB.