Bigdata In Our Palm

Spark-Python Logging

Aim:

To understand how we can log while using PySpark from both driver and executor.

Details:

We cannot directly use log4j logger for driver logging. However, this is not as straightforward as that of driver logging. This is because Python workers are spawned by executor JVMs and these workers do not have callback connection to Java. PySpark is built on top of Spark’s Java API. Data is processed in Python and cached/ shuffled in the JVM.

Py4J enables Python programs running in a Python interpreter dynamically access Java objects in a JVM. If we closely look at the diagram, we can see that the Py4J runs inside the Spark driver. We do not have Py4J inside each Spark workers.

As a result, we will have to copy the python module file that has information about the logging configurations to distributed filesystem (HDFS/MapRFS, let’s say - /user/mapr/spark/mysparkexecutorlogger.py) and then access it as follows:

# Add the custom logger file to spark context

sc.addPyFile('maprfs:///user/mapr/spark/mysparkexecutorlogger.py')

import mysparkexecutorlogger

Contents of ‘/user/mapr/spark/mysparkexecutorlogger.py’:

import logging

logger = logging.getLogger(__name__)

logger.setLevel(logging.DEBUG)

handler = logging.FileHandler('/spark/jobs/executor.log')

handler.setLevel(logging.DEBUG)

formatter = logging.Formatter('%(asctime)s -' ' %(name)s - %(levelname)s - %(message)s')

handler.setFormatter(formatter)

logger.addHandler(handler)