Logging

Introduction to Spark SQL in Python

Mark Plutowski

Data Scientist

Logging primer

import logging
logging.basicConfig(stream=sys.stdout, level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')
logging.info("Hello %s", "world")
logging.debug("Hello, take %d", 2)

2019-03-14 15:92:65,359 - INFO - Hello world

Logging with DEBUG level

import logging
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, 
                    format='%(asctime)s - %(levelname)s - %(message)s')
logging.info("Hello %s", "world")
logging.debug("Hello, take %d", 2)

2018-03-14 12:00:00,000 - INFO - Hello world
2018-03-14 12:00:00,001 - DEBUG - Hello, take 2

Debugging lazy evaluation

lazy evaluation
distributed execution

A simple timer

t = timer()
t.elapsed()

1. elapsed: 0.0 sec

t.elapsed() # Do something that takes 2 seconds

2. elapsed: 2.0 sec

t.reset() # Do something else that takes time: reset
t.elapsed()

3. elapsed: 0.0 sec

class timer

class timer:
    start_time = time.time()
    step = 0

    def elapsed(self, reset=True):
        self.step += 1
        print("%d. elapsed: %.1f sec %s" 
              % (self.step, time.time() - self.start_time))
        if reset:
            self.reset()

    def reset(self):
        self.start_time = time.time()

Stealth CPU wastage

import logging
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')

# < create dataframe df here >

t = timer()
logging.info("No action here.")
t.elapsed()
logging.debug("df has %d rows.", df.count())
t.elapsed()

2018-12-23 22:24:20,472 - INFO - No action here.
1. elapsed: 0.0 sec 
2. elapsed: 2.0 sec

Disable actions

ENABLED = False

t = timer()
logger.info("No action here.")
t.elapsed()
if ENABLED:
    logger.info("df has %d rows.", df.count())
t.elapsed()

2019-03-14 12:34:56,789 - Pyspark - INFO - No action here.
1. elapsed: 0.0 sec 
2. elapsed: 0.0 sec

Enabling actions

Rerunning the previous example with ENABLED = True triggers the action:

2019-03-14 12:34:56,789 - INFO - No action here.
1. elapsed: 0.0 sec 
2019-03-14 12:34:58,789 - INFO - df has 1107014 rows.
2. elapsed: 2.0 sec

Let's practice!

Introduction to Spark SQL in Python