Introduction to Spark SQL in Python
Mark Plutowski
Data Scientist
import logging
logging.basicConfig(stream=sys.stdout, level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')
logging.info("Hello %s", "world")
logging.debug("Hello, take %d", 2)
2019-03-14 15:92:65,359 - INFO - Hello world
import logging
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG,
format='%(asctime)s - %(levelname)s - %(message)s')
logging.info("Hello %s", "world")
logging.debug("Hello, take %d", 2)
2018-03-14 12:00:00,000 - INFO - Hello world
2018-03-14 12:00:00,001 - DEBUG - Hello, take 2
t = timer()
t.elapsed()
1. elapsed: 0.0 sec
t.elapsed() # Do something that takes 2 seconds
2. elapsed: 2.0 sec
t.reset() # Do something else that takes time: reset
t.elapsed()
3. elapsed: 0.0 sec
class timer:
start_time = time.time()
step = 0
def elapsed(self, reset=True):
self.step += 1
print("%d. elapsed: %.1f sec %s"
% (self.step, time.time() - self.start_time))
if reset:
self.reset()
def reset(self):
self.start_time = time.time()
import logging
logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')
# < create dataframe df here >
t = timer()
logging.info("No action here.")
t.elapsed()
logging.debug("df has %d rows.", df.count())
t.elapsed()
2018-12-23 22:24:20,472 - INFO - No action here.
1. elapsed: 0.0 sec
2. elapsed: 2.0 sec
ENABLED = False
t = timer()
logger.info("No action here.")
t.elapsed()
if ENABLED:
logger.info("df has %d rows.", df.count())
t.elapsed()
2019-03-14 12:34:56,789 - Pyspark - INFO - No action here.
1. elapsed: 0.0 sec
2. elapsed: 0.0 sec
Rerunning the previous example with ENABLED = True triggers the action:
2019-03-14 12:34:56,789 - INFO - No action here.
1. elapsed: 0.0 sec
2019-03-14 12:34:58,789 - INFO - df has 1107014 rows.
2. elapsed: 2.0 sec
Introduction to Spark SQL in Python