Loading natural language text

Introduction to Spark SQL in Python

Mark Plutowski

Data Scientist

The dataset

The Project Gutenberg eBook of The Adventures of Sherlock Holmes,

by Sir Arthur Conan Doyle.

Available from gutenberg.org

Introduction to Spark SQL in Python

Loading text

df = spark.read.text('sherlock.txt')
print(df.first()) 
Row(value='The Project Gutenberg EBook of The Adventures of Sherlock Holmes')
print(df.count())
5500
Introduction to Spark SQL in Python

Loading parquet

df1 = spark.read.load('sherlock.parquet')
Introduction to Spark SQL in Python

Loaded text

df1.show(15, truncate=False)
+--------------------------------------------------------------------+
|value                                                               |
+--------------------------------------------------------------------+
|The Project Gutenberg EBook of The Adventures of Sherlock Holmes    |
|by Sir Arthur Conan Doyle                                           |
|(#15 in our series by Sir Arthur Conan Doyle)                       |
|                                                                    |
|Copyright laws are changing all over the world. Be sure to check the|
|copyright laws for your country before downloading or redistributing|
|this or any other Project Gutenberg eBook.                          |
|                                                                    |
|This header should be the first thing seen when viewing this Project|
|Gutenberg file.  Please do not remove it.  Do not change or edit the|
|header without written permission.                                  |
|                                                                    |
|Please read the "legal small print," and other information about the|
|eBook and Project Gutenberg at the bottom of this file.  Included is|
|important information about your specific rights and restrictions in|
+--------------------------------------------------------------------+
Introduction to Spark SQL in Python

Lower case operation

df = df1.select(lower(col('value')))

print(df.first())
Row(lower(value)=
    'the project gutenberg ebook of the adventures of sherlock holmes')
df.columns
['lower(value)']
Introduction to Spark SQL in Python

Alias operation

df = df1.select(lower(col('value')).alias('v'))
df.columns
['v']
Introduction to Spark SQL in Python

Replacing text

df = df1.select(regexp_replace('value', 'Mr\.', 'Mr').alias('v'))

"Mr. Holmes." ==> "Mr Holmes."

df = df1.select(regexp_replace('value', 'don\'t', 'do not').alias('v'))

"don't know." ==> "do not know."

Introduction to Spark SQL in Python

Tokenizing text

df = df2.select(split('v', '[ ]').alias('words'))
df.show(truncate=False)
Introduction to Spark SQL in Python

Tokenizing text – output

+--------------------------------------------------------------------------------------+
|words                                                                                  |
+--------------------------------------------------------------------------------------+
|[the, project, gutenberg, ebook, of, the, adventures, of, sherlock, holmes]           |
|[by, sir, arthur, conan, doyle]                                                       |
|[(#15, in, our, series, by, sir, arthur, conan, doyle)]                               |
|[]                                                                                    
.
.
.
|[please, read, the, "legal, small, print,", and, other, information, about, the]      |
.
.
.
|[**welcome, to, the, world, of, free, plain, vanilla, electronic, texts**]            |
+--------------------------------------------------------------------------------------+
Introduction to Spark SQL in Python

Split characters are discarded

punctuation = "_|.\?\!\",\'\[\]\*()"
df3 = df2.select(split('v', '[ %s]' % punctuation).alias('words'))
df3.show(truncate=False)
Introduction to Spark SQL in Python

Split characters are discarded – output

+---------------------------------------------------------------------------------------+
|words                                                                                   |
+---------------------------------------------------------------------------------------+
|[the, project, gutenberg, ebook, of, the, adventures, of, sherlock, holmes]            |
|[by, sir, arthur, conan, doyle]                                                        |
|[, #15, in, our, series, by, sir, arthur, conan, doyle, ]                              |
|[]                                                                                     .
.
.

|[please, read, the, , legal, small, print, , , and, other, information, about, the]    |
.
.
.
[, , welcome, to, the, world, of, free, plain, vanilla, electronic, texts, , ]         |
++---------------------------------------------------------------------------------------+
Introduction to Spark SQL in Python

Exploding an array

df4 = df3.select(explode('words').alias('word'))
df4.show()
Introduction to Spark SQL in Python

Exploding an array – output

+----------+
|      word|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|       the|
|adventures|
|        of|
|  sherlock|
|    holmes|
|        by|
|       sir|
|    arthur|
|     conan|
|     doyle|
+----------+
Introduction to Spark SQL in Python

Explode increases row count

print(df3.count())
5500
print(df4.count())
131404
Introduction to Spark SQL in Python

Removing empty rows

print(df.count())
131404
nonblank_df = df.where(length('word') > 0)
print(nonblank_df.count())
107320
Introduction to Spark SQL in Python

Adding a row id column

df2 = df.select('word', monotonically_increasing_id().alias('id'))

df2.show()
Introduction to Spark SQL in Python

Adding a row id column – output

+----------+---+
|      word| id|
+----------+---+
|       the|  0|
|   project|  1|
| gutenberg|  2|
|     ebook|  3|
|        of|  4|
|       the|  5|
|adventures|  6|
|        of|  7|
|  sherlock|  8|
|    holmes|  9|
|        by| 10|
|       sir| 11|
|    arthur| 12|
|     conan| 13|
|     doyle| 14|
|       #15| 15|
+----------+---+
Introduction to Spark SQL in Python

Partitioning the data

df2 = df.withColumn('title', when(df.id < 25000, 'Preface')
                             .when(df.id < 50000, 'Chapter 1')
                             .when(df.id < 75000, 'Chapter 2')
                             .otherwise('Chapter 3'))
df2 = df2.withColumn('part', when(df2.id < 25000, 0)
                            .when(df2.id < 50000, 1)
                            .when(df2.id < 75000, 2)
                            .otherwise(3))
                            .show()
Introduction to Spark SQL in Python

Partitioning the data – output

+----------+---+------------+----+
|word      |id |title       |part|
+----------+---+------------+----+
|the       |0  |     Preface|0   |
|project   |1  |     Preface|0   |
|gutenberg |2  |     Preface|0   |
|ebook     |3  |     Preface|0   |
|of        |4  |     Preface|0   |
|the       |5  |     Preface|0   |
|adventures|6  |     Preface|0   |
|of        |7  |     Preface|0   |
|sherlock  |8  |     Preface|0   |
|holmes    |9  |     Preface|0   |
Introduction to Spark SQL in Python

Repartitioning on a column

df2 = df.repartition(4, 'part')
print(df2.rdd.getNumPartitions())
4
Introduction to Spark SQL in Python

Reading pre-partitioned text

$ ls sherlock_parts
sherlock_part0.txt   
sherlock_part1.txt   
sherlock_part2.txt   
sherlock_part3.txt   
sherlock_part4.txt   
sherlock_part5.txt       
sherlock_part6.txt   
sherlock_part7.txt   
sherlock_part8.txt     
sherlock_part9.txt     
sherlock_part10.txt     
sherlock_part11.txt  
sherlock_part12.txt
sherlock_part13.txt
Introduction to Spark SQL in Python

Reading pre-partitioned text

df_parts = spark.read.text('sherlock_parts')
Introduction to Spark SQL in Python

Let's practice!

Introduction to Spark SQL in Python

Preparing Video For Download...