Natuurlijke taal laden

Introductie tot Spark SQL in Python

Mark Plutowski

Data Scientist

De dataset

The Project Gutenberg eBook of The Adventures of Sherlock Holmes,

door Sir Arthur Conan Doyle.

Beschikbaar op gutenberg.org

Tekst laden

df = spark.read.text('sherlock.txt')

print(df.first())

Row(value='The Project Gutenberg EBook of The Adventures of Sherlock Holmes')

print(df.count())

Parquet laden

df1 = spark.read.load('sherlock.parquet')

Tekst geladen

df1.show(15, truncate=False)

+--------------------------------------------------------------------+
|value                                                               |
+--------------------------------------------------------------------+
|The Project Gutenberg EBook of The Adventures of Sherlock Holmes    |
|by Sir Arthur Conan Doyle                                           |
|(#15 in our series by Sir Arthur Conan Doyle)                       |
|                                                                    |
|Copyright laws are changing all over the world. Be sure to check the|
|copyright laws for your country before downloading or redistributing|
|this or any other Project Gutenberg eBook.                          |
|                                                                    |
|This header should be the first thing seen when viewing this Project|
|Gutenberg file.  Please do not remove it.  Do not change or edit the|
|header without written permission.                                  |
|                                                                    |
|Please read the "legal small print," and other information about the|
|eBook and Project Gutenberg at the bottom of this file.  Included is|
|important information about your specific rights and restrictions in|
+--------------------------------------------------------------------+

Lowercase toepassen

df = df1.select(lower(col('value')))

print(df.first())

Row(lower(value)=
    'the project gutenberg ebook of the adventures of sherlock holmes')

df.columns

['lower(value)']

Alias gebruiken

df = df1.select(lower(col('value')).alias('v'))

df.columns

['v']

Tekst vervangen

df = df1.select(regexp_replace('value', 'Mr\.', 'Mr').alias('v'))

"Mr. Holmes." ==> "Mr Holmes."

df = df1.select(regexp_replace('value', 'don\'t', 'do not').alias('v'))

"don't know." ==> "do not know."

Tekst tokenizen

df = df2.select(split('v', '[ ]').alias('words'))
df.show(truncate=False)

Tokenizen – output

+--------------------------------------------------------------------------------------+
|words                                                                                  |
+--------------------------------------------------------------------------------------+
|[the, project, gutenberg, ebook, of, the, adventures, of, sherlock, holmes]           |
|[by, sir, arthur, conan, doyle]                                                       |
|[(#15, in, our, series, by, sir, arthur, conan, doyle)]                               |
|[]                                                                                    
.
.
.
|[please, read, the, "legal, small, print,", and, other, information, about, the]      |
.
.
.
|[**welcome, to, the, world, of, free, plain, vanilla, electronic, texts**]            |
+--------------------------------------------------------------------------------------+

Scheidingstekens worden verwijderd

punctuation = "_|.\?\!\",\'\[\]\*()"
df3 = df2.select(split('v', '[ %s]' % punctuation).alias('words'))

df3.show(truncate=False)

Scheidingstekens worden verwijderd – output

+---------------------------------------------------------------------------------------+
|words                                                                                   |
+---------------------------------------------------------------------------------------+
|[the, project, gutenberg, ebook, of, the, adventures, of, sherlock, holmes]            |
|[by, sir, arthur, conan, doyle]                                                        |
|[, #15, in, our, series, by, sir, arthur, conan, doyle, ]                              |
|[]                                                                                     .
.
.

|[please, read, the, , legal, small, print, , , and, other, information, about, the]    |
.
.
.
[, , welcome, to, the, world, of, free, plain, vanilla, electronic, texts, , ]         |
++---------------------------------------------------------------------------------------+

Een array exploderen

df4 = df3.select(explode('words').alias('word'))
df4.show()

Array exploderen – output

+----------+
|      word|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|       the|
|adventures|
|        of|
|  sherlock|
|    holmes|
|        by|
|       sir|
|    arthur|
|     conan|
|     doyle|
+----------+

Explode verhoogt het rijenaantal

print(df3.count())

print(df4.count())

Lege rijen verwijderen

print(df.count())

nonblank_df = df.where(length('word') > 0)

print(nonblank_df.count())

Een rij-id-kolom toevoegen

df2 = df.select('word', monotonically_increasing_id().alias('id'))

df2.show()

Rij-id-kolom toevoegen – output

+----------+---+
|      word| id|
+----------+---+
|       the|  0|
|   project|  1|
| gutenberg|  2|
|     ebook|  3|
|        of|  4|
|       the|  5|
|adventures|  6|
|        of|  7|
|  sherlock|  8|
|    holmes|  9|
|        by| 10|
|       sir| 11|
|    arthur| 12|
|     conan| 13|
|     doyle| 14|
|       #15| 15|
+----------+---+

Data partitioneren

df2 = df.withColumn('title', when(df.id < 25000, 'Preface')
                             .when(df.id < 50000, 'Chapter 1')
                             .when(df.id < 75000, 'Chapter 2')
                             .otherwise('Chapter 3'))

df2 = df2.withColumn('part', when(df2.id < 25000, 0)
                            .when(df2.id < 50000, 1)
                            .when(df2.id < 75000, 2)
                            .otherwise(3))
                            .show()

Data partitioneren – output

+----------+---+------------+----+
|word      |id |title       |part|
+----------+---+------------+----+
|the       |0  |     Preface|0   |
|project   |1  |     Preface|0   |
|gutenberg |2  |     Preface|0   |
|ebook     |3  |     Preface|0   |
|of        |4  |     Preface|0   |
|the       |5  |     Preface|0   |
|adventures|6  |     Preface|0   |
|of        |7  |     Preface|0   |
|sherlock  |8  |     Preface|0   |
|holmes    |9  |     Preface|0   |

Herpartitioneren op een kolom

df2 = df.repartition(4, 'part')

print(df2.rdd.getNumPartitions())

Vooraf gepartitioneerde tekst lezen

$ ls sherlock_parts

sherlock_part0.txt   
sherlock_part1.txt   
sherlock_part2.txt   
sherlock_part3.txt   
sherlock_part4.txt   
sherlock_part5.txt       
sherlock_part6.txt   
sherlock_part7.txt   
sherlock_part8.txt     
sherlock_part9.txt     
sherlock_part10.txt     
sherlock_part11.txt  
sherlock_part12.txt
sherlock_part13.txt

Vooraf gepartitioneerde tekst lezen

df_parts = spark.read.text('sherlock_parts')

Laten we oefenen!

Introductie tot Spark SQL in Python

Natuurlijke taal laden

De dataset

Tekst laden

Parquet laden

Tekst geladen

Lowercase toepassen

Alias gebruiken

Tekst vervangen

Tekst tokenizen

Tokenizen – output

Scheidingstekens worden verwijderd

Scheidingstekens worden verwijderd – output

Een array exploderen

Array exploderen – output

Explode verhoogt het rijen­aantal

Lege rijen verwijderen

Een rij-id-kolom toevoegen

Rij-id-kolom toevoegen – output

Data partitioneren

Data partitioneren – output

Herpartitioneren op een kolom

Vooraf gepartitioneerde tekst lezen

Vooraf gepartitioneerde tekst lezen

Laten we oefenen!

Explode verhoogt het rijenaantal