Memuat teks bahasa alami

Pengantar Spark SQL dalam Python

Mark Plutowski

Data Scientist

Dataset

eBook Proyek Gutenberg: The Adventures of Sherlock Holmes,

oleh Sir Arthur Conan Doyle.

Tersedia di gutenberg.org

Memuat teks

df = spark.read.text('sherlock.txt')

print(df.first())

Row(value='The Project Gutenberg EBook of The Adventures of Sherlock Holmes')

print(df.count())

Memuat parquet

df1 = spark.read.load('sherlock.parquet')

Teks termuat

df1.show(15, truncate=False)

+--------------------------------------------------------------------+
|value                                                               |
+--------------------------------------------------------------------+
|The Project Gutenberg EBook of The Adventures of Sherlock Holmes    |
|by Sir Arthur Conan Doyle                                           |
|(#15 in our series by Sir Arthur Conan Doyle)                       |
|                                                                    |
|Copyright laws are changing all over the world. Be sure to check the|
|copyright laws for your country before downloading or redistributing|
|this or any other Project Gutenberg eBook.                          |
|                                                                    |
|This header should be the first thing seen when viewing this Project|
|Gutenberg file.  Please do not remove it.  Do not change or edit the|
|header without written permission.                                  |
|                                                                    |
|Please read the "legal small print," and other information about the|
|eBook and Project Gutenberg at the bottom of this file.  Included is|
|important information about your specific rights and restrictions in|
+--------------------------------------------------------------------+

Operasi huruf kecil

df = df1.select(lower(col('value')))

print(df.first())

Row(lower(value)=
    'the project gutenberg ebook of the adventures of sherlock holmes')

df.columns

['lower(value)']

Operasi alias

df = df1.select(lower(col('value')).alias('v'))

df.columns

['v']

Mengganti teks

df = df1.select(regexp_replace('value', 'Mr\.', 'Mr').alias('v'))

"Mr. Holmes." ==> "Mr Holmes."

df = df1.select(regexp_replace('value', 'don\'t', 'do not').alias('v'))

"don't know." ==> "do not know."

Tokenisasi teks

df = df2.select(split('v', '[ ]').alias('words'))
df.show(truncate=False)

Tokenisasi teks – output

+--------------------------------------------------------------------------------------+
|words                                                                                  |
+--------------------------------------------------------------------------------------+
|[the, project, gutenberg, ebook, of, the, adventures, of, sherlock, holmes]           |
|[by, sir, arthur, conan, doyle]                                                       |
|[(#15, in, our, series, by, sir, arthur, conan, doyle)]                               |
|[]                                                                                    
.
.
.
|[please, read, the, "legal, small, print,", and, other, information, about, the]      |
.
.
.
|[**welcome, to, the, world, of, free, plain, vanilla, electronic, texts**]            |
+--------------------------------------------------------------------------------------+

Karakter pemisah dihapus

punctuation = "_|.\?\!\",\'\[\]\*()"
df3 = df2.select(split('v', '[ %s]' % punctuation).alias('words'))

df3.show(truncate=False)

Karakter pemisah dihapus – output

+---------------------------------------------------------------------------------------+
|words                                                                                   |
+---------------------------------------------------------------------------------------+
|[the, project, gutenberg, ebook, of, the, adventures, of, sherlock, holmes]            |
|[by, sir, arthur, conan, doyle]                                                        |
|[, #15, in, our, series, by, sir, arthur, conan, doyle, ]                              |
|[]                                                                                     .
.
.

|[please, read, the, , legal, small, print, , , and, other, information, about, the]    |
.
.
.
[, , welcome, to, the, world, of, free, plain, vanilla, electronic, texts, , ]         |
++---------------------------------------------------------------------------------------+

Meledakkan array

df4 = df3.select(explode('words').alias('word'))
df4.show()

Meledakkan array – output

+----------+
|      word|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|       the|
|adventures|
|        of|
|  sherlock|
|    holmes|
|        by|
|       sir|
|    arthur|
|     conan|
|     doyle|
+----------+

Explode menambah jumlah baris

print(df3.count())

print(df4.count())

Menghapus baris kosong

print(df.count())

nonblank_df = df.where(length('word') > 0)

print(nonblank_df.count())

Menambah kolom id baris

df2 = df.select('word', monotonically_increasing_id().alias('id'))

df2.show()

Menambah kolom id baris – output

+----------+---+
|      word| id|
+----------+---+
|       the|  0|
|   project|  1|
| gutenberg|  2|
|     ebook|  3|
|        of|  4|
|       the|  5|
|adventures|  6|
|        of|  7|
|  sherlock|  8|
|    holmes|  9|
|        by| 10|
|       sir| 11|
|    arthur| 12|
|     conan| 13|
|     doyle| 14|
|       #15| 15|
+----------+---+

Membagi data ke partisi

df2 = df.withColumn('title', when(df.id < 25000, 'Preface')
                             .when(df.id < 50000, 'Chapter 1')
                             .when(df.id < 75000, 'Chapter 2')
                             .otherwise('Chapter 3'))

df2 = df2.withColumn('part', when(df2.id < 25000, 0)
                            .when(df2.id < 50000, 1)
                            .when(df2.id < 75000, 2)
                            .otherwise(3))
                            .show()

Membagi data – output

+----------+---+------------+----+
|word      |id |title       |part|
+----------+---+------------+----+
|the       |0  |     Preface|0   |
|project   |1  |     Preface|0   |
|gutenberg |2  |     Preface|0   |
|ebook     |3  |     Preface|0   |
|of        |4  |     Preface|0   |
|the       |5  |     Preface|0   |
|adventures|6  |     Preface|0   |
|of        |7  |     Preface|0   |
|sherlock  |8  |     Preface|0   |
|holmes    |9  |     Preface|0   |

Repartisi berdasarkan kolom

df2 = df.repartition(4, 'part')

print(df2.rdd.getNumPartitions())

Membaca teks pra-partisi

$ ls sherlock_parts

sherlock_part0.txt   
sherlock_part1.txt   
sherlock_part2.txt   
sherlock_part3.txt   
sherlock_part4.txt   
sherlock_part5.txt       
sherlock_part6.txt   
sherlock_part7.txt   
sherlock_part8.txt     
sherlock_part9.txt     
sherlock_part10.txt     
sherlock_part11.txt  
sherlock_part12.txt
sherlock_part13.txt

Membaca teks pra-partisi

df_parts = spark.read.text('sherlock_parts')

Ayo berlatih!

Pengantar Spark SQL dalam Python