Working with string columns

Reshaping Data with pandas

Maria Eugenia Inzaugarat

Data Scientist

Columns with strings

books

                   title  raitings_2015 sold_2015 raitings_2016 sold_2016 
0  The Civil War:Vol. 1             4.3       234           4.2       254   
1  The Civil War:Vol. 2             4.5       525           4.3       515
2  The Civil War:Vol. 3             4.1       242           4.2       251

books['title'].dtypes

dtype('O')

String methods

pandas Series string processing methods
Access easily by str attribute

Splitting into two columns

books

                   title  raitings_2015 sold_2015 raitings_2016 sold_2016 
0  The Civil War:Vol. 1             4.3       234           4.2       254   
1  The Civil War:Vol. 2             4.5       525           4.3       515
2  The Civil War:Vol. 3             4.1       242           4.2       251

books['title']

Splitting into two columns

books

                   title  raitings_2015 sold_2015 raitings_2016 sold_2016 
0  The Civil War:Vol. 1             4.3       234           4.2       254   
1  The Civil War:Vol. 2             4.5       525           4.3       515
2  The Civil War:Vol. 3             4.1       242           4.2       251

books['title'].str.split(':')

0    [The Civil War,  Vol. 1]
1    [The Civil War,  Vol. 2]
2    [The Civil War,  Vol. 3]

Splitting into two columns

books

                   title  raitings_2015 sold_2015 raitings_2016 sold_2016 
0  The Civil War:Vol. 1             4.3       234           4.2       254   
1  The Civil War:Vol. 2             4.5       525           4.3       515
2  The Civil War:Vol. 3             4.1       242           4.2       251

books['title'].str.split(":").str.get(0)

0    The Civil War
1    The Civil War
2    The Civil War

Splitting into two columns

books

                   title  raitings_2015 sold_2015 raitings_2016 sold_2016 
0  The Civil War:Vol. 1             4.3       234           4.2       254   
1  The Civil War:Vol. 2             4.5       525           4.3       515
2  The Civil War:Vol. 3             4.1       242           4.2       251

books['title'].str.split(":", expand=True)

               0        1
0  The Civil War   Vol. 1
1  The Civil War   Vol. 2
2  The Civil War   Vol. 3

Splitting into two columns

books[['main_title', 'subtitle']] = books['title'].str.split(":", expand=True)

books.drop('title', axis=1, inplace=True)
pd.wide_to_long(books , stubnames=['ratings', 'sold'], i=['main_title', 'subtitle'], j='year')

                               ratings sold
   main_title  subtitle  year        
The Civil War    Vol. 1  2015      4.3  234
                         2016      4.2  254
                 Vol. 2  2015      4.5  525
                         2016      4.3  515
                 Vol. 3  2015      4.1  242
                         2016      4.2  251

Concatenate two columns

books_new

   name_author  lastname_author  nationality  number_books
0    Virginia              Wolf      British            50
1    Margaret            Atwood     Canadian            40
2      Harper               Lee     American             2

Concatenate two columns

books_new

   name_author  lastname_author  nationality  number_books
0    Virginia              Wolf      British            50
1    Margaret            Atwood     Canadian            40
2      Harper               Lee     American             2

books_new['name_author'].str.cat(books_new['lastname_author'], sep=' ')

0      Virginia Wolf
1    Margaret Atwood
2         Harper Lee

Concatenate two columns

books_new

   name_author  lastname_author  nationality  number_books
0    Virginia              Wolf      British            50
1    Margaret            Atwood     Canadian            40
2      Harper               Lee     American             2

books_new['author'] = books_new['name_author'].str.cat(books_new['lastname_author'], sep=' ')

books_new

  name_author lastname_author nationality  number_books           author
0    Virginia            Wolf     British            50    Virginia Wolf
1    Margaret          Atwood    Canadian            40  Margaret Atwood
2      Harper             Lee    American             2       Harper Lee

Concatenate two columns

books_new

   name_author  lastname_author  nationality  number_books
0    Virginia              Wolf      British            50
1    Margaret            Atwood     Canadian            40
2      Harper               Lee     American             2

books_new.melt(id_vars='author', value_vars=['nationality', 'number_books'], var_name='feature', value_name='value')

            author       feature     value
0    Virginia Wolf   nationality   British
1  Margaret Atwood   nationality  Canadian
2       Harper Lee   nationality  American
3    Virginia Wolf  number_books        50
4  Margaret Atwood  number_books        40
5       Harper Lee  number_books         2

Concatenate index

comics_marvel

           subtitle  year  ratings  sold
main_title                              
Avengers       Next  1992      4.5   234
Avengers    Forever  1998      4.6   224
Avengers       2099  1999      4.8   141

Concatenate index

comics_marvel.head(2)

           subtitle  year  ratings  sold
main_title                              
Avengers       Next  1992      4.5   234
Avengers    Forever  1998      4.6   224

comics_marvel.index = comics_marvel.index.str.cat(comics_marvel['subtitle'], sep='-')

books

                 subtitle  year  ratings  sold
main_title                                    
Avengers-Next        Next  1992      4.5   234
Avengers-Forever  Forever  1998      4.6   224
Avengers-2099        2099  1999      4.8   141

Split index

comics_marvel.index = comics_marvel.index.str.split('-', expand=True)
comics_marvel

                 subtitle  year  ratings  sold
Avengers Next        Next  1992      4.5   234
         Forever  Forever  1998      4.6   224
         2099        2099  1999      4.8   141

Concatenate Series

books_new['name_author']

0    Virginia
1    Margaret
2      Harper

new_list = ['Wolf', 'Atwood', 'Lee']

books_new['name_author'].str.cat(new_list, sep=' ')

0      Virginia Wolf
1    Margaret Atwood
2         Harper Lee

Let's practice!

Reshaping Data with pandas