Working with string columns

Reshaping Data with pandas

Maria Eugenia Inzaugarat

Data Scientist

Columns with strings

books
                   title  raitings_2015 sold_2015 raitings_2016 sold_2016 
0  The Civil War:Vol. 1             4.3       234           4.2       254   
1  The Civil War:Vol. 2             4.5       525           4.3       515
2  The Civil War:Vol. 3             4.1       242           4.2       251

 

books['title'].dtypes
dtype('O')
Reshaping Data with pandas

String methods

  • pandas Series string processing methods
  • Access easily by str attribute
Reshaping Data with pandas

Splitting into two columns

books
                   title  raitings_2015 sold_2015 raitings_2016 sold_2016 
0  The Civil War:Vol. 1             4.3       234           4.2       254   
1  The Civil War:Vol. 2             4.5       525           4.3       515
2  The Civil War:Vol. 3             4.1       242           4.2       251
books['title']
Reshaping Data with pandas

Splitting into two columns

books
                   title  raitings_2015 sold_2015 raitings_2016 sold_2016 
0  The Civil War:Vol. 1             4.3       234           4.2       254   
1  The Civil War:Vol. 2             4.5       525           4.3       515
2  The Civil War:Vol. 3             4.1       242           4.2       251
books['title'].str.split(':')
0    [The Civil War,  Vol. 1]
1    [The Civil War,  Vol. 2]
2    [The Civil War,  Vol. 3]
Reshaping Data with pandas

Splitting into two columns

books
                   title  raitings_2015 sold_2015 raitings_2016 sold_2016 
0  The Civil War:Vol. 1             4.3       234           4.2       254   
1  The Civil War:Vol. 2             4.5       525           4.3       515
2  The Civil War:Vol. 3             4.1       242           4.2       251
books['title'].str.split(":").str.get(0)
0    The Civil War
1    The Civil War
2    The Civil War
Reshaping Data with pandas

Splitting into two columns

books
                   title  raitings_2015 sold_2015 raitings_2016 sold_2016 
0  The Civil War:Vol. 1             4.3       234           4.2       254   
1  The Civil War:Vol. 2             4.5       525           4.3       515
2  The Civil War:Vol. 3             4.1       242           4.2       251
books['title'].str.split(":", expand=True)
               0        1
0  The Civil War   Vol. 1
1  The Civil War   Vol. 2
2  The Civil War   Vol. 3
Reshaping Data with pandas

Splitting into two columns

books[['main_title', 'subtitle']] = books['title'].str.split(":", expand=True)
books.drop('title', axis=1, inplace=True)
pd.wide_to_long(books , stubnames=['ratings', 'sold'], i=['main_title', 'subtitle'], j='year')
                               ratings sold
   main_title  subtitle  year        
The Civil War    Vol. 1  2015      4.3  234
                         2016      4.2  254
                 Vol. 2  2015      4.5  525
                         2016      4.3  515
                 Vol. 3  2015      4.1  242
                         2016      4.2  251
Reshaping Data with pandas

Concatenate two columns

books_new
   name_author  lastname_author  nationality  number_books
0    Virginia              Wolf      British            50
1    Margaret            Atwood     Canadian            40
2      Harper               Lee     American             2
Reshaping Data with pandas

Concatenate two columns

books_new
   name_author  lastname_author  nationality  number_books
0    Virginia              Wolf      British            50
1    Margaret            Atwood     Canadian            40
2      Harper               Lee     American             2
books_new['name_author'].str.cat(books_new['lastname_author'], sep=' ')
0      Virginia Wolf
1    Margaret Atwood
2         Harper Lee
Reshaping Data with pandas

Concatenate two columns

books_new
   name_author  lastname_author  nationality  number_books
0    Virginia              Wolf      British            50
1    Margaret            Atwood     Canadian            40
2      Harper               Lee     American             2
books_new['author'] = books_new['name_author'].str.cat(books_new['lastname_author'], sep=' ')

books_new
  name_author lastname_author nationality  number_books           author
0    Virginia            Wolf     British            50    Virginia Wolf
1    Margaret          Atwood    Canadian            40  Margaret Atwood
2      Harper             Lee    American             2       Harper Lee
Reshaping Data with pandas

Concatenate two columns

books_new
   name_author  lastname_author  nationality  number_books
0    Virginia              Wolf      British            50
1    Margaret            Atwood     Canadian            40
2      Harper               Lee     American             2
books_new.melt(id_vars='author', value_vars=['nationality', 'number_books'], var_name='feature', value_name='value') 
            author       feature     value
0    Virginia Wolf   nationality   British
1  Margaret Atwood   nationality  Canadian
2       Harper Lee   nationality  American
3    Virginia Wolf  number_books        50
4  Margaret Atwood  number_books        40
5       Harper Lee  number_books         2
Reshaping Data with pandas

Concatenate index

comics_marvel
           subtitle  year  ratings  sold
main_title                              
Avengers       Next  1992      4.5   234
Avengers    Forever  1998      4.6   224
Avengers       2099  1999      4.8   141
Reshaping Data with pandas

Concatenate index

comics_marvel.head(2)
           subtitle  year  ratings  sold
main_title                              
Avengers       Next  1992      4.5   234
Avengers    Forever  1998      4.6   224
comics_marvel.index = comics_marvel.index.str.cat(comics_marvel['subtitle'], sep='-')

books
                 subtitle  year  ratings  sold
main_title                                    
Avengers-Next        Next  1992      4.5   234
Avengers-Forever  Forever  1998      4.6   224
Avengers-2099        2099  1999      4.8   141
Reshaping Data with pandas

Split index

comics_marvel.index = comics_marvel.index.str.split('-', expand=True)
comics_marvel
                 subtitle  year  ratings  sold
Avengers Next        Next  1992      4.5   234
         Forever  Forever  1998      4.6   224
         2099        2099  1999      4.8   141
Reshaping Data with pandas

Concatenate Series

books_new['name_author']
0    Virginia
1    Margaret
2      Harper
new_list = ['Wolf', 'Atwood', 'Lee']

books_new['name_author'].str.cat(new_list, sep=' ')
0      Virginia Wolf
1    Margaret Atwood
2         Harper Lee
Reshaping Data with pandas

Let's practice!

Reshaping Data with pandas

Preparing Video For Download...