Wide and long data formats

Reshaping Data with pandas

Maria Eugenia Inzaugarat

Data Scientist

You will learn

  • Wide and long formats
  • Long to wide transformation
  • Wide to long transformation
  • Stacking and unstacking columns
  • Reshaping and handling complex data, such as string columns or JSON data
Reshaping Data with pandas

Why it is important

  • Tidy datasets
  • Data is not in the appropriate format for analysis:
    • Human readable vs. statistical analysis
  • Nested data in DataFrames is complex to handle
  • Get summary statistics for multi-level index DataFrames
Reshaping Data with pandas

Shape of data

  • The way in which a dataset is organized in rows and columns
fifa_players = pd.read_csv("fifa_players.csv")
fifa_players
                name    age   nationality           club
0       Lionel Messi     32     Argentina      Barcelona
1  Cristiano Ronaldo     34      Portugal       Juventus
2    Neymar da Silva     27        Brazil  Saint-Germain
fifa_players.shape
(3, 4)
Reshaping Data with pandas

Wide format

fifa_players

                name    age   nationality           club
0       Lionel Messi     32     Argentina      Barcelona
1  Cristiano Ronaldo     34      Portugal       Juventus
2    Neymar da Silva     27        Brazil  Saint-Germain

Reshaping Data with pandas

Wide format

fifa_players
                         vv  
                name   | age | nationality          club
0       Lionel Messi   | 32  |  Argentina      Barcelona
1  Cristiano Ronaldo   | 34  |   Portugal       Juventus
2    Neymar da Silva   | 27  |     Brazil  Saint-Germain
                         ^^
  • Each feature is in a separate column
Reshaping Data with pandas

Wide format

fifa_players

                name    age   nationality           club
0       Lionel Messi     32     Argentina      Barcelona <--
1  Cristiano Ronaldo     34      Portugal       Juventus <--
2    Neymar da Silva     27        Brazil  Saint-Germain <--

  • Each feature is in a separate column
  • Each rows contains many features of the same player
Reshaping Data with pandas

Wide format

fifa_players
                name    age   nationality           club
0       Lionel Messi     32     Argentina      Barcelona
 --------------------------------------------------------
1    Cristiano Ronaldo    NaN  <-    Portugal       Juventus
 --------------------------------------------------------
2    Neymar da Silva     27        Brazil  Saint-Germain
  • Each feature is in a separate column
  • Each rows contains many features of the same player
  • No repetition but large number of missing values
  • Simple statistics and imputation
Reshaping Data with pandas

Long format

fifa_players_long.head()
               name     variable     value
0 Cristiano Ronaldo  nationality   Portugal 
1 Cristiano Ronaldo         club   Juventus 
2      Lionel Messi          age         32
3      Lionel Messi  nationality  Argentina 
4      Lionel Messi         club  Barcelona
Reshaping Data with pandas

Long format

fifa_players_long.head()
               name     variable     value
0   Cristiano Ronaldo  nationality   Portugal <--
1 Cristiano Ronaldo         club  Juventus 
2      Lionel Messi          age        32
3        Lionel Messi  nationality  Argentina <--
4      Lionel Messi         club  Barcelona
  • Each row represents one feature
Reshaping Data with pandas

Long format

fifa_players_long.head()
               name     variable     value
0   Cristiano Ronaldo  nationality   Portugal <--
1   Cristiano Ronaldo         club   Juventus <--
2      Lionel Messi          age         32
3      Lionel Messi  nationality  Argentina 
4      Lionel Messi         club  Barcelona
  • Each row represents one feature
  • Multiple rows for each player
Reshaping Data with pandas

Long format

fifa_players_long.head()
  |              name |    variable      value
0 | Cristiano Ronaldo | nationality   Portugal
1 | Cristiano Ronaldo |        club   Juventus
2 |      Lionel Messi |         age         32
3 |      Lionel Messi | nationality  Argentina 
4 |      Lionel Messi |        club  Barcelona
        ^^^^^^^^^^^
  • Each row represents one feature
  • Multiple rows for each player
  • A column (name) to identify same player
Reshaping Data with pandas

Long format

fifa_players_long.head()
               name     variable     value
0 Cristiano Ronaldo  nationality   Portugal 
1 Cristiano Ronaldo         club   Juventus 
2      Lionel Messi          age         32
3      Lionel Messi  nationality  Argentina 
4      Lionel Messi         club  Barcelona
  • Each row represents one feature
  • Multiple rows for each player
  • A column (name) to identify same player
  • Tidy data:
    • Better to summarize data
    • Key-value pairs
    • Preferred for analysis and graphing
Reshaping Data with pandas

Reshaping data

  • Transforming a DataFrame or Series structure to adjust it for analysis
    • Transposing a DataFrame
fifa_players.set_index('club')
                             name  age  nationality
         club 
    Barcelona        Lionel Messi   32    Argentina
     Juventus   Cristiano Ronaldo  NaN     Portugal
Saint-Germain     Neymar da Silva   27       Brazil
Reshaping Data with pandas

Reshaping data

  • Transforming a DataFrame or Series structure to adjust it for analysis
    • Transposing a DataFrame
fifa_players.set_index('club')[['name', 'nationality']]
                               name   nationality 
         club 
    Barcelona        Lionel Messi       Argentina
     Juventus   Cristiano Ronaldo        Portugal
Saint-Germain     Neymar da Silva          Brazil
Reshaping Data with pandas

Reshaping data

  • Transforming a DataFrame or Series structure to adjust it for analysis
    • Transposing a DataFrame
fifa_players.set_index('club')[['name', 'nationality']].transpose()
       club         Barcelona           Juventus     Saint-Germain
       name      Lionel Messi  Cristiano Ronaldo   Neymar da Silva
nationality         Argentina           Portugal            Brazil
Reshaping Data with pandas

Reshaping data

  • Converting data from wide to long format and vice versa
  • Unit of analysis:
    • Long format -> characteristic of a player
    • Wide format -> each player
Reshaping Data with pandas

Wide to long transformation

  • Performed using pandas functions, such as:

    • .melt()
    • .wide_to_long()
Reshaping Data with pandas

Long to wide format

  • Transform data using pandas methods, for example:

    • .pivot()
    • .pivot_table()
Reshaping Data with pandas

Let's practice!

Reshaping Data with pandas

Preparing Video For Download...