Reading, inspecting, and cleaning data from CSV

Importing and Managing Financial Data in Python

Stefan Jansen

Instructor

Import and clean data

  • Ensure that pd.DataFrame() is same as CSV source file
  • Stock exchange listings: amex-listings.csv

AmEx Data

Importing and Managing Financial Data in Python

How pandas stores data

  • Each column has its own data format (dtype)
  • dtype affects your calculation and visualization
pandas dtype Column characteristics
object Text, or a mix of text and numeric data
int64 Numeric: whole numbers - 64 bits ($\le 2^{64}$)
float64 Numeric: Decimals, or whole numbers with missing values
datetime64 Date and time information
Importing and Managing Financial Data in Python

Import & inspect

import pandas as pd

amex = pd.read_csv('amex-listings.csv')
amex.info() # To inspect table structure & data types
RangeIndex: 360 entries, 0 to 359
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
 --  ------                 --------------  -----  
 0   Stock Symbol           360 non-null    object 
 1   Company Name           360 non-null    object 
 2   Last Sale              346 non-null    float64
 3   Market Capitalization  360 non-null    float64
 4   IPO Year               105 non-null    float64
 5   Sector                 238 non-null    object 
 6   Industry               238 non-null    object 
 7   Last Update            360 non-null    object 
dtypes: float64(3), object(5)
Importing and Managing Financial Data in Python

Dealing with missing values

# Replace 'n/a' with np.nan
amex = pd.read_csv('amex-listings.csv', na_values='n/a')

amex.info()
RangeIndex: 360 entries, 0 to 359
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
 --  ------                 --------------  -----  
 0   Stock Symbol           360 non-null    object 
 1   Company Name           360 non-null    object 
 2   Last Sale              346 non-null    float64
 3   Market Capitalization  360 non-null    float64
 4   IPO Year               105 non-null    float64
 5   Sector                 238 non-null    object 
 6   Industry               238 non-null    object 
 7   Last Update            360 non-null    object 
dtypes: float64(3), object(5)
Importing and Managing Financial Data in Python

Properly parsing dates

amex = pd.read_csv('amex-listings.csv',
                   na_values='n/a',
                   parse_dates=['Last Update'])

amex.info()
RangeIndex: 360 entries, 0 to 359
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype         
 --  ------                 --------------  -----  
 0   Stock Symbol           360 non-null    object        
 1   Company Name           360 non-null    object        
 2   Last Sale              346 non-null    float64       
 3   Market Capitalization  360 non-null    float64       
 4   IPO Year               105 non-null    float64       
 5   Sector                 238 non-null    object        
 6   Industry               238 non-null    object        
 7   Last Update            360 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(3), object(4)
Importing and Managing Financial Data in Python

Showing off the result

amex.head(2) # Show first n rows (default: 5)
  Stock Symbol   Company Name  
0         XXII   22nd Century Group, Inc   
1          FAX   Aberdeen Asia-Pacific Income Fund Inc   

   Last Sale  Market Capitalization  IPO Year  
0     1.3300           1.206285e+08       NaN  
1     5.0000           1.266333e+09    1986.0  

   Sector        Industry              Last Update  
0  Non-Durables  Farming/Seeds/Milling  2017-04-26  
1  NaN           NaN                    2017-04-25
Importing and Managing Financial Data in Python

Let's practice!

Importing and Managing Financial Data in Python

Preparing Video For Download...