Cleaning Data

Intermediate Julia

Anthony Markham

Quantitative Developer

Show column information

  • Column names are the first things that you will look at in a dataset.
  • Generally speaking, column names should be legible and concise.
println(first(stock_data))
DataFrameRow
 Row | Date        Open     High     Low      Close    Adj Close  Volume
     | String15    Float64  Float64  Float64  Float64  Float64    Int64
<----|---------------------------------------------------------------------
   1 | 21/01/2022   164.42   166.33    162.3   162.41    161.473  122848900
Intermediate Julia

Rename a column

  • We might want to rename a column if we feel a different label is more appropriate.
 Row | Date        Open     High     Low      Close    Adj Close  Volume
     | String15    Float64  Float64  Float64  Float64  Float64    Int64
<----|---------------------------------------------------------------------
   1 | 21/01/2022   164.42   166.33    162.3   162.41    161.473  122848900
  • In this case, Adj Close has a space, which we want to avoid.
rename!(stock_data, Dict(:Adj Close => :Adj_Close))
 Row | Date        Open     High     Low      Close    Adj_Close  Volume
     | String15    Float64  Float64  Float64  Float64  Float64    Int64
<----|---------------------------------------------------------------------
   1 | 21/01/2022   164.42   166.33    162.3   162.41    161.473  122848900
Intermediate Julia

Describe and find missing data

  • Missing values are a common source of inconsistency in data.
  • There can be various reasons for missing data:

    • measurement errors
    • transcription errors
    • intentionally missing
  • The describe method can be used to quickly find missing data in a DataFrame.

println(describe(stock_data))
Intermediate Julia

Describe missing data

  • describe() gives us a general overview of our entire DataFrame.
7×7 DataFrame
 Row | variable   mean       min        median     max        nmissing  eltype
     | Symbol     Union     Any          Union     Any        Int64     DataType
<--------------------------------------------------------------------------------
   1 | Date                  1/02/2022             9/12/2022         0  String15
   2 | Open       152.613    126.01     151.19     178.55            0  Float64
   3 | High       154.721    127.77     153.72     179.61            0  Float64
   4 | Low        150.529    124.17     149.34     176.7             0  Float64
   5 | Close      152.698    125.02     151.21     178.96            4  Union
   6 | Adj_Close  152.296    125.02     151.07     178.154           0  Float64
   7 | Volume     8.70851e7  35195900   8.22912e7  182602000         0  Int64
Intermediate Julia

Remove missing data

  • We can drop the missing rows using dropmissing.
println(nrow(stock_data))
252
dropmissing!(stock_data, :"Close")
  • We can confirm that we dropped 4 rows using nrow() again.
println(nrow(stock_data))
248
Intermediate Julia

Replace missing data

  • Simply removing rows with missing values is often not an acceptable approach.
  • Depending on the data, we can replace missing values with a substitute.
replace!(stock_data[!, "Close"], missing => 130)
println(stock_data[[202, 227, 235, 240], :])
 Row | Date        Open     High     Low      Close     Adj Close  Volume
     | String15    Float64  Float64  Float64  Float64?  Float64    Int64
<----|-----------------------------------------------------------------------
   1 | 8/11/2022    140.41   141.43   137.49    130      139.5    89908500
   2 | 14/12/2022   145.35   146.66   141.16    130      143.21   82291200
   3 | 25/12/2022   130.92   132.42   129.64    130      131.86   63814900
   4 | 3/01/2023    130.28   130.9    124.17    130      125.07  112117500
  • Choosing an arbitrary value to replace missing values with is rarely correct.
Intermediate Julia

Let's practice!

Intermediate Julia

Preparing Video For Download...