Descriptive statistics

Introduction to Julia

James Fulton

Climate informatics researcher

Describe function

# Summarize the runs DataFrame
println(describe(df_run))
4×7 DataFrame
 Row | variable  mean      min     median  max        nmissing  eltype
     | Symbol    Union…    Any     Union…  Any        Int64     DataType
_____|__________________________________________________________________
   1 | day                 Monday          Wednesday         0  String
   2 | distance  3833.33   2000    4000.0  5000              0  Int64
   3 | time      23.6967   14.99   23.745  31.68             0  Float64
   4 | raining   0.666667  false   1.0     true              0  Bool
Introduction to Julia

Summary statistics on columns

using statistics

Functions in Statistics:

  • mean() - Calculate mean of array
  • median() - Calculate median value of array
  • std() - Calculate standard deviation of array values
  • var() - Calculate variance of array values
    # Calculate average of distance column
    average_distance = mean(df_run[:, "distance"])
    
Introduction to Julia

Other builtin summary functions

  • sum() - Calculate sum of array
  • minimum() - Calculate minimum value in array
  • maximum() - Calculate maximum value in array
total_distance = sum(df_run[:, "distance"])        # Returns 23000

minimum_distance = minimum(df_run[:, "distance"])  # Returns 2000

maximum_distance = maximum(df_run[:, "distance"])  # Returns 5000
Introduction to Julia

Column operations

For columns a and b of DataFrame df

Operation Scalar example Array example
Addition df.a .+ 1 df.a .+ df.b or df.a + df.b
Subtraction df.a .- 1 df.a .- df.b or df.a - df.b
Multiplication 2 .* df.a or 2 * df.a df.a .* df.b
Division df.a ./ 2 or df.a / 2 df.a ./ df.b
Introduction to Julia

Calculating run speed

# Convert distances to kilometers
distance_km = df_run.distance ./ 1000
# Convert run times to hours
time_hr = df_run.time ./ 60
println(distance_km)
println(time_hr)
[2.0, 5.0, 3.5, 3.0, 4.5, 5.0]
[0.25, 0.53, 0.37, 0.29, 0.42, 0.51]
6×4 DataFrame   
 Row | distance     time  ...
     |    Int64  Float64  ...
_____|__________________  ...
   1 |     2000    14.99  ...
   2 |     5000    31.68  ...
   3 |     3500    22.02  ...
   4 |     3000    17.25  ...
   5 |     4500    25.47  ...
   6 |     5000    30.77  ...
Introduction to Julia

Calculating run speed

# Convert distances to kilometers
distance_km = df_run.distance ./ 1000
# Convert run times to hours
time_hr = df_run.time ./ 60
# Run speed in km/hr
speeds = distance_km ./ time_hr


println(speeds)
[8.01, 9.47, 9.54, 10.43, 10.60, 9.75]
6×4 DataFrame   
 Row | distance     time  ...
     |    Int64  Float64  ...
_____|__________________  ...
   1 |     2000    14.99  ...
   2 |     5000    31.68  ...
   3 |     3500    22.02  ...
   4 |     3000    17.25  ...
   5 |     4500    25.47  ...
   6 |     5000    30.77  ...
Introduction to Julia

Column assignment

# Assign run speeds to new column named "speed"
df_run[:, "speed"] = distance_km ./ time_hr
# Assign using dot form
df_run.speed = distance_km ./ time_hr
Introduction to Julia

Column assignment

println(df_run)
6×4 DataFrame   
 Row | day     distance     time  raining    speed
     | String     Int64  Float64     Bool  Float64
_____|____________________________________________
   1 | Wednesday   2000    14.99     true     8.01
   2 | Monday      5000    31.68    false     9.47
   3 | Thursday    3500    22.02     true     9.54
   4 | Tuesday     3000    17.25     true    10.43
   5 | Thursday    4500    25.47    false    10.60
   6 | Monday      5000    30.77     true     9.75
Introduction to Julia

Let's practice!

Introduction to Julia

Preparing Video For Download...