Efficient workflow

Data Manipulation in Julia

Katerina Zahradova

Instructor

Tips for names

Short, meaningful names
- wages rather than df
- wages rather than us_min_wages_data_between_1968_and_2020_with_inflation_adjusted_column

Follow naming conventions/patterns
- mixing state_wage_2020 and effective.2020.dollars can be hard to remember
- same with capitals, avoid state, Year, and REGION in the same DataFrame

Too many variables

Don't create too many new variables
- clutters memory
- chaos: what is the difference between wages_no_missing, wages_missing_state_only, wages_original_no_missing, wages_state_mean_no_missing, etc.
Overwrite! Use select!(), transform!(), etc.
Use chain macros to reduce the need for new versions of the same data

Variables instead of hard coding

Variables over hard coding values

# Rather
replace_missing = 0

replace!(df.col1, missing => replace_missing)
replace!(df.col2, missing => replace_missing)

# Than
replace!(df.col1, missing => 0)
replace!(df.col2, missing => 0)

Make a function of it

Write a function rather than write code over and over and over again!
- functions prevents typos
- once set up, they are quicker to use

# Function to plot multiple lineplots with labels
function make_line_plot(xs, ys,labels; xlabel="", ylabel="", title="")
    p = plot(title = title, xlabel = xlabel, ylabel = ylabel)
    for (x, y, label) in zip(xs, ys, labels)
        plot!(x, y, label=label)
    end
    p
end

Comment and document

Comments for what we are doing

# Standardize names
rename!(df, :ColumnOne => :col_1)

# Lines with missing company
df[ismissing.(df.company),:]

# Pivoting on year and state
unstack(wages, :year, :state, :eff_min_wage)

Document why we are doing things

# Replace missing wages by minimum
# As the worst case
min = minimum(skipmissing(df.wages))
replace!(df.wages, missing => min)

# Joining with countries
# To study how countries influence quality
leftjoin(company, countries, on=:location)

Get to know the data

Take the time to understand the data
- Easier to extract information later
- Make plots, print the results, ...

Get to know your data

¹ Photo by Myriam Jessier on Unsplash

Ask for help!

Don't reinvent the wheel, use resources available
- Google
- Stack Overflow
- DataCamp Cheat Sheets
- ...

Google, Stack Overflow, DataCamp logos

Have fun!

Have fun, don't give up, and enjoy!

Flight delays in US airports

Structure of flight data

Let's practice!

Data Manipulation in Julia

Preparing Video For Download...