Multivariate Probability Distributions in R
Surajit Ray
Professor, University of Glasgow
Use common statistical distributions
Techniques for high-dimensional data
Iris Data from Cambridge University website
5.1 3.5 1.4 0.2 1
4.9 3.0 1.4 0.2 1
4.7 3.2 1.3 0.2 1
Birth Weight data (CSV with column header)
"","case","bwt","gestation","parity","age","height","weight","smoke"
"1",1,120,284,0,27,62,100,0
"2",2,113,282,0,33,64,135,0
From a URL
iris_url <- "https://mlg.eng.cam.ac.uk/teaching/3f3/1011/iris.data"
iris_raw <- read.table(iris_url, sep = "", header = FALSE)
Locally
iris_raw <- read.table("iris.txt", sep = "", header = FALSE)
head(iris_raw, n = 4)
V1 V2 V3 V4 V5
1 5.1 3.5 1.4 0.2 1
2 4.9 3.0 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1
4 4.6 3.1 1.5 0.2 1
colnames(iris_raw) <- c("Sepal.Length", "Sepal.Width", "Petal.Length","Petal.Width", "Species" )
head(iris_raw)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 1
2 4.9 3.0 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1
4 4.6 3.1 1.5 0.2 1
5 5.0 3.6 1.4 0.2 1
6 5.4 3.9 1.7 0.4 1
Check current names of columns
names(iris_raw)
"Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
$$
Accessing Sepal length and Sepal width columns
iris_raw[, 1:2]
iris[, c('Sepal.Length', 'Sepal.Width')]
Change the last variable Species
to a factor
iris_raw$species <- as.factor(iris_raw$species)
str(iris_raw)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
Recode the species labels from 1
, 2
and 3
to setosa
, versicolor
and virginica
library(car)
iris_raw$Species <- recode(iris_raw$Species,
" 1 ='setosa'; 2 = 'versicolor'; 3 = 'virginica'")
str(iris_raw)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Birth Weight data (CSV with column header)
"","case","bwt","gestation","parity","age","height","weight","smoke"
"1",1,120,284,0,27,62,100,0
"2",2,113,282,0,33,64,135,0
"3",3,128,279,0,28,64,115,1
Reading Birth Weight data
bwt <- read.csv("birthweight.csv", row.names = 1)
head(bwt, n = 3)
case bwt gestation parity age height weight smoke
1 1 120 284 0 27 62 100 0
2 2 113 282 0 33 64 135 0
3 3 128 279 0 28 64 115 1
Multivariate Probability Distributions in R