Reading multivariate data

Multivariate Probability Distributions in R

Surajit Ray

Professor, University of Glasgow

Course topics

Read and analyze multivariate data
Explore plotting techniques
Use common statistical distributions
- Gaussian and T distribution
Techniques for high-dimensional data
- Principal component analysis (PCA)

Structure of multivariate data

Rectangular in shape - organized by rows and columns
- Rows represent observations
- Columns represent variables
May or may not include:
- Row names or numbers
- Column headers
Possible missing data

Multivariate data examples

Iris Data from Cambridge University website

5.1  3.5  1.4  0.2  1
4.9  3.0  1.4  0.2  1
4.7  3.2  1.3  0.2  1

Birth Weight data (CSV with column header)

"","case","bwt","gestation","parity","age","height","weight","smoke"
"1",1,120,284,0,27,62,100,0
"2",2,113,282,0,33,64,135,0

Reading data

From a URL

iris_url <- "https://mlg.eng.cam.ac.uk/teaching/3f3/1011/iris.data"
iris_raw <- read.table(iris_url,  sep = "", header = FALSE)

Locally

iris_raw <- read.table("iris.txt", sep = "", header = FALSE)

Viewing the dataset

head(iris_raw, n = 4)

     V1  V2  V3  V4 V5
1   5.1 3.5 1.4 0.2  1  
2   4.9 3.0 1.4 0.2  1  
3   4.7 3.2 1.3 0.2  1  
4   4.6 3.1 1.5 0.2  1

Assigning column names

colnames(iris_raw) <- c("Sepal.Length", "Sepal.Width", "Petal.Length","Petal.Width", "Species" )

head(iris_raw)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2       1
2          4.9         3.0          1.4         0.2       1
3          4.7         3.2          1.3         0.2       1
4          4.6         3.1          1.5         0.2       1
5          5.0         3.6          1.4         0.2       1
6          5.4         3.9          1.7         0.4       1

Accessing specific columns

Check current names of columns

names(iris_raw)

"Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

Accessing Sepal length and Sepal width columns

iris_raw[, 1:2] 
iris[, c('Sepal.Length', 'Sepal.Width')]

Changing data types

Change the last variable Species to a factor

iris_raw$species <- as.factor(iris_raw$species)

str(iris_raw)

'data.frame':    150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...

Assigning factor labels

Recode the species labels from 1, 2 and 3 to setosa, versicolor and virginica

Assign factor labels
Change first variable to a factor

library(car) 
iris_raw$Species <- recode(iris_raw$Species,
                          " 1 ='setosa'; 2 = 'versicolor'; 3 = 'virginica'")

str(iris_raw)

'data.frame':    150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Reading csv data with named columns

Birth Weight data (CSV with column header)

"","case","bwt","gestation","parity","age","height","weight","smoke"
"1",1,120,284,0,27,62,100,0
"2",2,113,282,0,33,64,135,0
"3",3,128,279,0,28,64,115,1

Reading Birth Weight data

bwt <- read.csv("birthweight.csv", row.names = 1)
head(bwt, n = 3)

  case bwt gestation parity age height weight smoke
1    1 120       284      0  27     62    100     0
2    2 113       282      0  33     64    135     0
3    3 128       279      0  28     64    115     1

Let's read some multivariate data!

Multivariate Probability Distributions in R