Introducing ShortRead

Introduction to Bioconductor in R

Paula Andrea Martinez, PhD.

Data Scientist

Plant genomes

  • Arabidopsis thaliana is a small flowering plant
  • First plant to have its genome sequenced
  • Genome size 135 megabase pairs (Mbp)

Arabidopsis thaliana

Introduction to Bioconductor in R

Sequencing companies

Sequencer logos

1 Dan Koboldt massgenomics.org
Introduction to Bioconductor in R

fastq vs fasta

fastq

@ unique sequence identifier

raw sequence string

+ optional id

quality encoding per sequence letter
  • fastq, fq

fasta

> unique sequence identifier

raw sequence string

  • fasta, fa, seq
Introduction to Bioconductor in R

fasta

library(ShortRead)
# read fasta
fasample <- readFasta(dirPath = "data/", pattern = "fasta")

# print fasample print(fasample)
class: ShortRead
length: 500 reads; width: 50 cycles
# methods accessors
methods(class = "ShortRead")

# Write a ShortRead object writeFasta(fasample, file = "data/sample.fasta")
Introduction to Bioconductor in R

fastq

library(ShortRead)
# read fastq
fqsample <- readFastq(dirPath = "data/", pattern = "fastq")

# print fqsample fqsample
class: ShortReadQ
length: 500 reads; width: 50 cycles
# methods accessors
methods(class = "ShortReadQ")

# Write a ShortRead object writeFastq(fqsample, file = "data/sample.fastq.gz")
Introduction to Bioconductor in R

fastq sample

library(ShortRead)

# set the seed to draw the same read sequences every time set.seed(123)
# Subsample of 500 bases sampler <- FastqSampler("data/SRR1971253.fastq", 500)
# save the yield of 500 read sequences sample_small <- yield(sampler)
# Class ShortReadQ class(sample_small) # length 500 reads length(sample_small)
Introduction to Bioconductor in R

You are ready!

Introduction to Bioconductor in R

Preparing Video For Download...