Sequence quality

Introduction to Bioconductor in R

Paula Andrea Martinez, PhD.

Data Scientist

Quality scores - Phred table

Quality value Chance it is wrong Accuracy (%)
10 1 in 10 90
20 1 in 100 99
30 1 in 1000 99.9
40 1 in 10000 99.99
50 1 in 100000 99.999
Introduction to Bioconductor in R

Encoding - Phred +33

# quality encoding
encoding(quality(fqsample))

Encoding characters and their scores

 !  "  #  $  %  &  '  (  )  *  +  ,  -  .   #  encoding
 0  1  2  3  4  5  6  7  8  9 10 11 12 13   #  score

 /  0  1  2  3  4  5  6  7  8  9  :  ;  <   #  encoding 
14 15 16 17 18 19 20 21 22 23 24 25 26 27   #  score 

 =  >  ?  @  A  B  C  D  E  F  G  H  I      #  encoding
28 29 30 31 32 33 34 35 36 37 38 39 40      #  score
Introduction to Bioconductor in R

fastq quality

library(ShortRead)
quality(fqsample)
class: FastqQuality
A BStringSet instance 

# Quality is represented with ASCII characters 
[1]    40 ?@@DDDDDHDFDHE>AHFEGFIIEBGDBHH<3FEBEEEEG
[2]    40 BCCDFFFFHHHHHJJJJJJJJJJEHHGHIJJJJJJJJJJJ
[3]    40 BCCFFFFFHFHHHJJJJJJIIJJIIIIIGIIJJIJGIJII
[4]    40 CCCFFFFFHHHHHJJJJJJJJJJIJJJJJJJJJJJJJJJJ
Introduction to Bioconductor in R
library(ShortRead)

sread(fqsample)[1]
# Quality is represented with ASCII characters quality(fqsample)[1]
50 GTCCCATTTACCTCTGACTCTTTTGATGCTGCAATTGCTGCTCATATACT

50 ?@@DDDDDHDFDHE>AHFEGFIIEBGDBHH<3FEBEEEEGGIGIIGHGHC
## PhredQuality instance
pq <- PhredQuality(quality(fqsample))

# transform encoding into scores qs <- as(pq, "IntegerList") qs # print scores
30 31 31 35 35 35 35 35 39 35 37 35 39 36 29 32 39 37 36 38 37 40 40 36 33 38 35 33 39 39 27 18 37 36 33 36
36 36 36 38 38 40 38 40 40 38 39 38 39 34
Introduction to Bioconductor in R

Quality assessment

library(ShortRead)
# Quality assessment
qaSummary <- qa(fqsample, lane = 1)    # optional lane

# class: ShortReadQQA(10) # Names accessible with the quality assessment summary names(qaSummary)
 [1] "readCounts"           "baseCalls"            "readQualityScore"          "baseQuality"
 [5] "alignQuality"         "frequentSequences"    "sequenceDistribution"      "perCycle"  
 [9] "perTile"              "adapterContamination"
# QA elements are accessed with qa[["name"]]
# Get a HTML report
browseURL(report(qaSummary))
Introduction to Bioconductor in R
library(ShortRead)

# sequences alphabet alphabet(sread(fullSample))
A,C,G,T,M,R,W,S,Y,K,V,H,D,B,N,-,+,.
abc <- alphabetByCycle(sread(fullSample))

# Each observation is a letter and each variable is a cycle. First, select the 4 first rows nucleotides A, C, G, T # Then transpose nucByCycle <- t(abc[1:4,])
nucByCycle <- nucByCycle %>% as_tibble() %>% # convert to tibble mutate(cycle = 1:50) # add cycle numbers nucByCycle
    A     C     G     T cycle
16839 16335 16740 10878     1     
13056 13327 12064 22389     2     
13666 15617 13198 18355     3     
14723 15439 14239 16435     4
Introduction to Bioconductor in R

Are you excited?

Introduction to Bioconductor in R

Preparing Video For Download...