Sequence quality

Pengantar Bioconductor di R

Paula Andrea Martinez, PhD.

Data Scientist

Quality scores - Phred table

Quality value Chance it is wrong Accuracy (%)
10 1 in 10 90
20 1 in 100 99
30 1 in 1000 99.9
40 1 in 10000 99.99
50 1 in 100000 99.999
Pengantar Bioconductor di R

Encoding - Phred +33

# quality encoding
encoding(quality(fqsample))

Encoding characters and their scores

 !  "  #  $  %  &  '  (  )  *  +  ,  -  .   #  encoding
 0  1  2  3  4  5  6  7  8  9 10 11 12 13   #  score

 /  0  1  2  3  4  5  6  7  8  9  :  ;  <   #  encoding 
14 15 16 17 18 19 20 21 22 23 24 25 26 27   #  score 

 =  >  ?  @  A  B  C  D  E  F  G  H  I      #  encoding
28 29 30 31 32 33 34 35 36 37 38 39 40      #  score
Pengantar Bioconductor di R

fastq quality

library(ShortRead)
quality(fqsample)
class: FastqQuality
A BStringSet instance 

# Quality is represented with ASCII characters 
[1]    40 ?@@DDDDDHDFDHE>AHFEGFIIEBGDBHH<3FEBEEEEG
[2]    40 BCCDFFFFHHHHHJJJJJJJJJJEHHGHIJJJJJJJJJJJ
[3]    40 BCCFFFFFHFHHHJJJJJJIIJJIIIIIGIIJJIJGIJII
[4]    40 CCCFFFFFHHHHHJJJJJJJJJJIJJJJJJJJJJJJJJJJ
Pengantar Bioconductor di R
library(ShortRead)

sread(fqsample)[1]
# Quality is represented with ASCII characters quality(fqsample)[1]
50 GTCCCATTTACCTCTGACTCTTTTGATGCTGCAATTGCTGCTCATATACT

50 ?@@DDDDDHDFDHE>AHFEGFIIEBGDBHH<3FEBEEEEGGIGIIGHGHC
## PhredQuality instance
pq <- PhredQuality(quality(fqsample))

# transform encoding into scores qs <- as(pq, "IntegerList") qs # print scores
30 31 31 35 35 35 35 35 39 35 37 35 39 36 29 32 39 37 36 38 37 40 40 36 33 38 35 33 39 39 27 18 37 36 33 36
36 36 36 38 38 40 38 40 40 38 39 38 39 34
Pengantar Bioconductor di R

Quality assessment

library(ShortRead)
# Quality assessment
qaSummary <- qa(fqsample, lane = 1)    # optional lane

# class: ShortReadQQA(10) # Names accessible with the quality assessment summary names(qaSummary)
 [1] "readCounts"           "baseCalls"            "readQualityScore"          "baseQuality"
 [5] "alignQuality"         "frequentSequences"    "sequenceDistribution"      "perCycle"  
 [9] "perTile"              "adapterContamination"
# QA elements are accessed with qa[["name"]]
# Get a HTML report
browseURL(report(qaSummary))
Pengantar Bioconductor di R
library(ShortRead)

# sequences alphabet alphabet(sread(fullSample))
A,C,G,T,M,R,W,S,Y,K,V,H,D,B,N,-,+,.
abc <- alphabetByCycle(sread(fullSample))

# Each observation is a letter and each variable is a cycle. First, select the 4 first rows nucleotides A, C, G, T # Then transpose nucByCycle <- t(abc[1:4,])
nucByCycle <- nucByCycle %>% as_tibble() %>% # convert to tibble mutate(cycle = 1:50) # add cycle numbers nucByCycle
    A     C     G     T cycle
16839 16335 16740 10878     1     
13056 13327 12064 22389     2     
13666 15617 13198 18355     3     
14723 15439 14239 16435     4
Pengantar Bioconductor di R

Are you excited?

Pengantar Bioconductor di R

Preparing Video For Download...