Introduction to Bioconductor in R
Paula Andrea Martinez, PhD.
Data Scientist
Quality value | Chance it is wrong | Accuracy (%) |
---|---|---|
10 | 1 in 10 | 90 |
20 | 1 in 100 | 99 |
30 | 1 in 1000 | 99.9 |
40 | 1 in 10000 | 99.99 |
50 | 1 in 100000 | 99.999 |
# quality encoding
encoding(quality(fqsample))
Encoding characters and their scores
! " # $ % & ' ( ) * + , - . # encoding
0 1 2 3 4 5 6 7 8 9 10 11 12 13 # score
/ 0 1 2 3 4 5 6 7 8 9 : ; < # encoding
14 15 16 17 18 19 20 21 22 23 24 25 26 27 # score
= > ? @ A B C D E F G H I # encoding
28 29 30 31 32 33 34 35 36 37 38 39 40 # score
library(ShortRead)
quality(fqsample)
class: FastqQuality
A BStringSet instance
# Quality is represented with ASCII characters
[1] 40 ?@@DDDDDHDFDHE>AHFEGFIIEBGDBHH<3FEBEEEEG
[2] 40 BCCDFFFFHHHHHJJJJJJJJJJEHHGHIJJJJJJJJJJJ
[3] 40 BCCFFFFFHFHHHJJJJJJIIJJIIIIIGIIJJIJGIJII
[4] 40 CCCFFFFFHHHHHJJJJJJJJJJIJJJJJJJJJJJJJJJJ
library(ShortRead)
sread(fqsample)[1]
# Quality is represented with ASCII characters quality(fqsample)[1]
50 GTCCCATTTACCTCTGACTCTTTTGATGCTGCAATTGCTGCTCATATACT
50 ?@@DDDDDHDFDHE>AHFEGFIIEBGDBHH<3FEBEEEEGGIGIIGHGHC
## PhredQuality instance pq <- PhredQuality(quality(fqsample))
# transform encoding into scores qs <- as(pq, "IntegerList") qs # print scores
30 31 31 35 35 35 35 35 39 35 37 35 39 36 29 32 39 37 36 38 37 40 40 36 33 38 35 33 39 39 27 18 37 36 33 36
36 36 36 38 38 40 38 40 40 38 39 38 39 34
library(ShortRead) # Quality assessment qaSummary <- qa(fqsample, lane = 1) # optional lane
# class: ShortReadQQA(10) # Names accessible with the quality assessment summary names(qaSummary)
[1] "readCounts" "baseCalls" "readQualityScore" "baseQuality"
[5] "alignQuality" "frequentSequences" "sequenceDistribution" "perCycle"
[9] "perTile" "adapterContamination"
# QA elements are accessed with qa[["name"]]
# Get a HTML report
browseURL(report(qaSummary))
library(ShortRead)
# sequences alphabet alphabet(sread(fullSample))
A,C,G,T,M,R,W,S,Y,K,V,H,D,B,N,-,+,.
abc <- alphabetByCycle(sread(fullSample))
# Each observation is a letter and each variable is a cycle. First, select the 4 first rows nucleotides A, C, G, T # Then transpose nucByCycle <- t(abc[1:4,])
nucByCycle <- nucByCycle %>% as_tibble() %>% # convert to tibble mutate(cycle = 1:50) # add cycle numbers nucByCycle
A C G T cycle
16839 16335 16740 10878 1
13056 13327 12064 22389 2
13666 15617 13198 18355 3
14723 15439 14239 16435 4
Introduction to Bioconductor in R