Match and filter

Introduction to Bioconductor in R

Paula Andrea Martinez, PhD.

Data Scientist

Duplicate sequences

  • Biological sequence duplicates occur in nature
  • Amplification from the steps in library preparation (PCR)
  • Sequencing the sample more than once

Remove duplicates or at least mark them

  • Whole genome sequencing or exome sequencing

Mark duplicates using a threshold

  • RNA-seq and ChIP-seq
Introduction to Bioconductor in R

srduplicated

library(ShortRead)

# Counting duplicates TRUE is the number of duplicates table(srduplicated(dfqsample))
FALSE  TRUE 
500   500
# Cleaning reads from duplicates x[fun(x)]
cleanReads <- mydReads[srduplicated(mydReads) == FALSE]

# Counting duplicates table(srduplicated(cleanReads))
FALSE
500
Introduction to Bioconductor in R

Creating your own filters

srFilter to filter based on a condition x[fun(x)]

Filter example

library(ShortRead)

# Use a custom filter to remove reads from fqsample # This filter to remove reads shorter than a min number of bases readWidthCutOff <- srFilter(function(x) {width(x) >= minWidth}, name = "MinWidth")
minWidth <- 51
fqsample[readWidthCutOff(fqsample)]
Introduction to Bioconductor in R

nFilter

library(ShortRead)

# save your filter, .name is optional myFilter <- nFilter(threshold = 10, .name = "cleanNFilter")
# use the filter at reading point filtered <- readFastq(dirPath = "data", pattern = ".fastq", filter = myFilter) # you will retrieve only those reads that have a maximum of 10 N's filtered
Introduction to Bioconductor in R

idFilter and polynFilter

library(ShortRead)

#id filter example myFilterID <- idFilter(regex = ":3:1") # will return only those ids that contain the regular expression # optional parameters are .name, fixed and exclude # use the filter at reading point filtered <- readFastq(dirPath = "data", pattern = ".fastq", filter = myFilterID)
# filter to remove poly-A regions myFilterPolyA <- polynFilter(threshold = 10, nuc = c("A")) # will return the sequences that have a maximun number of 10 consecutive A's
# use the filter for subsetting filtered[myFilterPolyA(filtered)]
Introduction to Bioconductor in R

Let's practice using filters!

Introduction to Bioconductor in R

Preparing Video For Download...