Denoiser pipeline - components and walk-through

This vignette contains a detailed walk-through of the denoise algorithm. It goes through the processing of a single sequence, step-by-step. All of the discussed parameters for these individual components can be passed to the denoise function.

Data input

The read_fasta and read_fastq functions allow users to read data into R for denoising. These functions produce dataframes with columns containing the header data, sequence data and PHRED quality scores (for fastq only - there is an option to discard the quality scores if they are not of interest to the user).

fastq_example_file = system.file('extdata/coi_sequel_data_subset.fastq.gz', package = 'debar')
data = read_fastq(fastq_example_file)
## [1] "header_data" "sequence"    "quality"
#head(data) - to view the first few records

Building a DNAseq object

The debar denoising pipeline is built around the custom DNAseq object, which is used to store the input sequence data and the outputs generated by the denoising process.

Below the varible 'ex' will store a DNAseq object, with the raw sequence, the name of the sequence (in this case the read id) and optionally the PHRED quality scores as well.

i = 33 #the row number from the example dataframe to be analyzed in the single sequence demonstration
ex = DNAseq(data$sequence[[i]], name = data$header_data[[i]], phred = data$quality[[i]])
ex #
## A DNAseq object.
## Sample ID: example_read_33
## Raw Sequence:
## attcaaccaatcataaagatattgg...tgattttttggtcaccctgaagttt

The raw sequence, phred score and name can then be accessed through the dollar sign notation. As data are generated by the denoising process, they can also be accessed using this notation.

## [1] "example_read_33"
#can always check to see the available components with the names function
print("Available in the current DNAseq object:")
## [1] "Available in the current DNAseq object:"
## [1] "name"  "raw"   "phred"

Framing the barcode sequence

The frame function is one of the two main workhorses in the denoising pipeline. It takes the raw data sequence and compares it against the nucleotide profile hidden Markov model (PHMM), generating the statistical information (The Viterbi algorithm's output path) that is used to apply corrections to the sequence read. Since the PHMM is a profile of only the 657bp region of COI-5P (and not capable of denoising surrounding data) it also takes the input sequence and removes any flanking regions that fall outside of the 657bp COI-5P region. Don't worry though, if you wish to keep these parts of your sequence, there is an option to include the removed flanking sequences in the output (you may want these if they contain important identifying information, i.e. tags to relate multiplexed reads back to their sample of origin). Just be aware that if you do keep the flaking regions, any errors in these regions will go uncorrected!

By default, the frame function will compare both the forward and reverse compliments of an input read to the PHMM and use the log likelihood values to pick the direction that best matches the COI-5P profile. If you are confident that all reads being denoised are in the forward orientation, then you can speed up the frame function by setting the option: dir_check = FALSE (in this case frame will not generate the reverse compliment, and only pass the sequence through the PHMM in the forward orientation). Additionally frame contains an additional set of options that can be use to filter out small fragments and potential chimeric sequences. When the terminate_rejects option is set to TRUE then the following two checks are run: (1) checks to see if a sequence contains more than max_inserts (default = 400) consecutive insert states in the PHMM path (i.e. 400 or more consecutive bp in the sequence appear to not be part of COI-5P) (2) checks if the sequence contains less than min_match(default = 100) consecutive matches to the PHMM. If either condition is met the sequence will be flagged for rejection and not framed.

ex = frame(ex)

Adjusting the sequence

The adjust function is the component of the denoising pipeline where identified errors in the sequence are corrected. Using the PHMM path output and the information about the framing of the sequence generated by the frame function, adjust applies insertion and deletion corrections to the DNA sequence. The series hidden match, insert and delete states in the PHMM path are assessed and used to correct the DNA sequence. Certain rules are applied in the adjustment algorithm to ensure that adjustments are not being applied to highly erroneous sequences, or sequences with deviations from the expected COI structure that may be biologically true. First, triple inserts or triple deletes (3 consecutive nucleotides missing or added) are not corrected by the adjust algorithm. This is because the insertion or deletion of a codon would preserve the integrity of the amino acid sequence (no frame shift). In fact, certain species do possess COI-5P sequences with full codons missing (i.e. they are 654bp in length); adjustments of sequences of this kind would be erroneous. Additionally, if 5 or more consecutive insert or delete states are encountered, the adjustment is ceased and the sequence is flagged for rejection. Drastic deviations of this kind are extremely improbable and indicative of a larger problem with the sequence than simply minor technical errors (i.e. a contaminant sequence or chimera).

In addition to the DNAseq class object, two other parameters are accepted by the adjust function. censor_length is the parameter used to determine the number of base pairs adjacent to a correction that should be covered up (turned into placeholder characters). This parameter exists because the comparison of sequences to the PHMM via the Viterbi algorithm is not perfect in its identification of the location of indel errors. Censoring base pairs adjacent to the correction applied can increase the probability that the erroneous base pair is remove from a sequence, but comes at the trade off of a loss of information in the read (this limitation is overcome when multiple sequence outputs from a given sample are available). The adjustment corrects the length of the sequence (the erroneous bp is removed when an insertion is identified and a placeholder character is added when a deletion is identified) and censorship maximizes the likelihood of the error being removed. The default censor_length is 7 (i.e. seven bases left or right of an indel adjustment are turned into placeholder characters). This number is not arbitrary, but rather was determined through experimental corrections. In the denoising of a series of 10K barcode sequences each with a single artificial indel error with known position 61.8% of errors were corrected exactly. 38.2% of the time, errors were incorrectly reported by an average of 2.31bp (standard deviation=1.98).

A conservative censorship size of 7 was decided upon by analysis of the mean miss distance (and standard deviation) in only the incorrect adjustments. The mean plus two standard deviations (2.3 + (2*1.98) = 6.26) was rounded up, based on the test results this would lead to the censorship of greater than 95% of errors that were not correctly identified by the PHMM. This explanation is given to the user to inform any potential decisions about altering the censorship length. The default is highly conservative as the summary statistics used to calculate the length omit the 61.8% of the time that corrections were applied exactly. The use can make the censorship of reads as conservative as they wish (censor_length = 657 would mask a read showing a single bp deviation from the COI profile) or disable this feature (censor_length = 0) and accept all corrections made by the adjust algorithm.

The second additional parameter accepted by adjust is added_phred. This parameter is only employed if the DNAseq object possesses PHRED quality scores. The specified character (default is: * - a quality score of 9, which is a low value) is assigned to any placeholder characters that are inserted into the sequence by the adjustment algorithm. This is done so that the the relationships between PHRED scores and their corresponding base pairs are preserved (if a bp is deleted its PHRED score is removed as well), if downstream analyses or processing of the data rely on the PHRED scores, this information is still available.

ex = adjust(ex, censor_length = 4)

The algorithm keeps track of the number of adjustments made.

## [1] 0

The PHMM suggests a bp was missing at position 157 in the profile (the path can be called directly via dollar sign notation ex$data$path). A placeholder (character is a '-' until the final outseq step) is inserted and since adjust was given the parameter censor_length = 4 four base pairs in either side of the correction are also turned into placeholders. Note that since added_phred was left as its default, the inserted bp has a phred score of * (as indicated by the labels shown above the adjusted sequence characters).

##   ~   ~   ~   ~   ~   ~   ~   ~   ~   ~   ~   ~   ~   ~   ~ 
## "c" "t" "t" "c" "t" "t" "t" "a" "t" "a" "g" "t" "t" "a" "t"

Amino Acid check

Following the framing and adjustment of the sequence, aa_check can be used to translate the adjusted DNA sequence to amino acids and conduct a check of the amino acid sequence for the presence of stop codons. This is easier (i.e. translation is only performed once) because the frame function sets the DNA sequences in a common orientation and reading frame. The aa_check function's search for stop codons provides a simple double check of the adjust algorithm's performance. If stop codons are present then this indicates that reading frame shifts have not been corrected properly in the adjustment phase. Sequences of this kind will be flagged for rejection. This check is not perfect, as any persisting frame shifts late in the sequence may not be sufficiently disruptive to produce a stop codon.

A few more technical notes of the aa_check function: 1. sequences are translated using the censored translation table described and implemented within coil. This is done so that the aa_check function is not biased against taxonomic groups that employ more esoteric translation tables. You can override the default behaviour and pass the number of the genetic code if it is known for certain (with trans_table parameter) or even override the reading frame used for translation with the parameter frame_offset. 2. If you are familiar with coil]( you will be aware that it compares translated sequences against an amino acid PHMM to identify sequences with errors. This is not done within the present amino acid check namely in the interest of speed (comparing sequences to the PHMMs is computationally costly) and because the potentially large number of placeholder characters in a sequence resulting from the adjust algorithm would lower the likelihood scores of the corresponding amino acid sequences (due to a surplus of placeholders) and lead to a large number of false rejections.

ex = aa_check(ex) #the default behaviour should suit >99% of user's needs

Creating the output sequence

Once the DNA sequence has been framed, adjusted and checked for evidence of uncorrected errors the output sequence can be generated. The default ambiguous character used in the output sequence is “N” (ambig_char = "N"). This character is substituted into the output sequence where adjustments and censorship have occurred. Additionally, there are two styles of output sequences that can be produced: trimmed and framed sequences or sequences with flanking information reattached. The default behaviour of outseq is to keep the flanks of sequences (controlled by the parameter keep_flanks = TRUE). Any DNA sequence information that was outside of the framed COI-5P profile region is reattached to the ends of the sequence to create the output sequence. This is done so that debar can be applied in the analysis of raw sequence data that contains important information outside of the barcode region required in subsequent analysis steps. For example if the sequence are tagged with sample specific barcode sequences, this information is preserved in the output sequence so that post-denoising sequences can be demultiplexed and assigned to their correct source. Since flanking regions are excluded by the frame function, these regions are not denoised and any errors present in these portions of the sequence will go uncorrected. Note: the PHRED quality scores corresponding to the flanking sequence regions are preserved where applicable. The second option is to set the keep_flanks option to FALSE. This will lead to any flanking sequence being discarded and the output of only the DNA sequence for the COI-5P profile region. If there are missing bp at the front of the COI-5P profile region, placeholder characters are added to that all output sequences are in a common reading frame. If the DNAseq object contains PHRED scores, the PHRED output string corresponding to the desired DNA output string will be generated as well.

#option a - reattach the flanking sequence - use this if you wish to preserve sequence tags
ex = outseq(ex) 
nchar(ex$outseq) # the flanking sequence is reattached
## [1] 708
## [1] "~z~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~3~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~nN~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}~~~~~~~~~~~~~h~~~~~~~"
#option b - output only the sequence for the COI-5P region
ex = outseq(ex, keep_flanks = FALSE) 
ex$outseq #placeholder characters added to the front of the sequence to establish common reading frame when necessary
nchar(ex$outseq) # only the COI-5P region is outout
## [1] 654

Writing to file

Once the denoising of the sequence is completed, its components can be further queried and evaluated in within R, or the sequence can be saved to an output file.

debar contains functions for outputting sequences in either fastq or fasta format (write_fastq and write_fasta respectively). When a sequence is output, the name field of the DNAseq object is used to create the header line for the sequence and the outseq field becomes the sequence written to the file. Both output options have an append option, for which the default is TRUE. When TRUE the output is appended to an existing file with the specified name (filename parameter), if changed to FALSE than an existing file will be overwritten.

# note - the out demonstration markdown cells are set to eval = FALSE so that output files are not generated
# and saved without you reading this message and doing it on purpose. Make sure to check your working directory first!

write_fasta(ex, filename = "out.fa" , append = TRUE) #will append ex's output sequence to out.fa 

write_fasta(ex, filename = "out.fa" , append = FALSE) #will overwrite out.fa with the data for ex.

The write_fastq file contains two additional parameters relating to the handling of PHRED quality scores. If keep_phred is set to TRUE, then the PHRED scores contained in the DNAseq object will be used to build the PHRED line in the fastq record being output. If keep_phred is set to FALSE, or the DNAseq object does not possess PHRED scores (i.e. it originated from a fasta file) then the PHRED line will be filled with the specified phred_placeholder character specified by the user (or the default: #, an extremely low PHRED score of 2).

write_fastq(ex, filename = "out.fq") # default - appends output sequence to the file and keeps the objects phred scores

write_fastq(ex, filename = "out.fq" , append = FALSE, keep_phred = FALSE, phred_placeholder = "?") #alternative - overwrites 
# file and discards the phred scores, replacing them with the character "?" the correct number of times.


Funding for the development of this software was provided by grants in Bioinformatics and Computational Biology from the Government of Canada through Genome Canada and Ontario Genomics and from the Ontario Research Fund. Funders played no role in the study design or preparation of this software.