NSF PreMiEr 2024 (Callahan Lab) | Neha's USP Portfolio

Research Project Title: A Memory and Time Profiling Comparison of Different Processing Modes for the Divisive Amplicon Denoising Algorithm 2 (DADA2) Software

PI Name: Dr. Benjamin Callahan

North Carolina State University, Population Health and Pathobiology Department

Duration of Project Affiliation: 7 weeks

Research Context

Next Generation Sequencing (NGS) has revolutionized the biological sciences due to the sheer volume of sequences that can be analyzed in incredibly short periods of time. One type of targeted NGS is amplicon sequencing, which involves high-throughput analysis of marker genes; it is used by microbiome researchers worldwide to elucidate bacterial diversity and genetic variation within samples. The reads produced by amplicon sequencing are not perfect and will always constrain some sequencing errors, so they require denoising to ultimately produce error-free amplicon sequence variants (ASVs). One of the main software packages used to resolve this issue is DADA2.

Research Focus

DADA2 models and corrects amplicon errors to infer exact sample sequences without losing resolution or producing false positives. There are three processing modes that can be employed with DADA2, independent (default), pseudo-pooling, and pooling. Independent mode is quite powerful, but it is best used for common variants that can be inferred by multiple reads. For rare variants, it is better to use pooling mode; however, there is a cost in terms of processing time and memory usage. In such cases, pseudo-pooling may be more useful due to its lightweight sampling method where it essentially carries out independent mode twice, once to produce priors and then to produce ASVs normally. This has lower computational costs. Our main goal is to address the computational tractability of DADA2 when used for large datasets; we hope to do this by profiling each of the three processing modes for peak memory consumption and processing time to demonstrate: 1) what specific cases to use each DADA2 processing mode for, and 2) how pseudo-pooling is a viable alternative method to pooling, particularly for large datasets.

Project Responsibilities

I conducted a time and memory profiling on a mouse gut microbiome dataset using the bench package and Rprof package in R, and scripts for profiling were all run on the BRC cluster at NC State University. Processing time information was extracted as total time in bench and sampling time in Rprof, and both were plotted separately using the ggplot2 package. The bench_process_memory() function in the bench package provided us with information about current and peak memory usage for each processing mode. The peak memory for each mode was then compiled and plotted. Peak memory is of interest to us over current or total memory measurements because peak memory is more indicative of how a program will run in computational terms than allocated memory, which is built off of required processing time.