Project FOMC4799 services include NGS sequencing of the V3V4 region of the 16S rRNA amplicons from the samples. First and foremost, please
download this report, as well as the sequence raw data from the download links provided below.
These links will expire after 60 days. We cannot guarantee the availability of your data after 60 days.
Bioinformatics analysis service was not requested, however we still provide the sequence data quality trimming, noise-filtering, pair merging, as well as chimera filtering for the sequences, using the
DADA2 denoising algorithm and pipeline. The denoised, merged and chimera-free ASV (amplicon sequence variants) sequences allow you to perform
downstream analyses such as taxonomy assignment, diversity analysis and differential abundance analysis. If you need us help with these downstream bioinformatics analysis please contact us.
The samples were processed and analyzed with the ZymoBIOMICS® Service: Targeted
Metagenomic Sequencing (Zymo Research, Irvine, CA).
DNA Extraction: If DNA extraction was performed, one of three different DNA
extraction kits was used depending on the sample type and sample volume and were
used according to the manufacturer’s instructions, unless otherwise stated. The kit used
in this project is marked below:
☐
ZymoBIOMICS® DNA Miniprep Kit (Zymo Research, Irvine, CA)
☐
ZymoBIOMICS® DNA Microprep Kit (Zymo Research, Irvine, CA)
☑
ZymoBIOMICS®-96 MagBead DNA Kit (Zymo Research, Irvine, CA)
☐
N/A (DNA Extraction Not Performed)
Elution Volume: 50µL
Additional Notes: NA
Targeted Library Preparation: The DNA samples were prepared for targeted
sequencing with the Quick-16S™ NGS Library Prep Kit (Zymo Research, Irvine, CA).
These primers were custom designed by Zymo Research to provide the best coverage
of the 16S gene while maintaining high sensitivity. The primer sets used in this project
are marked below:
☐
Quick-16S™ Primer Set V1-V2 (Zymo Research, Irvine, CA)
☐
Quick-16S™ Primer Set V1-V3 (Zymo Research, Irvine, CA)
☑
Quick-16S™ Primer Set V3-V4 (Zymo Research, Irvine, CA)
☐
Quick-16S™ Primer Set V4 (Zymo Research, Irvine, CA)
☐
Quick-16S™ Primer Set V6-V8 (Zymo Research, Irvine, CA)
☐
Other: NA
Additional Notes: NA
The sequencing library was prepared using an innovative library preparation process in
which PCR reactions were performed in real-time PCR machines to control cycles and
therefore limit PCR chimera formation. The final PCR products were quantified with
qPCR fluorescence readings and pooled together based on equal molarity. The final
pooled library was cleaned up with the Select-a-Size DNA Clean & Concentrator™
(Zymo Research, Irvine, CA), then quantified with TapeStation® (Agilent Technologies,
Santa Clara, CA) and Qubit® (Thermo Fisher Scientific, Waltham, WA).
Control Samples: The ZymoBIOMICS® Microbial Community Standard (Zymo
Research, Irvine, CA) was used as a positive control for each DNA extraction, if
performed. The ZymoBIOMICS® Microbial Community DNA Standard (Zymo Research,
Irvine, CA) was used as a positive control for each targeted library preparation.
Negative controls (i.e. blank extraction control, blank library preparation control) were
included to assess the level of bioburden carried by the wet-lab process.
Sequencing: The final library was sequenced on Illumina® MiSeq™ with a V3 reagent kit
(600 cycles). The sequencing was performed with 10% PhiX spike-in.
The complete report of your project, including all links in this report, can be downloaded by clicking the link provided below. The downloaded file is a compressed ZIP file and once unzipped, open the file “REPORT.html” (may only shown as "REPORT" in your computer) by double clicking it. Your default web browser will open it and you will see the exact content of this report.
Please download and save the file to your computer storage device. The download link will expire after 60 days upon your receiving of this report.
Complete report download link:
To view the report, please follow the following steps:
1.
Download the .zip file from the report link above.
2.
Extract all the contents of the downloaded .zip file to your desktop.
3.
Open the extracted folder and find the "REPORT.html" (may shown as only "REPORT").
4.
Open (double-clicking) the REPORT.html file. Your default browser will open the top age of the complete report. Within the
report, there are links to view all the analyses performed for the project.
The raw NGS sequence data is available for download with the link provided below. The data is a compressed ZIP file and can be unzipped to individual sequence files.
Since this is a pair-end sequencing, each of your samples is represented by two sequence files, one for READ 1,
with the file extension “*_R1.fastq.gz”, another READ 2, with the file extension “*_R1.fastq.gz”.
The files are in FASTQ format and are compressed. FASTQ format is a text-based data format for storing both a biological sequence
and its corresponding quality scores. Most sequence analysis software will be able to open them.
The Sample IDs associated with the R1 and R2 fastq files are listed in the table below:
Sample ID
Original Sample ID
Read 1 File Name
Read 2 File Name
S01
1AB
zr4799_10V3V4_R1.fastq.gz
zr4799_10V3V4_R2.fastq.gz
S02
2AB
zr4799_11V3V4_R1.fastq.gz
zr4799_11V3V4_R2.fastq.gz
S03
3AB
zr4799_12V3V4_R1.fastq.gz
zr4799_12V3V4_R2.fastq.gz
S04
4AB
zr4799_13V3V4_R1.fastq.gz
zr4799_13V3V4_R2.fastq.gz
S05
5AB
zr4799_14V3V4_R1.fastq.gz
zr4799_14V3V4_R2.fastq.gz
S06
6AB
zr4799_15V3V4_R1.fastq.gz
zr4799_15V3V4_R2.fastq.gz
S07
7AB
zr4799_16V3V4_R1.fastq.gz
zr4799_16V3V4_R2.fastq.gz
S08
8AB
zr4799_17V3V4_R1.fastq.gz
zr4799_17V3V4_R2.fastq.gz
S09
9AB
zr4799_18V3V4_R1.fastq.gz
zr4799_18V3V4_R2.fastq.gz
S10
10AB
zr4799_19V3V4_R1.fastq.gz
zr4799_19V3V4_R2.fastq.gz
S11
11AB
zr4799_1V3V4_R1.fastq.gz
zr4799_1V3V4_R2.fastq.gz
S12
12AB
zr4799_20V3V4_R1.fastq.gz
zr4799_20V3V4_R2.fastq.gz
S13
13C
zr4799_21V3V4_R1.fastq.gz
zr4799_21V3V4_R2.fastq.gz
S14
14C
zr4799_22V3V4_R1.fastq.gz
zr4799_22V3V4_R2.fastq.gz
S15
15C
zr4799_23V3V4_R1.fastq.gz
zr4799_23V3V4_R2.fastq.gz
S16
16C
zr4799_24V3V4_R1.fastq.gz
zr4799_24V3V4_R2.fastq.gz
S17
17C
zr4799_2V3V4_R1.fastq.gz
zr4799_2V3V4_R2.fastq.gz
S18
18C
zr4799_3V3V4_R1.fastq.gz
zr4799_3V3V4_R2.fastq.gz
S19
19C
zr4799_4V3V4_R1.fastq.gz
zr4799_4V3V4_R2.fastq.gz
S20
20C
zr4799_5V3V4_R1.fastq.gz
zr4799_5V3V4_R2.fastq.gz
S21
21C
zr4799_6V3V4_R1.fastq.gz
zr4799_6V3V4_R2.fastq.gz
S22
22C
zr4799_7V3V4_R1.fastq.gz
zr4799_7V3V4_R2.fastq.gz
S23
23C
zr4799_8V3V4_R1.fastq.gz
zr4799_8V3V4_R2.fastq.gz
S24
24C
zr4799_9V3V4_R1.fastq.gz
zr4799_9V3V4_R2.fastq.gz
Please download and save the file to your computer storage device. The download link will expire after 60 days upon your receiving of this report.
DADA2 is a software package that models and corrects Illumina-sequenced amplicon errors.
DADA2 infers sample sequences exactly, without coarse-graining into OTUs,
and resolves differences of as little as one nucleotide. DADA2 identified more real variants
and output fewer spurious sequences than other methods.
DADA2’s advantage is that it uses more of the data. The DADA2 error model incorporates quality information,
which is ignored by all other methods after filtering. The DADA2 error model incorporates quantitative abundances,
whereas most other methods use abundance ranks if they use abundance at all.
The DADA2 error model identifies the differences between sequences, eg. A->C,
whereas other methods merely count the mismatches. DADA2 can parameterize its error model from the data itself,
rather than relying on previous datasets that may or may not reflect the PCR and sequencing protocols used in your study.
DADA2 pipeline includes several tools for read quality control, including quality filtering, trimming, denoising, pair merging and chimera filtering. Below are the major processing steps of DADA2:
Step 1. Read trimming based on sequence quality
The quality of NGS Illumina sequences often decreases toward the end of the reads.
DADA2 allows to trim off the poor quality read ends in order to improve the error
model building and pair mergicing performance.
Step 2. Learn the Error Rates
The DADA2 algorithm makes use of a parametric error model (err) and every
amplicon dataset has a different set of error rates. The learnErrors method
learns this error model from the data, by alternating estimation of the error
rates and inference of sample composition until they converge on a jointly
consistent solution. As in many machine-learning problems, the algorithm must
begin with an initial guess, for which the maximum possible error rates in
this data are used (the error rates if only the most abundant sequence is
correct and all the rest are errors).
Step 3. Infer amplicon sequence variants (ASVs) based on the error model built in previous step. This step is also called sequence "denoising".
The outcome of this step is a list of ASVs that are the equivalent of oligonucleotides.
Step 4. Merge paired reads. If the sequencing products are read pairs, DADA2 will merge the R1 and R2 ASVs into single sequences.
Merging is performed by aligning the denoised forward reads with the reverse-complement of the corresponding
denoised reverse reads, and then constructing the merged “contig” sequences.
By default, merged sequences are only output if the forward and reverse reads overlap by
at least 12 bases, and are identical to each other in the overlap region (but these conditions can be changed via function arguments).
Step 5. Remove chimera.
The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants
after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs.
Chimeric sequences are identified if they can be exactly reconstructed by
combining a left-segment and a right-segment from two more abundant “parent” sequences. The frequency of chimeric sequences varies substantially
from dataset to dataset, and depends on on factors including experimental procedures and sample complexity.
Results
1. Read Quality Plots NGS sequence analaysis starts with visualizing the quality of the sequencing. Below are the quality plots of the first
sample for the R1 and R2 reads separately. In gray-scale is a heat map of the frequency of each quality score at each base position. The mean
quality score at each position is shown by the green line, and the quartiles of the quality score distribution by the orange lines.
The forward reads are usually of better quality. It is a common practice to trim the last few nucleotides to avoid less well-controlled errors
that can arise there. The trimming affects the downstream steps including error model building, merging and chimera calling. FOMC uses an empirical
approach to test many combinations of different trim length in order to achieve best final amplicon sequence variants (ASVs), see the next
section “Optimal trim length for ASVs”.
Below is the link to a PDF file for viewing the quality plots for all samples:
2. Optimal trim length for ASVs The final number of merged and chimera-filtered ASVs depends on the quality filtering (hence trimming) in the very beginning of the DADA2 pipeline.
In order to achieve highest number of ASVs, an empirical approach was used -
Create a random subset of each sample consisting of 5,000 R1 and 5,000 R2 (to reduce computation time)
Trim 10 bases at a time from the ends of both R1 and R2 up to 50 bases
For each combination of trimmed length (e.g., 300x300, 300x290, 290x290 etc), the trimmed reads are
subject to the entire DADA2 pipeline for chimera-filtered merged ASVs
The combination with highest percentage of the input reads becoming final ASVs is selected for the complete set of data
Below is the result of such operation, showing ASV percentages of total reads for all trimming combinations (1st Column = R1 lengths in bases; 1st Row = R2 lengths in bases):
R1/R2
281
271
261
251
241
231
321
0.50%
9.02%
13.35%
13.68%
14.24%
14.48%
311
1.09%
9.12%
12.79%
13.43%
14.29%
14.79%
301
0.51%
8.51%
12.26%
13.18%
13.98%
14.76%
291
0.54%
8.39%
11.44%
12.73%
13.58%
14.19%
281
3.47%
16.04%
20.54%
23.30%
24.34%
25.50%
271
5.11%
18.24%
22.84%
27.75%
29.84%
31.94%
Based on the above result, the trim length combination of R1 = 271 bases and R2 = 231 bases (highlighted red above), was chosen for generating final ASVs for all sequences.
This combination generated highest number of merged non-chimeric ASVs and was used for downstream analyses, if requested.
3. Error plots from learning the error rates
After DADA2 building the error model for the set of data, it is always worthwhile, as a sanity check if nothing else, to visualize the estimated error rates.
The error rates for each possible transition (A→C, A→G, …) are shown below. Points are the observed error rates for each consensus quality score.
The black line shows the estimated error rates after convergence of the machine-learning algorithm.
The red line shows the error rates expected under the nominal definition of the Q-score.
The ideal result would be the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop
with increased quality as expected.
Forward Read R1 Error Plot
Reverse Read R2 Error Plot
The PDF version of these plots are available here:
4. DADA2 Result Summary The table below shows the summary of the DADA2 analysis,
tracking paired read counts of each samples for all the steps during DADA2 denoising process -
including end-trimming (filtered), denoising (denoisedF, denoisedF), pair merging (merged) and chimera removal (nonchim).
Sample ID
F4799.S01
F4799.S02
F4799.S03
F4799.S04
F4799.S05
F4799.S06
F4799.S07
F4799.S08
F4799.S09
F4799.S10
F4799.S11
F4799.S12
F4799.S13
F4799.S14
F4799.S15
F4799.S16
F4799.S17
F4799.S18
F4799.S19
F4799.S20
F4799.S21
F4799.S22
F4799.S23
F4799.S24
Row Sum
Percentage
input
64,098
71,970
52,179
65,271
59,323
55,664
61,610
55,782
67,935
58,903
49,987
50,607
75,547
63,619
42,720
68,328
58,792
81,880
52,053
66,806
75,172
67,588
22,542
68,652
1,457,028
100.00%
filtered
30,416
33,810
23,687
31,358
28,494
27,169
25,550
25,721
33,455
29,677
20,881
25,024
37,585
30,506
6,060
34,173
26,291
40,591
22,761
27,551
34,561
29,486
9,478
29,144
663,429
45.53%
denoisedF
29,449
32,524
22,832
30,699
27,775
26,331
24,689
24,707
32,583
28,856
20,188
24,469
36,853
29,687
5,882
33,533
25,555
39,780
21,758
26,418
33,639
28,478
8,723
28,106
643,514
44.17%
denoisedR
29,301
32,284
22,708
30,234
27,342
26,032
24,595
24,858
32,471
28,850
19,848
24,133
36,701
29,475
5,850
33,180
25,354
39,414
21,728
26,206
32,996
28,039
8,612
27,659
637,870
43.78%
merged
28,227
30,874
21,645
29,746
26,970
25,422
24,062
24,291
31,828
28,017
19,112
23,877
35,875
28,987
5,832
32,760
24,382
38,642
20,446
24,477
31,910
26,469
8,213
26,083
618,147
42.43%
nonchim
7,843
10,638
7,555
6,900
5,807
6,069
6,043
5,762
7,871
6,829
5,481
5,514
8,593
6,731
2,355
6,935
6,680
8,717
7,397
9,174
10,257
9,608
1,679
9,180
169,618
11.64%
This table can be downloaded as an Excel table below:
5. DADA2 Amplicon Sequence Variants (ASVs). A total of 1050 unique merged and chimera-free ASV sequences were identified, and their corresponding
read counts for each sample are available in the "ASV Read Count Table" with rows for the ASV sequences and columns for sample. This read count table can be used for
microbial profile comparison among different samples and the sequences provided in the table can be used to taxonomy assignment.