METHOD
Library preparation for metagenomic sequencing with Illumina
1 Genotek Inc., Moscow
2 Moscow South-West High School No. 1543, Moscow, Russia
Correspondence should be addressed: Anna Krasnenko
Nastavnicheskiy per. 17, str. 1, pod. 14, Moscow, 105120; moc.liamg@oknensarkanna
Acknowledgements: the authors thank Daria Plakhina and Ivan Stetsenok of Genotek for their help and Sergey Glagolev of Moscow South-West High School No. 1543 for his valuable advice and comments.
Contribution of the authors to this work: Krasnenko AYu, Eliseev AYu — analysis of literature, research planning and implementation, data analysis and interpretation; Borisevich DI, Tsukanov KYu — bioinformatic analysis; Davydova AI — drafting of a manuscript; Ilinsky VV — research planning, scientific advisor. All authors participated in editing of the manuscript.
The human body is home to a number of bacterial communities inhabiting the mouth cavity, gut, genitourinary system, etc. The totality of microorganisms that have symbiotic relationships with the host is called the microbiome [1]. Study of the human microbiome provides understanding of what bacteria live in healthy and diseased individuals [2].
Sequencing of those regions of bacterial genomes that discriminate between bacterial species and sometimes genera is called metagenomic sequencing, or marker gene sequencing, and is currently widely used in microbiome studies [3]. Regions of the 16S ribosomal RNA (16S rRNA) gene are especially convenient for sequencing. This gene is highly conserved in prokaryotes enabling the use of universal primers to amplify the target sequence, which makes the whole procedure cost-effective and not so time-consuming [4, 5]. The 16S rRNA gene harbors conserved and variable regions. The latter contain single-base substitutions that are instrumental in identifying microbial species or genera: once these substitutions are detected by sequencing, they can be matched against regularly updated public databases.
One of the core steps of metagenomic sequencing workflow is sample preparation, i.e. converting source nucleic acids into a library of DNA fragments ready to be loaded onto the sequencer. There are a lot of different sequencing platforms, and although their sample preparation strategies are more or less the same, there are some nuances related to the techniques employed for signal detection during sequencing [6]. Sample preparation aims at obtaining DNA fragments that serve to identify a species or a genus of the studied microorganism. In metagenomics accuracy is largely determined by a good choice of primers necessary to produce amplicons by polymerase chain reaction (PCR) [7, 8]. In 2013 Klindworth et al. compiled a list of 512 primer pairs organized into three subgroups based on the next generation sequencing (NGS) technique they can be used for: group 1 consisted of primers for Illumina and Ion Torrent platforms (small amplicons), group 2 consisted of primers for 454 Life Science platform (middle-sized amplicons), and primers included in group 3 were intended for PacBio and other similar platforms (large amplicons) and could be also used to prepare genomic libraries of colonial species [9]. Each group offers a few universal primer pairs for archaea and bacteria instrumental in species/genus identification.
In this work we discuss some aspects of metagenomic sample sequencing with Illumina with a particular focus on library preparation for further sequencing [10]. Illumina-based sequencing takes place in the flow cell coated with single-stranded oligos that are complementary to library adapters ligated to source DNA fragments and enable hybridization. Polymerase lengthens hybridized DNA fragments attached to the flow cell surface. PCR produces multiple copies of a single template molecule forming millions of dense clusters. Clusters are then sequenced in parallel: complementary strands are generated by fluorescently tagged nucleotides, and the emitted signal is recorded after the addition of each nucleotide to the strand. This technology sets certain requirements for sample preparation explained below.
Library preparation for further sequencing with Illumina
Sequencing can be performed on various types of biological material, such as saliva, ear wax, nasal mucosal swabs, etc. We focused on the general aspects of library preparation for sequencing with Illumina regardless of the sample type. Basically, the sequencing workflow includes 5 steps: 1) extracting the intact DNA from the sample; 2) selecting genome regions for sequencing and choosing primers for further PCR-based amplification (PCR quality is very important because it determines sequencing quality); 3) double barcoding of the obtained libraries; 4) sequencing itself; 5) bioinformatic analysis of the obtained data.
There are a lot of protocols and reagent kits for effective DNA extraction [11] depending on the type of the analyzed sample; therefore, there is no need to discuss this step in more detail here. In our study quality control of the extracted DNA was performed by agarose gel electrophoresis, concentration was measured by Qubit 3.0 Fluorometer (Thermo Fisher Scientific, USA) according to the manufacturer’s protocol [12].
DNA extraction is followed by PCR amplification of the studied DNA fragments for further sequencing. For this work we chose a few regions of the 16S rRNA gene for the reasons explained above. Quality of the obtained fragments depends on the complementarity of the selected primers to the regions of the 16S rRNA gene [13, 14]. Primers consist of a region-specific sequence complementary to the flanking region of the target fragment and a synthetic sequence non-complementary to the region-specific sequence that will hybridize to the adapter. It is important that at least four 3′-end nucleotides should be non-complementary within and between primers to avoid primer-dimer formation. Even a small mismatch in complementarity of 3–4 nucleotides at the 3′-end of the primer can significantly reduce PCR quality even if the annealing temperature has been adjusted [15, 16].
There are a lot of regularly updated databases containing sequences of the 16S rRNA gene identified for a plethora of microbial species [17, 18], which facilitates rapid selection of universal primers using a special software if necessary [19]. For this work we chose universal primer pairs for the V3 and V4 regions of the 16S rRNA gene [23]. The synthetic sequence of the chosen primers was represented by sequences complementary to Nextera and Truseq adapters (tab. 1).
Once the primers are selected, the PCR protocol needs to be optimized (a number of parameters have to be adjusted, such as primer concentration, DNA concentration, annealing temperature, Mg2+ concentration, number of cycles, etc.) to obtain a sufficient amount of good quality amplicons for further sequencing. PCR quality control is performed by agarose gel electroforesis. Negative and positive controls are a must. A negative control is usually a PCR mix without the DNA template. In our case, two DNA samples were used as positive controls: one of Rhizobium and another of Rhodoccocus bacterial genera.
PCR yield can be affected by primer-dimer formation [20]. Primer dimers also occur when performing PCR quality control in the agarose gel (fig. 1). Diluting primers or adjusting the annealing temperature can be a solution. The optimal annealing temperature determines the purity of the reaction product since it facilitates primer attachment to DNA. State-of-art equipment makes it possible to do a temperature gradient to optimize the annealing temperature in a single run. High yields also depend on Mg2+ levels: Mg2+ ions bind to dNTP, primers, DNA template and chelating agents (EDTA) present in the buffer [21]. Polymerase activity is known to grow at high Mg2+ concentrations, though polymerase specificity thereby decreases. As a rule, a range of Mg2+ concentrations from 1 to 4 mM with a 0.5 mM dilution step is tested to select the optimal concentration for the reaction mix.
When optimizing the PCR protocol, we found out that primers complementary to Nextera adapters were the best for the amplification of the V3 and V4 regions of the 16S rRNA gene. It is probably because the unique Nextera sequence is non-complementary to the regions of the studied bacterial genome, which prevents formation of side products. The optimized PCR protocol is shown in tab. 2. PCR was performed using Step One Plus system (Applied Biosystems, USA).
The harvested libraries are dual-indexed (barcoded) in another PCR step. Barcoding is adding an index sequence of 8 nucleotides to the DNA fragment to facilitate further discrimination between different sample sets [22]. There is a wide selection of reagent mixes for barcoding that can be used in Illumina-based sequencing, such as Nextera XT Index kit. We used oligos synthesized by Evrogen, Russia. We ran a few tests with various PCR parameters to discover that barcoding yields did not depend on the purity of the DNA template and required no sample purification. The optimal PCR parameters for Nextera-based barcoding with Nextera primers (tab. 3) are shown in tab. 4.
It should be reminded that poor library quality control will entail mistakes during sequencing. We usually perform quality control using Agilent Bioanalyzer 2100 (Agilent Technologies, USA) (fig. 2).
Sample sequencing
We will not focus on the sequencing step itself in this article, because there are standard sequencing protocols supplied by the vendor [24, 25]. We performed MiSeq paired-end sequencing (Illumina) with 250 b. p. reads according to the standard protocol.
Bioinformatic analysis
The obtained nucleotide sequences are processed and classified as suggested by the Ribosomal Database Project, ver. 11.5 (Michigan University, USA), using the RDPTools ver. 2016-07-21 [26, 27]. Classification confidence threshold (-conf) should be set to 50 % as recommended by [28].
Quality control is essential in sequencing. At least 95 % of sequences in each sample must be high quality, i. e. contain no adapters or contaminating elements that cannot be mapped onto the human genome. The number of reads per sample is especially important in metagenomic sequencing. But on the whole, there is no universal rule here and the number of reads depends mostly on the purpose of the study. If the study aims at identifying dominant bacteria in the sample, the number of reads can be low. For example, 350 reads per each of 22 human gut samples revealed the presence of two dominating bacteria: Firmicutes (75 %) and Bacteroidetes (18 %) [4]. However, higher read numbers increase chances of discovering microbial “minorities” in the sample and reduces effects of the sampling error. A high-resolution metagenomic analysis requires at least 10,000 reads [4].
The proportion of unclassified sequences and unknown bacterial sequences, the median proportion of sequences for which both genus and species have been reliably identified should be consistent with the results of the 16S rRNA-based microbial metagenomic analysis. The proportion of unclassified sequences should not exceed 20 %, the genus is expected to be identified for at least 70 % sequences, while the species — for at least 50 %. Still, these figures may vary depending on the study.
Normally, a table is compiled based on the analysis, showing a taxonomic hierarchy (domain, phylum, class, etc.) and providing information on the taxonomic tree and relative abundance of taxa in the sample. An example of such table is tab. 5.
CONCLUSIONS
Selecting primers for amplification is an important step in metagenomic sequencing of 16S rRNA gene regions. Quality of the obtained amplicons determines accuracy of sequencing. We suggest an approach to designing an optimal PCR protocol for sample preparation that can be used to adjust PCR parameters for library preparation and identified problems that may occur at this step. Although library preparation for metagenomic sequencing has been widely discussed in the literature [29], we have made an attempt to design a well-defined protocol, proposed optimal parameters for the amplification of 16S rRNA V3 and V4 regions using universal primers for further sequencing with Illumina. Of note, the quality of amplification yields depends on many factors, including the purity of the reagents, therefore our PCR protocol is not universal and its PCR conditions may vary depending on the reagents used.