METHOD
Designing of custom barcodes for sequencing on the MGI platform
Pirogov Russian National Research Medical University
Correspondence should be addressed: Аnna O. Shmitko
Ostrovityanova, 1/1, Мoscow, 117997, Russia; moc.liamg@79imhsanna
Funding: this research was funded by the grant №075-15-2019-1789 from the Ministry of Science and Higher Education of the Russian Federation allocated to the Center for Precision Genome Editing and Genetic Technologies for Biomedicine.
Author contribution: Shmitko AO — study planning, data collection, writing Original Draft Preparation; Bulusheva IA — methodology, data analysis, writing Original Draft Preparation; Vasiliadis IA, Suchalko ON, Syrko DS — data analysis, software, visualization; Belova VA — study planning, manuscript review and editing; Pavlova AS — data curation, data analysis, software; Korostin DO — conceptualization, supervision, methodology, manuscript review and editing.
MGI Tech is a relatively new player in the NGS market, founded in 2016 as a subsidiary of BGI Group [1–3]. The company's first sequencing platform, the MGISEQ-2000, was introduced in 2017, followed by the MGISEQ-200RS and MGISEQ-T7 platforms. MGI produces a range of sequencers based on the DNA nanoball technology and cPAS sequencing [4]. It allows for sequencing in single-end or paired-end mode using single or dual barcode conditions. The technology involves barcoding of samples during the ligation of adapters containing barcode sequences. DNA library barcoding is necessary for labeling sequences from different biological samples and read identification during the transformation of temporary sequencing files into the commonly used fastq format. The length of MGI barcodes is 10 bp.
The standard kits for library preparation and sequencing with the mid-throughput sequencer DNBSEQ G-400 are designed for single-indexed sequencing, whereas the dual barcoding mode is optional and requires purchasing additional kits. Currently, MGI provides a kit that includes 96 barcode adapters for the ligation step in DNA library preparation for single-end sequencing. In addition, MGI lists 32 barcode sequences for synthesis.
The G-400 system is sensitive to nucleotide balance at each cycle of barcode sequencing, as the quality drastically drops if the same position in the barcode sequences from the same lane is occupied by the same nucleotide. This explains why the barcode set from the same lane should meet the criteria for combining their sequences and enable generating compatible sets. The set of 96 adapters provided by MGI allows for forming 11 balanced sets (2 of 4, 8 of 8, and 1 of 24). In practice, however, it is often necessary to combine the samples containing barcodes from different sets, change the number of the samples loaded on a lane, and vary their ratio. In the laboratory routine, it is not uncommon to encounter scenarios where one or several DNA libraries fail to meet the quality control standards in the final stage. To address this issue, a flexible approach to combining samples simplifies the task of pooling libraries for loading to the lane. Additionally, the task of combining samples with different required amounts of output data, such as exomes with different coverage of ×200, ×100, must also be considered. Therefore, the manufacturer imposes limitations on the users of this platform, providing a small number of barcodes and sets, which thus prevents uncovering its true potential for sequencing. This may prove critical when selecting a sequencing platform. Custom solutions for various applications have been developed for the Illumina platform [5–7], whereas for the MGI platform, such solutions have not yet been provided.
We previously developed software that allows choosing the optimal combination of provided barcodes at various ratios and sample numbers for MGI adapter sets [8]. The updated software, including custom barcodes, is available in the GitHub repository (https://github.com/genomecenter/BC-store/tree/custom-adapter-sets). Other software for selecting a balanced ratio of barcodes, depending on the sequencing tasks, has been developed earlier for Illumina NGS Instruments [9–11].
The purpose of this paper is to present the algorithm we have developed, that can generate the required number of barcode sequences for a given study. Using this algorithm, we designed 252 barcodes, forming 63 balanced sets, each comprising 4 barcodes, and allowing any set to be combined with the others.
METHODS
Method formulation and barcode selection
The sequencer has limits in terms of the intensity of the registered signal from the fluorophores corresponding to the nucleotides. If the same position of barcodes contains the same nucleotide, the read quality significantly drops, leading to errors in barcode identification and further assigning reads to the samples [8]. Therefore, we had to design barcodes to generate the most balanced combinations. The algorithm of sequence design is based on the "quad method," which involves adding three barcodes obtained by the consecutive substitutions of bases to each barcode from the MGI set (fig. 1A, B).
Following this method, each of the 96 barcodes can serve as a root barcode for its quad, resulting in generating 96 * 4 = 384 unique barcodes.
As the percentage of each base at each position is 25%, the resulting combination is perfectly balanced and guarantees the highest quality of sequencing.
Verification of compliance with the criteria
Validation of the compatibility based on the balance
As each quad is perfectly balanced, any number of quads can be combined with each other. The ratio between the quads in a pool can vary; however, the ratios between barcodes in each quad should be equal.
Furthermore, we checked whether it was possible to generate pools containing 4n + 2 barcodes, where n is the number of quads. We checked the compatibility using the BC-Store software by combining 10 barcodes (as shown in fig. 2). The nucleotide fraction of each nucleotide at any position in a pool of 10 barcodes has the highest and lowest deviations equal to 0.2 and 0.3, respectively, and meets the criteria for a balanced combination. This is still valid when any of the two barcodes from the same quad are added to n quads at a ratio equal to or lower than in quads.
Verifying the compatibility of barcodes based on a mismatch number
At the next step, all quads were checked for compatibility by the number of mismatches. Each sample labeled by a barcode had to be uniquely identified, so the barcode sequences of a certain length should not overlap with others. We selected a threshold of 4 mismatches, as all 96 10 bp barcodes provided by the manufacturer differ by more than 4 bases. The analysis also included the 999 verification MGI barcode (a 10bp technical sequence present in the original software's demultiplexing file). We constructed a graph of incompatible quads (S1 Fig.) and, using an adjacency matrix (S2 Fig.), we selected 63 quads (252 barcodes) compatible with each other based on the number of permitted mismatches (fig. 3). The sequences of all 252 barcodes are listed in S1 Table.
Validating the uniqueness
We checked if the sequences of the designed barcodes are present among the original MGI barcodes. This is necessary for generating the file containing barcodes for automatic demultiplexing. For this purpose, we created a Venn diagram showing the sets of custom and original MGI barcodes. We obtained 63 overlaps, where all 63 barcodes were original MGI barcodes, while the other 189 sequences were unique sequences not coinciding with the MGI barcodes from different kits (fig. 4).
Preparation for sequencing
Adapter synthesis
According to the manufacturer's instructions, designing an individual adapter requires annealing two oligonucleotides.
One of them (top oligonucleotide) contains the barcode sequence and a phosphate at the 5'-end (Ad153_5T_1-index # (1~128) according to the manufacturer), the sequence of bottom oligonucleotide is partially complementary to the top oligonucleotide (Ad153Ω_Bottom_2) (https://en.mgitech.cn/ Download/download_file/id/71) [12].
The sequences of oligonucleotides containing the barcodes 1A-1D are shown in table, all sequences containing 252 barcodes are listed in S1 Table.
The sequences of oligonucleotides containing the barcodes 1A-1D are shown in Table 1, all sequences containing 252 barcodes are listed in S1 Table.
To prepare the adapters, a mixture was created by adding 1 µL of 5M NaCl, 10 µL of 200 µM top oligonucleotide, and 10 µL of 200 µM bottom oligonucleotide to 79 µL of LowTE buffer. The mixture was then heated at 95 °C for 2 minutes and gradually cooled to 17°C at a rate of 0.5 °C every 30 seconds.
The algorithm of uploading new barcodes to a sequencer
To automatically demultiplex the sequenced libraries and following the MGI's recommendations, we created a .csv file (S2 Table) containing barcode sequences, including new custom barcodes, the original MGI barcodes, and 999-validation barcode. MGI barcodes that were included in the quads had an nA structure, where n is an adapter number in the original MGI kit, while custom barcodes had nB, nC, nD structure according to the order of quad formation. The format of the original MGI barcodes not included in quads remained unchanged. The barcode numbers were separated from the barcode sequences using commas without spaces.
RESULTS
To validate the designed barcodes, we prepared libraries with the synthesized custom adapters (Evrogen). The libraries, prepared following the standard MGI protocol, were pooled and enriched using the SureSelect Human All Exon v7 kit [13] and then sequenced in the PE100 mode using the DNBSEQ G-400 machine. Fastq demultiplexing was performed by the software built in G-400 provided by MGI basecalllite based on the uploaded file containing the barcode sequencing data. By default, the algorithm considers a read "undecoded" if there are two or more mismatches in a 10 bp barcode sequence. Therefore, the fraction of undecoded reads can be used as a quality metric for the performance of DIY barcode adapters. We compared the fraction of undecoded reads in the complete data from each lane with custom barcodes (44 lanes) and the data from previous runs (44 lanes) that employed MGI barcodes. On average (mean ± SD), the fractions of undecoded reads per lane were 1.08 ± 0.19% and 1.68 ± 0.22% for the MGI adapters and custom adapters, respectively (fig. 5). Although the proportion of undecoded data increased when utilizing custom barcodes compared to the original barcodes (T = 13.5, df = 83, p-value = 1.17E-22), the absolute value relative to the total data output from a single lane is considered to be negligible. The values of undecoded and full data in GB are presented in S3 Table.
Thus, we have developed a viable approach for designing custom barcodes that allows for simultaneous sequencing of more than 96 samples on MGI called the 'quad method.' We obtained 189 custom barcodes that can be combined with the 63 MGI barcodes to generate 63 balanced quads. One barcode from each quad is an original MGI barcode (nA, where n is a number of an original barcode), and the other three are custom barcodes (nB, nC, nD) that complement it.
These quads can be combined with each other at any ratio and number as long as the ratio between the barcodes from the same quad remains equal. It is possible to create library pools with 4n + 2 barcodes, where n is a number of quads, which can include any two barcodes from the other quad. In this case, the fraction of the last two barcodes should not exceed the fractions of the others.
DISCUSSION
The MGI platform is designed for fast, high-throughput sequencing, offering undeniable benefits yet prone to limitations. We attempted to overcome certain limits resulting from the solutions and kits provided by the manufacturer. Our approach allows for improving the efficiency of sequencing and expanding the possibilities of the MGI platform. However, it is important to bear in mind that the combinations of the quads with some original MGI adapters not included in the quads can fail to meet the compatibility criterion for the mismatch number. That is why we recommend checking whether they are balanced using the BC-Store software. We assume that a higher value of undecoded reads may be related to the insufficient purity of the synthesized oligonucleotides compared to MGI [14]. Previously, we ordered the synthesis of identical barcodes from two different manufacturers and observed that, in the case of one of them, the proportion of undecoded reads was elevated.
CONCLUSIONS
The custom barcodes we devised enable the alteration of the ratio and the number of libraries loaded to a lane depending on the purpose and required data amount. Using the BC-store software we had earlier developed, the libraries can be more easily and quickly pooled for sequencing on the MGI NGS instruments, both in paired-end or single-end modes. Therefore, given all the advantages and disadvantages of this method, it can be used as an additional or alternative solution to the solution provided by MGI.