REVIEW
CRISPR-Сas systems of Mуcоbacterium tuberculosis: the structure, transformation in different lineages in the process of evolution and a possible role in the formation of virulence and drug resistance
1 Vavilov Institute of General Genetics, RAS, Moscow
2 Department of Bioinformatics, Faculty of Biological and Medical Physics,Moscow Institute of Physics and Technology (State University), Dolgoprudny
Correspondence should be addressed: Marina Zaychikova
ul. Gubkina 3, Moscow, 119991; ur.xednay@51zaniram, ur.ggiv@direlav
CRISPR-CAS SYSTEMS IN BACTERIA: STRUCTURE AND CLASSIFICATION
To date, CRISPR-Cas systems have been identified in approximately 40% of bacterial and 90% of archaeal genomes [1, 2]. These systems consist of two essential components: CRISPR (Сlustered Regularly Interspaced Short Palindromic Repeats) arrays and Cas (CRISPR-associated) proteins. Repetitive sequences of equal length alternating with unique regions (spacers) were described as early as 1987 in the E. coli genome [3], but at that time their function was unclear. In the 2000s, CRISPR-Cas systems were shown to have a role in bacterial immunity [4, 5]. By now, they have been proved to participate in a number of various cellular processes, including DNA repair, regulation of gene expression, virulence formation, etc. [6]. Interestingly, direct repeats (DR) discovered in the М. tuberculosis genome as early as 1990s have been used in the genotyping (spoligotyping) of mycobacteria even before the immune function of CRISPR-Cas systems was described, and their polymorphism is well-studied [7, 8].
CRISPR-Cas systems are very diverse. Each functional array contains three essential elements: repeats, spacers and a leader sequence. Adjacent to the array is a set of cas genes coding for proteins with various functional domains interacting with nucleic acids [9]. Although the sets of cas genes ensuring the performance of different components of CRISPR-Cas molecular mechanisms are different, they do have common features. For example, the majority of known active CRISPR- Cas systems contain two proteins called Cas1 and Cas2. These proteins form a complex that integrates new spacers into the array. New spacers are inserted into the array next to the leader sequence. Throughout the array’s lifetime some spacers can be lost as a result of recombination between repeats [10]. Partially or fully, the array can be acquired through horizontal gene transfer (HGT) [11].
CRISPR-Cas systems are classified based on the composition of their cas-loci. According to the currently used classification, they are subdivided into 2 big classes, 5 types and multiple subtypes [12]. Class 1 (types I, III and IV) comprises CRISPR-Сas systems with multisubunit effector complexes; in class 2 systems (types II and V) all functions of the effector complex are exerted by one protein, such as Cas9 [12]. Type II CRISPR-Cas systems are of paramount importance for biotechnology and specifically for genome editing, but they are quite rare and have been detected only in bacterial genomes [12]. The majority of CRISPR-Сas systems can be unambiguously assigned to one of its 5 main types. However, there are organisms whose cas-loci do not fit into the current classification.
Cas1 and cas2 genes, the CRISPR-Cas components involved in the integration of new spacers into the array, deserve particular attention. Although there is evidence that both of them have their role in spacer integration, all enzymatic activities necessary for this process can be found in Cas1, whereas the catalytic activity of Cas2 is not required to form a Cas1-Cas2 complex or insert a new spacer. So far, we know that Cas2 is an mRNA interferase that specifically cleaves ribosome-bound mRNA. On the face of it, such activity seems to be “inappropriate” when it comes to the integration of new spacers. However, some researchers suggest that Cas2 may have originated from ancient mobile elements, such as toxin-antitoxin (TA) systems [13, 14]. In view of this, it may be assumed that Cas2 retains its ancestral toxin-like endo- ribonuclease activity in the CRISPR-Cas system, but the latter is reversely controlled through inhibition during interaction with Cas1 and formation of the Cas1-Cas2 complex. According to this hypothesis, if the CRISPR-Cas system fails to inhibit viral growth, Cas2 is activated (possibly through Cas1 degradation) and stops translation, driving the cell to suicide or into the dormant state. Cas2 participation in spacer integration may be connected to Cas1 regulation or stabilization following the formation of Cas1-Cas2 complex, which at the same time reversely inactivates Cas2 [15]. The possible participation of Cas2 in getting the cell into a persistent state is a promising area of pathogen research (M. tuberculosis research, in particular).
FUNCTIONS OF CRISPR-CAS SYSTEMS IN BACTERIA
Because CRISPR-Cas systems are widely spread and very diverse, it is no wonder why more evidence of their involvement in different cellular processes appears in the literature [6]. Apart from the role in the adaptive immunity, the most well-known and well-described of CRISPR-Cas functions is regulation of gene expression. For example, the life cycle of the soil bacteria Myxococcus xanthus includes stages of fruit body formation and sporulation. Formation of the fruit body and further differentiation of its cells into microspores is rigorously regulated by intercellular signals and intracellular signaling cascades in which type I-C CRISPR-Cas systems of M. xanthus act as a component of the positive feedback loop and participate in sporulation [16].
Today, there is evidence that CRISPR-Cas systems can engage in DNA repair. It has been established that purified Cas1 (YgbT) obtained from Escherichia coli is capable of interaction, both at the physical and genetic levels, with key components of DNA repair systems, such as genes recB, recC and ruvB [17]. The researchers have demonstrated that the ygbT deletion strain has increased sensitivity to DNA damage.
Similar phenotypes have been observed in the strains with a deleted CRISPR cluster; this indicates, at least, that some of CRISPR-Cas components are involved in DNA repair.
Another alternative function of CRISPR-Cas systems pertains to their participation in biofilm formation [18]. The study of the type 1-F CRISPR-Cas systems of the opportunistic pathogen Pseudomonas aeruginosa has revealed that this system inhibits biofilm formation. Such CRISPR-dependent ability relies on the interaction between a certain spacer and its prototype, the protospacer located in the bacteriophage genome. This interaction eventually leads to the induction of phage-related genes that, in turn, trigger death of surface cells. These findings suggest that CRISPR-Cas systems possess another mechanism unrelated to the adaptive immunity. Bacteria usually regulate their gene expression post- transcriptionally by various small non-coding RNA. Although these RNA molecules control a great deal of cell physiology, only a few of them participate in the recognition of intrusive nucleic acids, ceding this role to CRISPR-Cas systems. Unlike eukaryotic systems, bacterial CRIPSR-Cas systems cleave DNA, which means that if they should engage in the regulation of endogenous genes, the bacterial chromosome will be inevitably destroyed. Surprisingly, though, in 2013 an article was published in Nature reporting a mechanism of post- transcriptional regulation in Francisella novicida, in which the virulence gene is regulated by the Cas9 protein and CRISPR- associated small RNA [19]. Hypothetically, Cas9 directs its activity against endogenous mRNA (but not DNA). So far, the association between CRIPSR-Cas systems and the ability of bacterial strains to exhibit increased virulence or even drug resistance has been shown in a number of research works [20]. Speaking of alternative functions of CRIPSR-Cas systems, some authors hypothesize that biofilm formation in Pseudomonas aeruginosa is a by-product of a “classical” CRISPR-Cas immune function, whereas virulence in Francisella novicida or development regulation in Myxococcus xanthus have come about independently [6]. The history of gradual discovery of different CRIPSR-Cas functions, starting with immune, resembles the exploration of RNA interference in eukaryotes. At first, RNA interference was shown to have a role in the immune defense, and it was not until later that its effects on various cellular processes were discovered, including gene regulation and heterochromatin formation [21]. Some authors draw a parallel between CRISPR-Cas systems and RNA interference [22, 23].
CRISPR-CAS SYSTEMS IN MYCOBACTERIA: GENERAL STRUCTURE AND PECULIARITIES OF CAS-OPERONE IN M. TUBERCULOSIS H37RV
The Mycobacterium genus is represented by a wide range of organisms, including human pathogens among which members of the Mycobacterium tuberculosis complex (MTBC) are the most important. This complex includes Mycobacterium tuberculosis, the major causative agent of tuberculosis. М. tuberculosis is genetically heterogenous and can be divided into several groups, or the so-called lineages. Each lineage is characterized by a certain set of mutations that have accumulated in the course of evolution [24–26]. Isolates of different lineages can be distinguished by their phenotype, specifically by the ability to develop drug resistance (DR), virulence and pathogenicity, all of which determine the severity of the disease [27, 28]. The most widespread and clinically significant lineages of М. tuberculosis are Beijing, Haarlem, LAM, and S. The Beijing lineage (in particular, the B0/W-148 sublineage that has emerged recently) is the most epidemiologically important one due to its high prevalence and propensity to develop DR [29, 30]. The Haarlem lineage is characterized by increased virulence [28]. Of certain interest are the lineages EAI and Ural, with their reduced virulence that makes them less prevalent [28]. EAI is an ancient lineage territorially limited to South East Asia [31]. Related to Haarlem, the Ural line is not very widespread, just like EAI, and appears to have reduced transmissibility [32] (fig. 1).
Given its possible role in virulence formation [19, 20], CRISPR-Cas systems could become an interesting research object, especially in different М. tuberculosis lineages.
To date, CRISPR-Cas systems have been identified in 14 mycobacterial species [34]. All such systems are located on a chromosome. CRISPR arrays with more than 5 repeats have been identified in only 3 mycobacterial species: M. tuberculosis and M. bovis, which belong to the MTBC, and in the pathogenic M. avium. M. avium misses cas-genes that should be adjacent to the CRISPR array, and CRISPR loci in M. tuberculosis and M. bovis are very similar in terms of their structure. This reflects a close evolutionary relationship between them and is consistent with their phylogeny [34, 35]. M. tuberculosis CRISPR-Cas systems have a structure typically found in type III-A systems [34].
We have analyzed the CRISPR-Cas systems in 41 complete genome sequences of different M. tuberculosis lineages available in the NCBI RefSeq database, including 13 Beijing genomes, 3 B0/W-148 genomes, 2 EAI genomes, 10 Haarlem genomes, 1 Ural genome, 2 S genomes, and 10 LAM genomes. Additionally, we have analyzed a few draft genomes, including 7 B0/W-148 genomes, 4 URAL genomes, 3 EAI genomes, and 3 S genomes, the reason being the low number of complete genomes available. Genotyping was based on marker polymorphisms [36–38]. For some genomes the genotype of the isolate was already known from the literature. The search and analysis of CRISPR-Cas systems was conducted using two algorithms: CRISPRFinder and CRISPR Recognition Tool [39, 40]. fig. 2 shows a typical structure of M. tuberculosis CRISPR-Cas systems exemplified by the H37Rv strain, the standard reference genome.
The majority of the analyzed M. tuberculosis strains had two long CRISPR-arrays (fig. 2) [8]. The only exception was the strain 7199-99, which belongs to the Haarlem lineage; its
CRISPR2 array had been reduced starting from spacer 12 and including the region between the arrays, leading to the formation of a single array of 33 spacers. The largest number of spacers in an M. tuberculosis genome is 57 [8], the smallest is 10, as was the case with some of B0/W-148 strains. Adjacent to the CRISPR1 array were 9 cas-genes, namely cas2, cas1, csm6, csm5, csm4, csm3, csm2, cas10 (csm1), and cas6 (fig. 2). The cas-genes of M. tuberculosis are highly conserved. In our study no mutations were detected in cas1, cas2, csm4, csm2 and cas6. Other analyzed genes had single random mutations (tab. 1). The CRISPR2 array was separated from the CRISPR1 array by a sequence of ~ 1300 b.p. (fig. 2) containing two annotated transposases that belong to the IS6110 family [34]. Of note, the CRISPR-Cas systems of M. tuberculosis typically have a short leader sequence of 48 b.p. [34].
DISTINCTIVE CHARACTERISTICS OF CRISPR-CAS SYSTEMS IN DIFFERENT M. TUBERCULOSIS LINEAGES
Beijing lineage
The region containing cas1, cas2, csm5, csm6 (tab. 1) and the CRISPR1 array were missing in the analyzed Beijing isolates [8, 34]. The remaining CRISPR2 array had only 14 spacers instead of 18, of which 10 (Sp1-Sp10) are shared by all M. tuberculosis lineages and 4 (SpB11-SpB14, B stands for Beijing) are specific to the Beijing lineage (fig. 3). These spacers are absent in other M. tuberculosis lineages. Curiously, the majority of the analyzed strains had two annotated transposases in the region between the gene csm4 and the CRISPR2 array. In the Beijing lineage csm4 is significantly shorter than its ortholog from other lines: the length of the protein it codes for is either 76 a.a.r. or 116 to 118 a.a.r., whereas in other M. tuberculosis lineages the protein length is 302 a.a.r. If the encoded protein is about 100 a.a.r. long, it cannot retain its conserved domains necessary for the interaction with csm3 inside the csm1–csm4–csm3 complex [41]. This implies that the interference stage may be disrupted in the Beijing lineage. The Beijing lineage originated in North China, Korea and Japan about 7, 000 years ago [37] (fig. 4). It appears that after this lineage separated from others, its CRISPR2 array continued to incorporate CRISPR2-specific SpB11-SpB14 spacers. This could be due to the differences in the environmental factors the pathogen had to face. Then the lineage lost a few cas-genes, including cas1 and cas2 involved in the integration of new spacers, and the array growth stopped. As a result, representatives of the Beijing line normally have only one array (CRISPR2) with 14 spacers in it. However, some isolates of the evolutionary young Beijing sublineage B0/W-148 appear to have lost a few of them. A number of these isolates lack SpB13 and SpB14, while others have lost all 4 SpB11-SpB14 spacers specific to the Beijing lineage. Interestingly, we have found these 4 spacers in the CRISPR arrays of some M. bovis strains. High frequency of mutations and decreased DNA repair in the Beijing lineage described in the literature [42] may result from the reduction of the CRISPR-Cas system and can be a potential cause of variability and drug resistance observed in the lineage. A hypothesized association between reduced or missing CRISPR-Cas systems and DR is consistent with the findings of the recent study of the Campylobacter jejuni pathogen, which demonstrated that the strains causing the most severe gastroenteritis and post-infectious complications have shortened CRISPR-arrays or totally lack the CRISPR-Cas system [20, 43].
Ural and Haarlem lineages
A typical feature of the Ural and Haarlem lineages is spacer insertions. They occur in the CRIPSR array at the locus following the Sp3 spacer. Insertions are found in only some of the analyzed Haarlem isolates and all Ural isolates. Importantly, we observed those spacers in some M. bovi and two EAI isolates; therefore, past recombination events and horizontal gene transfer cannot be ruled out. We also observed a few cases of spacer loss or acquisition by the CRISPR2 array in the Ural and Haarlem lineages. For example, 3 Ural isolates lacked the Sp4-Sp6 spacers in the CRISPR2 array, and 2 Haarlem isolates were missing the Sp6 spacer in it.
EAI lineage
Of all M. tuberculosis lineages, EAI has the longest CRISPR- arrays. EAI is one of the most ancient lineages, so this could be the reason. In some isolates, the CRISPR2 array is more than 24 spacers long, and the CRISPR1 array contains over 30 spacers. The largest number of spacers was found in the isolate HN-024: 25 spacers in CRISPR1 and 34 spacers in CRISPR2; some of them were unique.
S and LAM lineages
On the whole, the S and LAM lineages have a canonical M. tuberculosis CRISPR-Cas structure (fig. 2). A certain polymorphism can be observed. For example, one LAM isolate was missing the Sp4-Sp6 spacers in its CRISPR1 array, and another LAM isolate had lost the Sp20 spacer from the same array. To sum up, the CRISPR1 array of M. tuberculosis is highly variable and therefore can be conveniently used in genotyping [8]. Although spacer deletions are common, they almost never occur in 10 highly conserved ancestral CRISPR2 spacers Sp1-Sp10 distal to the leader sequence. The same is true for mutations. Protospacers of Sp1-Sp10 remain unidentified. Although ancient spacers are regarded as barely significant because of their high variability and a rapid evolution of prophages which they protected the bacteria against, they look intact in all analyzed M. tuberculosis lineages and do not undergo deletions. This brings in another possible explanation: the Sp1-Sp10 spacers are vital for bacteria, and their role is yet to be elucidated.
THE SEARCH FOR FUNCTIONALLY RELATED PARTNERS AND COMPENSATORY MECHANISMS IN THE BEIJING LINEAGE WITH REDUCED CRISPR-CAS SYSTEMS
The search for functionally related partners and mechanisms compensating for the functions of сas1, сas2, сsm5, and сsm6 in the Beijing lineage was conducted using a method of phylogenetic profiling and the genomic sequences of different M. tuberculosis lineages (in total, 130 complete genome sequences available in NCBI were analyzed). The phylogenetic profile (PP) is a binary vector determining the presence of a sequence coding for a protein of interest in the genomes of a group of organisms [44]. Hypothetically, the evolution of genes belonging to the same functional pathway happens simultaneously, therefore the genes with similar or inverted PP can be used as functionally related candidate partners or candidate compensatory mechanisms, respectively.
Using phylogenetic profiling we identified orthologous gene groups in different M. tuberculosis lineages, constructed binary vectors and a pairwise distance matrix for the vectors, and performed PP clusterization. Construction and visualization of PP were done in OrthoFinder v.2.0.0 [45] and Count [46]. The pairwise distance matrix was constructed based on the mutual information values (MI): DMI=1–MI. Cluster analysis was performed using the unweighted pair group method with arithmetic mean (UPGMA) [47].
The cluster analysis of PP allowed us to identify genes that had undergone evolutionary events similar to those undergone by сas1, сas2, сsm5, and сsm6 (fig. 5, А). The loss of some CRISPR-Cas components in a number of Beijing isolates of M. tuberculosis may have been accompanied by at least two evolutionary losses and one acquisition of a genome region (in different regions of a chromosome) (fig. 5 B and C).
The analysis of PP of Beijing M. tuberculosis genomes revealed long deletions specific to this lineage. Because of those deletions, the orthologs of Rv0071, Rv0072, Rv0073 and Rv1761c, Rv1760, Rv1758 (identifiers correspond to the genes in the M. tuberculosis H37Rv genome; see tab. 2) now have similar phylogenetic profiles (partner profiles, :media_ 5; B). It should be noted that the chromosomal region harboring genes Rv1761c, Rv1760 and Rv1758 is flanked with the inverted repeats of IS6110 IS-elements belonging to the IS3 family. In the second round of the analysis, a long insertion was revealed specific to the Beijing lineage; because of that insertion the orthologs of CFBS_RS10335, CFBS_RS10345, CFBS_RS10350, CFBS_RS10355, CFBS_RS10360, CFBS_ RS10365, and CFBS_RS21395 (identifiers correspond to the genes in the M. tuberculosis CCDC5079 genome) (tab. 2) now have phylogenetic profiles very much resembling inverted profiles (compensator profiles, fig. 5 C).
To sum up, our PP analysis has identified a number of genes that have undergone similar evolutionary events. The loss of cas1, cas2, csm5, and csm6 in the Beijing lineage of M. tuberculosis was accompanied by the loss and acquisition of other genes (tab. 2). Those candidate genes have a potential to participate in the mechanisms of compensation for cas-gene functions or be their functionally related partners in M. tuberculosis, creating a subject for further research.
CONCLUSION
The CRISPR-Сas systems of M. tuberculosis vary considerably between the lineages: some (EAI) have long arrays, others (Beijing) are partially reduced. Therefore, the presence of an active type III-A CRISPR-Cas system is not an essential prerequisite for the evolutionary success in terms of pathogenicity, virulence, transmissibility and adaptability of the lineage.
The partial loss of the array and a few cas-genes in the Beijing lineage of M. tuberculosis seems to have resulted in a fully or partially lost ability of their CRISPR-Cas to destroy a foreign DNA. Disturbances in the functioning of the CRISPR-Cas system in one of the most successful M. tuberculosis lineages may have been accompanied by the activation of mechanisms compensating for the lost genes (for example, our analysis revealed a long insertion in the Beijing lineage) and by the loss of the functional partners of cas-genes; because it is assumed that the gene that has lost its functional partner will not be retained through selection in the genome and will be eliminated, which can be illustrated by the detected long deletions specific to the Beijing lineage. It should be noted that the observed regularities in the pattern of evolutionary losses and acquisitions could be random and require further experimental verification. Phylogenetic profiling provides a basis for generating a hypothesis and material for further research.
Although the CRISPR-Cas system of M. tuberculosis Beijing strains can be inactive, it is assumed that in the lineages that have a full set of cas-genes and repeats, the CRISPR-Cas system retains its activity and is capable of contributing to the defense against foreign DNA [34]. Considering a short leader sequence typical for all CRISPR-Cas systems of M. tuberculosis, it may be more productive to focus on the exploration of their alternative functions, such as regulation of gene expression, DNA repair, virulence formation, etc.
The structure of M. tuberculosis CRISPR-Cas systems has been studied and described in great detail [8], but the role of these systems is still unclear. For the majority of mycobacterial spacers, no protospacers among the known mycobacteriophage have been identified so far. This is probably because the majority of M. tuberculosis bacteriophages have been isolated from M. smegmatis and then tested for their ability to invade M. tuberculosis. [34]. Different lineages have different sets of spacers and therefore – possibly – different immunity, which means varying degrees of resistance to phages and different regulation of gene expression. Apart from spacers, of interest is the role of the Cas2 protein outside the Cas1-Cas2 complex, in light of the previously proposed hypothesis [15] suggesting that with independent activation this protein can stop translation and drive the cell into the dormant state or promote apoptosis. A search for Cas2 inhibitors also presents a certain interest. Particular attention should be paid to the possible functional link between CRISPR-Cas systems, specifically the cas-genes, and TA-systems [38]. Possible participation of CRISPR-Cas systems in virulence formation and drug resistance may allow to develop novel approaches to combating drug-resistant strains of M. tuberculosis.