ORIGINAL RESEARCH
Clarification of the status of some mutations considered pathogenic, by harmless mutations attributes
1 Bioinformatics Data Processing Department,Genotek Ltd., Moscow, Russia
2 Lomonosov Moscow State University, Moscow, Russia
3 The Core Facilities Center “Genetic Polymorphism”,Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
Correspondence should be addressed: Dmitry O. Korostin
ul. Gubkina, d. 3, Moscow, Russia, 119991; moc.liamg@nitsorok.d
The impact of single nucleotide polymorphisms (SNP) on the phenotype is hard to predict. Currently existing tools for predicting mutation pathogenicity have a number of flaws, such as low sensitivity and specificity of no more than 75–80 % for SNP. Besides, they often do not annotate insertions and deletions [1, 2, 3].
Pathogenic mutations described in experimental articles are collected into databases, such as the Online Mendelian Inheritance in Man database (OMIM, [4]) and the Human Gene Mutation Database (HGMD [5]). However, the term pathogenicity can be interpreted widely; there is no unanimous opinion on what it implies. As a result, different approaches are applied while selecting mutations for their inclusion in a database; thus, the data in different databases are not the same and need rectification.
To identify non-pathogenic mutations, their indirect indicators are often used, such as allele frequency in a population and the effect on the amino acid sequence of a protein. With new data coming into sight, these indicators can help us understand how the existing databases can be improved. Knowing that mutations described as pathogenic meet the criteria for non-pathogenic variants is important for the practical usage of the data derived from these databases. This knowledge can help us understand why certain genetic variants affect the phenotype while others do not.
For scientists who rely on HGMD in their research it may not be obvious that apart from clearly deleterious mutations, it currently includes harmless ones assessed as pathogenic. Within the framework of this study, the pathogenicity of mutations included in HGMD was evaluated using bioinformatic tools. Allele frequencies annotated in HGMD were compared to those from Exome Aggregation Consortium 0.3 [6]; the effect of HGMD mutations on the amino acid sequence of proteins was analyzed, and their pathogenicity was predicted using the most common bioinformatic tools: snpEff, PolyPhen-2 and SIFT.
METHODS
A public version of HGMD (of the fourth quarter of 2014) was used as a source of pathogenic mutations. It contained 73,208 mutations. Their allele frequencies were calculated using snpEff 4.0. The obtained data were compared to the allele frequencies from Exome Aggregation Consortium 0.3 that included whole exome and whole genome sequencing data from 60,706 samples of unrelated patients. ExAC provides allele frequency data on six populations: African, Latino, East Asian, South Asian, Finnish and European (non-Finnish). All unidentified samples are grouped as “Other”. When we used the database, the number of genotyped samples for each annotated mutation varied in different populations, from about 500 for “Other” to 30,000 for Europeans. Allele frequencies were compared using bcftools [7].
HGMD mutations affecting the amino acid sequence of proteins were identified using snpEff 4.0 [8]. A possible level of pathogenicity was predicted using PolyPhen-2 and SIFT utilities. These utilities are standard tools for predicting mutation pathogenicity; neither of them used HGMD data as a training set.
RESULTS
snpEff annotation
Mutations obtained from HGMD were annotated by snpEff, frequencies of each mutation type were established according to snpEff classification. We have found that in many cases mutations have more than one prediction, meaning they can refer to various types at the same time. It usually happens when a mutation is located within the gene and the adjacent genes are used for its annotation. We have filtered variants belonging to more than one type and selected those with the most conspicuous impact according to the algorithm suggested by snpEff developers (see the table below) [8].
Annotation with ExAC
18,159 (25 %) mutations present in HGMD are described in ExAC.
Results obtained by PolyPhen-2 and SIFT
We have predicted mutation pathogenicity using PolyPhen-2 and SIFT utilities. PolyPhen-2 uses two models for pathogenicity prediction: HumDiv and HumVar. According to the developers’ description, HumVar predicts Mendelian diseases better, while HumDiv is more efficient with complex phenotypes and mildly deleterious alleles [9]. We have chosen HumDiv model to use a wider pathogenicity definition. Threshold for cutting off pathogenic and possibly pathogenic variants was set by default.
PolyPhen-2 annotated 52,248 mutations, 39,032 (72 %) of them were identified as pathogenic and 6,220 (11 %) as possibly pathogenic. SIFT utility analyzed 53,097 mutations with 34,638 (65 %) identified as pathogenic and 4,358 (8 %) as possibly pathogenic (with low probability). Both utilities recognized the variants submitted to the database as pathogenic in 70–80 % cases, which corresponds to their expected performance [2, 3].
DISCUSSION
Using ExAC database as a resource containing data on allele frequency
Technical description of ExAC has not been released yet, but the database is known to include data from both population genetic studies and sequencing projects describing the samples of patients with various diseases. We believe that such projects use less samples compared to population genetic research works, and their effect on the resulting frequency must be negligible, especially if samples of a large number of individuals have been analyzed in population genetic studies. That is why our analysis did not cover mutations that had been genotyped in a few individuals only. That being said, we believe that ExAC can certainly be used to estimate the frequencies in such studies as ours. The developers of this database claim that it can be used as a reference set of allele frequencies for disease studies.
Presence of synonymous mutations in HGMD
95 % of all mutations obtained from HGMD were distributed by snpEff in two groups: missense mutations and nonsense mutations. However, about 2.5 % of mutations were identified as synonymous (see the table). Although the pathogenicity of synonymous variants has been described in literature, in most cases synonymous mutations are considered harmless. We focused on this group as a group of variants with the most disputable pathogenicity. PolyPhen-2 utility does not perform the pathogenicity assessment of synonymous mutations because it relies on the effect of a mutation on the protein amino acid sequence. SIFT utility allows for the assessment of the synonymous mutation pathogenicity; it identified only 4 out of 1,793 synonymous mutations as pathogenic. It is highly probable that the rest of 1,789 mutations (~2.5 % of all mutations in HGMD) are not pathogenic because they do not have any other signs of pathogenicity.
Analysis of synonymous pathogenic mutations in HGMD
Only one of the four synonymous mutations in HGMD identified as pathogenic by SIFT utility is described in dbSNP [10]. It is NM_005228.3:с.2361G>A (NP_005219.2:p.Gln787=) mutation with rsid rs1050171. According to Zhang et al. [11], this mutation is associated with lung cancer; its molecular mechanism of action has not been identified yet. The frequency of the alternative (“mutant”) allele A is about 43 %, according to the “1000 genomes” project data presented in dbSNP. The ClinVar database [12] defines this SNP as benign [13]. The reasons for SIFT classifying this mutation as pathogenic are probably related to the conservative position where the mutation occurred. It is located at codon position 3 that is usually less conservative than positions 1 and 2, and gets a lower score. However, for this mutation the PhyloP Vertebrate evolutionary conservation score obtained from UCSC Genome Browser [14], combined with the scores of positions 1 and 2 of adjacent codons, is much higher than the score of other third codon position nucleotides, which is indicative of high conservation of the nucleotide of interest.
After all, the true nature of this mutation is hard to identify. On the one hand, there is evidence that this mutation is non- pathogenic, such as the data from ClinVar database, its synonymous type, the high frequency of the allele variants in the population. On the other hand, the results of prediction using SIFT utility in HGMD and the high evolutionary conservation suggest the pathogenicity of this variant. This example illustrates the difficulty of mutation pathogenicity prediction: even manual analysis cannot provide the unambiguous interpretation of the results, because the mutation type depends on the choice of a tool for analysis.
Variants with a mutation present in a heterozygote only
To analyze the mutations absent in the samples in the homozygous state, we have chosen four mutations, each being present in a heterozygote in more than 75 % of samples and in a homozygote in less than 5 % of samples (according to the ExAC data):
- chr1:1650845G>A (rs1059831, gene CDK11A, HGMD phenotype: associated with type 2 diabetes) [14],
- chr2:112614429G>A (rs72936240, gene ANAPC1, HGMD phenotype: protein deficit associated with the risk of cancer) [15],
- chr7:142458451A>T (rs111033566, gene PRSS1, HGMD phenotype: hereditary pancreatitis) [16],
- chr17:7197581G>T (rs189257850, gene YBX2, HGMD phenotype: associated with male infertility) [17].
Homozygous variants 2 and 3 have never been present in any population, homozygous variant 1 has been found in only one out of 8,209 samples in the South Asian population. Strangely, for variant 4 only 203 samples have been genotyped, while for variants 1–3 about 60,000 samples have been genotyped. For variant 4 only one individual out of 52 in the East Asian population has been described as homozygous and 13 individuals out of 62 have been described as homozygous in the Latin American population.
These mutations are mainly found in heterozygotes, which can be explained by the fact that they cause death or at least cannot be inherited. Based on the phenotype analysis, variants 2 and 4 can be excluded as heterozygous because of early death or infertility of their carriers. Variant 4 is the most interesting one, but it is the only variant that has not been genotyped widely. It is difficult to understand why this mutation is highly frequent in one of the populations and why the number of individuals analyzed in this population is so low. Because the number of the individuals analyzed is low, those data have been possibly obtained by analyzing diseased individuals (see the description of ExAC specifics above), so no predictions for this variant are possible. Variant 2 can be described as lethal in the homozygous state. We make a supposition that although it is not obvious that variants 1 and 3 are lethal, the existent data prove that these mutations cause death or infertility in homozygotes.
CONCLUSIONS
Assessing mutation pathogenicity is a difficult task. Sometimes neither automatic nor manual analysis can classify it as clearly pathogenic or harmless. However, in the absence of experimental data on transgenic organisms with a mutation of interest, the existing databases can still be used for pathogenicity analysis, but one should use them carefully. Automatic use of those databases is restricted by the quality of data presented there. It is important to manually check if the mutations described in experimental works are pathogenic, especially if the claims of their pathogenicity do not correspond to the database prediction.