METHOD
The approach to patient clustering based on the microchip data confined to distinct loci using the combinations of variants
Lopukhin Federal Research and Clinical Center of Physical-Chemical Medicine of the Federal Medical Biological Agency, Moscow, Russia
Correspondence should be addressed: Elena I. Sharova
Malaya Pirogovskaya, 1 str. 3, Moscow, 119435, Russia; moc.liamg@87avorahs
Funding: the study was supported by the President of the Russian Federation (grant for young postdocs МК-2951.2022.1.4)
Acknowledgements: the authors thank dbGaP for providing access to the phs000421.v1.p1 and phs000001.v3.p1 datasets. The dataset with the dbGaP registration number phs000421.v1.p1 was obtained from the genetic study of Fuchs' endothelial corneal dystrophy (FECD) available from https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000421.v1.p1. The authors recognize the grants that have been used to support registration of cases and controls and would be used in this GWAS: R01EY016514 (DUEC, PI: Gordon Klintworth), R01EY016482 (CWRU, PI: Sudha Iyengar) and 1X01HG006619-01 (PI: Sudha Iyengar, Natalie Afshari). The authors express their gratitude to the FECD study participants and the FECD research team for their valuable contribution to this study. The dataset with the dbGaP registration number phs000001.v3.p1 was obtained from the Age-Related Eye Disease Study (AREDS) database available from https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000001.v3.p1. AREDS was funded by the National Eye Institute (N01-EY-0-2127). The authors thank the AREDS participants and research team for their valuable contribution to this study. The authors express their gratitude to L.O. Skorodumova, research fellow at the Lopukhin Federal Research and Clinical Center of Physical-Chemical Medicine, for valuable suggestions, comments, and support.
Author contribution: Sharova EI — concept and selection of data; Sharova EI, Iulmetova LN — planning and selection of methods; Kulemin NA — project funding and management; Iulmetova LN — design and computation; Sharova EI, Iulmetova LN, Kulemin NA — discussion, manuscript writing and editing.
Compliance with ethical standards: the study was performed according to the principles of the Declaration of Helsinki using the data of the phs000421.v1.p1 and phs000001.v3.p1 projects, the access to which was approved and provided by dbGaP in accordance with the policy of approval and access to specific datasets.
Finding a biological basis for the inheritability of phenotypes is one of the main tasks of modern medical genetics. Generally, approaches aimed at the detection of pathogenic genomic variants can be divided into two categories: biological and mathematical. Biological methods include the approaches that explain phenotypes based on the studied biochemical processes. When it is impossible to directly trace the biochemical pathway underlying phenotype formation, but the disease shows a familial tendency, various statistical approaches are applied: genome-wide association studies (GWAS) [1], polygenic risk score (PRS) [2], haplotype identification approaches [3], and other methods. Basic GWAS methodology performs single nucleotide polymorphism (SNP) association testing to identify SNP loci exceeding a genome-wide significance p-value threshold. Thus, the GWAS results for any disorder representing a combination of rare inherited mutations could be inaccurate, since the number of rare polymorphisms don’t meet the significance criteria. The PRS approach might be considered as an extension of GWAS, however, it also evaluates the effect of each SNP independently. For some disorders the genetic basis can’t be explained by biological or popular statistical methods. The inheritance of such phenotypes is based on the haplotype architectures. We define a haplotype as a linear combination of a certain number (up to several hundred) of the linked variable variants that together forma small number (less than 100, 10–20 on average) of allele variants. The approach involving identification of specific haplotype variants is actively used in pharmacogenetics for analysis of P450 cytochromes. For example, there are more than 120 haplotype variants for CYP2D6 resulting from more than 500 polymorphisms [4]. However, this approach is extremely rarely discussed with reference to the majority of loci of polygenic diseases.
GWAS is commonly applied to the nervous system disorders, polygenic developmental disorders and neurodegenerative diseases, such as amyotrophic lateral sclerosis, Parkinson's disease, schizophrenia, autism spectrum disorders. GWAS method allows to identify genome regions, the alterations of which are overrepresented in affected individuals relative to the general (control) population. GWAS also handles the structural variations that can’t be detected directly by the chip SNPs but are in linkage disequilibrium with those ones.In particular, the amyotrophic lateral sclerosis GWAS detects the C9orf72 gene locus comprising the G4C2 expanded six-nucleotide repeat (GGGGCC) [5], however, the repeat variants are not detected directly with the chip. The Huntington's disease GWAS reveals the chromosome 15 HTT gene locus comprising trinucleotide repeats [6], but there are no probes matching the repeat region in the chip.
Fuchs Endothelial Corneal Dystrophy (FECD) is a hereditary eye disease characterized by a decrease in the number of corneal endothelial cells that maintain the corneal stroma water balance. FECD is a polygenic disease that is of considerable interest for genetic research [7]. There are two FECD forms: early onset and late onset FECD. These forms have different genetic bases. Early onset FECD is diagnosed at the age below 50 and is a very rare disorder associated with the COL8A2 gene pathogenic variants [8]. The late onset FECD manifests at the age greater than 50 and it is the most common form of FECD. It was shown that late onset FECD is associated with the intronic CTG18.1 trinucleotide repeat expansion in TCF4 [9]. According to our data and the data provided by foreign authors, the CTG18.1 intronic trinucleotide repeat expansion in TCF4 is the most common FECD-associated variant among Caucasoid populations. The expansion of at least one allele of CTG18.1 trinucleotide repeat was detected in approximately two thirds of the FECD patients in European descent cohorts. Later Afshari et al. [10] made an attempt to find other variants associated with FECD in a bigger cohort also using GWAS. They confirmed the association of the TCF4 locus and identified three new loci in the genes KANK4, LAMC1 genes and near the ATP1B1 gene, however, their independence from trinucleotide repeat expansion was not tested [10]. The role of mutations in ZEB1 [11], SLC4A11 [12], AGBL1 [13], and LOXHD1 [14] in the development of FECD is also discussed. The question, whether FECD is a set of phenocopies or a polygenic disease, still remains open. The reported asymptomatic carriers of the repeat expansion [9, 15] and the disputable nature of the clear monogenic link of FECD to some other genes suggest that late onset FECD is a set of polygenic phenocopies. This makes it similar to other late onset repeat expansion diseases.
Thus it leads to the question if it's possible to split the patients into groups within the loci using GWAS results and what accuracy can be achieved. And is it possible to stratify late onset FECD patients by expansion/no expansion based on the microchip-based data? Are the haplotype stratification results and simple patient grouping based on the minor allele of SNPs comparable for these purposes? The study was aimed to develop and test the approach of dividing patients into groups based on the chip-based genotyping and genomewide association study (GWAS) results.
METHODS
The analysis was carried on dbGaP datasets corresponding to two studies: the FECD Genetics Multi-center Study [16] and the Age-Related Eye Disease Study (AREDS, Refractive Error Substudy) [17–18]. All samples were genotyped on Illumina HumanOmni2.5-4v1 arrays. Clinical manifestations of the disease were classified using a modified Krachmer grading scale based on the slit lamp biomicroscopy data [19].
Both sample-level and variant-level quality control (QC) was performed. The genotyping data were preprocessed using the PLINK 1.9 software [20], GRAF 2.4 [21–22], and code written in R version 4.1.0.
First genotypes with GenCall (GC) scores below 0.3 were removed. Subsequent QC selected markers met the following criteria: missing genotype rate < 10%, minor allele frequency > 1%, number of Mendel errors, a Hardy-Weinberg Equilibrium p-value > 1 × 10–10 for control samples and p-value > 1 × 10–15 for FECD patients. Duplicate markers, i.e. markers with different IDs but identical genetic positions and allele coding, were detected and analyzed separately. Both markers of each pair of duplicates were excluded from consideration. One marker with the lowest missing genotype rate was excluded from each pair of duplicates showing 10 differences or more. A total of 1,580,746 SNP markers were included in the analysis after applying all the filters.
The following inclusion criteria were defined for the group of FECD patients: age 47 or older; keratoplasty in at least one eye or grade 2 or above disease (according to the modified Krachmer grading scale) in at least one eye.
Inclusion criteria for the control group: age 60 or older; normal cornea with no epithelial, endothelial, or stromal abnormalities except corneal injuries.
Exclusion criteria: samples with Mendel errors, samples with mismatch between annotated and genetic sex (determined based on the X chromosome heterozygosity rate and Y chromosome genotype counts); samples with genotype missingness above 5%; relatives up to the second degree of relationship (according to GRAF-rel).
The population structure was estimated using GRAF-pop in order to obtain a genetically homogenous sample. The samples identified as outliers in the genetic distance coordinates were filtered out. The patients were divided into groups according to the potential carrier state of repeat expansion in three stages:
Stage 1: selection of significant variants;
Stage 2: clustering the study participants based on the haplotypes/combinations of the selected variants, calculation of the repeat expansion rate;
Stage 3: evaluation of the concordance between the obtained repeat expansion rate and the percentage of the repeat expansion carriage according to the experimental data reported in previous studies. The repeat allele was considered as expanded if the number of the repeats was ≥ 40 and as unexpanded if the number of the repeats was < 40.
In the first stage, variants were tested for association with FECD using logistic regression with sex and the first six principal components as covariates. p-values were adjusted for multiple testing using the Benjamini–Hochberg method. The chromosome 18 (carrying the locus with the repeats) variants were first filtered by p-value < 1 × 10–15. For comparison with the haplotypebased approach, three SNPs showing the lowest p-values in the resulting set of variants were considered as the potential markers of the increased number of repeats. Additionally marker pruning based on LD (r2 > 0.6) was performed. The genotype matrix was encoded according to the dominant inheritance model.
In the second stage, we used the assumption that the patients with the repeat expansion in the TCF4 gene carried the certain combinations of SNPs. We expected that the FECD samples would cluster within the TCF4 locus based on the haplotypes and the combinations of individual variants. However, individuals with phenocopy due to expansion would fall into common clusters based on the similarity of the combinations of minor variants. Asymptomatic control repeat carriers from the control sample (2‒10%) and a fraction of the control sample carrying minor haplotypes with no repeats would fall into the same clusters. Furthermore, the combinations of major variants and haplotypes showing predominance of major alleles would form clusters mostly of the control sample representatives. However, these clusters would also include some FECD patients with phenocopy and some patients with the expansion no longer linked to minor haplotypes (in 7% of FECD patients, linked haplotypes and repeats sometimes break apart, which has been earlier demonstrated for the rs613872 variant [23]). That is why the percentage of FECD patients and subjects with no FECD can be used as a surrogate marker of the carriage of the repeats in specific clusters
Agglomerative hierarchical clustering implemented in the hclust function of the stats R package was used for clustering. The algorithm arranges the data into a tree representation by merging the pairs of clusters with the minimum distance into a new cluster. The algorithm takes the matrix of pairwise distances between the points (samples) as input; initially each point represents a distinct cluster. Since the haplotypes are not identical, we expected that there would be more than two clusters, while the optimal number of clusters was defined by the Silhouette metrics.
For each cluster we calculated the percentage of patients and controls. Clusters with a patient predominance we considered associated with FECD. For the selected three SNPs, carriage of the minor allele was considered as a marker of the repeat expansion carriage.
To evaluate the resulting partition, we calculated the odds ratio from the estimates of the expansion status in each group.
The last step was to compare the results obtained by the proposed approach with the experimental data reported in the previous studies. We selected studies based on the following conditions:
The number of repeats in TCF4 was determined by means of fragment analysis or triplet repeat primed PCR.
The study participants were individuals of European ancestry.
The sample size was at least 50 people for each comparison group.
RESULTS
After the quality control procedure, the discovery dataset consists of 3,660 samples of European ancestry (tab. 1) and 1,580,746 SNP markers.
Since GWAS was performed on the same cohort of samples that were studied in the Afshari et al [10], its results (fig. 1) are comparable to those described in the article. The genomic inflation factor was 1.05, which indicated slight population stratification.
For further analysis, only the locus of chromosome 18 was considered. Filtering by p-values resulted in 134 SNPs, three of which with the lowest p-values were rs784257, rs72932578, and rs618869 (and according to gnomAD v3.1.2, the frequencies of the С, T, and С minor alleles in the European population are 0.17932, 0.05649, and 0.13451, respectively). These variants were further tested in terms of dividing patients into groups.
The haplotype block size was 50 variants left after pruning. After clustering the samples were divided into 10 subgroups (fig. 2).
Clustering has shown that clusters with a predominance of control group participants are homogeneous in terms of their representation. However, three clusters with the potentially increased repeat counts (in which FECD patients prevail) are heterogeneous in terms of haplotypes. This is reflected by the uneven distribution of patients with various phenotypes within each cluster. This may be due to both asymptomatic carriers of the increased repeat number in this locus and the resolution of the population-variable chip SNPs that is not enough for accurate division of samples based on the repeats of varying length.
Our analysis has shown that that the proportion of people from clusters with a presumptive carriage of expanded repeats in the group of samples with FECD is significantly higher than in the control group (tab. 2). Furthermore, the calculated rate of probable repeat expansion carriers varies significantly depending on the selected method (prediction of expansion based on the haplotypes/combinations of variants or based on the genotypes of certain variants with low p-values).
To verify the results obtained we have selected the studies involving experimental determination of the repeat expansion. The number of repeats is routinely defined by conventional fragment analysis or triplet repeat primed PCR with subsequent fragment analysis. A total of five papers with appropriate samples have been found (tab. 3).
To compare the predicted and reported frequencies of the expansion carriers we have merged the samples from the papers. Comparative analysis has shown that markers reproduce the frequency of the expansion carriers in the comparison groups to a different extent (fig. 3).
None of the applied approaches represent the repeat frequency in the group of patients and the control group accurately enough compared to the results of direct typing reported in the papers (tab. 4). However, the haplotype-based approach outperformed the SNPs in terms of odds ratio by covering the 95% confidence intervals of the samples used in two studies.
It is interesting to note that the individual variants we have considered produce extremely discordant results (fig. 4), i.e. quite different people are carriers of minor allele in these variants, which makes the applied metrics volatile. rs784257 differs most from the haplotype-based approach in terms of the allele carrier state, it is also the most significant variant according to the GWAS results. At the same time, it shows the maximum discrepancy in the proportions of potential expansion carriers in the control group and no better correspondence with the FECD group. This allele is most likely to show weaker linkage to the repeat carrier state than the other two alleles.
DISCUSSION
Molecular genetic stratification of patients with polygenic diseases is a useful tool for studying the disease genetics. Furthermore, there could be patients with the groups of causal variants linked to various haplotypes within one phenotype. Despite the fact that the gene is definitely associated with the disease, p-values of the variants would be higher due to the large number of the groups of linked variants, i.e. the variants that are significant for every group do not surpass the generally accepted significance threshold (p-value < 5 × 10–8) due to the features of the disease genetic basis. Moreover, many loci, the significance of which is close to the generally accepted threshold, are characterized by the marked sparseness of the significant variants, under which only a few variants are strongly
associated with the disease. Thus, it is impossible to choose between genomic variants as the population outliers (the significance of which results from random population frequency shift) or assigning these variants to the potentially significant group of variants. That is why the genetic data structuring methods that involve assessing interactions between both variants within haplotype blocks and haplotype blocks are a promising tool for the disease genetic basis clarification.
GWAS makes it possible to obtain more information about the disease genetic basis than exclusion of variants based on p-values and loci formation with reference to the nearest gene that represents the transition from the "variant" level to the "gene" level. However, questions remain about unequal contribution of various loci to the genetic basis of the disease in specific groups of people with the same phenotype. This is due, among other things, to the lack of advanced approaches to formation of the combinations of variants, i.e. to working at the intermediate level between the "variant" level and "gene" level. Since the variants show incomplete linkage, it would be reasonable to consider the sets of haplotypes/combinations of variants that define the differentiated disease risk instead of the specific risk haplotypes or protective haplotypes. This means that the variant with the highest population attributable risk (combination of allele frequency and relative risk) is likely to be the most significant one in the locus.
Assessment of the groups of haplotypes linked to the causal variants is still a challenging task, however, it more and more often outperforms GWAS, even despite the lack of the high throughput standard approaches. The GWAS performed in 2005 showed that the CFH gene was associated with agerelated macular degeneration [27]. Later it was reported that this association was not confined to individual variants and was also observed in the groups of patients with structural alterations, such as partial deletions of the CFHR1-5 genes [27].
Furthermore, it was found that most of the variation attributed to individual variants was in fact the marker of haplotypes showing large-scale structural alterations in this region. And these are haplotype variants of the locus structure, including those with different population abundance, that show much stronger correlation with the risk of retinal degeneration than the majority of individual variants in this locus [28].
In this study we have implemented sample clustering by variants of the region containing the expansion based on the data on the association of individual variants with the repeats [14, 23, 29], particularly, allele G of the rs613872 variant, and haplotype blocks [29]. After clustering the samples of the group of patients with FECD turned out to be distributed unevenly across the clusters, which was indirect evidence of clustering by haplotypes linked to the expansion. All clusters except one (cluster 3) had a clear status of the repeats. Uncertainty in defining the status was due to parity between patients and controls in the cluster. In the future we have to decide what to do with such clusters: re-cluster people in these clusters in the case-by-case manner or leave them with uncertain status. It is also necessary to select another clustering metrics. This requires additional data that include both sample genotyping results and information about the repeat length. Regardless of these limitations, the results obtained using the haplotypebased approach were better than the results shown by individual variants. However, these results turned out to be not precise enough to consider our method optimal.
This work accomplished two goals.
- Initial testing of an approach that allows stratification of patients and control groups at the intermediate level (not the level of a single variant and not the level of the gene closest to the locus) without first understanding the haplotype structure of the locus. The proportion of patients with FECD and control samples in clusters is used as a measure, allowing this approach to be used for diseases in which the approximate proportion of individuals with a phenotype closely related to or due to changes in a given locus is not known in advance.
- Obtaining a subsample of patients with FECD and no expansion carrier state for precision re-analysis of GWAS in order to refine genetic structure in this particular category of patients.
In the future, patient clustering will make it possible not only to allocate groups within the phenotype showing a strong contribution from distinct genetic variants, including structural variants, but also to propose the basis and approaches to predicting the patients' responses to various types of therapy.
CONCLUSIONS
The study has shown the possibility of using the haplotypebased approach for genetic stratification of patients based on the cause of the genetic disorder, namely the presence of the repeat expansion. The findings have made it possible to draw the following conclusions: 1) the haplotype-based approach is better suited for detection of the association of loci with certain groups of patients than individual variants; 2) for a more accurate picture we should reconsider the approach to defining the haplotype composition and modeling the data matrix for clustering. In particular, it is planned to analyze some methods of computing the genetic similarities (genetic distances (genetic distances) among samples and apply more specific methods for initial selection of variants; 3) the results obtained show that clustering splits the patients with FECD and the control group based on the groups of haplotypes/combinations of variants associated with the repeat expansion. Further testing of the approach requires additional evidence base that demands the use of more validation data.