ORIGINAL RESEARCH
Identification of prognostically significant DNA methylation signatures in patients with various breast cancer types
1 Research Centre for Medical Genetics, Moscow, Russia
2 Sechenov First Moscow State Medical University (Sechenov University), Moscow, Russia
3 Pirogov Russian National Research Medical University, Moscow, Russia
Correspondence should be addressed: Alexey I. Kalinkin
Moskvorechye, 1, Moscow, 115522; ur.xednay@2akiexela
Funding: the study was supported by the Ministry of Science and Higher Education of the Russian Federation within the framework of the Federal Scientific and Technical Program for the Development of Genetic Technologies in 2019–2027 (agreement № 075-15-2021-1073).
Author contribution: Kalinkin AI — study design, data acquisition, analysis and interpretation, manuscript writing; Sigin VO — manuscript writing; Nemtsova MV — study concept and design; Strelnikov VV — study concept and design, scientific editing.
According to the Global Cancer Observatory (GLOBOCAN), about 2.3 million of new breast cancer (BC) cases and 684,996 deaths from BC were reported in 2020. BC, being the most common type of cancer all over the world [1], is a highly heterogeneous disease with varying molecular and clinical characteristics [2].
Today, BC subtypes are defined by immunohistochemical (IHC) staining of tumor tissue [3], particularly based on the estrogen, progesterone, HER2 receptor protein expression in the tumor and on the cancer cell proliferation rate. The development of methods for gene expression analysis involving the use of DNA microarrays played a major role in determining BC molecular subtypes. The use of the classifier based on the expression of 50 PAM50 genes makes it possible to clearly distinguish luminal A (LumA), luminal B (LumB), HER2enriched (HER2+) molecular subtypes, as well as basal-like or triple-negative breast cancer (TNBC) [4]. TNBC that comprises 15–20% of all BC cases is characterized by aggressiveness, high metastasis rate, frequent relapses, and low survival rate compared to other BC subtypes [5]. Multigene microarraybased test systems make it possible to obtain prognostic information that is important for cancer patients, especially in cases of equivocal predictions made based on the clinical characteristics and IHC markers. Such systems include Mammaprint/Blueprint and Prosigna/PAM50, which, in addition to their predictive value, provide the possibility of division into molecular subtypes [6]. These systems can be used to define high or low risk of relapse in female BC patients, however, this option is not yet available for TNBC and HER2+ molecular subtypes due to a lack of clinical trials.
Epigenetic changes modulate genome utilization through histone modification, changes in histone variant composition, chromatin remodeling, DNA methylation, positioning of nucleosomes and non-coding RNAs (expression of specific miRNAs). For the effect to become manifest, all of the above mentioned epigenetic alterations act in concert. DNA methylation is one of the best known factors of gene expression regulation. It occurs due to covalent modification of cytosines through the methyl group attachment to the C5-positions of cytosine residues in the context of CpG dinucleotides [7]. CpG dinucleotides tend to concentrate in the GC-rich DNA regions known as CpG islands, many of which are located in promoter gene regions and long repeat regions, such as retrotransposable elements or centromere repeats. Methylation of cytosine is mediated by the enzyme class known as methyltransferases (DNMT) [7]. A total of five DNMT family members have been identified in mammals: DNMT1, DNMT2, DNMT3a, DNMT3b, and DNMT3L. DNMT3a and DNMT3b are de novo methyltransferases that interact with non-methylated CpG dinucleotides. DNMT1 is responsible for methylation maintenance during replication in S phase. It has been shown that DNMT3L stimulates de novo methylation that involves DNMT3a and mediates transcriptional repression with the help of histone deacetylase 1 (HDAC1) [7]. Aberrant DNA methylation is associated with a wide range of diseases and appears to be most marked in malignant tumors [8]. Studies of recent years show that every epithelial tumor contains about 10–15 genes inactivated by the genome structural changes and hundreds of genes inactivated by DNA hypermethylation. This demonstrates the importance of this modification for tumor development. Total hypomethylation is one more feature of tumor genomes. This is a genome-wide hypomethylation that results mainly from the loss of methylation at repetitive elements and leads to genome instability and chromosomal rearrangement [8]. The increased promoter methylation in the tumor suppressor genes suppressing various mechanisms of tumor progression, that resuts in epigenetic silencing and reversible inactivation of these genes, plays an important role in BC pathogenesis [8]. Identification of the tumor-specific aberrant DNA methylation patterns can be useful for early diagnosis of cancer, differential diagnosis of malignant neoplasms, in the capacity of prognostic and predictive markers [9]. The study of specific DNA methylation patterns identified by genome-wide analysis makes an important contribution to understanding of BC pathogenesis [10]. As noted above, each cancer type is divided into subtypes. There are genomic patterns, including epigenetic ones, that are typical for these subtypes. Thus, it is necessary to perform specific genome-wide DNA methylation profiling in cancer patients, along with the conventional assessment of the promoter hypermethylation point events in certain genes [11].
Prognosis involves prediction of the possible course and outcome of cancer. Survival analysis that is based on mathematical approach to cancer prognosis makes it possible to predict the likelyhood of staying alive after a certain time. Because of their biological importance and stability, DNA methylation markers are an effective prognostic factor [12]. In one of the studies, the data of the genome-wide DNA methylation analysis of BC samples from The Cancer Genome Atlas Breast Cancer (TCGA-BRCA) database were used to construct a model of seven CpG dinucleotides that made it possible to clearly distinguish breast tumors of all subtypes and normal tissues, and to identify six methylation sites that strongly correlated with overall survival (OS) [13]. The analysis of methylation data from open sources by LASSO regression and boosting revealed 29 and 11 CpG dinucleotides associated with OS, resperctively [14]. The study of data taken from the open source (TCGA-BRCA) also made it possible to identify three genes (TDRD10, PRAC2, and TMEM132C), the methylation status of which had some predictive value, however, this was true mostly for estrogen receptor-positive breast tumors [15]. A prognostic model that comprises five genes (TGFBR2, EIF4EBP1, FOSB, BCL2A1, ADRB2) has been developed for TNBC based on the data obtained from TCGA-BRCA. The model is equally well suited for prediction of OS and disease free survival (DFS) [16].
Research is necessary due to the lack of such signatures for HER2-enriched subtype and a rather limited number of signatures for other BC molecular subtypes. The diagnostic potential of the existing survival prediction models is also uncertain, that is why we have used a modified algorithm to search for CpG dinucleotides associated with all available clinical endpoints found in the TCGA-BRCA database.
The study was aimed to obtain various signatures based on the open data on DNA methylation in BC from The Cancer Genome Atlas Breast Cancer for prediction of various clinical endpoints (overall survival, disease-free survival, and progression-free survival) for BC molecular subtypes and test the relationship between the clinicopathological characteristics and the signatures obtained.
METHODS
The publicly available clinical parameters and the data of the genome-wide DNA methylation profiling obtained using the HumanMethylation450 (HM450) hybridization chips (Illumina Inc.; USA) within the framework of The Cancer Genome Atlas Breast Cancer (TCGA-BRCA) project (https://portal.gdc.cancer. gov/projects/TCGA-BRCA) were acquired and processed using the TCGAbiolinks software package [17]. Inclusion criteria for patients to be used for further selection of candidate CpG pairs were as follows: appropriate BC molecular subtype, availability of accessible clinicopathological information, availability of the DNA methylation profiling data obtained using the Illumina HumanMethylation450 chips. Exclusion criteria: no data on the time values for clinical endpoints, patient's age, TNM stage and grade. Then the results obtained using the patients' FFPE (formalin fixed paraffin-embedded) blocks and cross-hybridization probes were excluded from the profiling data matrix.
Selection of CpG pairs assocciated with OS, DFS or progression-free survival (PFS) was performed by univariate Cox regression method [18]. Of all the selected CpG pairs, those subjected to multiple testing adjustment (adjusted value p < 0.05, Wald test was used) by the false discovery rate (FDR) method were further analyzed. The LASSO Cox regression [19] method implemented in the SurvHiDim software package [20] was used to select the most stable CpG pairs. Multivariate Cox regression [21] was used to calculate the CpG-based signatures and to test the independence between the patients' clinical parameters and these signatures. The ability to classify various outcomes was defined by logistic regression method. Stratification into the high- and low-risk categories was performed using the median. The cvROC (cross-validated receiver operative curve) method [22] was used to test the quality of the models constructed and to plot the ROC curves. The best sensitivity and specificity values were defined by the Youden's index method. The Kaplan–Meier curves were constructed using the survminer software package [23]. The Mantel–Cox test was used to compare two survival curves. The 10-fold cross-validation method was used throughout all stages of marker selection and signature calculation. All the listed above calculations were performed using the R statistical programming language [24].
RESULTS
The studied TCGA-BRCA data set included the DNA methylation profile obtained using the HM450 chips and the clinicopathological characteristics of 735 primary BC samples. After exclusion of paraffin-embedded samples, there were a total of 555 LumA+B subtype (LumAB) samples, 134 TN subtype samples, and 46 HER2-enriched subtype samples (tab. 1). Prior to selection of traits for further analysis, crosshybridization probes were excluded from the methylation data matrix, so that the number of probes reduced from 485,577 to 456,344, respectively.
The next stage of analysis involved using univariate Cox regression to search for methylation sites that correlate with the duration of OS, DFS and PFS in various BC molecular subtypes. After the initial selection with allowance for the multiple testing adjusted p-value, the following probes were selected:
- 10,433 probes associated with OS in the LumAB subtypes, 3214 probes in the TN subtype, 6471 probes in the HER2-enriched subtype;
- 4419 probes associated with DFS in the LumAB subtypes, 168 probes in the TN subtype, 483 probes in the HER2-enriched subtype;
- 2345 probes associated with PFS in the LumAB subtypes, 43 probes in the TN subtype, 3216 probes in the HER2-enriched subtype.
LASSO Cox regression was used for each of the listed sets, allowing for selection of CpG dinucleotides that were most important for analysis. Different numbers of such CpG pairs were identified during each stage of cross-validation. CpG pairs found in more than 50% of cross-validation data splits were selected (tab. 2).
To select the combinations of CpG dinucleotides showing significant correlations with various survival outcomes, we assessed all possible combinations (signatures) of such CpG dinucleotides in various BC molecular subtypes. For each clinical endpoint and BC molecular subtype, cvAUC (crossvalidated area under curve, the average area under curve at all stages of cross-validation), sensitivity, specificity and accuracy were defined for various combinations. The first 10 combinations showing high cvAUC values were tested for independence of the clinicopathological characteristics. The diagnostic characteristics of these combinations along with the number of probes and genes belonging to the probes are provided in tab. 3.
The combination of 12 CpG dinucleotides for prediction of OS in the LumAB subtype was the largest defined combination, while the combination of two CpG dinucleotides for prediction of DFS in the HER2-expressing subtype was the smallest one. The cvROC (cross-validated receiver operative characteristics: ROC curve is plotted at every stage of cross-validation, then the average curve is constructed) curves and the Kaplan–Meier curves were plotted for each signature to show the diagnostic potential and estimate the survival function. The LumAB combinations showed lower cvAUC values (0.76–0.83), while the combinations for TN and HER2-expressing subtypes showed high cvAUC values with fewer nember of combinations (0.83–1) (fig. 1).
Our combinations are independent of the clinical characteristics (tab. 4). This makes it possible to use the risk indicators of these clinical endpoints in any group of patients.
The Kaplan–Meier curve analysis revealed a significant (p < 0.05) decrease in OS, DFS and PFS in the group of patients with high risk of death, relapse and disease progression compared to the group of patients with low risk. This was true for all BC molecular subtypes and all selected combinations (fig. 2).
DISCUSSION
In this study, we considered the possibility of identifying the CpG dinucleotide differentially methylated sites to predict survival outcomes in various BC molecular subtypes using the methods of survival analysis and DNA methylation data. The approach to calculation of differential methylation based on univariate Cox regression is widely used in a variety of studies. Thus, this method was used for identification of 249,810 and 249,811 probes based on DNA methylation data associated with ovarian cancer and BC, respectively [12], and for identification of probes based on DNA methylation data associated with cutaneous melanoma [25].
We have shown that the use of various combinations (2–12 CpG dinucleotides) makes it possible to achieve acceptable (cvAUC 0.7–0.8), high (0.8–0.9) and very high quality (0.9–1) of classification into high and low risk of death, relapse and progression. During the study we have identified 47 probes/ genes (SLC30A7, EXTL2, C15orf41, MIA3, NIPAL3, HEY2, HK1, DIRC3, TMEM41A, SH3BP5L, RFX2, SLC25A39, BAT2, ZNF417, PSMA6, RG9MTD3, ZNF827, ABCC5, HLA-DRB5, HIST3H2A, RERE, SPAG5, cg13447284, cg03512997, LIN54, RASGRP2, LDLRAD3, ZNF643, PKNOX1, KCNMB2, ZFAND1, HDAC9, cg13745678, DPPA5, cg02927111, PKNOX1, SSU72, CADPS2, PEX5L, GSTM4, cg26290926, BIRC5, cg10660854, SLC43A1, BOD1, cg00297843, KCNN1), the methylation of which is associated with OS, DFS and PFS. We failed to define which genes seven probes (cg13447284, cg03512997, cg13745678, cg02927111, cg26290926, cg10660854, cg00297843) belonged to. Five of these probes (BIRC5, PKNOX1, SPAG5, HDAC9, PSMA6) have been previously reported in scientific literature as molecular markers of BC patients' survival based on the data of gene expression quantification [26–31].
It is noteworthy that in our study we found the same number of CpG dinucleotides (HM450 probes) for prediction of OS and DFS (based on five probes) as other researchers [16], but the probes varied according to genes they belonged to. According to our data, the signature for prediction of PFS consists of six probes; no PFS signiture was calculated during the study [16]. Other researchers suggest using invidual markers of promoter hypermethylation in seven genes (RASSF1, BRCA1, PITX2, RARB, PGR, CDH1, and PCDH10) for prediction of OS and DFS outcome in patients with ER+ BC, they also consider using the panel of three genes (GSTP1, RASSF1, and RARB) for prediction of OS based on the literature data analysis (systematic review of the reports) [26], while we have used a strategy of developing panels of six, nine and 12 methylation markers based on the marker diagnostic potential defined by statistical analysis of the experimental data set.
Among genes included in our combinations, attention is drawn to BIRC5 (encodes baculoviral IAP repeat-containing protein 5) that is overexpressed in the majority of tumors, including BC, and is associated with poorer prognosis of overall, disease-free and progression-free survival. It has been shown that the use of taxane chemotherapy drugs may increase the expression of this gene [27]. PKNOX1 (gene of the short arm of chromosome 21 that encodes eponymous protein and plays an important role in embryogenesis) is a tumor suppressor gene, while the increased expression of this gene is associated with poorer survival rate [28]. The increased expression of SPAG5 (encodes protein associated with the mitotic spindle apparatus), associated with poorer prognosis of OS, DFS and PFS only in estrogen receptor-positive (ER+) breast tumors, is also a prognostic factor [29], which is confirmed by our study. The findings of the study that involved ER+ BC samples show that the increased expression of HDAC9 epigenetic enzyme (encodes protein, histone deacetylase 9) in tumors is associated with the poorer prognosis of DFS [30]. Our study shows the association between the abnormal methylation of this gene and survival of patients with estrogen receptornegative (ER–) tumors, and more precisely with TNBC. The reduced DFS of patients with ER+ BC was associated with the increased expression of PSMA6 (encodes proteasome subunit alpha type-6) [31], which was also confirmed by our findings.
CONCLUSIONS
Molecular epigenetic signatures for various BC types were discovered using the survival analysis methods, combinations of various methylation sites, and estimation of diagnostic parameters. This method may be recommended to search for signatures typical for BC and other tumor diseases. In the future the discovered epigenetic signatures may be used to develop the methylation-sensitive quantitative PCR assays. After clinical trials, such assays may become a cheaper and more practical alternative to gene expression microarrays, without reducing diagnostic performance.