ORIGINAL RESEARCH
Interactions between gene pools of russian and finnish-speaking populations from Tver region: analysis of 4 million SNP markers
1 Vavilov Institute of General Genetics, Moscow, Russia
2 Research Center for Medical Genetics, Moscow, Russia
3 Biobank of North Eurasia, Moscow, Russia
4 Federal Research and Clinical Center of Physical-Chemical Medicine, Moscow, Russia
Correspondence should be addressed: Oleg P. Balanovsky
Gubkina, 3, Moscow, 119991; ur.xobni@yksvonalab
Funding: the study was supported by the Russian Ministry and Science and Higher Education (Government Contact # 011–17 dated September 26, 2017). Genotyping and manuscript preparation were done under the DNA-based identification Research and Technology Project of the Union State. Bioinformatic analysis and interpretation of the obtained results were carried out under the State Assignment of the Russian Ministry of Science and Higher Education for Bochkov Research Centre for Medical Genetics.
Acknowledgement: we thank all the donors who took part in this study, the Biobank of North Eurasia for DNA collections and Napolskikh VV, the corresponding member of RAS, for his contribution to data interpretation.
10.24075/brsmu.2020.072
The city of Tver and the adjacent territories situated at the border between Central and Northwest Russia played an important role in the country’s history in general and the interactions between Russian and Western Finnish-speaking populations in particular. Before Slavic colonization, which started around the middle of the 1st millennium, this area was inhabited by Finno-Ugric tribes, predominantly by the Merya. In the early 12th century, a settlement of merchants and craftsmen, which came to be known as Tver, emerged in the estuary of the Tvertsa river. In the middle of the 13th century, Tver rose as one of the 3 Grand Russian Principalities of the Mongol invasion period. For two centuries, Tver was vying with Moscow for the right to unify Russian lands under its rule, maintaining its status as a center of attraction for human resources.
The 15–16th centuries marked the beginning of Karelian migration from the Karelian Isthmus and the areas adjacent to Lake Ladoga lying to the North-East of today’s Tver region. In the wake of the Russo-Swedish war, the migration intensified dramatically. By 1670, as many as 25,000 to 30,000 Orthodox Karelians had fled to the lands of Tver. The refugees settled in the areas devastated by famine and chaos during the Time of Troubles. They started their own closely built settlements away from Russian villages. The subsequent waves of Karelian migration were not so massive [1, 2]. Thus, the exodus of Karels from their homeland produced an ethnographic group of Tver Karels who maintained their native Karelian language (the Finnic subgroup of Finno-Ugric languages) throughout centuries. In 1937, the Karelian national district was established with an administrative center in Likhoslavl. Two years later, in 1939, it was abolished, and the activists of the Karelian movement were arrested. This might have driven some Tver Karels to rethink their ethnic self-identification. According to the censuses, the Karelian population shrank in half during the 20th century, declining from 150,000 people in 1930 (of whom 95% spoke Karelian) to 7,000 in 2010 [3]; still, Karels remained within the borders of their habitat [4].
The fact that 2 ethnic groups, Tver Russians and Tver Karels have been living side by side for over 3 centuries raises the question of possible genetic admixture between these two populations. This question was partially answered in our previous publication on the analysis of the Tver Karelian gene pool, which we conducted using a panel of 49 lineage-informative Y-chromosome SNPs for Eastern European populations [5]. We convincingly demonstrated the genetic similarity of Y-chromosomes between Tver Karels and the indigenous populations of Northeast Europe, especially South Karels and Karelian Veps. The study showed that Tver Karels retained their ancestral Y-chromosomal gene pool throughout more than 10 generations in spite of the dramatic twentyfold population decline and years of mingling with the Russian population. The massive population decline might be explained by a change in the self-identification of Tver Karels and their assimilation by the Russian population. If this explanation is valid, it would be natural to expect that the genome of today’s Tver Russians will contain an increased proportion of Y-chromosomal variants typical of Northeastern European populations in general and Karels in particular. It is known that interethnic marriages between neighboring ethnic groups produce a more stable Y-chromosomal gene pool compared to the autosomal gene pool because the majority of such marriages are patrilocal (a woman moves into her husband’s home village), i.e. resulting in the geographical migration of mitochondrial DNA and autosomes and no geographical migration of Y chromosomes. Both of these factors might be the reason why the autosomal gene pools of Tver Karels and Tver Russians were hugely mutually influential and became more homogenous than the Y gene pools.
Studying the gene pools of indigenous peoples with a genome-wide SNP panel is essential for cataloging the genetic diversity of the Russian population and identifying the distinct features of regional gene pools. These data are important for pharmacogenomics and forensics. The majority of existing pharmacogenetic protocols have been designed for European populations and may not produce a satisfactory result for the Russian populations which carry other allelic variants; besides, the frequencies of well-studied alleles differ significantly between the ethnic groups living in Russia, similarly to the populations of East Asia and Africa [6, 7]. The studies investigating the frequencies of pharmacogenetic markers in Russian populations have been summarized in a recent review [8]. Data on the gene pools are instrumental in forensic analysis in cases when there is a need to identify the origin of a person using only trace amounts of DNA. Currently, there are a few systems for DNA-based identification, and a few others are still in development, but the key thing is the availability of genetic data on ancestral populations [9, 10].
The aim of this study was to characterize the gene pools of Tver Karels and Tver Russians using a genome-wide panel of 4 million autosomal SNPs and to analyze the gene flow between these 2 populations. Conducted on a large dataset of samples from European Russia, the analysis will serve a more general purpose of exploring the interactions between Slavic and Finnish-speaking populations.
METHODS
This field study of the Russian and Karelian populations inhabiting Tver region followed the method detailed in [11]. The study included only unrelated individuals who did not share a common grandparent (according to the information they provided in the questionnaire) and whose ancestors from at least 2 previous generations had been born in Tver region, self-identified as Russian or Karelian and had no memories of other ethnicities in their ancestry.
Participants were eligible for the study if 1) both of their grandmothers and both of their grandfathers identified as Russian or Karelian; 2) they were willing to give informed consent to participate.
The following exclusion criteria were applied: poor DNA quality or insufficient DNA amount for whole-genome genotyping.
The population of Tver Karels was represented by 11 individuals whose ancestors came from the central part of Tver Karels’ habitat, including Likhoslavl district (n = 4), Maksatikha district (n = 1), Spirovo district (n = 2), and Rameshki (n = 4) district. In 1930, the Karelian population of these 4 regions numbered 88,000, amounting to 58% of the total population of Tver Karels (the distribution was as follows: 15% resided in Likhoslavl, 19% in Maksatikha, 8% in Spirovo, and 16% in Rameshki districts). In 2010, there were only 5,000 Karels living in these 4 districts, constituting 78% of the total population of Tver Karels (36% in Likhoslavl, 13% in Maksatikha, 15% in Spirovo, and 14% in Rameshki districts).
Tver Russians were represented by 30 individuals. Since we aimed to study interactions between the Russian and Karelian gene pools, the plan was to compile the Russian subset in such a way that it would represent the areas that did not overlap geographically with the habitat of Tver Karels but were in the vicinity to it. This strategy appears to be optimal for determining the intensity of gene flow from Russians to Karels because the degree of genetic variation between Tver Karels and the populations of remote Russian settlements without a past history of direct contact with Karelians might turn out to be too high due to the genetic differences existing between Russian populations, whereas the degree of genetic variation between Karels and Russians living in Karelian villages might be too low as the Russian villagers might be actually the descendants of the Karels who once started to self-identify as a different ethnicity. For extra control, we studied several Russian populations (instead of one) living at various distances from the habitat of Tver Karels. The Eastern population of Tver Russians occupies the area neighboring the habitat of Tver Karels (fig. 1). In the autosomal gene pool analysis, this population was represented by 13 individuals, all born in the Kashin district of Tver region. The Western population of Tver Russians was selected in such a way that geographically it was at a greater distance from the habitat of Tver Karels than the Eastern Russian population. The Western subset comprised 15 individuals born in the Selizharovo district of Tver region. Two individuals from the Torzhok district to the south of Likhoslavl, the administrative center of Tver Karels, were allocated to a separate Southern group. Thus, a total of 41 samples collected from the residents of Tver region were genotyped using a genome-wide SNP panel. fig. 1 shows the places of origin for each of 4 grandparents of every participant.
The gene pool of Tver Russians and Karels was compared to the gene pools of the Russian populations from neighboring territories (Archangelsk, Vologda, Voronezh, Kursk, Pskov, Novgorod, Smolensk, and Yaroslavl) and South and North Karels residing in Karelia (n = 16). It total, 27 Karelian, 100 Russian and a number of other East European genomes (Belarusian, Vepsian, Votian, Izhora Ingrian, Lithuanian, and Ukrainian) were analyzed using the same genome-wide SNP panel. The majority of the listed populations were previously studied using the panels of Y-chromosome markers [5, 12, 13].
All DNA samples, including those collected in Tver region and those representing the group of comparison, were genotyped using a panel by Illumina consisting of 4.5 million SNPs, an Infinium Omni5Exome-4 v1.3 BeadChip Kit (Illumina; USA) and an iScan genotyping system (Illumina; USA). Primary data analysis and quality control were carried out in GenomeStudio v2011.1 (Illumina; USA). For all the studied samples, the CallRate value was at least 0.99. Thus, genotypes were generated for 4, 559 465, SNP markers.
The obtained genotypes were uploaded to the GG-base [14] and are now available for downloading (RussiansTverKashin, RussiansTverSelizharovo, RussiansTverTorzhok, TverKarelians).
Primary data analysis was performed using the classic principal component analysis, which allowed us to identify the overall structure of the studied gene pools. Genetic drift between the studied populations was measured with f3- statistics. The D-statistic was employed to identify the direction of gene flow between the studied populations.
Data filtering was done in PLINK 1.9 [15, 16]. The applied filters are described below.
Prior to PCA, we filtered out SNPs with the genotyping rate of < 95% (geno 0.05) and the minor allele frequency of < 1% (maf 0.01); we also excluded samples with > 10% missing genotype rates (mind 0.1); SNPs that were in high linkage disequilibrium with each other (r2 > 0.2) were pruned using a sliding window of 1,500 SNPs shifting 150 SNPs at a time (indep-pairwise 1500 150 0.2). The output files contained 274,036 SNPs and 126 (of the initial 131) samples. Principal components were computed in EIGENSTRAT smartpca [17, 18] with 5 outlier removal iterations. The results generated by smartpca were visualized using Python 3, pandas [19, 20], matplotlib [21] and seaborn [22] libraries.
To prepare the data for ADMIXTURE analysis, the same filters were applied (mind 0.1, geno 0.05, maf 0.01). After that, SNPs pairs with r2 > 0.2 were pruned. The resultant dataset was analyzed in ADMIXTURE v1.3.0 [23]; cross validation errors were calculated for each k.
F3 statistics measure the genetic drift between two populations, i.e. the degree of their genetic ancestry relative to an outgroup. F3 statistics were computed in qp3Pop (AdmixTools) [24] using a Yoruba population from the 1000 Genomes Project as an outgroup [25]. Apart from the Yoruba dataset, the analysis covered 668 samples genotyped for 3,757,004 markers. The following filters were applied: mind 0.1, geno 0.05, maf 0.01; SNP pairs with r2 > 0.5 were excluded from the dataset. The resultant dataset included 1,144,136 SNPs in a total of 635 samples.
The D-statistic is a tool for detecting genetic admixtures between 4 populations. In its classic version, the most genetically distant population (an African one) serves as an outgroup, and the test identifies the direction of gene flow between 3 remaining populations. The calculations were performed in qpDstat (AdmixTools) using a Yoruba population as an outgroup. In total, 748 samples and 3,757,004 SNPs were analyzed. The following filters were applied: mind 0.05; geno 0.2; maf 0.01; r2 > 0.6. The resultant dataset included 1,355,253 SNPs in 633 samples.
RESULTS
The position of Tver Russians and Tver Karels in the PCA space which was constructed based on the genome-wide panel of 4,500,000 SNPs is shown in fig. 3. The Tver Karelian sample is closer to the samples of Karelian Karels and at some distance from the analyzed Russian populations (Tver, Novgorod, Vologda and Yaroslavl). Only one sample of Tver Karels is genetically close to Vologda Russians. All other samples of Tver Karels cluster together, demonstrating little genetic variation. This clustering is consistent with the results of our previous study which analyzed Y-chromosome lineages [5] and concluded that the community of Tver Karels retained its ancestral gene pool.
At the same time, the analysis of the autosomal markers included in the panel reveals a shorter genetic distance between Tver Karels and Russians. fig. 2 shows that samples of North Karels, South Karels, Tver Karels, and Russians together form a clinal gradient. The closest to Karels are Russians from Vologda region; Tver, Pskov and Central Russian populations constitute a single genetic “cloud”. Genetic differences between the Western and Eastern groups of Tver Russians are slight yet pronounced and consistent with their geography: the Western population of Tver Russians shares its genetic space with Pskov samples, which is seen in the PC plot, whereas the Eastern population of Tver Russians (Kashin district) remains on the periphery. Remarkably, two samples from the Eastern population join the Novgorod-Yaroslavl group, which the second principal component differentiates from the rest of the Russian populations (fig. 2.)
According to PCA, the highest degree of similarity exists between Tver and Karelian Karels but not between Tver Karels and the studied Russian populations. Still, PCA results encourage a hypothesis that the genetic pools of Russian and Tver Karels might be characterized by a slight degree of admixture between each other. Of 3 studied Karelian populations, only Tver Karels shifted towards Russians, whereas Tver Russians, similarly to other Russian populations analyzed in this paper, keep their genetic distance from any of the studied Karelian populations. This suggests that the most intense gene flow occured from Russians to Karels and not the other way around. F3 statistics clarify the degree of genetic similarity between Tver Karels and Eastern European populations: the closest to Tver Karels are Baltic populations, including the Izhora (Inger), the Vote, South Karels, Veps, Lithuanians, and North Karels (listed in the descending order). Russian populations are more genetically distant from Tver Karels and can be arranged in the following descending order based on the degree of similarity: Pskov, Novgorod, West Tver, Smolensk, Kursk, East Tver, Yaroslavl, Vologda, Voronezh, and North East Arkhangelsk. Notably, the genetic similarity between Tver Russians and Tver Karels is far from being pronounced.
The ADMIXTURE analysis can qualitatively and quantitively assess the contribution of ancestral populations to a studied genetic pool. With ADMIXTURE, it is possible to vary the number of populations k to detect common ancestral components with various degree of fractionality.
At k = 5 (fig. 3; table), significant contribution is visible for only 2 components. Component A is shown in blue; its contribution is the greatest in the speakers of Uralic languages. Component B (Lithuanians, Belarusians, Ukrainians and most Russian populations) is shown in ochre. Component A prevails in Karelian Karels (85%; see Table). By contrast, component B is observed in a few individual samples representing this group but found in every sample of Tver Karels, comprising 41% of their genomes (see table). Component B occurs twice as frequently in Tver Russians, making up 80% of their genomes. Thus, the results of the ADMIXTURE analysis at k = 5 do not contradict the hypothesis about partial gene flow from Russians to Tver Karels.
At k = 6 (fig. 3) the picture becomes more detailed and complex, now showing the contribution of the bright yellow component C (Karelian genomes; see Table). In Karelian Karels, its contribution reaches 100%; it is twice as rare (52%) in Tver Karels and very rare in Tver Russians (8%) and Pskov Russians (4%), indicating that gene flow from these two groups to Karels was either insignificant or zero. Component C is present in Russians because all Eastern European populations share common ancestry. Surprisingly, component C is detected in other populations inhabiting the territories that neighbor Tver region, including the Russians of Novgorod (39%), Yaroslavl (30%) and Vologda (20%).
At k = 8 (see fig. 3), the chart reflects the differentiation of component C. The bright yellow component (arbitrary termed “West Finnish”) still looks influential in Karels and Vologda Russians (96% in Karelian Karels, 53% in Tver Karels, 20% in Vologda Russians; this contribution is designated as component E).
However, its contribution to the genome of other Russian populations is minimal. Perhaps, the presence of this component in the genomes of Russian populations does not reflect their recent intermixing with Karels, but is the evidence of historically distant events like the origin of Russian populations from Slavs mingling with authochtonous Finnish-speaking tribes.
Thus, at k = 8 only Vologda Russians are characterized by a prominent (one-fifth of the genome) contribution of the Western Finnish component E. In other Russian populations, where component E is absent, a different component (shown in light gray, I) is observed, Its contribution is the greatest in Novgorod (91%) and Yaroslavl (90%) Russians, accounting for almost entire genome. Component I also constitutes over one-third of the genome in Tver (39%), Pskov (36%) and Vologda (34%) Russians. This “Novgorod” component also occurs in other studied populations of the Central and Southern Russia, making up at least 38% of their genomes.
Based on what proportion of the genome is represented by components I and K (arbitrarily termed “South Russian”), 2 groups of Tver Russians can be identified. Interestingly, these groups are not in accord with their geography genetically: component K is dominant relative to component I in the Western part of Tver region bordering on Novgorod region (K/I = 63/27), whereas in the Eastern population of Tver Russians the contributions of both components are equal (K/I = 42/42); in Central Tver, the “Novgorod” component I comprises the entire gene pool (100%).
DISCUSSION
This study was conducted using a genome-wide panel of autosomal SNPs. Its findings support the conclusion of our previous study, in which we used a panel of Y-chromosome markers: the genetic distance between Tver Karels and Karelian Karels is closer than between Tver Karels and their Russian neighbors residing in Tver region [5]. Importantly, it was not only descriptive statistics (PCA, ADMIXTURE) but also D-statistics that underpinned this conclusion. Classically, the D-statistic (the f4-statistic) employs one African population as an outgroup. The method helps to understand the direction of gene flow between the 3 remaining populations; its results are considered reliable at |Z| > 3. The Z scores generated by the D-statistic (Yoruba, TverKarelians; SouthKarelians, TverRussians) for the Eastern and Western populations of Tver Russians were –6.9 and –5.0, respectively. This proves that the gene pool of Tver Karels is closer to the gene pool of Karelian Karels than to the gene pool of Tver Russians. At the same time, there is more pronounced genetic similarity between Tver Karels and the studied Russian populations than between the studied Russian populations and Karels from South (and certainly North) Karelia. Z scores for Tver Karels become statistically significant if the analysis includes more southern (relative to Tver) populations of Russians. For example, the D statistic (Yoruba, RussiansSmolensk; TverKarelians, Karelians) produces Z = –3.4 for the Russian populations inhabiting the South of Smolensk region. This indicates that the gene pool of Tver Karels, which, on the whole, is similar to that of Karelian Karels, is reliably close to the gene pools of Smolensk Russians and other Russian populations.
To sum up, assuming that initially the ancestors of Tver Karels and Karelian Karels existed as a single population [1, 2, 4], D-statistics prove that later the ancestors of Tver Karels accepted the genetic contribution of populations inhabiting the southern territories of the East European plain. Eastern Europe has witnessed a lot of complex migration patterns, so the source of southern admixture in Tver Karels cannot be identified with absolute certainty; however, history suggests that the best candidates here are the Russian populations of Tver and the neighboring regions.
CONCLUSIONS
We have studied the gene pools of Tver Karels and Tver Russians using a panel of 4,500,000 autosomal SNPs and compared them to the samples of Karelian Karels and the inhabitants of Russian regions bordering on Tver (Pskov, Novgorod, Vologda, Yaroslavl). The applied statistical methods (PCA, ADMIXTURE, D- and f3-statistics) generated consistent results.
The gene pool of Tver Karles retains its similarity to the gene pool of Karelian Karels despite a long (300 to 500 years) history of living among the larger Russian population and the twentyfold population decline during the 20th century. At the same time, the gene pool of Tver Karels exhibits greater similarity to the Russian gene pool, in comparison with other analyzed Karelian populations. Having compared the findings of the analysis of autosomal SNP markers (a partial shift towards the Russian gene pool) and the previously obtained results of genotyping for Y-chromosome markers (no detected admixture between Tver Karels and Russians), we conclude that gene flow between Russians and Tver Karels was predominantly determined by marriages between Karelian men and Russian women.
Demographic data (the sharp decline in the Tver Karelian population) and historical events suggest that Tver Karels changed their ethnic self-identification and were assimilated by the Russian population. Therefore, it would be logical to hypothesize that the genome of Tver Russian descendants of Karels who once changed their ethnic self-identification contains a greater Karelian genetic component. This, however, is not the case: Tver Russians are as genetically distant from Karels as Pskov Russians.