- Brief Report
- Open access
- Published:
Compilation of all known HERV-K HML-2 proviral integrations
Mobile DNA volume 16, Article number: 21 (2025)
Abstract
Human endogenous retroviruses (HERVs) occupy 8% of the human genome. Although most HERV integrations are severely degenerated by mutations, the most recently integrated proviruses, such as members of the HERV-K HML-2 subfamily, partially retain regulatory and protein-coding capacity. The precise number of HML-2 proviral copies in the modern human population is constantly changing in literature, as new integrations are being uncovered. The first comprehensive list of HML-2 proviral loci was compiled in 2011, including a total of 91 proviruses. Since then, multiple articles published additions and modifications to that list, mainly in the form of new polymorphic proviral sites, updated chromosomal band characterizations or the correspondence of coordinates in the new version of the published human reference genome. In the present study, we systematically searched the literature for lists of HML-2 proviruses and their coordinates and cross-examined every proviral locus information, also against the human genome. We gathered all available data about all HML-2 proviral integrations identified to date and updated, corrected and refined the coordinates in both human genome assemblies currently used in research, to incorporate the whole provirus in each case. Thereby we present an exhaustive (to date) catalogue of all known HML-2 proviruses and their respective coordinates, as a powerful tool for studies aiming to decipher HERV role in health and disease, especially for high-throughput data analyses, which could lead to the discovery of links between specific HERV integrations and biological mechanisms or medical disorders.
Introduction
A little less than half of the human genome is occupied by transposable (or mobile) elements (TEs). These unique entities possess the independent ability to self-expand and generate a multitude of new copies of themselves throughout the DNA of the host organism [1]. The origin of eukaryotic TEs can be traced back to eubacterial transposons and retroelements [2, 3]. Probably the most complex structure and copying mechanism is evolved by LTR (Long Terminal Repeat) retrotransposons which are contenders for being the ancestors of retroviruses, after acquiring fusogenic env or env-like genes [2]. Retroviruses may invade the host germline cells and become endogenized [3]. Human endogenous retroviruses (HERVs) occurred as a result of chromosomal integration of ancient retroviruses into the germline cells of human ancestors, which as such, became inheritable when not fatal for the developing embryo or for the germline cell itself. Importantly, some HERV sequences, either protein-coding or regulatory have been co-opted for essential cellular and organismal functions [4]. Initial retroviral insertions typically deposit a full-length provirus that is capable of copying-and-pasting itself, thereby creating new genomic insertions even for a long time after endogenization. In human evolutionary history the rate of production of new HERV insertions have declined [4], with only one HERV lineage being known to produce new insertions possibly as recently as 100.000 years ago [5]. The HERV-K HML-2 (Human endogenous MMTV-like 2) subfamily has been proven to be the most recently active HERV group containing the youngest known proviruses and a plethora of insertionally polymorphic members [5]. Although many attempts have been made towards the discovery of de novo HERV insertions, indicating that an endogenous retrovirus is still active in the human germline, no such activity has been detected [6]. HML-2 group is particularly highly investigated for comprising proviruses with full-length retroviral ORFs (open reading frames) and intact regulatory elements. Intriguingly, four proviruses (HERV-K115 at 8p23.1, HERV-K119 at 12q14.1, HERV-K113 at 19p12b and De9 at Xq21.33) have uninterrupted ORFs for all genes [6, 7]. This fact could be an indication not only of retrotranspositional activity but also of infectivity, since the presence of a full-length env gene may indicate a capability for virulence. In fact, retroviral proteins produced by one HML-2 integration could join forces with retroviral proteins produced by another HML-2 integration to complement each other and possibly form a viral particle. Although HML-2 virus-like particles (VLPs) have been isolated from early-stage embryos, placental and cancerous tissues, no proof of HML-2 VLP infectivity has been found [8, 9]. Still, evidence of HML-2 transcription has been found in numerous pathological conditions including neurodegenerative diseases, autoimmune disorders and cancer [10].
The investigation of HML-2 involvement in the development, progression and outcome of a pathology requires awareness of the HML-2 loci residing in the human genome. Two major types of HML-2 integrations are found in the human genome: full-length or almost-full length proviruses and solo LTRs. Full-length HML-2 proviruses incorporate a set of core retroviral protein-coding gene clusters (i.e., gag-pol-env), flanked by LTRs on both ends. Solitary LTRs (solo LTRs) emerge as a consequence of homologous recombination between the LTRs and the subsequent deletion of the internal sequence [11]. The sequence of full-length HML-2 proviruses has been subjected to the activity of evolutionary forces that led to degeneration over the course of time by point mutations or large-scale insertions/deletions. As a result, some proviruses, especially the oldest integrations, have either lost parts of their genome, or incorporated non-HML-2 sequences (or both). The first extensive list of HML-2 proviruses was compiled and published by Subramanian et al. [12] in 2011 (from now on “Subramanian list”). To categorize an HML-2 integration as a provirus, the authors stated that it should have at least one LTR plus a part of the internal protein-coding sequence in the same orientation [12]. According to this criterion, the authors identified 91 proviruses, and established a universal nomenclature based on chromosome band location in the human genome, as earlier denomination for HML-2 loci varied among research groups. Since then, a number of new HML-2 proviral loci has been identified, mainly polymorphic ones which are found in certain human subpopulations or individuals [6, 13]. Moreover, more recent catalogues of HML-2 proviruses have been published [6, 14,15,16], although not exhaustive. Thus, the Subramanian list has been for over a decade the golden standard for HML-2 studies, an invaluable tool for proviral genomics.
Apart from the discovery of additional proviral insertions, which in some cases coincide in the same chromosomal band as other, already recognised, proviruses, HML-2 proviruses in the Subramanian list are characterized by chromosomal positions in the genome assembly GRCh37.p13 (hg19) of the Human Genome Project, while positions in the currently used GRCh38.p14 (hg38) are slightly shifted. Moreover, in some cases additional information about HML-2 proviral sites has been published in recent literature, for example regarding polymorphic alleles of certain known integrations, or pericentromeric HML-2 integrations [13, 17,18,19]. In addition, Bendall et al. [20] created an annotation catalogue (from now on “Bendall S1 File”) containing HERV proviruses annotations (including HML-2) based upon information deduced from Repbase database [21] and the Repeatmasker (developed by A.F.A. Smit, R. Hubley, and P. Green; see http://www.repeatmasker.org/) track in the UCSC browser on human (GRCh38/hg38) genome. The lengths of some of the proviruses contained in this file, which is nonetheless concordant with the Subramanian list [20], are different (higher in most cases) from the lengths provided for the same proviruses in the Subramanian list. Also, Bendall S1 File contains four HML-2 novel proviral insertions that are not only missing from Subramanian list but also have never been mentioned in literature (discussed below). On the other hand, this file lacks some polymorphic and pericentromeric HML-2 proviruses that are also absent from the reference genome.
After careful comparison of the existing HML-2 proviral lists, extrapolation of the chromosomal positions of proviruses in both genome releases (hg19-hg38), and thorough inspection of all possible polymorphic sites found in literature, we provide an updated catalogue of all HML-2 proviruses that have been identified in the human genome to date. The said catalogue contains, among other data, updated chromosomal locus nomenclature, HML-2 proviral coordinates in both genome releases, Genbank Accession Numbers for complete proviral genomes, polymorphic alleles information. This is the most up-to date and extensive recitation of HML-2 proviruses that is currently available and it can serve as a useful tool for further studies of HERVs.
Materials and methods
Appointment of correct chromosomal coordinates to the HML-2 proviruses
For each proviral integration, we compared the proviral length that resulted from the subtraction of coordinates in hg38 (provided in the Bendall S1 File [20] and also from a list of HML-2 proviruses with hg38 coordinates published by Chabukswar et al. [14]) with the respective length that resulted from the coordinates in the Subramanian list. If the lengths were equal, we assumed that the coordinates in hg38 were correct. In some cases, we also utilized the length information provided in Genbank deposited genomes for each provirus (Additional File 1) as well as in the Repeatmasker track of the UCSC genome browser. For a final step of confirmation, we used the online BLAST [22] tool to compare the sequences corresponding to the abovementioned coordinates in hg19 (Subramanian list) and hg38 that we downloaded from NCBI. For almost half of the proviruses, the different methods for length calculation described above resulted in a variety of length estimations. In these cases, for each proviral integration, we explored the sequences flanking the coordinates that resulted in the shorter proviral entry and used BLAST alignment with the longer proviral entry to find if these flanking sequences were actually part of the provirus.
For proviral integrations that are available in literature only in the hg19 build or only in the hg38 build, the LiftOver tool of the UCSC genome browser [23] was used for conversion from one reference genome version to the other.
Phylogenetic analyses
HML-2 proviral sequences were downloaded from NCBI using their hg38 coordinates provided in Additional File 1. Sequences were clustered according to their length into three groups: 8-9kb sequences, 5-8kb and under 5kb. Sequences of each group were aligned to HERV-K113 (Additional File 1) using ClustalW [24], were combined and manually edited in BioEdit v. 7.7.1 [25]. The phylogeny was inferred using the Maximum Likelihood method and Tamura-Nei (1993) model [26] of nucleotide substitutions and the tree with the highest log likelihood (-177,392.14) is shown. The initial tree for the heuristic search was selected by choosing the tree with the superior log-likelihood between a Neighbor-Joining (NJ) tree [27] and a Maximum Parsimony (MP) tree. The NJ tree was generated using a matrix of pairwise distances computed using the Tamura-Nei (1993) model [26]. The MP tree had the shortest length among 10 MP tree searches, each performed with a randomly generated starting tree. The analytical procedure encompassed 102 nucleotide sequences with 24,141 positions in the final dataset. All missing-information and alignment gap sites were retained (‘Use all sites’ option for Gaps/missing data field). Evolutionary analyses were conducted in MEGA12 [28] utilizing up to 4 parallel computing threads.
Results and discussion
Addition of novel (polymorphic) HML-2 proviruses available in literature to the list
As mentioned above, the Subramanian list (divided into two tables in the original publication [12]) includes a total of 91 proviruses that the authors either discovered during their study or mined from other, older publications. A more recent study by the same group [6], added four novel polymorphic proviruses (termed 8q24.3c, 19p12d, 19p12e, Xq21.33 in [6]) to the list, three of which were published for the first time, while one (the second one) had been described earlier by other groups [15, 17, 29]. Interestingly, the polymorphic Xq21.33 is estimated to be the youngest HML-2 provirus in humans [30]. With the addition of the abovementioned proviruses, it seems that the 19p12 band, which is in close proximity to the centromeres of chromosome 19, harbours five HML-2 proviruses, at least three of which are insertionally polymorphic, meaning that an individual could carry the provirus or the pre-integration site [15]. The first two proviruses in this band were denominated by Subramanian et al. [12], 19p12a and 19p12b, and were also known as K52 and K113, respectively. The third provirus in this band, coordinates-wise, was not included in the Subramanian list, because it was discovered a year after the publication of the latter [17, 29]. In 2015, Macfarlane & Badge [15] proposed the designation 19p12c for this provirus, while Wildschutte et al. call it 19p12d, as the Subramanian list hosted already a 19p12c provirus, the also called K51, which however, resides downstream from the 19p12c of Macfarlane & Badge [15]. The next provirus in this chromosomal band was termed 19p12e by Wildschutte et al., who discovered it [6]. Last in this band, is K51, which, as mentioned before, was termed 19p12c in the Subramanian list and 19p12d by MacFarlane & Badge [15]. From the above, it follows that the “a, b, c, d, e, etc.” indicator that follows the chromosomal band characterization, corresponding to the succession of proviruses, is rather misleading and confusing especially as novel integrations are being uncovered within the same band, between the already existing integrations. In that case, renaming of proviruses becomes mandatory, which however may cause even greater confusion given the existence of publications comprising the old nomenclature. Thus, we propose a new denomination system for the chromosomal band positions of HML-2 proviruses, which utilizes the provirus start coordinates in hg38 main assembly. Namely, instead of a, b, c…, we propose the use of a three-digit number in a parenthesis, with the last digit corresponding to the last number of the start coordinates, the middle digit corresponding to the last number of the thousands in the start coordinates, and the first digit corresponding to the last number of the millions in the start coordinates (if the start coordinates number is only some hundreds of thousands, then the first digit of the denominator is zero). For example, 19p12a with start coordinates 20,387,400, can be renamed as 19p12(070). Since majority of chromosomal bands and sub-bands (in cases of larger bands) are less than 10 million base pairs in length, we believe that this denomination system makes it is highly unlikely to have two proviruses claiming the same name. Cases with large chromosomal bands, such as the q12 band of chromosome Y, are mostly heterochromatic, meaning that they are very poor in genes and other expressed elements. In our updated proviral HML-2 catalogue, we present all proviruses that are situated in chromosomal bands with multiple counterparts, bearing the new, proposed above denomination system (Additional File 1).
MacFarlane & Badge also discovered a new, polymorphic provirus in 1p31.1 band of chromosome 1, upstream from the 1p31.1 or K116 provirus included in the Subramanian list [15]. Hence, they proposed the designation 1p31.1a for the newly found provirus, and 1p31.1b for the already uncovered one. Herein, we renamed said proviruses as 1p31.1(398) and 1p31.1(576), respectively.
Three years later, Thomas et al. [16] found an indication of the existence of a novel unfixed HML-2 proviral integration in 5q11.2. This integration exists as a solo LTR in the hg19 and the hg38 genome builds, under the coordinates Chr5: 58759612–58760580 and Chr5: 59463786–59464754, respectively. In general, polymorphisms in HML-2 loci can be in the form of a full (or almost full)-length provirus, solo LTR and the unoccupied site (pre-integration site) [31]. Since Thomas et al. [16] reported merely the prediction that a part of human population hosts a provirus between these coordinates, the sequence of this provirus is unavailable and further investigation is required to confirm this dimorphism, as declared by authors themselves.
As recently as 2022, Chabukswar et al. [14] detected a new fixed provirus in the chromosomal band 6q11.1, which we included in our HML-2 catalogue (Additional File 1). Authors also provided a list of proviral HML-2 sequences, which comprised only the proviruses of the Subramanian list (with the exception of K113 and K105) plus the additional proviral HML-2 integration they found.
We also examined the Bendall S1 File [20], which as mentioned above, is based on information from Repbase, Repeatmasker and UCSC. We detected four additional proviruses that have not been reported in literature, but are present in the hg38 reference genome, in chromosomal bands 1p36.21, 11p15.4, 11q12.3 and 16p11.2 (Table 1, Additional File 1). At least one provirus was already detected in all those bands and included in the Subramanian list. We also discovered that some HML-2 proviruses already reported in literature were absent from the Bendall S1 File (see Additional File 2). The reason that the bioinformatics tools used in Bendall et al. [20] to detect TEs did not find these proviruses, lies mainly in their polymorphic nature; polymorphic proviral HML-2 alleles found in human individuals or sub-populations, that appear as solo LTRs or pre-integration sites in the reference genome, are missing proviral coordinates both in hg19 and in hg38. Another “problem” that we observed regarding the HML-2 loci cited in the said Bendall S1 File is the fact that four proviruses that are known from literature, are divided each into two parts and referred to as two separate proviruses (Additional File 3). One of these four proviruses is HERV-K115 in 8p23.1, which is one of the youngest proviruses known, with relatively intact ORFs [30]. Thus, it is a well-studied proviral integration with a profound and canonical proviral genome of 9463 bp, which in no way could consist of two proviruses of 4893 and 4510bp, respectively, as listed in the Bendall S1 File (Additional File 3).
An unusual group of HML-2 proviruses that has only one member included in the Subramanian list, are the centromeric and pericentromeric ones. The Subramanian list contains K105, the chromosomal location of which was unknown, as it was found to be situated within an unassembled, at the time, centromeric region [32]. High-repetitiveness of sequences in the centromeric and pericentromeric regions posed additional difficulties in identifying HERV integrations. Later it was discovered that K105 is a variant of K111, a provirus that was deposited earlier in the NCBI database [18]. Proviruses K105, K111 and K112 (also as solo LTRs) were found to be situated in the centromeric region of chromosomes 21 and 22 [18]. The same research group found two other types of centromeric HML-2 proviruses: K222 and the recombinant form K111/K222 in pericentromeric regions mostly of chromosomes 13, 14 and 15 [19]. However, it should be noted that at least 15 chromosomes host centromeric insertions of the above-mentioned HML-2 proviruses [13]. Another intriguing finding regarded the assumed copy number of these integrations, which also varies among individuals, thus creating copy number polymorphisms [13]. The precise copy number of centromeric proviruses is unattainable to date due to the highly-repetitive nature and also the lack of annotation of centromeric sequences [13, 17, 19]. Possibly the recently completed Telomere-to-Telomere (T2T) Project, covering the sequencing of remaining 8% of the human genome composed mostly of the repetitive regions (centromeres, telomeres, etc.) [33], could aid the resolution of the abovementioned problem. It is important to note that the Bendall S1 File [20] does not contain centromeric and pericentromeric HML-2 proviruses as they are not annotated in the human genome (Additional File 2). We included basic information about these proviruses in the catalogue presented herein (Additional File 1).
It should be also noted that in 2013, Shin et al. [7] identified a very rare type of HML-2 insertions, termed atypical or non-classical. These integrations lacked the 5’ and 3’ LTR regions and consisted only of a short internal sequence and are possibly associated with genome repair mechanisms ensuring stability. Since we are only considering here proviruses that appertain to the definition given by Subramanian et al. [12] (see above), we are not including atypical HML-2 integrations to our catalogue.
To sum up, we have identified in literature 10 HML-2 proviruses that were discovered after the publication of Subramanian list (Table 1). If not counting the pericentromeric HML-2 integration K105, Subramanian list contained 90 proviruses. In our updated, comprehensive catalogue (Additional File 1), we included 88 out of these 90 proviruses along with the 10 more recent ones. Two proviral sites (7p22.1a and 7p22.1b) that Subramanian et al. [12] considered as two separate proviruses (K108L and K108R) can actually be regarded as one, containing two tandem proviral integrations that share a common LTR in the middle. Due to the fact that: (a) the two integrations are not autonomous, rather the 3’LTR of the right one (K108R) functions as the 5’LTR for the left one (K108L), (b) we have found transcripts from this provirus that begin from the 5’LTR of the right one and end at the 3’LTR of the left one (yet unpublished data) and (c) K108 has been frequently regarded in literature as one HML-2 provirus [8, 34, 35], we decided to also include it in our catalogue as one proviral site (Additional File 1). Therefore, the number of HML-2 proviral integrations known to date totals to 99, without taking into consideration the number of centromeric and pericentromeric HML-2 integrations which is not yet precisely known [13, 19].
Refinement of coordinates
The second major part of our study was to appoint the correct coordinates in the latest edition of the reference genome (hg38) to each HML-2 provirus of our updated catalogue, since the Subramanian list included coordinates in the hg19 assembly. Towards this goal, we drew information from the Bendall S1 File [20] and from a list of HML-2 proviruses with hg38 coordinates published by Chabukswar et al. [14].
In five cases (Additional File 4), coordinates given by Subramanian List resulted in a longer provirus than coordinates given by the Bendall S1 File or Chabukswar et al. [14]. Thus, we examined the sequences flanking the hg38 coordinates from the last two sources for identity with the parts of the proviruses that are missing. This process resulted in the correction of hg38 coordinates, with the aim to include the whole provirus (Additional File 4).
In cases that coordinates given by Subramanian List resulted in a shorter provirus than coordinates given by the Bendall S1 File or Chabukswar et al. [14], we sought for sequences identical to the missing parts of the provirus in regions flanking hg19 coordinates, using the BLAST tool. We indeed managed to refine several hg19 coordinates of HML-2 proviruses (Additional File 1).
In one case (provirus in 22q11.23), coordinates in Subramanian list, Bendall S1 File and Chabukswar et al. [14] all produced different proviral lengths. This provirus provides us with the perfect opportunity to describe in more depth the course of actions we implemented for the correction/refinement of previously published coordinates: First, we compared through BLAST the sequence defined by the coordinates of the Subramanian list in hg19 (8881bp) with the longest proviral sequence defined by the coordinates of the Bendall S1 File (10839bp). We found that there were sequences missing from both sides of the hg19 provirus. Subsequently, we took a look at these hg38 coordinates in the Repeatmasker track of the UCSC genome browser and found that the provirus is possibly even longer than the one defined in the Bendall S1 File. However, it is interrupted by non-proviral sequences as well as sequences that belong to other TEs and repetitive elements. We therefore sought for information about this specific integration in literature and found that this provirus is a complex one, as it consists of two 5’LTRs, separated by a 163bp gag sequence and also a longer non-HML-2 sequence. It also contains a gag gene with an intact ORF, an impaired by mutations pol gene and an env gene which is also non-functional due to the insertion of an inverted Alu element. 3’LTR sequence is also interrupted by a MER11A LTR sequence [36]. Since there is evidence that transcription begins at the first 5’LTR [36], we confirmed that the Bendall S1 File coordinates for the 5’ end of this provirus are correct. However, the respective coordinates for the 3’ end of the provirus coincide with the part of the 3’LTR preceding interruption by the MER11A LTR element. But the 3’LTR of 22q11.23 extends beyond the MER (MEdium Reiteration frequency) element for another 491bp. Therefore, we provide here the correct hg38 coordinates for this integration, that result in a 12367bp long provirus. Using the corresponding UCSC information for hg19, we revised also the hg19 coordinates for 22q11.23 (Additional File 1). Similar methodology was applied for the deciphering of all other HML-2 proviral coordinates and lengths conflicts.
For the three polymorphic proviruses that exist as solo LTRs in the most recent assemblies of the reference genome (Additional File 2), it is not possible to provide proviral coordinates as the provirus exists in the genome of only part of the human population. These loci are not included in the Bendall S1 File or in the Chabukswar et al. [14] table, therefore we used the LiftOver tool of the UCSC genome browser [23] to convert the hg19 coordinates provided by the original publication of these polymorphic proviruses or by the Subramanian list for three out of the four loci.
Also, five proviruses (8q24.3[404], 19p12(184), 19p12(217), 19p12d, Xq21.33) are found as pre-integration sites in the hg19 version of the human genome. Two of them (19p12(184) and 19p12(217)) are detected as proviruses in the alternative chromosomes published for hg38. We found the exact coordinates in the Bendall S1 File and included them in our catalogue (Additional File 1). For the other three, Wildschutte et al. [6] provides the pre-integration site in hg19. Thus, we used again LiftOver to convert pre-integration site coordinates in hg19 to respective coordinates in hg38. The hg38 pre-integration site coordinates are provided herein (Additional File 1).
As mentioned above (Table 1), four proviruses (1p36.21(327), 11p15.4(495), 11q12.3(281) and 16p11.2(427)) were included in the Bendall S1 File but not in the Subramanian list and nowhere else in literature. To find the coordinates in hg19, we used the LiftOver tool to convert hg38 coordinates from the Bendall S1 File to hg19 coordinates. We cross-checked the results both through BLAST and also through the RepeatMasker track showing the LTR elements in the respective genome assembly. For provirus in 1p36.21(327), the LiftOver tool did not produce any results in hg19. This fact, along with a manual search failure indicates that the current provirus does not exist in hg19. Similarly, provirus 16p11.2(427) is also missing from hg19 assembly, since LiftOver conversion gave back the coordinates of provirus 16p11.2(476), which in hg38 assembly is situated 500kb downstream from 16p11.2(427).
Other types of information regarding proviruses, such as aliases, polymorphic alleles, references, etc. were found in literature [6, 12, 14,15,16, 19] and are provided in Additional File 1.
Confirmation of the HML-2 nature of novel proviral additions
To ensure that the 10 HML-2 proviruses that we found in literature (Table 1) and are missing from the Subramanian list, indeed cluster together with the existing HML-2 proviruses of the list and also that refinement of coordinates did not alter significantly the proviral HML-2 phylogenetic tree topology, we produced an alignment of all 99 HML-2 proviruses of Additional File 1, based on the coordinates in hg38 column (Additional File 5). We then proceeded to the construction of a Maximum Likelihood phylogenetic tree (Fig. 1). As it can be seen in Fig. 1, the 10 novel HML-2 proviruses (highlighted in green) belong in fact to the HML-2 family. The tree also resembles closely the topology of the Bayesian inference trees published by Subramanian et al. [12] of HML-2 endogenous elements and gag and env genes.
Evolutionary analysis of all known HML-2 proviruses. The phylogeny was inferred using the Maximum Likelihood method and Tamura-Nei (1993) model [26] of nucleotide substitutions and the tree with the highest log likelihood is shown. Highlighted in green are proviruses that appeared in literature after the publication of Subramanian list and are therefore missing from both the list and the phylogenetic analyses conducted by the authors [12]. For the sake of correctness of the alignment on which the phylogenetic tree was based, we split the 7p22.1 locus that contained two tandem proviral integrations sharing an LTR in the middle, into 7p22.1a and 7p22.1b. The shared LTR was used in 7p22.1a alignment entry as the 3’LTR and in 7p22.1b as the 5’LTR. Although proviruses 8p22 and 17p13.1 are capable of being aligned, are incorporated in all HML-2 lists cited in the main text, including the Subramanian list, and are also recognised by RepeatMasker as HML-2 elements, they are found to be distant in sequence, resembling an outgroup, hence Subramanian et al. [12] suggested them presumably being members of a group of proviruses that are a close cousin to HML-2, which they termed HML-11. We found that a similar divergence is observed for provirus 1q21.3, which is possibly also a close relative of HML-2 proviruses. Scale bar at the bottom represents genetic change
Concluding remarks
Due to a plethora of articles being published regarding the number and chromosomal position of HML-2 proviruses as well as the new polymorphic sites that are constantly being uncovered, after a thorough study of all respective literature and the actual sequences of HML-2 proviral loci in the human genome, we composed a comprehensive catalogue of all currently known HML-2 proviruses (Additional File 1). We followed the definition of proviruses given by Subramanian et al. [12], as integrations that have “at least one LTR associated with internal coding sequence in the same orientation”. Our catalogue includes a total of 99 proviruses (apart from the centromeric and pericentromeric ones, the precise number of which is unknown), characterized by chromosomal band position and coordinates in the hg19 (GRCh37) and hg38 (GRCh38.p14) reference genome versions. In some cases, coordinates in hg19 have been slightly corrected from the ones given in Subramanian et al. [12] to include the whole provirus. Coordinates in hg38 are also corrected from the ones previously provided in literature [14, 20], for the same reason. Out of the 99 proviruses, 42 are over 9kb in length and therefore can be presumably characterized as full-length proviruses, comprising two LTRs of ~ 1kb each, and an internal region of > 7kb [7]. However, length is not indicative of true HML-2 genetic composition, as some integrations host (a) large deletions and/or insertions of non-HML-2 sequences, such as other transposable elements or unknown sequences (e.g., 19p12(061)), (b) tandem duplications of HML-2 sequences, such as a second 5’LTR (e.g., 22q11.23), or even a second truncated HML-2 provirus (e.g., 7p22.1), (c) partial/truncated HML-2 core genes in combination with large insertions of non-HML-2 sequences (e.g., 20q11.22). It should be noted that this catalogue corresponds to the current knowledge of HML-2 integrations, as new copies may be detected in the future, especially polymorphic ones. Moreover, the incorporation of human genomes from Africa to public databases increases the possibility of uncovering novel HML-2 proviruses, since the diversity of HML-2 elements was found to be higher in African than non-African populations [37].
In HERV research and especially in the research involving the more active HML-2 endogenous retroviral sites, proviral loci lists, like the one provided by Subramanian et al. [12] or the more recent, Bendall S1 File by Bendall et al. [20] have been utilised by numerous groups as a basis for analysis of high-throughput data. It is therefore important for the lists to be as up-to-date and as truthful as possible in comparison with the current knowledge. In this study, we aimed to present an updated and refined list of HML-2 proviruses and respective coordinates in both genome assemblies (hg19 and hg38), to include proviral parts that were previously overlooked. Moreover, since the Bendall S1 File has been used in many bioinformatic studies of HML-2 expression profiles [38,39,40,41], our finding that some proviruses are missing from the file (Additional File 2) and some are split into two proviruses (Additional File 3), may help scientists of the field become aware of this deviation and undertake corrective measures in their HERV studies.
Data availability
All data analysed during this study are included in this published article [and its supplementary information files].
Abbreviations
- HERV:
-
Human Endogenous Retrovirus
- HML-2:
-
Human endogenous MMTV-like 2
- LTR:
-
Long Terminal Repeat
- MER:
-
MEdium Reiteration frequency
- ORF:
-
Open Reading Frame
- TE:
-
Transposable Element
- VLP:
-
Virus-Like Particle
References
Li W, Pandya D, Pasternack N, Garcia-Montojo M, Henderson L, Kozak CA, et al. Retroviral elements in pathophysiology and as therapeutic targets for amyotrophic lateral sclerosis. Neurotherapeutics. 2022;19(4):1085–101.
Eickbush TH, Malik HS. Origins and Evolution of Retrotransposons. Mobile DNA II2007. p. 1111–44.
Wells JN, Feschotte C. A field guide to eukaryotic transposable elements. Annu Rev Genet. 2020;54:539–61.
Feschotte C, Gilbert C. Endogenous viruses: insights into viral evolution and impact on host biology. Nat Rev Genet. 2012;13(4):283–96.
Holloway JR, Williams ZH, Freeman MM, Bulow U, Coffin JM. Gorillas have been infected with the HERV-K (HML-2) endogenous retrovirus much more recently than humans and chimpanzees. Proc Natl Acad Sci U S A. 2019;116(4):1337–46.
Wildschutte JH, Williams ZH, Montesion M, Subramanian RP, Kidd JM, Coffin JM. Discovery of unfixed endogenous retrovirus insertions in diverse human populations. Proc Natl Acad Sci U S A. 2016;113(16):E2326–34.
Shin W, Lee J, Son SY, Ahn K, Kim HS, Han K. Human-specific HERV-K insertion causes genomic variations in the human genome. PLoS ONE. 2013;8(4):e60605.
Fuchs NV, Loewer S, Daley GQ, Izsvák Z, Löwer J, Löwer R. Human endogenous retrovirus K (HML-2) RNA and protein expression is a marker for human embryonic and induced pluripotent stem cells. Retrovirology. 2013;10:115.
Bhardwaj N, Montesion M, Roy F, Coffin JM. Differential expression of HERV-K (HML-2) proviruses in cells and virions of the teratocarcinoma cell line Tera-1. Viruses. 2015;7(3):939–68.
Hohn O, Hanke K, Bannert N. HERV-K(HML-2), the best preserved family of HERVs: endogenization, expression, and implications in health and disease. Front Oncol. 2013;3:246.
Hughes JF, Coffin JM. Human endogenous retrovirus K solo-LTR formation and insertional polymorphisms: implications for human and viral evolution. Proc Natl Acad Sci U S A. 2004;101(6):1668–72.
Subramanian RP, Wildschutte JH, Russo C, Coffin JM. Identification, characterization, and comparative genomic distribution of the HERV-K (HML-2) group of human endogenous retroviruses. Retrovirology. 2011;8:90.
Kaplan MH, Kaminski M, Estes JM, Gitlin SD, Zahn J, Elder JT, et al. Structural variation of centromeric endogenous retroviruses in human populations and their impact on cutaneous T-cell lymphoma, Sezary syndrome, and HIV infection. BMC Med Genomics. 2019;12(1):58.
Chabukswar S, Grandi N, Tramontano E. Prolonged activity of HERV-K(HML2) in Old World Monkeys accounts for recent integrations and novel recombinant variants. Front Microbiol. 2022;13:1040792.
Macfarlane CM, Badge RM. Genome-wide amplification of proviral sequences reveals new polymorphic HERV-K(HML-2) proviruses in humans and chimpanzees that are absent from genome assemblies. Retrovirology. 2015;12:35.
Thomas J, Perron H, Feschotte C. Variation in proviral content among human genomes mediated by LTR recombination. Mob DNA. 2018;9:36.
Contreras-Galindo R, Kaplan MH, Contreras-Galindo AC, Gonzalez-Hernandez MJ, Ferlenghi I, Giusti F, et al. Characterization of human endogenous retroviral elements in the blood of HIV-1-infected individuals. J Virol. 2012;86(1):262–76.
Contreras-Galindo R, Kaplan MH, He S, Contreras-Galindo AC, Gonzalez-Hernandez MJ, Kappes F, et al. HIV infection reveals widespread expansion of novel centromeric human endogenous retroviruses. Genome Res. 2013;23(9):1505–13.
Zahn J, Kaplan MH, Fischer S, Dai M, Meng F, Saha AK, et al. Expansion of a novel endogenous retrovirus throughout the pericentromeres of modern humans. Genome Biol. 2015;16(1):74.
Bendall ML, de Mulder M, Iniguez LP, Lecanda-Sanchez A, Perez-Losada M, Ostrowski MA, et al. Telescope: Characterization of the retrotranscriptome by accurate estimation of transposable element expression. PLoS Comput Biol. 2019;15(9):e1006453.
Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J. Repbase update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 2005;110(1–4):462–7.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421.
Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, et al. The UCSC Genome Browser Database: update 2006. Nucleic Acids Res. 2006;34(Database issue):D590-8.
Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22(22):4673–80.
Hall TA. BioEdit: a user-friendly biological sequence alignment editor and analysis program for windows 95/98/NT. Nucleic Acids Symp Ser. 1999;41:95–8.
Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993;10(3):512–26.
Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–25.
Kumar S, Stecher G, Suleski M, Sanderford M, Sharma S, Tamura K. MEGA12: molecular evolutionary genetic analysis version 12 for adaptive and green computing. Mol Biol Evol. 2024;41(12):msae263.
Lee E, Iskow R, Yang L, Gokcumen O, Haseley P, Luquette LJ 3rd, et al. Landscape of somatic retrotransposition in human cancers. Science. 2012;337(6097):967–71.
Kaplan MH, Contreras-Galindo R, Jiagge E, Merajver SD, Newman L, Bigman G, et al. Is the HERV-K HML-2 Xq21.33, an endogenous retrovirus mutated by gene conversion of chromosome X in a subset of African populations, associated with human breast cancer? Infect Agent Cancer. 2020;15:19.
Garcia-Montojo M, Doucet-O’Hare T, Henderson L, Nath A. Human endogenous retrovirus-K (HML-2): a comprehensive review. Crit Rev Microbiol. 2018;44(6):715–38.
Barbulescu M, Turner G, Seaman MI, Deinard AS, Kidd KK, Lenz J. Many human endogenous retrovirus K (HERV-K) proviruses are unique to humans. Curr Biol. 1999;9(16):861–8.
Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44–53.
Terry SN, Manganaro L, Cuesta-Dominguez A, Brinzevich D, Simon V, Mulder LCF. Expression of HERV-K108 envelope interferes with HIV-1 production. Virology. 2017;509:52–9.
Khadjinova AI, Wang X, Laine A, Ukadike K, Eckert M, Stevens A, et al. Autoantibodies against the envelope proteins of endogenous retroviruses K102 and K108 in patients with systemic lupus erythematosus correlate with active disease. Clin Exp Rheumatol. 2022;40(7):1306–12.
Goering W, Schmitt K, Dostert M, Schaal H, Deenen R, Mayer J, et al. Human endogenous retrovirus HERV-K(HML-2) activity in prostate cancer is dominated by a few loci. Prostate. 2015;75(16):1958–71.
Macfarlane C, Simmonds P. Allelic variation of HERV-K(HML-2) endogenous retroviral elements in human populations. J Mol Evol. 2004;59(5):642–56.
La Ferlita A, Distefano R, Alaimo S, Beane JD, Ferro A, Croce CM, et al. Transcriptome analysis of human endogenous retroviruses at locus-specific resolution in non-small cell lung cancer. Cancers (Basel). 2022;14(18):4433.
Burn A, Roy F, Freeman M, Coffin JM. Widespread expression of the ancient HERV-K (HML-2) provirus group in normal human tissues. PLoS Biol. 2022;20(10):e3001826.
Russ E, Mikhalkevich N, Iordanskiy S. Expression of human endogenous retrovirus group K (HERV-K) HML-2 correlates with immune activation of macrophages and type I interferon response. Microbiol Spectr. 2023;11(2):e0443822.
She J, Du M, Xu Z, Jin Y, Li Y, Zhang D, et al. The landscape of hervRNAs transcribed from human endogenous retroviruses across human body sites. Genome Biol. 2022;23(1):231.
Acknowledgements
Not applicable.
Funding
This work was supported by the Hellenic Foundation for Research and Innovation (H.F.R.I.) through the ‘2nd Call for H.F.R.I. Research Projects to Support Faculty Members and Researchers’ (Project Number 2517).
Author information
Authors and Affiliations
Contributions
G.M. designed research, supervised and secured funding; E.K. performed the literature search, analyzed the data and wrote the first draft. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
13100_2025_359_MOESM1_ESM.xlsx
Additional file 1. Catalogue of all HML-2 proviruses that have been identified in the human genome to date. The catalogue is provided as an .xlsx file and contains, among other data, updated chromosomal locus nomenclature, HML-2 proviral coordinates in both genome releases, Genbank Accession Numbers for complete proviral genomes, polymorphic alleles information, etc.
13100_2025_359_MOESM2_ESM.docx
Additional file 2. HML-2 proviruses missing from the S1 File of Bendall et al. 2019 [20].
13100_2025_359_MOESM3_ESM.docx
Additional file 3. HML-2 proviruses that are reported integral in literature, but are divided into two separate proviruses in the S1 File of Bendall et al. 2019 [20].
13100_2025_359_MOESM5_ESM.meg
Additional file 5. Full alignment of all HML-2 proviruses known to date. Sequences are provided as a MEG File, and can be viewed as alignment in the MEGA software, and as plain text in common word processing applications.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Kyriakou, E., Magiorkinis, G. Compilation of all known HERV-K HML-2 proviral integrations. Mobile DNA 16, 21 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13100-025-00359-8
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13100-025-00359-8