A novel molecular mechanism in human genetic disease: A DNA repeat-derived lncRNA (original) (raw)

Abstract

Two thirds of the human genome is composed of repetitive sequences. Despite their prevalence, DNA repeats are largely ignored. The vast majority of our genome is transcribed to produce non protein-coding RNAs. Among these, long non protein-coding RNAs represent the most prevalent and functionally diverse class. The relevance of the non protein-coding genome to human disease has mainly been studied regarding the altered microRNA expression and function in human cancer. On the contrary, the elucidation of the involvement of long non-coding RNAs in disease is only in its infancy. We have recently found that a chromatin associated, long non protein-coding RNA regulates a Polycomb/Trithorax epigenetic switch at the basis of the repeat associated facioscapulohumeral muscular dystrophy, a common muscle disorder. Based on this, we propose that long non-coding RNAs produced by repetitive sequences contribute in shaping the epigenetic landscape in normal human physiology and in disease.

Keywords: repetitive element, non protein-coding RNA, polycomb, trithorax, chromatin, muscular dystrophy

Long Non Protein-Coding RNAs

Nowadays, we are witnessing a progressive deconstruction of the protein-centric view that has governed the molecular biology field for decades.1 Although functional non protein-coding RNAs (ncRNAs) such as tRNAs and snRNAs were discovered long time ago, the recognition of RNA as a functional molecule per se, acting at key levels in the complex, multilayered regulation of eukaryotic gene expression, occurred only in recent years.2 This was triggered by the milestone discovery that the vast majority of mammalian genomes is transcribed to generate a wide array of ncRNAs.3

The main classification of ncRNAs is based on their length. Typically, molecules shorter than 200 bp are termed short or small ncRNAs, while transcript greater than 200 bp are defined long ncRNAs (lncRNAs).4

Short ncRNAs, which include microRNAs, piwi-interacting RNAs and short interfering RNAs, have been the most extensively characterized in recent years.5 They are involved in several biological processes, such as post-transcriptional gene regulation, heterochromatin formation and mobile repetitive elements control.5

LncRNAs represent the most numerous and functionally diverse class of RNA produced by mammalian cells.6 They are predominantly nuclear6 and have been involved in regulation of gene expression at transcriptional and post-transcriptional levels and in the formation of functional sub-compartments in the nucleus2,7 (Fig. 1). Despite the growing interest on lncRNAs, they still remain poorly explored in terms of biological relevance, cellular function and mechanism of action.

graphic file with name rna-9-1211-g1.jpg

Figure 1. Schematic representation of the diverse cellular functions of lncRNAs. LncRNAs are localized both in the cytoplasm (in yellow) and in the nucleus (in red). The vast majority of lncRNAs has yet to be functionally characterized (?). Inside the nucleus, lncRNAs can act in the negative (-) or positive (+) regulation of gene expression at transcriptional level. Furthermore, specific nuclear lncRNAs can promote a functional compartmentalization of the nucleus. They are involved in the constitution of specific sub-nuclear structures such as nucleoli (rRNAs), speckles (MALAT1) and paraspeckles (NEAT1). It has also been described that cytoplasmic lncRNAs (in yellow) can compete with mRNAs (in grey) for the binding to miRNAs (in light blue), interfering with the available cellular pool of miRNAs, ultimately influencing gene expression at the post-transcriptional level.

Most of the ncRNA analyses are performed on the non-repetitive, non-exonic, polyA+ portion of the mammalian transcriptomes. It has to be noted that over two-thirds of the mammalian genome is composed of repetitive elements.8 Moreover, there are many examples of ncRNAs that are antisense or partially overlapping to protein-coding genes.3 Finally, there are indications that the majority of ncRNAs are non-polyadenylated.6 Hence, the range, depth and complexity of the mammalian transcriptome are far from being fully characterized.9

Chromatin remodeling factors are crucial players in defining the epigenetic state of a cell and their genomic recruitment needs fine-tuned temporal and spatial regulation. Generally, chromatin remodeling components lack sequence-specific DNA-binding domains and need to be guided to their targets by auxiliary factors. Interestingly, lncRNAs tend to be enriched in the nucleus and are increasingly involved in the epigenetic regulation of gene expression.6,10 In particular, several lncRNAs bind to multiple chromatin regulatory proteins to mediate their recruitment to specific genomic targets.11-19

LncRNAs display features that are particularly suitable for molecules mediating the targeting of protein complexes to precise genomic sites. Indeed, compared to transcription factors, which typically bind to simple and degenerate DNA motifs, lncRNAs can specify “unique addresses” in the genome. Additionally, lncRNAs can form secondary structures that can function as “bait” for protein interaction, suggesting that transcription of lncRNAs might contribute to the formation of specific epigenetic landscapes.

A paradigmatic example of the role of lncRNAs in epigenetic gene regulation is represented by X-chromosome inactivation. Upon its discovery more than fifty years ago,20 this has become one of the most exciting fields in molecular biology. X-chromosome inactivation is a multi-step cascade of events where several lncRNAs, produced by the so-called X-chromosome Inactivation Center (XIC), collaborate with Polycomb Group (PcG) of proteins leading to chromatin compaction and epigenetic repression.21

Together with the Trithorax group (TrxG) of proteins, PcG constitutes an evolutionary conserved antagonistic system that regulates gene expression at the epigenetic level.22,23 Typically, PcG is associated with gene repression while TrxG with gene activation. Initially discovered as key regulators of homeotic (Hox) genes during development in D. melanogaster, it is now clear that PcG and TrxG factors play crucial roles in many biological phenomena such as cell proliferation,24 stem cell identity,25 cancer,26 genomic imprinting27 and X-chromosome inactivation.28

In mammalian X-inactivation, the 1.7 kb lncRNA RepA, by directly interacting with one of its components, recruits the multiprotein complex Polycomb Repressive Complex 2 (PRC2) to the future inactive X chromosome (Xi) enabling full induction of the lncRNA Xist.13,29 With the help of YY1,30 Polycomb Repressive Complex 1 (PRC1)31 and PRC2,13 Xist “paints” the X-chromosome and mediates the epigenetic repression of the whole chromosome territory. On the active X-chromosome (Xa), instead, the antisense 40 kb Tsix lncRNA32 functions by maintaining low Xist expression levels and by interfering with PcG binding to RepA, thus preventing the repressive cascade.13,33

Pioneering studies conducted in flies revealed that transcription of ncRNAs from PcG/TrxG binding sites can counteract PcG silencing.11,34-36 The discovery in mammals of similar transcripts originating from PcG/TrxG targets such as the Hox cluster15,37 ,suggests that ncRNA transcription could be a general feature of the regulation of PcG/TrxG function.

So far, the exact molecular mechanism(s) through which lncRNAs recruit epigenetic regulators remains largely unclear. Moreover, the vast majority of the lncRNAs characterized up to now function in trans to epigenetically repress gene expression.

We have recently contributed to this field through the identification of the first activating lncRNA involved in a human genetic disease: facioscapulohumeral muscular dystrophy (FSHD) (see below).

Summary of Our Paper

FSHD (MIM #158900) is the third most common myopathy38 and is characterized by progressive wasting of facial, upper arm and shoulder girdle muscles_._ FSHD is an autosomal-dominant hereditary disorder with peculiar features.39 The disease is not caused by classical mutations in a protein-coding gene. Rather, it is associated with reduction in the copy number of the 3.3 kb macrosatellite D4Z4 repeat mapping to the sub-telomeric region of human chromosome 4 long arm (4q35). D4Z4 is extremely polymorphic in the general population, ranging from 11 to 150 copies.40,41 On the contrary, FSHD patients carry a contracted D4Z4 array, containing only 1 to 10 units,42,43 suggesting that a gain-of-function element, linked to a threshold D4Z4 copy number effect, is involved in the disease.

D4Z4 belongs to a family of human tandem repeats termed macrosatellites that are non-centromerically located.44 Together with other members of the family, such as DXZ4 on chromosome X45 and RS447 on 4p,46 D4Z4 is extremely GC-rich. Notably, the area occupied by D4Z4 repeats in healthy subjects represents one of the largest GC-rich regions of the human genome.

Several FSHD clinical features, such as the variability in severity and rate of progression, the gender bias in penetrance, the asymmetric muscle wasting and the discordance of the disease in monozygotic twins, strongly suggest the involvement of epigenetic factors.47 Accordingly, DNA methylation,48 histone modifications49,50 and higher order chromatin structure49,51,52 are altered in FSHD patients. While in healthy subjects the 4q35 locus is organized as repressed chromatin, FSHD has been associated with an epigenetic switch that leads to the inappropriate de-repression of several 4q35 genes, among which there are the leading FSHD candidates.53,54 While this is known since over a decade, the molecular mechanism through which D4Z4 repeats regulate chromatin structure and gene expression at 4q35 has remained elusive.

Our recent work indicates that the D4Z4 repeat is a novel Polycomb target.18 Indeed, D4Z4 repeats are able to initiate de novo PcG recruitment and a reduction of PcG levels causes de-repression of the 4q35 locus. Importantly, FSHD patients display a reduced PcG binding and a reduced spreading of the PcG histone mark H3K27me3 at the disease locus compared to controls.18

Based on our data, we propose that in healthy subjects, the presence of many D4Z4 units results in extensive PcG binding, DNA methylation, histone de-acetylation and chromatin compaction leading to a repressive chromatin organization. In FSHD patients, reduction of D4Z4 copy number results in a critical reduction of PcG binding and, as a consequence, reduction of PcG silencing on the contracted 4q35 allele. This creates the epigenetic environment permissive for the transcription of an activatory, long ncRNA, that we named DBE-T, originating proximally to and covering part of the repeat array (Fig. 2). DBE-T is produced solely in FSHD patients and is required for opening up the 4q35 chromatin and for de-repression of 4q35 genes. Mechanistically, we discovered that DBE-T is a chromatin-associated RNA that functions in cis, as demonstrated by the lack of 4q35 genes transcriptional de-repression in cells expressing DBE-T in trans from a transgene. Our data show that DBE-T remains associated to the FSHD locus chromatin and promotes the recruitment of the TrxG protein Ash1L through direct binding. Ash1L recruitment is associated with the accumulation of H3K36me2 at the FSHD locus. As a consequence, this leads to de-repression of FSHD candidate genes18 (Fig. 2). In this respect, DBE-T appears a very interesting target to develop therapeutic approaches aimed at normalizing 4q35 gene expression in FSHD patients.

graphic file with name rna-9-1211-g2.jpg

Figure 2. Model of DBE-T mediated de-repression of 4q35 genes in FSHD. In healthy subjects, the D4Z4 repeat array carrying many units displays extensive binding of PcG proteins leading to repression of the 4q35 locus. In FSHD patients, reduction in D4Z4 copy number critically diminishes PcG binding and silencing, allowing for transcription of the lncRNA DBE-T to occur_. DBE-T_ functions in cis to promote opening of chromatin structure and de-repression of 4q35 genes through direct binding and recruitment of the TrxG protein Ash1L, which drives H3K36me2 at the FSHD locus.

Open Questions

Classical FSHD patients (FSHD1) carry the deletion of just D4Z4 repeats on a single 4q35 allele. FSHD1 patients carrying also deletions proximal to the D4Z4 arrays have been described.55 Notably, some of these proximal deletions cover the region were we mapped the DBE-T transcriptional start site (TSS). Future work is required to determine if DBE-T (possibly a truncated version or a version starting from an alternative start site) is still produced in these patients.

A far less common form of the disease, FSHD2 (MIM #158901), is not linked to D4Z4 deletions at 4q35 but, through an unknown mechanism, is also characterized by loss of heterochromatin at the FSHD locus.56 Interestingly, FSHD2 patients display loss of heterochromatin not only from 4q35 but also from 10q26, where a macrosatellite repeat array almost identical to D4Z4 exists, suggesting a functional communication between these chromosomes, reminiscent of transvection in Drosophila.57 Consistent with this, somatic pairing of 4q35 and 10q26 D4Z4 has been reported.58 Whether DBE-T is produced also in FSHD2 patients and has a role in the general loss of heterochromatin at 4q35 and 10q26 remains to be investigated.

The copy number of D4Z4 repeats is inversely correlated to the age of onset and clinical severity of FSHD.59-64 In general, alleles with 1-3 D4Z4 repeats are associated with a severe form of disease that presents in childhood, while patients with 8-10 repeats develop a milder disease with reduced penetrance. Interestingly, in FSHD the disease affects the various muscle types differently. Recently, a possible relationship between clinical and epigenetic parameters was reported.56 It would be important to determine if DBE-T plays a role in mediating the epigenetic component of the clinical severity in FSHD.

A recent study conducted in yeast suggests that lncRNAs acting in cis could be responsible for the generation of a wide spectrum of fine-tuned, modulated effects on their targets, allowing for the production of variegated gene expression and, as a consequence, variegated phenotypes among genetically identical cells.65 In this perspective, it would be extremely interesting to investigate the expression of DBE-T in asymmetrically affected muscles of FSHD patients, to investigate the association between the production of this lncRNA and the casual, stochastic establishment of a dystrophic phenotype only in certain axes of the body during development.

Even though methylation of H3K36 has been mainly associated with gene activation, there are reports ascribing multiple roles to this histone mark. Different biological outcomes depends on when (developmental stage or cell cycle phase) or where (promoter region or gene body) the methylated H3K36 mark is placed and on which histone reader binds to this modification.66 While in budding yeast all levels of H3K36 methylation are executed by a single enzyme, termed Set2, which is coupled to transcriptional elongation,67 in mammals multiple, non redundant proteins are responsible for the different flavors of H3K36 methylation, clearly suggesting an increase in function complexity during evolution. In particular, H3K36me2 has been linked to gene activation,68,69 prevention of spurious intragenic transcription initiation70 and in the cellular response to DNA double-strand breaks.71 In agreement with a recent study,69 we observed enrichment of H3K36me2 in proximity of DBE-T TSS.18 Notably, Ash1L knock down leads to reduced H3K36me2 accumulation and decreased target expression, supporting an activatory role for this histone mark in our system.

Intriguingly, it appears to exist an evolutionary conserved antagonism between H3K36 and H3K27 methylation.72-74 Considering that both the PcG hallmark H3K27me3 and H3K36me2 peak in proximity of DBE-T TSS, it would be interesting to better characterize the relationship between these two marks in the context of FSHD.

Conclusions

Transcripts arising from repeats are cell-type specific and source of regulatory information, as repeats can provide alternative promoters, alternative exons, regulatory ncRNAs and short interfering RNAs targeting cellular genes.75-79 The influence of repeat transcription to the transcriptional output of mammalian genomes is surprisingly broad. In mammalian cells, around 6-30% of the total amount of transcripts initiate within repetitive elements.77 Hence, transcription of repetitive elements could be one of the major driving forces of evolution.

The majority of the typical PcG histone marks are located outside protein-coding regions and mainly in genomic repeats.31,80-83 Our results support the hypothesis that repetitive elements might function as genomic binding platforms for PcG, whose activity can be regulated by lncRNAs.18,83

RNAs are arbitrarily defined as non-protein coding if they do not contain an ORF longer than 100 amino acids.4,84 Nevertheless, this dichotomy does not necessarily reflect distinct biological functions.4,84-86 Indeed, there are examples of protein-coding RNAs displaying also non-coding functions.87-93 Moreover, RNAs encoding for small (less than 100 amino acids) functional peptides have been reported.94 Thus, the complexity of RNA functions has just begun to be unveiled and upcoming discoveries might reveal unexpected surprises.

Acknowledgements

This work is a partial fulfillment of Valentina Casa’s PhD in Molecular Medicine, Program in Neuroscience, San Raffaele University. The Gabellini laboratory is supported by the European Research Council (ERC), the Italian Epigenomics Flagship Project, the Italian Ministry of Health and the FSHD Global Research Foundation. D. Gabellini is a Dulbecco Telethon Institute Assistant Scientist.

Footnotes

References