MEME: discovering and analyzing DNA and protein sequence motifs (original) (raw)
Journal Article
,
* To whom correspondence should be addressed. Tel: +61 7 3346 2614; Fax: +61 7 3346 2101; Email: t.bailey@imb.uq.edu.au
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
Search for other works by this author on:
Received:
14 February 2006
Revision received:
21 March 2006
Cite
Timothy L. Bailey, Nadya Williams, Chris Misleh, Wilfred W. Li, MEME: discovering and analyzing DNA and protein sequence motifs, Nucleic Acids Research, Volume 34, Issue suppl_2, 1 July 2006, Pages W369–W373, https://doi.org/10.1093/nar/gkl198
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
MEME (Multiple EM for Motif Elicitation) is one of the most widely used tools for searching for novel ‘signals’ in sets of biological sequences. Applications include the discovery of new transcription factor binding sites and protein domains. MEME works by searching for repeated, ungapped sequence patterns that occur in the DNA or protein sequences provided by the user. Users can perform MEME searches via the web server hosted by the National Biomedical Computation Resource ( http://meme.nbcr.net ) and several mirror sites. Through the same web server, users can also access the Motif Alignment and Search Tool to search sequence databases for matches to motifs encoded in several popular formats. By clicking on buttons in the MEME output, users can compare the motifs discovered in their input sequences with databases of known motifs, search sequence databases for matches to the motifs and display the motifs in various formats. This article describes the freely accessible web server and its architecture, and discusses ways to use MEME effectively to find new sequence patterns in biological sequences and analyze their significance.
INTRODUCTION
The purpose of MEME (Multiple EM For Motif Elicitation) (rhymes with ‘team’) ( 1 , 2 ) is to allow users to discover signals (called ‘motifs’) in DNA or protein sequences. The user of MEME inputs a set of sequences believed to share some (unknown) sequence signal(s). For example, some or all of a set of promoters from co-expressed and/or orthologous genes may contain binding sites (the ‘signal’) for the same transcription factor ( 3 ). Similarly, a set of proteins that interact with a single host protein may do so via similar domains (the ‘signal’) ( 4 ). Both types of sequence signals can often be represented as motifs-ungapped, approximate sequence patterns. Using a process akin to gapless, local, multiple sequence alignment, MEME searches for statistically significant motifs in the input sequence set. In this way, MEME can discover the binding sites for the shared transcription factor in the set of promoters or the common protein–protein binding domains in the set of proteins. MEME can also be used to discover motifs describing many other types of DNA or protein signals besides transcription factor binding sites and protein–protein interaction domains.
To use MEME via the website, the user provides a set of sequences in the FASTA format by either uploading a file or by cut-and-paste. The only other required input is an email address where the results will be sent. (A planned future version will remove this requirement by providing temporary storage of the results on the web server for a preset period of time.) By default, MEME looks for up to three motifs, each of which may be present in some or all of the input sequences. MEME chooses the width and number of occurrences of each motif automatically in order to minimize the ‘ E -value’ of the motif—the probability of finding an equally well-conserved pattern in random sequences. By default, only motif widths between 6 and 50 are considered, but the user may change this as well as several other aspects of the search for motifs.
The MEME output is HTML and shows the motifs as local multiple alignments of (subsets of) the input sequences, as well as in several other formats ( Figure 1 ). ‘Block diagrams’ show the relative positions of the motifs in each of the input sequences. Buttons on the MEME HTML output allow one or all of the motifs to be forwarded for analysis by other web-based programs. Clicking on a button allows all of the motifs to be sent to the MAST web server where various sequence databases (or uploaded sequences) can be searched for sequences matching the motifs. This is useful in cases, for example, where the user would like to find whether the motif of interest is also present in other genes or genomes.
MAST is a web-based tool that can be used to search for sequences that match one or more motifs. It can be used to look for sequences that contain motifs found by MEME, by other motif discovery tools or that are taken from a motif database. The MAST website, reached via the same URL as the MEME website, provides numerous nucleotide and protein databases for searching. MAST queries may contain any number of motifs, and it scores each sequence in the selected database using all of the motifs. In the first example above, MAST can search DNA sequences for matches to the putative transcription factor binding site (TFBS) motifs found by MEME in a set of promoter sequences. MAST can search for matches in protein sequences to the putative protein–protein interaction motifs found in the second MEME example.
Users of MEME via the website or locally installed versions are asked to cite this article as well as the primary reference for MEME (5). Users of MAST are asked to cite this article and Ref. ( 6 ).
MOTIF DISCOVERY STRATEGIES
Motif discovery can be viewed as a ‘needle in a haystack’ problem. The motif discovery algorithm is looking for a set of similar short sequences (the needle) in a set of much longer sequences (the haystack). The problem is easier when the motif instances are long and very similar to each other. It gets much harder when the motif instances are short and/or degenerate, or the input sequences are very long.
Discovering TFBS motifs in a set of DNA sequences (e.g. genomic regions upstream of genes) is a difficult task owing to the tendency of binding sites to be short and degenerate, and owing to the fact that promoter regions are often difficult to identify precisely. The problem tends to be worse in eukaryotes than in prokaryotes and yeast because eukaryotic TFBS tend to be shorter and more variable ( 7 ).
To successfully discover TFBS motifs with MEME, it is necessary to choose and prepare the input sequences carefully. Candidate sequences can be the promoters of genes believed to be co-regulated based on the evidence from expression microarray experiments, or sequences appearing to bind to a transcription factor based on chromatin immunoprecipitation experiments. The sequences should be as short as possible and contain as few ‘noise’ sequences (sequences not containing any motif) as possible. Ideally, the sequences should be <1000 bp long ( 8 ). Including more than 40 motif-containing sequences generally does not improve TFBS motif discovery with MEME and similar algorithms ( 9 ). If the sequences contain low-information segments that do not contain motifs of interest, it can be helpful to remove them using the DUST program (R. L. Tatusov and D. J. Lipman, unpublished NCBI/Toolkit), which is available for downloading at http://blast.wustl.edu/pub/dust/ . Repetitive DNA elements should also be removed from the sequences input to MEME using the RepeatMasker program (A. Smit, R. Hubley and P. Green, unpublished data), which can be accessed via the Web ( http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker ).
It should be noted that MEME is not suited to whole-genome TFBS motif discovery. Owing to their shortness and degeneracy, TFBS motifs become statistically ‘invisible’ in the context of a whole genome. The sensitivity of the search for TFBS motifs can be improved by using a ‘higher-order background sequence model’, but this option is only available currently when users download the MEME source code and install it locally. Instructions for the installation are available at the MEME website ( http://meme.nbcr.net/meme/website/meme-download.html ) by clicking on ‘View MEME man page’; see the documentation for the ‘-bfile’ switch there.
Protein motifs are generally easier to discover owing to the length of the protein alphabet and the chemical similarity among groups of amino acids. This allows shorter motifs to be more statistically significant and makes it easier to distinguish functional motifs from statistical artifacts. To use MEME to discover protein motifs, the same basic guidelines apply as with DNA motifs—keep the sequences as short as possible and include as few sequences that are not likely to contain the motif as possible in the input to MEME. Low-complexity regions can be removed from the protein input sequences using the SEG program ( 10 ).
ANALYZING MOTIFS USING THE MEME OUTPUT HYPERLINKS
The MEME HTML output contains buttons making it easy to analyze the motifs it discovers. By clicking on the button labeled ‘Compare PSPM to known motifs in JASPAR database’ following each motif, the DNA motif can be compared to each of the motifs in the JASPAR database ( 11 ) of known TFBS motifs. Similarly, protein motifs may be compared with protein motifs in the BLOCKS database of protein motifs ( 12 ) by clicking on the ‘submit BLOCK’ button following each motif on the MEME form. This takes the user to the ‘BLOCKS server’ where clicking on ‘LAMA’ will compare the motif with those in the BLOCKS database. The BLOCKS server also allows users to display protein motifs in many different ways, including LOGOS ( 13 ) or phylogenetic trees, by clicking on the corresponding buttons on the BLOCKS server form. By clicking on one of the file output formats under Logos, the user is able to obtain a LOGOS diagram similar to that shown in Figure 2 .
To search sequences for matches to the motifs found by MEME, users can click on the ‘MAST’ button at the top of the MEME output form. This will take the user to the MAST website where they can select the database to search. Since MAST is sequence-oriented, TFBS motifs should only be used to search promoter regions. These are listed in the MAST database pull-down menu as ‘Upstream Sequence Databases’. Currently, only a few organisms are supported. However, users can upload their own database of promoter sequences for searching using MAST. Protein motifs can be used to search any of the sequence databases provided by the MAST website since MAST can search either protein or nucleotide databases with protein motifs. The MAST database are updated weekly.
WEB SERVER AND USER SUPPORT
As of MEME version 3.5, the configuration and installation of MEME (including the web server) is significantly simplified by using Autoconf ( http://www.gnu.org/software/autoconf/autoconf.html ) and Automake ( http://www.gnu.org/software/automake/automake.html ) from the GNU Build System. An installation session for MEME and MAST web server may be as simple as follows:
cd meme_3.5.2
./configure --prefix=$HOME/meme --with-url= http://www.nbcr.net/
meme --enable-web
make
make test
make install
Supported platforms now include Linux, Solaris, MacOS X, Cygwin and Irix.
The MEME web server hosted by NBCR is queried by about 800 different users (based on unique email addresses) each month. Usage has been growing steadily since the service was first introduced in 1996. Figure 3 shows usage growth at the NBCR server since 2000.
To meet the growing user demand and take advantage of the emerging grid-computing resources ( 14 ), we have made MEME available for the installation on Linux clusters using either the RPM package manager or Rocks. The RPM package manager is a tool for managing software installation on computers running many versions of the Linux operating system. Rocks ( http://www.rocksclusters.org ) is a highly customized toolkit for computational biologists and engineers to build and maintain Linux clusters. The current NBCR MEME web server cluster is built using the MEME roll for Rocks and requires minimal maintenance effort.
MEME and MAST can be downloaded and installed free of charge by academic users via the website: ( http://meme.nbcr.net/meme/website/meme-download.html ). Approximately 300 users download the MEME/MAST software each month. The MEME support team offers assistance to the MEME and MAST user community through the forum ( http://nbcr.net/forum/viewforum.php?f=5 ) or the mailing list ( meme@nbcr.net ). Institutes interested in setting up MEME mirror sites are encouraged to contact us for any assistance.
FUTURE DIRECTIONS
To increase the sensitivity of MEME searches, we will add an option in the web server to let the user upload a background sequence model to MEME. We hope to add algorithms for removing low-complexity regions (SEG and DUST) and repeated elements (RepeatMasker) in the MEME website as a convenience to users. These services will also be exposed as web services and are integrated using workflow tools developed by using NBCR.
We have also planned to add buttons to the MEME output to allow TFBS motifs to be used in searching for cis -regulatory modules via algorithms such as MCAST ( 15 ). MCAST will be configured to be able to search the same DNA databases as MAST. In conjunction with this, we will add databases of upstream sequences for many additional organisms to the MAST/MCAST websites to facilitate the analysis of TFBS motifs discovered by using MEME.
NBCR has developed a set of tools built on top of the open source software that allows bioinformatics applications to be deployed as Web Services easily (S. Krishnan, B. Stearn, K. Bhatia, W. W. Li and P. Arzberger, manuscript submitted) and leverage the Cyberinfrastructure components transparently ( 14 ). A prototype has been deployed using MEME as a scientific driver ( 16 ) that offers a user with a dynamic pool of distributed compute resource, workflow management console and a friendly user interface. This portal will be deployed to the production web server in the future.
Figure 1
Sample MEME output.This portion of an MEME HTML output form shows a protein motif that MEME has discovered in the input sequences. The sites identified as belonging to the motif are indicated, and above them is the ‘consensus’ of the motif and a color-coded bar graph showing the conservation of each position in the motif. Some of the hyperlinked buttons that allow the motif to be viewed and analyzed in other ways can be seen at the bottom of the screen shot.
Figure 2
LOGO of protein motif. LOGOS are a visualization tool for motifs. The height of a letter indicates its relative frequency at the given position ( x -axis) in the motif.
Figure 3
Usage of MEME at the NBCR web server. The plot shows the number of different users submitting jobs to the NBCR MEME web server each month since December 2000. Usage figures for March 2006 include up to March 20 only.
The authors acknowledge NBCR award from NCRR, NIH P41 RR08605, for support of the MEME and MAST website. TLB acknowledges grant from NIH, R01 RR021692-01, for support of continuing development of the MEME and related sequence analysis tools. T.L.B. also acknowledges the ARC Centre for Bioinformatics (ACB) (ARC CE0348221) for infrastructure support for the MEME mirror site at the ACB. Funding to pay the Open Access publication charges for this article was provided by the NIH.
Conflict of interest statement . None declared.
REFERENCES
1
Bailey, T.L. and Elkan, C.
1995
Unsupervised Learning of Multiple Motifs In Biopolymers Using EM
Mach. Learn
21
51
–80
2
Bailey, T.L. and Elkan, C.
1995
The value of prior knowledge in discovering motifs with MEME Proceedings of the Third International Conference on Intelligent Systems for Molecular biology, July In Rawlings, C., Clark, D., Altman, R., Hunter, L., Lengauer, T., Wodak, S. (Eds.). Menlo Park, CA AAAI Press pp.
21
–29
3
Lyons, T.J., Gasch, A.P., Alex Gaither, L., Botstein, D., Brown, P.O., Eide, D.J.
2000
Genome-wide characterization of the Zap1p zinc-responsive regulon in yeast
Proc. Natl Acad. Sci. USA
97
7957
–7962
4
Fang, J., Haasl, R.J., Dong, Y., Lushington, G.H.
2005
Discover protein sequence signatures from protein-protein interaction data
BMC Bioinformatics
6
1
–8
5
Bailey, T.L. and Elkan, C.
1994
Fitting a mixture model by expectation maximization to discover motifs in biopolymers Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, August In Altman, R.B., Brutlag, D.L., Karp, P.D., Lathrop, R.H., Searls, D.B. (Eds.). Menlo Park, CA AAAI Press pp.
28
–36
6
Bailey, T.L. and Gribskov, M.
1998
'Combining evidence using P -values: application to sequence homology searches
Bioinformatics
14
48
–54
7
Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., et al.
2005
Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites
Nat. Biotechnol
.
23
137
–147
8
Pevzner, P.A. and Sze, S.H.
2000
Combinatorial approaches to finding subtle signals in DNA sequences Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, August. In Bourne, P.E., Gribskov, M., Altman, R.B., Jensen, N., Hope, D., Lengauer, T., Mitchell, J.C., Scheeff, E.D., Smith, C., Strande, S., Weissig, H. (Eds.). Menlo Park, CA AAAI Press pp.
269
–278
9
Hu, J., Li, B., Kihara, D.
2005
Limitations and potentials of current motif discovery algorithms
Nucleic Acids Res
.
33
4899
–4913
10
Wootton, J.C. and Federhen, S.
1966
Analysis of compositionally biased regions in sequence databases
Methods Enzymol
266
554
–571
11
Sandelin, A., Alkema, W., Engström, P., Wasserman, W.W., Lenhard, B.
2004
JASPAR: an open-access database for eukaryotic transcription factor binding profiles
Nucleic Acids Res
32
D91
–D94
12
Henikoff, J.G., Pietrokovski, S., Henikoff, S.
1997
Recent enhancements to the blocks database servers
Nucleic Acids Res
.
25
222
–225
13
Schneider, T.D. and Stephens, R.M.
1990
Sequence logos: a new way to display consensus sequences
Nucleic Acids Res
.
18
6097
–6100
14
Foster, I. and Kesselman, C.
The Grid 2: Blueprint for a New Computing Infrastructure
2004
2nd edn San Francisco, CA Morgan Kaufmann Publishers, Inc
15
Bailey, T.L. and Noble, W.S.
2003
Searching for statistically significant regulatory modules
Bioinformatics
19
Suppl 2,
II16
–II25
16
Li, W.W., Krishnan, S., Mueller, K., Misleh, C., Arzberger, P.
2006
Building cyberinfrastructure for bioinformatics using service oriented architecture Proceedings of the IEEE International Symposium on Cluster Computing and the Grid, May In Bu Sung, F.L., Abramson, D., Cai, W., Graupner, S., Jin, H., Sloot, P. (Eds.). USA IEEE Press (in press)
© The Author 2006. Published by Oxford University Press. All rights reserved The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org
I agree to the terms and conditions. You must accept the terms and conditions.
Submit a comment
Name
Affiliations
Comment title
Comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.
Citations
Views
Altmetric
Metrics
Total Views 13,390
10,199 Pageviews
3,191 PDF Downloads
Since 12/1/2016
Month: | Total Views: |
---|---|
December 2016 | 8 |
January 2017 | 21 |
February 2017 | 53 |
March 2017 | 73 |
April 2017 | 58 |
May 2017 | 61 |
June 2017 | 61 |
July 2017 | 45 |
August 2017 | 62 |
September 2017 | 40 |
October 2017 | 48 |
November 2017 | 56 |
December 2017 | 144 |
January 2018 | 182 |
February 2018 | 143 |
March 2018 | 190 |
April 2018 | 199 |
May 2018 | 137 |
June 2018 | 106 |
July 2018 | 107 |
August 2018 | 110 |
September 2018 | 98 |
October 2018 | 131 |
November 2018 | 123 |
December 2018 | 120 |
January 2019 | 161 |
February 2019 | 166 |
March 2019 | 180 |
April 2019 | 174 |
May 2019 | 180 |
June 2019 | 139 |
July 2019 | 154 |
August 2019 | 161 |
September 2019 | 141 |
October 2019 | 131 |
November 2019 | 157 |
December 2019 | 121 |
January 2020 | 157 |
February 2020 | 134 |
March 2020 | 136 |
April 2020 | 135 |
May 2020 | 127 |
June 2020 | 161 |
July 2020 | 115 |
August 2020 | 154 |
September 2020 | 139 |
October 2020 | 207 |
November 2020 | 181 |
December 2020 | 186 |
January 2021 | 137 |
February 2021 | 114 |
March 2021 | 183 |
April 2021 | 143 |
May 2021 | 148 |
June 2021 | 143 |
July 2021 | 88 |
August 2021 | 119 |
September 2021 | 128 |
October 2021 | 140 |
November 2021 | 135 |
December 2021 | 129 |
January 2022 | 139 |
February 2022 | 114 |
March 2022 | 143 |
April 2022 | 140 |
May 2022 | 199 |
June 2022 | 125 |
July 2022 | 131 |
August 2022 | 139 |
September 2022 | 136 |
October 2022 | 140 |
November 2022 | 168 |
December 2022 | 111 |
January 2023 | 129 |
February 2023 | 141 |
March 2023 | 204 |
April 2023 | 243 |
May 2023 | 193 |
June 2023 | 133 |
July 2023 | 112 |
August 2023 | 142 |
September 2023 | 153 |
October 2023 | 156 |
November 2023 | 139 |
December 2023 | 180 |
January 2024 | 215 |
February 2024 | 166 |
March 2024 | 224 |
April 2024 | 240 |
May 2024 | 230 |
June 2024 | 158 |
July 2024 | 194 |
August 2024 | 155 |
September 2024 | 184 |
October 2024 | 190 |
November 2024 | 144 |
×
Email alerts
Citing articles via
More from Oxford Academic