MEME: discovering and analyzing DNA and protein sequence motifs (original) (raw)

Journal Article

* To whom correspondence should be addressed. Tel: +61 7 3346 2614; Fax: +61 7 3346 2101; Email: t.bailey@imb.uq.edu.au

Search for other works by this author on:

Received:

14 February 2006

Revision received:

21 March 2006

Cite

Timothy L. Bailey, Nadya Williams, Chris Misleh, Wilfred W. Li, MEME: discovering and analyzing DNA and protein sequence motifs, Nucleic Acids Research, Volume 34, Issue suppl_2, 1 July 2006, Pages W369–W373, https://doi.org/10.1093/nar/gkl198
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

MEME (Multiple EM for Motif Elicitation) is one of the most widely used tools for searching for novel ‘signals’ in sets of biological sequences. Applications include the discovery of new transcription factor binding sites and protein domains. MEME works by searching for repeated, ungapped sequence patterns that occur in the DNA or protein sequences provided by the user. Users can perform MEME searches via the web server hosted by the National Biomedical Computation Resource ( http://meme.nbcr.net ) and several mirror sites. Through the same web server, users can also access the Motif Alignment and Search Tool to search sequence databases for matches to motifs encoded in several popular formats. By clicking on buttons in the MEME output, users can compare the motifs discovered in their input sequences with databases of known motifs, search sequence databases for matches to the motifs and display the motifs in various formats. This article describes the freely accessible web server and its architecture, and discusses ways to use MEME effectively to find new sequence patterns in biological sequences and analyze their significance.

INTRODUCTION

The purpose of MEME (Multiple EM For Motif Elicitation) (rhymes with ‘team’) ( 1 , 2 ) is to allow users to discover signals (called ‘motifs’) in DNA or protein sequences. The user of MEME inputs a set of sequences believed to share some (unknown) sequence signal(s). For example, some or all of a set of promoters from co-expressed and/or orthologous genes may contain binding sites (the ‘signal’) for the same transcription factor ( 3 ). Similarly, a set of proteins that interact with a single host protein may do so via similar domains (the ‘signal’) ( 4 ). Both types of sequence signals can often be represented as motifs-ungapped, approximate sequence patterns. Using a process akin to gapless, local, multiple sequence alignment, MEME searches for statistically significant motifs in the input sequence set. In this way, MEME can discover the binding sites for the shared transcription factor in the set of promoters or the common protein–protein binding domains in the set of proteins. MEME can also be used to discover motifs describing many other types of DNA or protein signals besides transcription factor binding sites and protein–protein interaction domains.

To use MEME via the website, the user provides a set of sequences in the FASTA format by either uploading a file or by cut-and-paste. The only other required input is an email address where the results will be sent. (A planned future version will remove this requirement by providing temporary storage of the results on the web server for a preset period of time.) By default, MEME looks for up to three motifs, each of which may be present in some or all of the input sequences. MEME chooses the width and number of occurrences of each motif automatically in order to minimize the ‘ E -value’ of the motif—the probability of finding an equally well-conserved pattern in random sequences. By default, only motif widths between 6 and 50 are considered, but the user may change this as well as several other aspects of the search for motifs.

The MEME output is HTML and shows the motifs as local multiple alignments of (subsets of) the input sequences, as well as in several other formats ( Figure 1 ). ‘Block diagrams’ show the relative positions of the motifs in each of the input sequences. Buttons on the MEME HTML output allow one or all of the motifs to be forwarded for analysis by other web-based programs. Clicking on a button allows all of the motifs to be sent to the MAST web server where various sequence databases (or uploaded sequences) can be searched for sequences matching the motifs. This is useful in cases, for example, where the user would like to find whether the motif of interest is also present in other genes or genomes.

MAST is a web-based tool that can be used to search for sequences that match one or more motifs. It can be used to look for sequences that contain motifs found by MEME, by other motif discovery tools or that are taken from a motif database. The MAST website, reached via the same URL as the MEME website, provides numerous nucleotide and protein databases for searching. MAST queries may contain any number of motifs, and it scores each sequence in the selected database using all of the motifs. In the first example above, MAST can search DNA sequences for matches to the putative transcription factor binding site (TFBS) motifs found by MEME in a set of promoter sequences. MAST can search for matches in protein sequences to the putative protein–protein interaction motifs found in the second MEME example.

Users of MEME via the website or locally installed versions are asked to cite this article as well as the primary reference for MEME (5). Users of MAST are asked to cite this article and Ref. ( 6 ).

MOTIF DISCOVERY STRATEGIES

Motif discovery can be viewed as a ‘needle in a haystack’ problem. The motif discovery algorithm is looking for a set of similar short sequences (the needle) in a set of much longer sequences (the haystack). The problem is easier when the motif instances are long and very similar to each other. It gets much harder when the motif instances are short and/or degenerate, or the input sequences are very long.

Discovering TFBS motifs in a set of DNA sequences (e.g. genomic regions upstream of genes) is a difficult task owing to the tendency of binding sites to be short and degenerate, and owing to the fact that promoter regions are often difficult to identify precisely. The problem tends to be worse in eukaryotes than in prokaryotes and yeast because eukaryotic TFBS tend to be shorter and more variable ( 7 ).

To successfully discover TFBS motifs with MEME, it is necessary to choose and prepare the input sequences carefully. Candidate sequences can be the promoters of genes believed to be co-regulated based on the evidence from expression microarray experiments, or sequences appearing to bind to a transcription factor based on chromatin immunoprecipitation experiments. The sequences should be as short as possible and contain as few ‘noise’ sequences (sequences not containing any motif) as possible. Ideally, the sequences should be <1000 bp long ( 8 ). Including more than 40 motif-containing sequences generally does not improve TFBS motif discovery with MEME and similar algorithms ( 9 ). If the sequences contain low-information segments that do not contain motifs of interest, it can be helpful to remove them using the DUST program (R. L. Tatusov and D. J. Lipman, unpublished NCBI/Toolkit), which is available for downloading at http://blast.wustl.edu/pub/dust/ . Repetitive DNA elements should also be removed from the sequences input to MEME using the RepeatMasker program (A. Smit, R. Hubley and P. Green, unpublished data), which can be accessed via the Web ( http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker ).

It should be noted that MEME is not suited to whole-genome TFBS motif discovery. Owing to their shortness and degeneracy, TFBS motifs become statistically ‘invisible’ in the context of a whole genome. The sensitivity of the search for TFBS motifs can be improved by using a ‘higher-order background sequence model’, but this option is only available currently when users download the MEME source code and install it locally. Instructions for the installation are available at the MEME website ( http://meme.nbcr.net/meme/website/meme-download.html ) by clicking on ‘View MEME man page’; see the documentation for the ‘-bfile’ switch there.

Protein motifs are generally easier to discover owing to the length of the protein alphabet and the chemical similarity among groups of amino acids. This allows shorter motifs to be more statistically significant and makes it easier to distinguish functional motifs from statistical artifacts. To use MEME to discover protein motifs, the same basic guidelines apply as with DNA motifs—keep the sequences as short as possible and include as few sequences that are not likely to contain the motif as possible in the input to MEME. Low-complexity regions can be removed from the protein input sequences using the SEG program ( 10 ).

ANALYZING MOTIFS USING THE MEME OUTPUT HYPERLINKS

The MEME HTML output contains buttons making it easy to analyze the motifs it discovers. By clicking on the button labeled ‘Compare PSPM to known motifs in JASPAR database’ following each motif, the DNA motif can be compared to each of the motifs in the JASPAR database ( 11 ) of known TFBS motifs. Similarly, protein motifs may be compared with protein motifs in the BLOCKS database of protein motifs ( 12 ) by clicking on the ‘submit BLOCK’ button following each motif on the MEME form. This takes the user to the ‘BLOCKS server’ where clicking on ‘LAMA’ will compare the motif with those in the BLOCKS database. The BLOCKS server also allows users to display protein motifs in many different ways, including LOGOS ( 13 ) or phylogenetic trees, by clicking on the corresponding buttons on the BLOCKS server form. By clicking on one of the file output formats under Logos, the user is able to obtain a LOGOS diagram similar to that shown in Figure 2 .

To search sequences for matches to the motifs found by MEME, users can click on the ‘MAST’ button at the top of the MEME output form. This will take the user to the MAST website where they can select the database to search. Since MAST is sequence-oriented, TFBS motifs should only be used to search promoter regions. These are listed in the MAST database pull-down menu as ‘Upstream Sequence Databases’. Currently, only a few organisms are supported. However, users can upload their own database of promoter sequences for searching using MAST. Protein motifs can be used to search any of the sequence databases provided by the MAST website since MAST can search either protein or nucleotide databases with protein motifs. The MAST database are updated weekly.

WEB SERVER AND USER SUPPORT

As of MEME version 3.5, the configuration and installation of MEME (including the web server) is significantly simplified by using Autoconf ( http://www.gnu.org/software/autoconf/autoconf.html ) and Automake ( http://www.gnu.org/software/automake/automake.html ) from the GNU Build System. An installation session for MEME and MAST web server may be as simple as follows:

cd meme_3.5.2

./configure --prefix=$HOME/meme --with-url= http://www.nbcr.net/

meme --enable-web

make

make test

make install

Supported platforms now include Linux, Solaris, MacOS X, Cygwin and Irix.

The MEME web server hosted by NBCR is queried by about 800 different users (based on unique email addresses) each month. Usage has been growing steadily since the service was first introduced in 1996. Figure 3 shows usage growth at the NBCR server since 2000.

To meet the growing user demand and take advantage of the emerging grid-computing resources ( 14 ), we have made MEME available for the installation on Linux clusters using either the RPM package manager or Rocks. The RPM package manager is a tool for managing software installation on computers running many versions of the Linux operating system. Rocks ( http://www.rocksclusters.org ) is a highly customized toolkit for computational biologists and engineers to build and maintain Linux clusters. The current NBCR MEME web server cluster is built using the MEME roll for Rocks and requires minimal maintenance effort.

MEME and MAST can be downloaded and installed free of charge by academic users via the website: ( http://meme.nbcr.net/meme/website/meme-download.html ). Approximately 300 users download the MEME/MAST software each month. The MEME support team offers assistance to the MEME and MAST user community through the forum ( http://nbcr.net/forum/viewforum.php?f=5 ) or the mailing list ( meme@nbcr.net ). Institutes interested in setting up MEME mirror sites are encouraged to contact us for any assistance.

FUTURE DIRECTIONS

To increase the sensitivity of MEME searches, we will add an option in the web server to let the user upload a background sequence model to MEME. We hope to add algorithms for removing low-complexity regions (SEG and DUST) and repeated elements (RepeatMasker) in the MEME website as a convenience to users. These services will also be exposed as web services and are integrated using workflow tools developed by using NBCR.

We have also planned to add buttons to the MEME output to allow TFBS motifs to be used in searching for cis -regulatory modules via algorithms such as MCAST ( 15 ). MCAST will be configured to be able to search the same DNA databases as MAST. In conjunction with this, we will add databases of upstream sequences for many additional organisms to the MAST/MCAST websites to facilitate the analysis of TFBS motifs discovered by using MEME.

NBCR has developed a set of tools built on top of the open source software that allows bioinformatics applications to be deployed as Web Services easily (S. Krishnan, B. Stearn, K. Bhatia, W. W. Li and P. Arzberger, manuscript submitted) and leverage the Cyberinfrastructure components transparently ( 14 ). A prototype has been deployed using MEME as a scientific driver ( 16 ) that offers a user with a dynamic pool of distributed compute resource, workflow management console and a friendly user interface. This portal will be deployed to the production web server in the future.

Figure 1

Sample MEME output.This portion of an MEME HTML output form shows a protein motif that MEME has discovered in the input sequences. The sites identified as belonging to the motif are indicated, and above them is the ‘consensus’ of the motif and a color-coded bar graph showing the conservation of each position in the motif. Some of the hyperlinked buttons that allow the motif to be viewed and analyzed in other ways can be seen at the bottom of the screen shot.

LOGO of protein motif. LOGOS are a visualization tool for motifs. The height of a letter indicates its relative frequency at the given position ( x -axis) in the motif.

Figure 2

LOGO of protein motif. LOGOS are a visualization tool for motifs. The height of a letter indicates its relative frequency at the given position ( x -axis) in the motif.

Usage of MEME at the NBCR web server. The plot shows the number of different users submitting jobs to the NBCR MEME web server each month since December 2000. Usage figures for March 2006 include up to March 20 only.

Figure 3

The authors acknowledge NBCR award from NCRR, NIH P41 RR08605, for support of the MEME and MAST website. TLB acknowledges grant from NIH, R01 RR021692-01, for support of continuing development of the MEME and related sequence analysis tools. T.L.B. also acknowledges the ARC Centre for Bioinformatics (ACB) (ARC CE0348221) for infrastructure support for the MEME mirror site at the ACB. Funding to pay the Open Access publication charges for this article was provided by the NIH.

Conflict of interest statement . None declared.

REFERENCES

Bailey, T.L. and Elkan, C.

1995

Unsupervised Learning of Multiple Motifs In Biopolymers Using EM

Mach. Learn

–80

Bailey, T.L. and Elkan, C.

1995

The value of prior knowledge in discovering motifs with MEME Proceedings of the Third International Conference on Intelligent Systems for Molecular biology, July In Rawlings, C., Clark, D., Altman, R., Hunter, L., Lengauer, T., Wodak, S. (Eds.). Menlo Park, CA AAAI Press pp.

–29

Lyons, T.J., Gasch, A.P., Alex Gaither, L., Botstein, D., Brown, P.O., Eide, D.J.

2000

Genome-wide characterization of the Zap1p zinc-responsive regulon in yeast

Proc. Natl Acad. Sci. USA

7957

–7962

Fang, J., Haasl, R.J., Dong, Y., Lushington, G.H.

2005

Discover protein sequence signatures from protein-protein interaction data

BMC Bioinformatics

–8

Bailey, T.L. and Elkan, C.

1994

Fitting a mixture model by expectation maximization to discover motifs in biopolymers Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, August In Altman, R.B., Brutlag, D.L., Karp, P.D., Lathrop, R.H., Searls, D.B. (Eds.). Menlo Park, CA AAAI Press pp.

–36

Bailey, T.L. and Gribskov, M.

1998

'Combining evidence using P -values: application to sequence homology searches

Bioinformatics

–54

Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., et al.

2005

Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites

Nat. Biotechnol

137

–147

Pevzner, P.A. and Sze, S.H.

2000

Combinatorial approaches to finding subtle signals in DNA sequences Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, August. In Bourne, P.E., Gribskov, M., Altman, R.B., Jensen, N., Hope, D., Lengauer, T., Mitchell, J.C., Scheeff, E.D., Smith, C., Strande, S., Weissig, H. (Eds.). Menlo Park, CA AAAI Press pp.

269

–278

Hu, J., Li, B., Kihara, D.

2005

Limitations and potentials of current motif discovery algorithms

Nucleic Acids Res

4899

–4913

Wootton, J.C. and Federhen, S.

1966

Analysis of compositionally biased regions in sequence databases

Methods Enzymol

266

554

–571

Sandelin, A., Alkema, W., Engström, P., Wasserman, W.W., Lenhard, B.

2004

JASPAR: an open-access database for eukaryotic transcription factor binding profiles

Nucleic Acids Res

D91

–D94

Henikoff, J.G., Pietrokovski, S., Henikoff, S.

1997

Recent enhancements to the blocks database servers

Nucleic Acids Res

222

–225

Schneider, T.D. and Stephens, R.M.

1990

Sequence logos: a new way to display consensus sequences

Nucleic Acids Res

6097

–6100

Foster, I. and Kesselman, C.

The Grid 2: Blueprint for a New Computing Infrastructure

2004

2nd edn San Francisco, CA Morgan Kaufmann Publishers, Inc

Bailey, T.L. and Noble, W.S.

2003

Searching for statistically significant regulatory modules

Bioinformatics

Suppl 2,

II16

–II25

Li, W.W., Krishnan, S., Mueller, K., Misleh, C., Arzberger, P.

2006

Building cyberinfrastructure for bioinformatics using service oriented architecture Proceedings of the IEEE International Symposium on Cluster Computing and the Grid, May In Bu Sung, F.L., Abramson, D., Cai, W., Graupner, S., Jin, H., Sloot, P. (Eds.). USA IEEE Press (in press)

© The Author 2006. Published by Oxford University Press. All rights reserved The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 13,390

10,199 Pageviews

3,191 PDF Downloads

Since 12/1/2016

Month:	Total Views:
December 2016	8
January 2017	21
February 2017	53
March 2017	73
April 2017	58
May 2017	61
June 2017	61
July 2017	45
August 2017	62
September 2017	40
October 2017	48
November 2017	56
December 2017	144
January 2018	182
February 2018	143
March 2018	190
April 2018	199
May 2018	137
June 2018	106
July 2018	107
August 2018	110
September 2018	98
October 2018	131
November 2018	123
December 2018	120
January 2019	161
February 2019	166
March 2019	180
April 2019	174
May 2019	180
June 2019	139
July 2019	154
August 2019	161
September 2019	141
October 2019	131
November 2019	157
December 2019	121
January 2020	157
February 2020	134
March 2020	136
April 2020	135
May 2020	127
June 2020	161
July 2020	115
August 2020	154
September 2020	139
October 2020	207
November 2020	181
December 2020	186
January 2021	137
February 2021	114
March 2021	183
April 2021	143
May 2021	148
June 2021	143
July 2021	88
August 2021	119
September 2021	128
October 2021	140
November 2021	135
December 2021	129
January 2022	139
February 2022	114
March 2022	143
April 2022	140
May 2022	199
June 2022	125
July 2022	131
August 2022	139
September 2022	136
October 2022	140
November 2022	168
December 2022	111
January 2023	129
February 2023	141
March 2023	204
April 2023	243
May 2023	193
June 2023	133
July 2023	112
August 2023	142
September 2023	153
October 2023	156
November 2023	139
December 2023	180
January 2024	215
February 2024	166
March 2024	224
April 2024	240
May 2024	230
June 2024	158
July 2024	194
August 2024	155
September 2024	184
October 2024	190
November 2024	144

MEME: discovering and analyzing DNA and protein sequence motifs (original) (raw)

Cite

Abstract

INTRODUCTION

MOTIF DISCOVERY STRATEGIES

ANALYZING MOTIFS USING THE MEME OUTPUT HYPERLINKS

WEB SERVER AND USER SUPPORT

FUTURE DIRECTIONS

REFERENCES

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Cited

MEME: discovering and analyzing DNA and protein sequence motifs (original) (raw)

Cite

Abstract

INTRODUCTION

MOTIF DISCOVERY STRATEGIES

ANALYZING MOTIFS USING THE MEME OUTPUT HYPERLINKS

WEB SERVER AND USER SUPPORT

FUTURE DIRECTIONS

REFERENCES

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited