MICA: desktop software for comprehensive searching of DNA databases (original) (raw)

BMC Bioinformatics volume 7, Article number: 427 (2006)Cite this article

Abstract

Background

Molecular biologists work with DNA databases that often include entire genomes. A common requirement is to search a DNA database to find exact matches for a nondegenerate or partially degenerate query. The software programs available for such purposes are normally designed to run on remote servers, but an appealing alternative is to work with DNA databases stored on local computers. We describe a desktop software program termed MICA (_K_-M er I ndexing with C ompact A rrays) that allows large DNA databases to be searched efficiently using very little memory.

Results

MICA rapidly indexes a DNA database. On a Macintosh G5 computer, the complete human genome could be indexed in about 5 minutes. The indexing algorithm recognizes all 15 characters of the DNA alphabet and fully captures the information in any DNA sequence, yet for a typical sequence of length L, the index occupies only about 2_L_ bytes. The index can be searched to return a complete list of exact matches for a nondegenerate or partially degenerate query of any length. A typical search of a long DNA sequence involves reading only a small fraction of the index into memory. As a result, searches are fast even when the available RAM is limited.

Conclusion

MICA is suitable as a search engine for desktop DNA analysis software.

Background

Researchers are increasingly working with large DNA databases. For example, the human genome is approximately 3 gigabases. Searching these databases has traditionally been done by using web applications to communicate with dedicated servers. As an alternative analysis tool, desktop computers offer richer and more responsive graphical interfaces. Desktop software programs are available for displaying and manipulating plasmids and other relatively small DNA molecules. Such functionality could theoretically be extended to large DNA databases, because typical desktop computers now have hard disk capacities of hundreds of GB. However, most bioinformatics applications load all of the relevant DNA data into main memory, so the RAM capacity of desktop computers remains a limitation. The challenge is to create desktop DNA analysis software that accomodates large DNA databases while using modest amounts of RAM.

A basic requirement for such software is rapid searching of a DNA database to find all exact matches for a query sequence. The desired search speeds can only be achieved by indexing the database. One well-characterized indexing strategy is to generate a suffix tree [1]. Although suffix trees have been used productively for some molecular biology applications, such as aligning whole genomes [2], they consume large amounts of memory, up to 15 bytes or more per base. Suffix arrays are more compact than suffix trees and can provide similar search capabilities, but they still require 4–8 bytes per base [3]. These conventional suffix-based search strategies rely on loading the entire index into memory, so they are not currently suitable for desktop analysis of large DNA databases. A potential solution to this problem is to generate a compressed suffix array using a Burrows-Wheeler transform [4]. The indexes generated by this method occupy as little as 0.3 bytes per base, but indexing is slow: generation of a compressed suffix array for the human genome required many hours on a desktop computer [4].

Non-suffix-based indexing stategies are currently in more widespread use for DNA databases. The SSAHA algorithm divides a DNA sequence into nonoverlapping _K_-mers, and stores the position of these _K_-mers in a hash table [5]. A similar indexing method is used by the BLAT algorithm [6]. Both SSAHA and BLAT generate relatively small indexes, on the order of 1 byte or less per base, and they can be orders of magnitude faster than FASTA or BLAST, which index the query sequence rather than the database [79]. SSAHA and BLAT have proven to be powerful for applications such as mapping sequence reads to a genome, or aligning mRNA sequences with the corresponding genomic DNA sequences. However, SSAHA and BLAT have limitations. Unlike suffix-based algorithms, which can identify all matches to any query sequence, SSAHA cannot detect a match of fewer than K bases, and requires 2_K_-1 consecutive matching bases to guarantee that a match will be registered. Because SSAHA sorts the search results, efficient searching is achieved by ignoring the _K_-mers that occur most frequently in the database. Similarly, BLAT sacrifices completeness for speed.

These various algorithms have been designed with the assumption that the complete index of a DNA database will be stored in main memory. Such algorithms are inconvenient for desktop applications. An index might be too large to fit within the RAM of a personal computer. Even if sufficient RAM were available, loading an index might be time-consuming. For example, a compressed suffix array of the human genome occupies about 1 GB [4], and reading those data from a hard disk into RAM would require tens of seconds, an unacceptable delay for users who expect fast access to information. We are developing software that will allow users to open and browse a large DNA database as rapidly as they open a plasmid file. The main innovation of our approach is to leave the index on disk, and to retrieve the relevant data selectively during a search.

For general purpose DNA analysis software, the database searches need to be comprehensive. Applications include searching whole chromosomes to create complete lists of restriction sites or oligonucleotide hybridization sites. We have met this need with an indexing and searching algorithm called MICA (_K_-M er I ndexing with C ompact A rrays). The indexes occupy ~2 bytes per base and can be generated quickly. Unlike most other algorithms, MICA indexes not only the standard nucleotide base characters A, C, G, and T, but also the degenerate base characters B, D, H, K, M, R, S, V, W, Y, and N. Efficient search procedures identify all matches for a nondegenerate or partially degenerate query of any length. When a file is being searched, only a fraction of the index is loaded into memory. The result is that desktop computers with modest amounts of RAM can rapidly open and search large DNA databases.

Implementation

Index structure

Indexing a DNA sequence can be achieved by scanning the sequence with a window of width K. For a sequence of length L, sliding the window one base at a time yields L - K + 1 overlapping _K_-mers. A MICA index uses arrays to store all of the positions for each _K_-mer in the subject DNA sequence. Because chromosomal DNA molecules can be up to several hundred million bases long, the _K_-mer position values would normally be represented as 4-byte integers. We reduce this data storage requirement by dividing the sequence into C separate "chunks" of 65,535 (216-1) bases, where C = ceiling(L/65,535). The intra-chunk positions of each nondegenerate _K_-mer are stored as 2-byte integers. The absolute positions of a _K_-mer within the full DNA sequence can then be calculated with the aid of a list specifying the number of instances of the _K_-mer within each chunk.

In addition to the four nondegenerate base characters, there are 10 characters (B, D, H, K, M, R, S, V, W, and Y) representing partially degenerate base possibilities, plus one character (N) that represents any possible base. Partially degenerate _K_-mers must be recognized if the index is to capture all of the information in any DNA sequence. It would be wasteful to create an array for all of the different partially degenerate _K_-mers because most of the array elements would typically be empty. Instead, for each partially degenerate _K_-mer that is actually present in a subject sequence, we use a 4-byte integer to record the absolute position of the _K_-mer, followed by a _K_-byte string to record the sequence of the _K_-mer. This approach is inefficient with regard to data storage, but most DNA sequences contain very few partially degenerate _K_-mers, and the simplicity of the data format facilitates searching. A separate strategy is used for stretches of N's, which are commonly used to indicate undefined portions of a sequence. Here S designates the number of "N-stretches" of K or more consecutive N's. MICA efficiently indexes N-stretches by recording their starting positions and their lengths.

Also stored in the index is the topology of the subject DNA molecule (linear or circular). With circular DNA molecules, MICA finds matching sequences that span the numerically defined origin.

Table 1 summarizes the MICA file structure, together with the generic data storage requirements for each part of the file. As an example, Table 1 lists the specific data storage requirements for human chromosome 1, which at ~246 million bases represents one of the longest DNA molecules that needs to be indexed. To ensure that memory addresses can be represented as 4-byte integers, we have constrained MICA to index individual DNA sequences of no more than 1 gigabase.

Table 1 MICA file structure and data storage requirements

Full size table

Choice of K

The main body of a MICA index is the Chunk Data Array, which stores the positions of the nondegenerate K_-mers (Table 1). The total number of position values is largely independent of K. However, there are 4_K different nondegenerate K_-mers, so the Chunk Data Array is divided into 4_K partitions. Each partition is divided into C sub-partitions that contain the intra-chunk position values. The sizes of these sub-partitions are recorded in a list at the beginning of the partition. As a result, the practical upper limit for K is 7, because for K = 8 there would be 65,536 sub-partitions per chunk, and the lists of sub-partition sizes would occupy more space than the lists of _K_-mer position values. The practical lower limit for K is 3, because reducing K increases the size of the partitions that must be read from disk, thereby slowing most searches.

Index creation

To create a MICA index, a subject DNA sequence in FASTA format [10] is scanned using a window of width K. Both uppercase and lowercase characters are recognized. An initial scan fills in the data for index elements E – J (Table 1). Then the appropriate memory for the data arrays (index elements K – M) is allocated, and a second scan fills in the positions of the _K_-mers and N-stretches. These operations are fast, partly because building the index requires no sorting.

If sufficient memory is available, indexing speed is maximized by building the entire index in memory and then writing to disk in a single step. In the case of human chromosome 1, this process requires about 0.67 GB of RAM, an amount that is available on many desktop computers. If memory is limiting, only a subset of the _K_-mer position values are stored in memory at a given time, and the index is written to disk in multiple steps.

Currently, MICA embeds a copy of the DNA sequence within the file. This sequence consists of uppercase characters in 8-bit ASCII format, and therefore occupies L bytes. The original sequence file is then dispensable for searching. For future applications, MICA will be integrated with software that automatically generates a suitably formatted DNA sequence.

General search strategy

If a DNA sequence occupies more than 16 chunks (~1 megabase), only elements A – C and E – J of the MICA file are initially read from disk (Table 1). These reads are very fast because they involve a small amount of data, just over 1 KB for K = 4 or just over 16 KB for K = 6. During a search, MICA uses the data from this first portion of the index to find the relevant position values. For example, to find the positions of a nondegenerate _K_-mer, an entry in the Chunk Counts Summary (index element H) indicates where the relevant position values can be read from the Chunk Data Array (index element K). Thus, MICA selectively reads only the essential data from disk, thereby performing efficient I/O operations and minimizing RAM usage.

Figure 1 provides a pseudocode summary of the basic MICA search routines. The query length Q can range from one base to the length of the subject DNA sequence. Both strands of the DNA molecule are searched. For a query that is palindromic–i.e., identical to its reverse complement–a single search is performed. For a query that is nonpalindromic, two successive searches are performed, one with the query and another with the reverse complement of the query. If the DNA molecule is circular, the initial search is followed by a secondary search for matches that span the origin. The key step in this secondary search involves dividing the query in half and then checking for one of two possibilities: either the first half-query matches within the last _Q_-1 bases of the DNA sequence, or the second half-query matches within the first _Q_-1 bases of the DNA sequence.

Figure 1

figure 1

Pseudocode summary of the basic MICA search routines. See the text for additional information about memory management, intersection algorithms, and merge operations for degenerate _K_-mers.

Full size image

If a query is shorter than K bases, it is extended to K bases by adding N's, and is then treated as being partially degenerate (see below). If a query is exactly K bases, the search consists of converting the 2-byte intra-chunk position values for that _K_-mer to 4-byte absolute position values. If a query is longer than K bases, it is decomposed into constituent _K_-mers, which are examined as follows. The list of intra-chunk position values for the first _K_-mer is read from the index and converted to absolute position values, thereby creating an initial working list. Each value in the working list is then compared with the second _K_-mer list. The result is a new working list, which indicates where both the first and second _K_-mers from the query match the subject DNA sequence. This new working list is then compared with the next _K_-mer list, and so on. In this manner, MICA progressively trims the initial working list to generate a final list of matches.

With a query longer than K bases, the constituent _K_-mers are examined in increasing order of their frequency of appearance in the subject DNA sequence. For example, a search for the 12-mer AAAACCCCGGGG using K = 4 might involve calculating the positions for CCCC, then comparing each CCCC position against the list of positions for GGGG, then comparing each CCCCGGGG position against the list of positions for AAAA, which in this case would be the most common of the three 4-mers. This strategy of starting with the rarest _K_-mer can significantly accelerate searches because some _K_-mers are found less frequently than others and therefore result in fewer comparisons. In chromosome 1, the most common 4-mer (AAAA) appears 56 times more often than the rarest 4-mer (CGCG), and the most common 6-mer (TTTTTT) appears 929 times more often than the rarest 6-mer (CGTACG).

Each successive _K_-mer search is limited to the range of chunks that generated hits for the current working list. In the example above, after the CCCCGGGG hits have been identified, the search for AAAA is limited to the chunks between the first and last occurrence of CCCCGGGG. If a working list contains no hits, the search is terminated. This range limitation method can accelerate searches when a query has a small number of matches to the subject sequence.

Searching with partially degenerate queries

A partially degenerate query can be expanded by searching for all of the possible matching sequences. For example, the restriction enzyme Bsp 1286I has the recognition sequence GDGCHC, which can potentially match 9 nondegenerate sequences (when the second base is A, G, or T, and the fifth base is A, C, or T) and 40 partially degenerate sequences (when the second base is D, R, K, or W, or the fifth base is H, M, Y, or W). A search for GDGCHC will therefore return matches for all of the 49 possible matching 6-mers. Alternatively, searches for a partially degenerate query can be restricted to return only literal matches to that character string, so a search for "GDGCHC" will return matches only for the single 6-mer GDGCHC.

MICA scans a partially degenerate query to determine an efficient search strategy that limits the degeneracy of the constituent _K_-mers. For example, the restriction enzyme Bae I has the recognition sequence ACNNNNGTAYC, and for K = 4 the matches are found by searching for GTAY, TAYC, and ACNN. The MICA index is ordered lexicographically, so the _K_-mer ACNN invokes 16 contiguous disk reads from the Chunk Data Array, whereas the equivalent _K_-mers NNAC and NACN would invoke non-contiguous reads. Because contiguous reads are faster than non-contiguous reads, MICA pushes any degeneracy to the end of a _K_-mer whenever possible.

With a partially degenerate _K_-mer, the working list must be compared to multiple individual _K_-mer lists using an intersection algorithm. An obvious approach would be to adapt the method that is used with a single nondegenerate _K_-mer. In that case, MICA finds the intersection of the working list and the next _K_-mer list using a standard technique: a pointer is assigned to the _K_-mer list, and for each successive element in the working list, the pointer is advanced until the value in the _K_-mer list equals or exceeds the value in the working list [11]. When there are multiple _K_-mer lists, a pointer can be assigned to each one, and a working list element can be compared to all of the _K_-mer lists. However, this method becomes very inefficient if the working list is larger than the individual _K_-mer lists, because most of the comparisons fail to advance the pointers. MICA therefore uses an alternative intersection algorithm for partially degenerate _K_-mers. A boolean array of 65,535 elements is used to represent the positions in a chunk. For a given chunk, all of the individual _K_-mer lists are scanned, and the 2-byte position values are recorded by setting the corresponding boolean elements to true, yielding a boolean array that indicates which positions in the chunk match one of the _K_-mers. Then the intersection is obtained by checking whether each working list element corresponds to a value of true in the boolean array. This method is efficient due to the relatively small number of operations and the sequential nature of the memory accesses.

When a _K_-mer is very degenerate, a substantial amount of time may be needed to read and process the index data. In such cases, MICA switches to an alternate mode that uses the embedded DNA sequence. The entire query is compared to DNA sequence fragments that overlap each hit in the working list. Based on empirical tests, MICA was configured to perform this mode switch whenever the amount of index data that would need to be read exceeds 33% of the total DNA sequence data. This condition typically arises with extremely degenerate _K_-mers such as ANNN (K = 4) or ANNNNN (K = 6). Even with a less degenerate _K_-mer, the alternate mode is used if more data would be read by using the index than by directly examining the DNA sequence. Thus, at each stage of a search, MICA takes advantage of the fastest available option.

Memory management during searches

Reading data from disk is slow, so the search speed can be maximized by pre-loading the entire file into main memory. MICA uses this approach when the sequence occupies up to 16 chunks. The corresponding files usually occupy no more than 3 MB of RAM and can be read from disk in a fraction of a second.

For longer sequences, as described above, MICA sacrifices some potential search speed in exchange for rapid index loading and low memory usage. The only parts of the index that are initially read into memory are the elements that describe the structure of the data arrays. During a search, the position values for the relevant _K_-mers are selectively read from disk. These read operations are usually the rate-limiting step in the search, but they are relatively efficient because of the compact 2-byte indexing format and because all of the positions for each _K_-mer are stored contiguously. Only a small portion of the index is used at a given time, so a typical search requires very little memory.

If a query sequence contains a degenerate or otherwise abundant _K_-mer, then reading the full list of position values for that _K_-mer might require more RAM than is available. MICA deals with this situation by dividing the subject sequence into segments and searching each segment in turn. During the search of a given segment, only the corresponding _K_-mer position values are read into memory.

Results and discussion

MICA was coded in C++ and tested on a 2.5-GHz G5 Macintosh running OS X (10.4, Tiger) with 2.5GB of RAM. As subject data we used the May 2005 Ensembl release of the human genome [12], comprising 3.08 gigabases in 25 files representing the linear chromosomes 1–22, X, and Y, plus the circular mitochondrial chromosome. A simple graphical user interface was later constructed using Trolltech's Qt 4.1.

Indexing performance

For a server application, a large index may be acceptable if sufficient memory is available, and slow indexing is acceptable because the index is created once and then used indefinitely. For a desktop application, smaller indexes are desirable because they occupy less disk space. Moreover, versatility is increased if the index can be created and updated rapidly, because this feature facilitates the analysis of new sequences and the modification of existing sequences.

Table 2 shows representative MICA index sizes and indexing times. With human chromosomes, storage requirements for the indexes were just under 2 bytes per base, reflecting the existence of N-stretches in the current genome assembly. With a computer-generated random sequence containing no degenerate base characters, the storage requirement was slightly over 2 bytes per base. The indexing time for chromosome 1, which is ~246 million bases, was 19.3 sec for K = 4 or 27.1 sec for K = 6. Only 56% of the K = 6 indexing time (15.3 sec) was due to the indexing procedure itself, with most of the remaining time being consumed by writing the completed index to disk. Indexing the entire human genome required 262 sec (4.4 min) for K = 4 or 345 sec (5.8 min) for K = 6. These speeds should enable a researcher to process a DNA database and move promptly to the analysis stage.

Table 2 Representative sizes, creation times, and loading times for MICA indexes

Full size table

To simulate indexing with limited RAM, we instructed MICA to index chromosome 1 using the procedure that would be followed if only 100 MB of RAM were available. The indexing time for K = 6 was 126 sec, which should still be adequate for most applications.

Searching performance

The subject sequences were chromosome 1 or the entire human genome, and both DNA strands were searched. With the entire genome, the relevant index elements for each chromosome were loaded separately into memory for each search, but these loading times (Table 2) accounted for only a small fraction of the total search times. A series of searches was performed with nondegenerate queries of various lengths and with several partially degenerate queries.

Table 3 shows representative search times for K = 4. As expected, searches for 4-mers were the fastest, with each search requiring an average of 0.13 sec for chromosome 1 or 2.5 sec for the entire genome. Searches for 6- or 8-mers took about three times as long. For 15-mers, the average search times were 0.56 sec for chromosome 1 or 9.0 sec for the entire genome, about 50% longer than for 8-mers. As the query length increased further, the search times actually decreased as MICA took advantage of rare 4-mers within the queries.

Table 3 Representative search times for K = 4

Full size table

Table 4 shows representative search times for K = 6. As expected, searches for 6-mers were the fastest. For 8- to 100-mers, the K = 6 searches were approximately five times as fast as the corresponding K = 4 searches. The reason for this difference is that read operations are the slowest step for most searches, and an average 6-mer occupies much less index space than an average 4-mer.

Table 4 Representative search times for K = 6

Full size table

When the query length is less than K, the search times are relatively long because multiple _K_-mer lists must be merged. As an example for K = 4, the positions of each 3-mer were found by merging four 4-mer lists, so the 3-mer searches were much slower than the 4-mer searches (Table 3). For K = 6, the 4-mer searches were much slower than the 6-mer searches, and the 3-mer searches were slower still (Table 4). Multiway merges are naturally suited to parallel processing [13], and we are exploring the possibility of accelerating merges by harnessing the enhanced multithreading capacity of newer desktop computers.

For a given query length, searches are fast if there are very few matches to the query, and somewhat slower if multiple matches are distributed throughout the subject sequence. As a test case of a query with many matches, we used a conserved 30-base fragment of the Alu repeat element [14]. This Alu fragment was found 1,130 times in chromosome 1 and 14,041 times in the entire genome. The searches for the Alu fragment took only slightly longer than the searches for rare 30-mers (Tables 3 and 4). Thus, MICA delivers strong performance even with repeated sequences, which are challenging for some other search algorithms [5].

To test partially degenerate queries, we searched for the recognition sequences of the restriction enzymes Bsp 1286I (GDGCHC), Bgl I (GCCNNNNNGGC), and Bae I (ACNNNNGTAYC). In the case of Bgl I, the search involved generating lists of positions for the 3-mers GCC and GGC, and then finding the intersection of those lists. As described above, 3-mer searches are expensive because of the merges, so MICA defers such merges until after the intersection operations. This approach made searching for the Bgl I recognition sequence about twice as fast as searching for a single 3-mer (Tables 3 and 4). The searches for the non-palindromic Bae I recognition sequence were relatively slow because MICA needed to process the data for the 2-mers AC and GT. In general, the most time-consuming searches are those involving _K_-mers with substantial degeneracy, because multiple individual _K_-mer lists need to be read from disk and then examined.

Effects of varying K

We performed extensive tests with K chosen to be either 4 or 6. For _K_= 6 the individual _K_-mer reads were 16-fold smaller on average than for K = 4, yet the K = 6 searches for typical nondegenerate queries were faster by only about five-fold (Tables 3 and 4). The reason for this discrepancy is that the K = 6 reads are so small that disk seek times become limiting. Thus, increasing K to 7 would only marginally accelerate searches for typical nondegenerate queries. Moreover, a larger value of K would be detrimental with very short queries and with some partially degenerate queries. For example, when searching for the Bgl I recognition sequence, MICA expands the 3-mer GCC to the 4-fold degenerate GCCN for K = 4, but expands the same 3-mer to the 64-fold degenerate GCCNNN for K = 6 or the 256-fold degenerate GCCNNNN for K = 7. The best overall compromise for most purposes seems to be K = 6.

In the case of DNA molecules such as plasmids for which only a few KB are needed to store the DNA sequence, a K = 6 index is excessively large because it requires 24 KB to record how many times each nondegenerate _K_-mer is present. By contrast, a K = 4 index requires only 1.5 KB to store this information for a molecule of up to 65,535 bp. Therefore, MICA uses K = 4 if the DNA sequence fits within one chunk, or K = 6 if the DNA sequence occupies two or more chunks.

Effects of memory usage

For the genome-wide searches listed in Table 4, the amount of RAM used by MICA ranged from about 1.6 MB for a rare 6-mer to 40 MB for Bae I sites. These numbers are small because the searches were performed one chromosome at a time. To determine how memory limitation affects search times, we searched for Bae I sites in chromosome 1 under conditions that simulated different amounts of available RAM (Figure 2). The memory check algorithm estimated conservatively that searching all of chromosome 1 would require a maximum of 47.4 MB of RAM. As a result, when the available RAM dropped below this level, chromosome 1 was searched in segments. The search speeds decreased accordingly, but this decrease was not severe until the available RAM dropped below about 10 MB. Thus, even when a database contains large chromosomal sequence files, searching can be performed efficiently using a computer with modest amounts of memory.

Figure 2

figure 2

Example of search times as a function of available RAM. To simulate searching with various amounts of free memory, we instructed MICA (K = 6) to search chromosome 1 for Bae I sites (ACNNNNGTAYC) using the procedures that would be followed if the indicated amounts of RAM were available. For these measurements, each search was preceded by a large number of extraneous reads, which flushed the main memory of any prior data from the index.

Full size image

Because all of the nuclear human chromosomes exceed 1 megabase in length, MICA does not load the full indexes into memory, but instead loads only the elements that describe the structure of the data arrays. The benefits of this strategy are rapid index loading and low memory usage. The disadvantage is that searches are slower than they would be if the full indexes were pre-loaded into memory. To quantify this effect, we repeated the test searches after pre-loading into memory the full MICA file for chromosome 1. The resulting values, listed in brackets in Tables 3 and 4, show that pre-loading accelerated the searches. This acceleration was dramatic with nondegenerate queries of length K or more. Yet the pre-loading operation took approximately 18 sec, a prohibitively long time if the goal is to do one or a few quick searches. By avoiding a time-consuming preparatory stage, MICA enables users to search large sequences without delay.

Immediately after indexing, the entire index is typically available in memory, so this time point is convenient for doing any routine searches. As a model for such a routine search, we indexed chromosome 1 using K = 6, and then searched for the recognition sequences of the 236 restriction enzymes sold by New England BioLabs. This search was fast at just under 11 sec.

Relation to other approaches

The MICA index is hash-based but is actually related to suffix-based indexes. Hunt [15] has defined the "suffix sequoia" as a suffix tree derivative in which the entries are truncated at a string length of K. The result is a lexicographically-ordered _K_-mer array, which resembles the Chunk Data Array in the MICA index (Table 1), except that the suffix sequoia is about twice the size because absolute positions are stored as 4-byte integers. Another algorithm that resembles MICA is ACMES, which creates a hash table of 8-mers and accesses only the relevant hash bins during a search [16, 17]. ACMES can find exact matches for queries of any length, although the indexes are large because this program was designed to search both sequence and annotation data. Thus, a number of researchers have converged on the same general strategy for indexing DNA sequences.

Despite these similarities, MICA has the following unusual features that make it advantageous, especially for desktop applications.

  1. (1)
    A MICA index occupies only about 2_L_ bytes, and a mammalian genome can be indexed in a few minutes. The previously described indexes that support comprehensive searching are either substantially larger or require much longer to generate.
  2. (2)
    Rapid searching is accomplished without loading large amounts of data into main memory, because only a small fraction of each index is typically read from disk. Even if memory is quite limited, the search operations are still fast.
  3. (3)
    Because MICA indexes 4- to 6-mers, very short queries can be matched quickly. This functionality is particularly useful for finding restriction sites. By comparison, ACMES indexes 8-mers and is therefore slower at matching very short queries [17].
  4. (4)
    MICA can recognize and index all 15 characters of the standard DNA alphabet. By comparison, degenerate base characters are read as A's by SSAHA [5] and are excluded from the index by BLAT [6]. ACMES expands degenerate base characters by adding index entries for all of the corresponding nondegenerate 8-mers [17], but this approach has the disadvantage of indexing possible sequences as if they were actually present. With MICA, the index precisely captures the information in the original sequence, and the searches find all matches to any nondegenerate or partially degenerate query.

Conclusion

MICA is designed to be a core indexing and search engine. Because the underlying approach is very simple, we were able to optimize the algorithms extensively to take advantage of the properties of modern desktop computers [18]. The end result meets our goal of enabling users to open and search large DNA databases rapidly on computers with limited RAM.

In its present form, MICA is ideally suited to comprehensive searching for exact matches in a DNA database. Such a database might represent, e.g., a genome or a collection of plasmid vectors. Potential applications include: in silico restriction enzyme digestion, which can be used to type organisms by amplified fragment length polymorphism (AFLP) analysis or pulsed field gel electrophoresis [19, 20]; "virtual PCR" to predict the specificity of PCR amplification from complex templates [21, 22]; and the automated definition of oligonucleotide-flanked sequence-tagged sites (STSs) in genomic sequences [23, 24]. We are incorporating MICA into desktop software that allows for versatile browsing and manipulation of chromosome-sized DNA sequences. For example, a MICA-based PCR simulator allows us to simulate a PCR amplification from human genomic DNA in 2–3 sec.

MICA could be extended by adding alignment algorithms for identifying sequences that are similar but not identical to the query. Such algorithms have been widely studied and implemented [1, 79, 15, 25, 26], and they can benefit greatly from using an index to find "seeds" for the alignments [6, 27, 28]. In addition, MICA could easily be modified to operate in server mode. For this purpose, faster searching of large sequences would be achieved by loading the complete indexes into memory, as illustrated in Tables 3 and 4.

Availability and requirements

Project name: MICA – Desktop Software for Comprehensive Searching of DNA Databases

Project home page: MICA binaries for Macintosh OS X (mica_mac.dmg) and Windows (mica_1.0_win.exe) are available at http://www.ibridgenetwork.org/technology.asp?Page=FB2310FB.

Operating systems: Macintosh OS X (10.3.9 or higher) and Windows (NT, 2000, XP)

Programming language: C++

Other requirements: None

License: Freely available for academic and non-profit use. Researchers can request assistance with obtaining the MICA source code and integrating it with other software.

Any restrictions to use by non-academics: Commercial users require a license. For questions regarding commercial uses, please contact the University of Chicago's Office of Technology and Intellectual Property, UCTech, at (773) 702–1692 or www.uctech.uchicago.edu.

References

  1. Gusfield D: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge, Cambridge University Press; 1997:534.
    Chapter Google Scholar
  2. Kurtz S, Philippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol 2004, 5: R12. 10.1186/gb-2004-5-2-r12
    Article PubMed Central PubMed Google Scholar
  3. Abouelhoda MI, Kurtz S, Ohlebusch E: Replacing suffix trees with enhanced suffix arrays. J Discrete Algorithms 2004, 2: 53–86. 10.1016/S1570-8667(03)00065-0
    Article Google Scholar
  4. Lippert RA, Mobarry CM, Walenz BP: A space-efficient construction of the Burrows-Wheeler transform for genomic data. J Comp Biol 2005, 12: 943–951. 10.1089/cmb.2005.12.943
    Article CAS Google Scholar
  5. Ning A, Cox AJ, Mullikin JC: SSAHA: a fast search method for large DNA databases. Genome Res 2001, 11: 1725–1729. 10.1101/gr.194201
    Article PubMed Central CAS PubMed Google Scholar
  6. Kent WJ: BLAT-The BLAST-like alignment tool. Genome Res 2002, 12: 656–664. 10.1101/gr.229202. Article published online before March 2002
    Article PubMed Central CAS PubMed Google Scholar
  7. Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988, 85: 2444–2448. 10.1073/pnas.85.8.2444
    Article PubMed Central CAS PubMed Google Scholar
  8. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410. 10.1006/jmbi.1990.9999
    Article CAS PubMed Google Scholar
  9. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389
    Article PubMed Central CAS PubMed Google Scholar
  10. FASTA format description [http://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml\]
  11. Knuth DE: The Art of Computer Programming. Volume 3: Sorting and Searching. 2nd edition., Addison-Wesley; 1998:800.
    Google Scholar
  12. Ensembl Genome Browser [http://www.ensembl.org/index.html\]
  13. Greene WA: k-way merging and k-ary sorts: ; Birmingham, AL. 31st Annual ACM Southeast Conference 1993, 127–135.
    Google Scholar
  14. Price AL, Eskin E, Pevzner PA: Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res 2004, 14: 2245–2252. 10.1101/gr.2693004
    Article PubMed Central CAS PubMed Google Scholar
  15. Hunt E: The suffix sequioa index for approximate string matching. DCS Tech Report, Dept of Computing Science, University of Glasgow, http://wwwdcsglaacuk/publications/PAPERS/7185/TR-2003–135pdf 2003, 1–26.
    Google Scholar
  16. Reneker J, Shyu CR, Zeng P, Polacco JC, Gassmann W: ACMES: fast multiple-genome searches for short repeat sequences with concurrent cross-species information retrieval. Nucleic Acids Res 2004, 32: W649-W653.
    Article PubMed Central CAS PubMed Google Scholar
  17. Reneker J, Shyu CR: Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals. BMC Bioinformatics 2005, 3: 111. 10.1186/1471-2105-6-111
    Article Google Scholar
  18. Crawford I, Wadleigh K: Software Optimization for High Performance Computing: Creating Faster Applications., Prentice Hall; 2000:377.
    Google Scholar
  19. Rombauts S, Van de Peer Y, Rouzé P: AFLPinSilico, simulating AFLP fingerprints. Bioinformatics 2003, 19: 776–777. 10.1093/bioinformatics/btg090
    Article CAS PubMed Google Scholar
  20. Bikandi J, San Millán R, Rementeria A, Garaizar J: In silico analysis of complete bacterial genomes: PCR, AFLP-PCR and endonuclease restriction. Bioinformatics 2004, 5: 798–799. 10.1093/bioinformatics/btg491
    Article Google Scholar
  21. Lexa M, Horak J, Brzobohaty B: Virtual PCR. Bioinformatics 2001, 17: 192–193. 10.1093/bioinformatics/17.2.192
    Article CAS PubMed Google Scholar
  22. Boutros PC, Okey AB: PUNS: transcriptomic- and genomic-in silico PCR for enhanced primer design. Bioinformatics 2004, 20: 2399–2400. 10.1093/bioinformatics/bth257
    Article CAS PubMed Google Scholar
  23. Rotmistrovsky K, Jang W, Schuler GD: A web server for performing electronic PCR. Nucleic Acids Res 2004, 32: W108-W112. 10.1093/nar/gnh102
    Article PubMed Central CAS PubMed Google Scholar
  24. Murphy K, Raj T, Winters RS, White PS: me-PCR: a refined ultrafast algorithm for identifying sequence-defined genomic elements. Bioinformatics 2004, 20: 588–590. 10.1093/bioinformatics/btg466
    Article CAS PubMed Google Scholar
  25. Li M, Ma B, Kisman D, Tromp J: Patternhunter II: highly sensitive and fast homology search. J Bioinform Comput Biol 2004, 2: 417–439. 10.1142/S0219720004000661
    Article CAS PubMed Google Scholar
  26. Noé L, Kucherov G: Improved hit criteria for DNA local alignment. BMC Bioinformatics 2004, 5: 149. 10.1186/1471-2105-5-149
    Article PubMed Central PubMed Google Scholar
  27. Ning Z, Spooner W, Spargo A, Leonard S, Rae M, Cox A: The SSAHA trace server. Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB 2004) 2004, 544–545.
    Google Scholar
  28. Wu TD, Watanabe CK: GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 2005, 21: 1859–1875. 10.1093/bioinformatics/bti310
    Article CAS PubMed Google Scholar

Download references

Acknowledgements

Thanks to Michael Scott and Eugene Losev for helpful comments on the manuscript. This work was supported by The University of Chicago and by GSL Biotech, LLC. Both organizations approved the design and conclusions of the study, and the decision to submit this manuscript for publication.

Author information

Authors and Affiliations

  1. GSL Biotech, LLC, 5211 S. Kenwood Ave, Chicago, IL, 60615, USA
    William A Stokes & Benjamin S Glick
  2. Department of Molecular Genetics and Cell Biology, and Institute for Biophysical Dynamics, University of Chicago, 920 East 58th Street, Chicago, IL, 60637, USA
    Benjamin S Glick

Authors

  1. William A Stokes
    You can also search for this author inPubMed Google Scholar
  2. Benjamin S Glick
    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toBenjamin S Glick.

Additional information

Authors' contributions

WAS generated the code, and BSG guided the project. Both authors contributed to the algorithm design.

Both authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Stokes, W.A., Glick, B.S. MICA: desktop software for comprehensive searching of DNA databases.BMC Bioinformatics 7, 427 (2006). https://doi.org/10.1186/1471-2105-7-427

Download citation

Keywords