Complete Genome Sequence of the Giant Virus OBP and Comparative Genome Analysis of the Diverse ϕKZ-Related Phages (original) (raw)

Abstract

The 283,757-bp double-stranded DNA genome of Pseudomonas fluorescens phage OBP shares a general genomic organization with Pseudomonas aeruginosa phage EL. Comparison of this genomic organization, assembled in syntenic genomic blocks interspersed with hyperplastic regions of the ϕKZ-related phages, supports the proposed division in the “EL-like viruses,” and the “phiKZ-like viruses” within a larger subfamily. Identification of putative early transcription promoters scattered throughout the hyperplastic regions explains several features of the ϕKZ-related genome organization (existence of genomic islands) and evolution (multi-inversion in hyperplastic regions). When hidden Markov modeling was used, typical conserved core genes could be identified, including the portal protein, the injection needle, and two polypeptides with respective similarity to the 3′-5′ exonuclease domain and the polymerase domain of the T4 DNA polymerase. While the N-terminal domains of the tail fiber module and peptidoglycan-degrading proteins are conserved, the observation of C-terminal catalytic domains typical for the different genera supports the further subdivision of the ϕKZ-related phages.

INTRODUCTION

Pseudomonas fluorescens phage OBP (vB_PflM_OBP) (30, 47) is member of a growing group of giant phages, isolated, to date, only from Pseudomonas species and represented by the completely sequenced and well-studied Pseudomonas aeruginosa myovirus ϕKZ (9, 18, 19, 3133, 36, 42). Three other phages encoding many proteins with similarity to ϕKZ proteins have been completely sequenced: EL (23), 201ϕ2-1 (52), and ϕPA3 (43). Krylov et al. (33) assigned 19 unsequenced Pseudomonas phages to this group. We refer to the group as ϕKZ-related phages. Lavigne et al. (35) have argued that EL should be classified as a genus separate from ϕKZ and 201ϕ2-1 based on its less extensive levels of similarity. By that criterion, ϕPA3 belongs to the “phiKZ-like viruses” and the results reported here classify OBP as the second member of the possible genus EL-like viruses.

OBP shares a number of definitive properties with the other ϕKZ-related phages. Typically, ϕKZ-related phages have a very large icosahedral head, ∼122 nm in diameter, and a long (∼190-nm) contractile tail surrounded by fibers. A ϕKZ-related phage head contains a proteinic inner body, which has been speculated to organize the packaged DNA (18, 32). The exceptionally large genomes of the ϕKZ-related phages (between 211 and 317 kb of nonredundant sequence) are composed of circularly permuted and terminally redundant linear double-stranded DNAs as determined for ϕKZ and 201ϕ2-1 (42, 52). The ϕKZ-related genomes display a pronounced difference in G+C content (between 36.8 and 48%) and all have markedly lower G+C content than the chromosomes of their GC-rich Pseudomonas hosts (60 to 66% G+C).

This group of phages has numerous genes involved in nucleotide metabolism (e.g., thymidylate synthase, thymidylate kinase, ribonucleoside diphosphate reductase, subunit beta [NrdB], and dihydrofolate reductase) and at least six genes encoding β and divided β′ subunits specific for multisubunit bacterial RNA polymerases. Three of these RNAP subunits are virion associated, suggesting that they are injected into the cells for expression of early phage genes (52).

Since the ϕKZ-related viruses have diverged strongly from the better-characterized Myoviridae and additionally show a higher divergence rate than “T4-like viruses” or host genomes (52), annotation using sequence comparison has been seriously hampered even for the conserved core genes (e.g., those encoding tail fibers, baseplate module, and DNA polymerase).

This article presents the sequence analysis and a detailed genome annotation of P. fluorescens phage OBP. Hidden Markov modeling was included to clarify the portal and DNA polymerase genes of all the ϕKZ-related phages—two functions long thought to be present but previously lacking gene identifications. The genes encoding the injection needle component of the cell-puncturing device, the putative tail fiber module(s), and an inner head multiprotein family were also located.

MATERIALS AND METHODS

Phage amplification and purification.

Bacteriophage OBP was isolated from a sample of compost from the Tver district in Russia in 2001. Phage OBP was amplified on its host, Pseudomonas fluorescens Pf 1.1, according to Shaburova et al. (47). Subsequent phage purification and concentration was conducted using the CsCl ultracentrifugation method.

Phage DNA isolation and sequencing.

Phage genomic DNA was isolated according to Naryshkina et al. (45). Initial sequence data (∼36.0 kb) were obtained from a shotgun library of phage DNA in pUC18 (Amersham Biosciences, Amersham, Great Britain) with the dideoxy terminator sequencing method [ABI 3130 sequencer, BigDye chemistry (Applied Biosystems, Foster City, CA)]. In the next step, the complete phage genome was sequenced by pyrosequencing at the McGill University and Génome Québec Innovation Centre (Montreal, QC, Canada). Three nonoverlapping contigs of ∼61.0, 90.7, and 133.1 kb were generated and matched to the dideoxy shotgun sequencing data. Primer walking directly on phage genomic DNA was performed until a single contig was generated. Finally, regions with possible sequence errors, based on bioinformatic analysis of the sequenced phage genome, were also verified using the primer walking method. Four frameshift errors were identified and corrected, resulting in an estimated error rate of ∼1/45,000 bp for the regions which could be bioinformatically verified based on protein similarity. One could expect one to three additional errors of this kind in the remaining portion of the genome.

Open reading frames (ORFs) were identified with GeneMark (39), heuristic GeneMark (7), and frame-by-frame GeneMark (48) in combination with ORF Finder (http://www.ncbi.nlm.nih.gov/gorf/gorf.html). If multiple translation initiation sites for an ORF were suggested, manual inspection for the presence of a convincing Shine-Dalgarno sequence and/or close packing with the nearest upstream ORF was used to select the most probable start codon.

Genome annotation.

Translated ORFs were compared with known proteins using a locally implemented version of PSI-BLAST (2) with the entire NCBI nonredundant (nr) plus environmental protein (env_nr) databases. Generated matches of borderline significance were further analyzed by reverse PSI-BLAST within the NCBI databases into which all OBP hypothetical proteins had been incorporated and with a member of the putative related family as the query. Searches of established protein family profiles were conducted with RPSBLAST (reverse PSI-BLAST) (41), HHpred (49, 50), and HMMER (15). Since most myoviral protein families have not yet been added to the international family databases, we constructed family alignments from all proteins encoded by bacteriophage T4 or bacteriophage P2 and embedded them within the HHpred library system. The family build strategy used the target2k strategy of the Sequence Alignment and Modeling system (SAM) (25, 28), with PSI-BLAST used in place of BLASTP. OBP family profiles were also built with SAM, and then family-to-family profile matching was done with a local implementation of the hhsearch program within the HHpred system.

In addition, all putative OBP ORFs were analyzed using a diverse set of programs: the secondary structure prediction algorithm of PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/), the Statistical Analyses of Protein Sequences package (SAPS) (8); transmembrane helix prediction by TMHMM2.0 (29), coiled-coil prediction using the COILS server (40), and prediction of signal peptides by the SignalP (6) and LipoP 1.0 (27) algorithms. Searches for tRNAs were conducted using tRNAscan-SE, version 1.23 (38), and Rfam (21). Potential conserved intergenic motifs were scanned with MEME/MAST (4) and PHIRE (34) to identify phage-specific regulatory elements. Putative factor-independent terminators were identified with TransTerm (13), followed by PHIRE analysis for motif detection. Sequence logos were calculated by WebLogo (http://weblogo.berkeley.edu/) (12).

Nucleotide sequence accession number.

The OBP phage genome has been deposited in GenBank under the accession no. JN627160.

RESULTS AND DISCUSSION

Shaburova et al. (47) reported that the giant virulent bacteriophage OBP, isolated from a compost sample in the Tver district (Russia, 2001), forms small turbid plaques on its bacterial host, Pseudomonas fluorescens Pf 1.1. Using transmission electron microscopic imaging, they revealed that OBP belongs to morphogroup A1 of the family Myoviridae of double-stranded DNA bacteriophages (119-nm head; 191-nm contractile tail). The OBP genome assembled into a circular sequence of 283,757 bp comprising 309 predicted open reading frames (ORFs), which are, according to the orientation of transcription, organized into clusters and predominantly (86.4%) orientated clockwise on the genomic map (see Table S1 in the supplemental material). The genome is tightly organized, leaving only 6.1% noncoding sequence with generally small inter-ORF sequences. (Of the 170 inter-ORF sequences greater than 15 bp, only 21 are greater than 125 bp.) Most ORFs either are spaced exactly consistently with a Shine-Dalgarno sequence and/or transcriptional control sequences or are slightly overlapping (only 20 overlaps greater than 15 bp). Most ORFs are predicted to initiate translation at an AUG codon, while the GUG and UUG start codons are predicted for 15 and two ORFs, respectively.

Although the average G+C content of OBP (43.5%) is significantly lower than that of the host P. fluorescens (about 60% G+C), only four tRNA genes specific for the codons for Met (AUG), Arg (AGA), Ser (AGC), and Asn (AAC) were predicted. Since phages selectively recruit tRNAs to compensate for the compositional differences between the viral and the host genome (5), it could be expected that the OBP tRNAs translate AT-rich codons. However, only one (the Arg codon) translates a codon with an A in the synonymous positions.

Moreover, the tRNAs are not conserved among the other ϕKZ-related phages, which also have a lower GC content [ϕKZ (36.8% G+C), EL (49.3%), and 201ϕ2-1 (45.3%)] than the respective Pseudomonas species they infect. However, three major virion structural proteins produced in high numbers at the end of the replication cycle and well conserved in the ϕKZ-related phages [proteins constituting the tail tube (Gp202), tail sheath (Gp203), and capsid (Gp11)] display a higher-than-average G+C content (50.1%, 47.9%, and 50.3%, respectively). This higher G+C content is consistent with an adaptation to facilitate translation of genes encoding high-copy-number proteins while using the bacterial machinery adapted to the higher G+C content of the host.

Genome organization.

The OBP genome showed similarity at the DNA sequence level to bacterial genomes in three small regions corresponding to the genes coding for two tRNAs, the Arg (positions 16124 to 16196) and Asn (16829 to 16901) tRNAs, and for the ribonucleotide diphosphate reductase β subunit (207819 to 208952). Homology to phages was observed only at the protein level.

Among proteins encoded by other phages found to be similar by BLASTP, the closest homologs were nearly always from P. aeruginosa phage EL. BLASTP and PSI-BLAST results for the five ϕKZ-related phages generally supported the idea that OBP genes were more like EL genes and that 201ϕ2-1 genes were more like ϕKZ genes, with the newly reported ϕPA3 being just slightly more like 201ϕ2-1 than like ϕKZ (data not shown). Hence, although there is considerable variation in numbers of genes, indicating substantial participation in genetic exchanges, there appears to be a core set of orthologs, consistent with vertical descent and supporting division of the ϕKZ-related phages into the EL-like viruses and the phiKZ-like viruses as per Lavigne et al. (35).

Of 90 five OBP proteins found to be similar to EL proteins, most fell in a similar order and orientation along the genome. These are shown in blue on the map in Fig. 1. The genes matching EL tended to fall in blocks of synteny (blocks A, B, and C) (Fig. 1; also, see Table S1 in the supplemental material) with relatively low levels of disruption separated by regions of a more hyperplastic character. The division of phage genomes into syntenic blocks and hyperplastic regions was found to be a useful means of describing vertical descent in the presence of significant amounts of horizontal transfer in the analysis of ϕKZ-related and T4 genomes (11, 17, 23). The OBP/EL syntenic blocks were similar to syntenic blocks between EL and ϕKZ (23) and among 201ϕ2-1, ϕKZ, and EL (52). Syntenic blocks appearing in all five ϕKZ-related phages are indicated in Fig. 1. A tendency for both syntenic blocks and single genes to have undergone inversion in ϕKZ-related phages was noted previously (23, 52). Several previously noted inversions are mapped in Fig. 1 in red, along with additional single-gene inversions in OBP and ϕPA3.

Fig 1.

Fig 1

Genome sequence comparison of the ϕKZ-related phages. Genes inherited by vertical descent are shown in blue (syntenic position and orientation), red (syntenic position but inverted orientation with respect to OBP), and green (positional shift in the genome due to a multi-inversion process). Orange genes are ϕKZ-related homologs but are inconsistent with vertical descent. Genes in black have no ϕKZ-related homolog, but a homolog has been found elsewhere. Transcriptional orientation is indicated by the light blue (left to right) and pink (right to left) boxes above the OBP genome. Blocks of synteny (A, B, and C) are interspersed with hyperplastic regions.

When additional sets of homologs among the ϕKZ-related phages were investigated, another form of genomic rearrangement was discovered, typified by the two-gene block OBP-Gp262–OBP-Gp-263. The homologs of these two genes in EL are displaced to the left of the preceding block of genes, which was already established to be orthologous between OBP and EL. However, none of the genes were inverted in their orientation. Such an arrangement could occur during vertical descent if the larger segment (OBP-Gp258–OBP-Gp263) were to invert and then the two subsegments (OBP-Gp258–OBP-Gp261 and OBP-Gp262–OBP-Gp263) were to reinvert, generating EL-Gp165-166 and EL-Gp168-171. Other examples that might be explained by a multi-inversion process are presented in green in Fig. 1. In three cases, one of the two blocks interchanging positions was still in inverted orientation. In two cases, both segments had apparently reinverted. Hence, the process that generates inversions tends to be focused in hyperplastic regions, often affecting the same genes several times over the time span separating these phages.

The end effect is that the entire genomes of the ϕKZ-related phages can be viewed as syntenic chromosomes interrupted by gene insertions and deletions and punctuated by some particularly hyperplastic regions. The hyperplastic nature of some regions has been exaggerated by the presence of several local inversions, obscuring the syntenic relationship among the genomes. The most prominent hyperplastic regions are between B and C in OBP/EL and to the right of C in the phiKZ-like viruses. The rearrangements specific to the orthologs of OBP-Gp263 (furthest right green set in Fig. 1) may be informative with regard to the differential placement of the major hyperplastic regions of the “EL-like viruses” and the phiKZ-like viruses. In order to explain the transfer of the OBP-Gp263 ortholog to the right of syntenic block C in the phiKZ-like viruses, it is necessary to propose an inversion of the entire block C. The entire blue segment would then have reinverted prior to engaging in the two subinversions drawn between EL and ϕKZ in Fig. 1. It is possible that the entire region between B and C was transferred along the OBP-Gp263 ortholog to the right of block C by this process. Hence, the major hyperplastic regions of both EL-like viruses and phiKZ-like viruses may have been derived from the same ancestral region.

It has been proposed (11, 36) that regions of conserved synteny encode proteins that assemble into complexes and that this arrangement was relatively intolerant of disruption by inserted genetic material. Conversely, integration of a series of genes with nonessential functions would create a locus that was relatively tolerant of subsequent substitution with other genes. Consistent with this theory, the syntenic blocks contain a large and fixed set of essential genes involved in virion morphology and DNA replication and expression. Most of the genes apparently imported into hyperplastic regions are orphans, providing little information about their functions. But the theory suggests that they are nonessential genes, providing weakly useful functions in such a way that they often get replaced by alternative nonessential genes.

Noncoding signals. (i) Rho-independent terminators.

Analysis of the inter-ORF regions with TransTerm (26) revealed the presence of 72 ρ-independent terminators with G+C-rich stems of 5 to 25 bp and four base loops (with a fewer larger loops), followed by a stretch of several U bases which are often interrupted (see Table S1 in the supplemental material). Many terminators were predicted at the end of the transcriptional blocks, and each time two opposite transcriptional blocks converge, a bidirectional terminator was predicted. However, other terminators were found between genes where there is no room for a promoter to restart transcription, suggesting some kind of antitermination or pausing of the RNA polymerase. Twelve terminators contain the conserved UUCG tetraloop motif, a highly thermodynamically stable motif abundant in prokaryotic and eukaryotic RNAs and also conserved in many ϕKZ stem-loop terminators (16, 42). Using PHIRE (34), a 13-nucleotide motif was identified within the stems of 13 transcription terminators. These terminators are clustered in four groups, and nine of them are bidirectional (Table 1).

Table 1.

Conservation of a 13-nucleotide motif in the stems of 13 ρ-independent transcription terminators

Terminator Orientationa A stretchb Stem Loop Stem U stretchb
T6 bi AUAUAA CCCUCUACUCUCC UUUC GGAGAGUAGAGGG AAUUAUGAUCUUUAUUU
T8 bi AAAAUAAAAUAACAAAGA CCCCUACCCCUCC UUUC GGAGUGGGUAGGG UUAUCUUUAAUUUAUUUUU
T11 CCCCUAUCUCGCC UUUG GGUGGGGUAGGGG UUUUUGAUU
T90 bi AAAGAAUAACAUAUU CCCCUUACCCUCC GAAA GGAAGGGUAAGGG UUAUGUUUGUU
T96c + CCCUACUCCCUCC GCAA GGAAGGGGUAGGGG UUUAUGUUU
T99c bi AAAGACAUAU CCCCUACCCACUCC GUCAA GGAGGGGUAGGGG UUUAU
T104 bi AUAAAU CCCCUACCCCUCC UUGAC GGAGUGGGUAGGG GUUAUGUUUGUUUCUU
T126 + CUCCUACCUCACC UUCG GGUGGGGUAGGGG GUUUUU
T139 bi AAACAUAU CCUCUACCCCUCC GUAA GGAAGGGUAGAGG UUAU
T146 bi AAACGACAUAUAC UCCCUACCCUUCC UUGC GGAGGGGUAGGGG GGUUAUGCUUAUGUU
T254d + CCCCUACCCCUCC GUAA GGAAGGGUAGGGU UUU
T262 bi AUAAAU CCCCUACCCCUCC UAAU GGAGUGGGUAGGG CUUAUGUUU
T275 bi AAACAUAA CCCCUACCCCUCC GUAA GGAAGGGUAGGGA GUAUAUGCUUUAUUUUUUUU

(ii) A prospective phage-specific early transcription promoter.

Within the inter-ORF regions, a noncoding DNA sequence motif was found at 25 locations throughout the OBP genome (see Table S1 in the supplemental material). The sequence is represented in Fig. 2 as a sequence logo in comparison to a similar motif found in phage 201ϕ2-1. Similar sequences have been reported for phage EL (WTTTYAAACCTACATATY) (23), and ϕKZ (TATATTAC) (42). These motifs are always oriented the same way relative to the downstream gene and are often preceded by a ρ-independent terminator motif, consistent with their being transcriptional promoters. The logo presentation reveals that these motifs may well be functionally interchangeable between phages even though there has been drift with regard to which base is most prevalent in specific positions in different phages. The logos (Fig. 2) are superficially similar to bacterial promoter sequences; however, a hidden Markov model constructed from the 201ϕ2-1 sequences using SAM and used to search a variety of fully sequenced Pseudomonas genomes did not produce a single bacterial match. This is consistent with the high information content of the logos (∼25 bits), which is sufficient to prevent chance matches more often than once per 33 million bases (12). Hence, these are most probably phage-specific promoters. None of the matches to ϕKZ-related phages were at the head of operons with identifiable middle or late expressed genes. However, operons containing the nonvirion RNA polymerase genes were headed by these motifs. Hence, we conclude that these are early promoters.

Fig 2.

Fig 2

Logo representation of the phage-specific promoters of phage OBP in comparison to a similar motif of phage 201ϕ2-1. When the consensus base differs, both phages often contain instances of the base characteristic of the other, indicating that these promoters may be functionally interchangeable.

Searching of the ϕKZ-related genomes with the SAM HMMs for these motifs revealed that all instances were on the same strand. This would be explained if these are targets of the injected virion RNA polymerase, and if early transcription is involved in genome import. This would be consistent with the high information content of the transcriptional promoters, since it would be important that the limited number of injected polymerase molecules not be distracted by binding to the host chromosome. Since ϕKZ-related genomes are circularly permuted, the importation hypothesis would require early promoters to be scattered at multiple positions physically throughout the chromosome but always in the same orientation, as is the case. This could further explain why late functions are scattered into genomic islands in ϕKZ-related phages, rather than being organized into a single large cluster as is the case for some large genome phages with fixed ends (22).

Gene products involved in nucleotide metabolism.

OBP encodes several proteins inferred to be involved in nucleotide metabolism: thymidylate kinase (Gp8), thymidylate synthase (Gp234), thymidine kinase (Gp249), dihydrofolate reductase (Gp241), and the ribonucleotide diphosphate reductase beta subunit (NrdB; Gp211). Two putative electron carriers that might support ribonucleotide diphosphate reductase function were assigned: a glutaredoxin (Gp12) and a 2Fe-2S ferredoxin (Gp210). Large genome phages often carry genes that augment or replace host functions for nucleotide synthesis. Duplication of host pyrimidine biosynthetic functions was shown to enhance T4 growth (14). Other ϕKZ-related phages carry homologs of some of these but not others (Fig. 1). Curiously, EL carries only one gene from this set of functions (thymidylate kinase), and it is neither particularly similar in sequence to the OBP thymidylate kinase nor in a syntenic position. Indeed, none of the OBP nucleotide metabolism genes appear in a syntenic position relative to homologs in other ϕKZ-related phages, and all of them are more similar to bacterial genes than to other ϕKZ-related genes. This implies that the ancestor of the EL-like viruses lost its nucleotide metabolism genes and that OBP (and to a lesser extent EL) has been reacquiring them. None of the OBP genes appear to be especially similar to their Pseudomonas paralogs, as would have been expected if they were acquired from the host genomes. Instead they are quite divergent from any gene currently in GenBank, as would be expected if they had been traveling in the viral gene pool for an extended period of time. This is of interest because, unlike host takeover genes, the host chromosome has homologs of these genes, yet the phages still prefer to acquire the versions in the viral gene pool. This implies that these genes do not just duplicate host gene function but that they have become customized to support phage growth in some way.

Newly recognized components of the replication complex: two putative polypeptides with similarity to RB69-like DNA polymerases.

Although the presence of a phage-encoded DNA polymerase (DNAP) for the ϕKZ-related phages has been presumed since the appearance of the first article about the ϕKZ genomic sequence (42), no similarity to other DNAPs has previously been found. With the new ϕKZ-related sequences, PSI-BLAST found two polypeptides (OBP-Gp55 and OBP-Gp99) in syntenic genome region B matching T4 DNAPs. T4 DNAPs are exemplified by the crystal structure of the DNAP (Gp43) of the enterobacterial phage RB69 (20, 53). RB69-Gp43 is 63% identical to T4-Gp43 itself, is a member of the eukaryotic pol α family, and has the domain structure shown in Fig. 3. It contains an N-terminal half with an editing 3′-5′ exonuclease domain of the DNAQ family and an N-terminal domain involved in some autoregulatory function in DNAP expression (24). This N-terminal half is attached via a small linker to a C-terminal half, with DNAP activity located in a right-hand-like structure composed of thumb, finger, and palm subdomains. The PSI-BLAST matches clearly showed that OBP-Gp55 (first PSI-BLAST iteration) has elements of the polymerase domain and OBP-Gp99 (second PSI-BLAST iteration) has the 3′-5′ exonuclease domain (shown in dark gray in Fig. 3). However, PSI-BLAST could not establish which, if either, of these proteins contained the finger domain and the invariant aspartate catalytic residue in the beginning of the polymerase domain. To clarify this, SAM alignments were made of the OBP sequences plus their immediate relatives and confined to just the C-terminal part of Gp99 or the N-terminal part of Gp55, so that the obviously matched segments were excluded. In this way, HMMs produced from the SAM alignments cannot be biased to match RB69-Gp43 by their adjacency to a well-matched region. These HMMs were compared to the RB69-Gp43 HMMs in the context of a full HHpred library search. It then became clear that OBP-Gp55 matches the missing finger domain in a broken fashion (shown in light gray in Fig. 3). The existence of this shortened finger domain in OBP-Gp55 is supported by the match to the crystal structure of the Sulfolobus solfataricus DNAP model, because that polymerase also has a shorter finger. This caused the alignment break within the finger domain itself to go away, leaving no discrepancy except that OBP-Gp55 has a 200-residue insert within its palm domain. However, the essential catalytic aspartic acid residue in the palm domain at the left of the finger domain (46) is not matched in OBP-Gp55. Although there is a conserved aspartate in the region in OBP-Gp55 (asterisk in Fig. 3), there is no other similarity at all relating that area to DNA polymerase. Hence, OBP-Gp55 does not seem to include a complete polymerase domain.

Fig 3.

Fig 3

Sequence alignment of two OBP polypeptides (Gp55 and Gp99) to the T4-like DNA polymerase (Gp43) of the enterobacterial phage RB69. RB69-Gp43 consist of an N-terminal half with an N-terminal domain (N-ter) and a 3′-5′ exonuclease domain of the DNAQ family, which is connected via a small linker (ln) to the C-terminal polymerase domain, composed of thumb, finger (Fn), and palm (PolBc) subdomains. Psi-BLAST matches are indicated in dark gray, while light gray shows similarity regions identified after alignments of the hidden Markov models of the ϕKZ relatives of both polypeptides and the HMMs of the RB69-Gp43 relatives. Asterisks indicate conserved aspartate residues.

The confusion caused by the missing catalytic residues in OBP-Gp55 was relieved by analysis of the C-terminal part of OBP-Gp99. The Gp99 match extends into the DNAP domain itself, with the missing active site aspartic acid (asterisk in Fig. 3) within the motif D[IL]D[VI][TV][GS][AT]YP, matching the RB69-related DNAP motif [LF]D[LF]XSLYP and having a similar predicted secondary structure in the area. Other features of OBP-Gp99 support the idea that these two polypeptides form a complex. OBP-Gp99 matches two regions in RB69-Gp43 that form the contacts that mount the exonuclease domain on the polymerase domain. These are the N-terminal domain and the linker between the exodomain and the polymerase domain. OBP-Gp55 has an extra 200-residue domain inserted in the palm, and OBP-Gp99 has an additional C-terminal domain. These would be juxtaposed underneath the palm structure, where they may assist in stabilizing the association of the two polypeptides. There appears to be no structural reason why these two polypeptides could not function together as a working DNA polymerase.

Gene products involved in peptidoglycan degradation.

The structural lysozyme of OBP-Gp276, orthologous to EL-Gp183, is composed of a large structural part and a small C-terminal lysozyme domain (RPSBLAST; amino acids [aa] 2069 to 2218; 1E-17). However, at the C-terminal distal end, OBP-Gp276 contains approximately 170 residues that do not align (by PSI-BLAST) with the ∼300 residues of EL-Gp183. This extreme C-terminal portion (2043 to 2237) of the structural lysozyme of phage ϕKZ, Gp181, was shown to enhance the specific catalytic activity 3-fold (10). Similarity to the ϕKZ and 201ϕ2-1 homologs (Gp181 and Gp276, respectively), however, breaks off before the catalytic domain.

The putative OBP endolysin, Gp279, probably has a modular structure with an N-terminal substrate binding module (Pfam; aa 13 to 96; 0.0016) and a C-terminal catalytic module with putative lysozyme activity (RPSBLAST COG3179; aa 126 to 292; 2E-15). This domain architecture, in an inverse arrangement, is characteristic of endolysins of phages infecting Gram-positive bacteria (37) but unique for the ϕKZ-related phages EL and ϕKZ, which infect a Gram-negative host (9). SAM assembles an alignment of two 60-residue domains at the N terminus of the OBP and EL endolysins which is able to align the N-terminal domain of ϕKZ-Gp144 (PSI-BLAST; 3.5E-3) and matches Pfam01471, a putative peptidoglycan-binding domain (HHpred; 0.016). These observations firm up the annotation of the N-terminal peptidoglycan binding domain of OBP-Gp279. At the C terminus, the endolysins of phages EL and ϕKZ display a C-terminal soluble lytic transglycosylase (SLT) domain which, however, does not show any detectable similarity with the catalytic lysozyme module of the endolysin of OBP, Gp279 (9). OBP-Gp279 instead displays similarity to a marginal region (∼80 aa) of the lysozyme module of the structural lysozyme of OBP (4E-6) and EL (3E-4), suggesting a common origin for both lysozyme modules.

Besides these two OBP lytic gene products found in genomic positions identical to those of their ϕKZ-related counterparts, the OBP genome encodes a second endolysin-type gene product, Gp149. The 255-residue OBP-Gp149 contains an SLT domain (Pfam; aa 46 to 99; 0.0078) but no N-terminal peptidoglycan-binding domain. This SLT domain is located in a gene segment (aa 35 to 241) with PSI-BLAST similarity to the catalytic region of the ϕKZ-like structural lysozymes (aa 1876 to 2078) and to the SLT domain of the ϕKZ endolysin, Gp144 (9, 19). Strikingly, an N-terminal signal peptide in OBP-Gp149 with a cleavage site between aa 23 to 24 is predicted (SignalP and LipoP). The absence of the tripartite structure (N-H-C) within the signal peptide typical of an endolysin signal-arrest-release sequence suggests that OBP-Gp149 is a SecA-dependent secreted endolysin with an N-terminal secretion signal (54). In contrast to OBP-Gp279 and its EL and ϕKZ counterparts (Gp188 and Gp144, respectively), this second OBP endolysin (Gp149) is a globular enzyme, typical of endolysins of phages infecting Gram-negative species (9, 37).

All ϕKZ-related phages display conservation of the structural region of their structural lysozyme and, furthermore, a similar peptidoglycan-binding domain within their modular endolysin, suggesting common ancestry of these enzymes. While for all phiKZ-like viruses a conserved SLT domain was observed for these lytic gene products, a lysozyme module was identified for three lytic proteins of the EL-like viruses, indicating that these domains were probably acquired soon after the genera diverged. The similarity of the SLT domain of the OBP globular endolysin to that of the phiKZ-like viruses provides evidence for horizontal gene transfer between the two genera, while the SLT domain of the EL endolysin may have diverged beyond recognition or may be the result of a different recombinatorial exchange.

Two new virion protein functional assignments.

The injection needle was assigned to OBP-Gp146. PSI-BLAST originally made an association with the injection needle of myovirus P2 (P2-GpV); to confirm this, we made an alignment and HMM of the ϕKZ-related orthologs only and found that hhsearch could align that HMM with the P2-GpV model in the HHpred libraries with an E value of 0.001. The best-characterized injection needle is T4-Gp5 (3, 44), and there is cross matching between the HHpred models for T4-Gp5 and P2-GpV in the injection needle domain of T4-Gp5. The T4 injection needle has the tail lysozyme domain embedded within the same polypeptide, whereas most myoviruses have a stand-alone injection needle homologous to P2-GpV while the lysozyme is fused to some other baseplate or tail tip component (22). The OBP component carrying the lysozyme domain is discussed above.

The portal protein was assigned to OBP-Gp132. This assignment was aided by having an HHpred-style HMM library derived from previously compiled extensive alignments of homologs of proteins from the prototypical myovirus P2 (51). Searching an HMM constructed from OBP-Gp132 and its orthologs against all P2 HMMs produced a top scoring match to P2-gpQ, with an E value of 0.005. Conversely, searching the P2-gpQ HHM against a library of all OBP families produced a top scoring match to OBP-Gp132, with an E value of 0.01. These E values are too low to allow these HMMs to match in the context of a full HHpred library search (about 100,000 HMMs). Hence, the assignment is conditional on the expectation that ϕKZ-related phages have a portal homologous to that of other myoviruses. If and only if that assertion is accepted does it become fair to search only the smaller myoviral protein family library and accept the correspondingly smaller E values. The fact that the P2 alignment included essentially all other known myoviruses (more than 1,100 members) encouraged us to accept it as representative of myoviral portal proteins. An unusual property of the OBP-Gp132 portal protein family is that the 201ϕ2-1 ortholog has been shown to be proteolytically processed by the maturation protease to remove an N-terminal domain (52). The N-terminal domain was not part of the portal alignment. To our knowledge, no other tailed phage portal protein has been found to be subjected to proteolytic processing. However, structural modeling revealed that the extra domain would be positioned inside the capsid adjacent to the capsid wall (data not shown). Hence, we propose that the giant phage portal proteins have an extra domain to help guide assembly of the oversized capsid shell.

Putative tail fiber families.

A first OBP paralog family (OBP-Gp142–OBP-Gp145) contains gene products orthologous to the EL virion structural proteins Gp113 to Gp116, respectively. Both the OBP and EL paralog families contain an N-terminal paralog block, called block A, of approximately 250 residues and a second paralog block, block B (∼110 residues) at the C terminus. Homologs of the N-terminal domain A are also found in the phiKZ-like viruses (see Table S1 in the supplemental material). OBP-Gp142/EL-Gp113 consists only of the single N-terminal paralog domain A, while their ϕKZ-like counterparts contain an extra 500 residues, indicating some recombination length changes. OBP-Gp143 and OBP-Gp144 have gene structures (N-terminal a and C-terminal B blocks) and length (∼1,250 residues) similar to those of their EL orthologous, Gp114 and Gp115, respectively. Their similarity in the N terminus extends to ∼350 residues, while their large midsections do not match up and probably diverged beyond recognition. Also, within the ϕKZ-like orthologs of approximately 460 residues, a similar C-terminal region is observed; however, it lacks similarity to the C-terminal block B of the EL-like viruses. OBP-Gp145 and EL-Gp116 share a truncated paralog domain A, which lack the first 150 to 180 residues. While the ϕKZ-like orthologs are restricted to this truncated A block, the OBP-Gp145 C terminus, about 800 residues longer than EL-Gp116, matches lipoproteins of marine bacteria (PSI-BLAST) and consists of repeated domains which match the 120-residue Mycoplasma protein DUF285 (RPSBLAST; Pfam03382; 8E-38).

As paralog domain A matches a DUF2479 domain found at the N terminus in a supposed tail fiber family of phages infecting different bacterial species (HHpred; Pfam 10651; 3.5E-7), it can be hypothesized that this paralogous family found within the different ϕKZ-related phages probably encodes the abundant proteinaceous fibers attached to the virions (33, 52). The conserved N-terminal paralog domain A, inherited from a single common ancestor, attaches the nonhomologous C-terminal domains to the phage particle. Similarity of the C-terminal segments within the EL-like viruses and within the phiKZ-like viruses indicates that these domains developed after divergence of these two genera.

Additionally, a second large gene family was identified in the OBP genome with an internally repetitive structure suggestive of fiber formation. The gene products were OBP-Gp203 to OBP-Gp207, OBP-Gp209, and OBP-Gp226. The internal repeat is approximately 120 residues, and the different proteins have between 1 and 12 of these domains. The proteins are often headed by a distinctive N-terminal domain. No homologs of these sequences were found. Because other ϕKZ-related phages do not carry this family, there are no mass spectrometry data available at this time to confirm that the gene products are attached to the virion.

An OBP internal head protein multigene family.

A large number of proteins have been observed to undergo proteolytic processing in the other ϕKZ-related phages and are presumed to be inner head proteins (36, 52). Many of these were part of a highly divergent multigene family, so that it required hidden Markov modeling just to find all the 201ϕ2-1 and ϕKZ family members. When PSI-BLAST was used to annotate the OBP genome, more OBP gene products were found to match to that inner head protein family than were found by scanning with the HMM constructed from 201ϕ2-1, ϕKZ, and EL members of the family. This unusual result caused a further examination of the unusual sequence relationships within this family. The confusion is caused by the following. (i) These sequences diverge much faster than the other ϕKZ-related orthologs. The protease cleavage sites are the most conserved residues in the proteins. (ii) In some cases the family members are more similar to other members in the same genome than to members in the other ϕKZ-related genomes. Because we included OBP proteins in the version of the nr library searched by PSI-BLAST, this allowed identification of other paralogs in the same genome that are hard to find with a profile or HMM formed from family members in another genome. The final list of OBP family members was Gp109 to Gp113, Gp89, and Gp262. Previously unrecognized members of the family in EL were also found in this search. The total list in EL is Gp50, Gp54, Gp61 to Gp64, and Gp165.

Unique to the ϕKZ-related phages is the cylindrical inner body observed within the phage capsid (3133, 47). This inner body is thought to be a dense spiral structure around which DNA molecules are spooled. It can be hypothesized that the interior head paralog family could be involved in organizing the genomic phage DNA inside the head structure. Thomas et al. (52) noted that these polypeptides have a generally acidic propeptide of approximately 200 residues, whereas the mature segments tend to be neutral or slightly positively charged. They proposed that the propeptides contribute to the scaffolding function, whereas the neutral to slightly positively charged segments complex with the DNA and remain behind inside the capsid after processing. Moreover, Thomas et al. (52), using mass spectrometry and spectrum counting, estimated that there are large amounts of these polypeptides in the mature capsid. Perhaps their observed diversity is necessary to prevent aggregation at these concentrations. Alternatively, these proteins could be injected into the cell to perform host takeover functions, and their observed sequence diversity may be a consequence of host coadaptation.

The sequence relationships in this family require that some level of concerted evolution has occurred. In other words, family members are sometimes lost and regenerated by gene duplication. All of the ϕKZ-related phages have a major locus at a syntenic position and at least one satellite locus. The satellite loci suggest that horizontal transfer is also involved in maintaining this family in each genome.

Conclusion.

Comparison of all sequenced ϕKZ-related genomes highlighted again their genomic organization into blocks of synteny—encoding a large set of essential gene products involved in morphology, DNA replication, and expression—interspersed by hyperplastic regions coding for what we assume to be nonessential genes. These hyperplastic regions seem to be sensitive for inversion processes which often influence the same gene or gene segment several times and thereby obscure the syntenic relationship among these genomes. Nevertheless, this organizational arrangement of the ϕKZ-related genomes helps to discern vertical descent and supports the idea of further division of the ϕKZ-related phages into EL-like viruses and phiKZ-like viruses as postulated by Lavigne et al. (35), possibly within a single subfamily.

With respect to other phage groups with large genomes, the pattern of interspersed hypervariable regions is similar to that observed for T4-like phages and their relatives, including the distantly related T4 phages infecting Cyanobacteria (11, 17). However, inversions of the type observed in ϕKZ-related phages are not common among T4-related phages or any other large genome phages. This would suggest that the evolutionary processes creating dispersal of syntenic islands are common to other phage genomes but that the processes creating inversions are peculiar to the ϕKZ-related phages. The evolutionary time and divergence covered by the characterized ϕKZ-related phages (52) and T4-related phages (17) are comparable, so a similar propensity for inversion should have shown itself if it were present among the T4-related phages. The single example of a large-genome (>200 kb) phage with fixed ends (22) shows a different pattern of synteny. In this phage, 0305ϕ8-36, the structural and morphogenesis genes are clustered together, covering about one-half the genome. There are breaks in synteny between 0305ϕ8-36 and its single known homologous relative, but they appear to reflect substitution of auxiliary structural virion components, not interspersal of nonstructural gene cassettes. The divergence time separating 0305ϕ8-36 from its relative (22) is comparable to the time separating the ϕKZ-related phages, so the peculiarity of the evolutionary pattern affecting 0305ϕ8-36 is not simply a matter of making comparisons over vastly different time spans. The agreement between ϕKZ-related phages and T4-related phages suggests that their pattern of evolution is the default for circularly permuted genomes, whereas the pattern of evolution affecting 0305ϕ8-36 is distorted by advantages derived from organizing genes by time of expression relative to fixed ends (22).

We propose that the peculiar incidence of inversions in the descent of ϕKZ-related phages is related to their unusual transcription program. We identified 25 incidences of a sequence motif which we interpreted to be early promoters involved in DNA import (see Results). Key among the observations connecting these motifs to DNA import is their uniform unidirectionality among all ϕKZ-related genomes in spite of their residing in the most hyperplastic portions of their genomes. From the lack of conservation of the genes driven by these promoters among the different genomes, it can be inferred that these genes have nonessential functions, but sequence comparison provides little information about their specific functions. However, their association with early promoters means that they are early-expressed genes, and the latter are usually involved in host takeover. Hence, the early promoter motif provides a much-needed observable parameter for diagnosing early host takeover operons. Further, if it is assumed that they are inserted with the embedded ϕKZ-related phage-specific early promoters already in place and these genes are initially often integrated in the wrong orientation, subsequent inversions would be driven that envelop flanking genes in the process. The peculiar tendency of ϕKZ-related genomes to undergo inversions concentrated around hyperplastic regions would thus be explained.

Genome annotation of ϕKZ-related phages through sequence comparison to other myoviruses is complicated by a high divergence barrier originating from both ancient common ancestry and a high divergence rate (52). We have employed a suite of profile and HMM methods to increase the annotation in the face of this barrier. These allow us to formulate phage-specific protein families that are absent from the international family databases and to draw connections between such families developed from well-characterized phages and proteins in the current target genomes. Since the HHpred HMM-HMM comparison offers the greatest signal-to-noise recognition of distant similarity of all methods we have tested, we merged our homemade models into an HHpred style library for maximum utility. We also used HHpred to address a weakness inherent in all family-building methods in that an unnamed sequence from a distant family can be incorporated at an early iteration with low confidence, leading to inclusion of named members of the distant family at high confidence at later iterations. This causes the weak step in the chain of inferences relating the two families to hide behind an exaggerated E value reported at the later iteration. For annotations where this artifact would be a concern, we separately aligned families representing the ϕKZ-related proteins and the putative distant relatives and converted them to HMMs. We can then be confident that the E value reported for the HMM-to-HMM comparison between these families evaluates the weakest link between the ϕKZ-related protein and the applied annotation.

Using hidden Markov modeling, conserved core genes—the portal, the injection needle, and the DNA polymerase genes—were located in all the ϕKZ-related phages. The HMM models of the ϕKZ-related orthologs of the portal and the injection needle proteins could be aligned with HMMs of the homologous proteins of the prototypical myovirus P2. The T4-like DNA polymerase activity is provided by two separate polypeptides with similarity to the T4 polymerase domain and the 3′-5′ exonuclease domain, which are suggested to interact to form one working DNA polymerase complex. Furthermore, a highly divergent multigene family encoding inner head proteins which may be involved in the organization of the genomic phage DNA inside the phage capsid was located. Also, a paralogous family found within each of the ϕKZ-related phages which is likely to encode the abundant proteinaceous modules associated with the virions was identified. While their N-terminal structural domain is conserved within the ϕKZ-related phages, their C-terminal catalytic segments are conserved within the EL-like viruses and the phiKZ-like viruses. Similar observations were made for the peptidoglycan-degrading late-gene products and further support the division of the ϕKZ-related phages into two separate genera.

Supplementary Material

Supplemental material

ACKNOWLEDGMENTS

We thank Mike Cao for assistance in promoter motif analysis and Matthew Dorsett for assistance in analysis of the portal protein family (Department of Biochemistry, University of Texas Health Science Center, San Antonio, Texas). Anneleen Cornelissen holds a predoctoral fellowship of the Instituut voor de aanmoediging van Innovatie door Wetenschap en Technologie in Vlaanderen (I.W.T., Belgium). Stephen C. Hardies was supported by a pilot grant from the University of Texas Health Science Center at San Antonio (Texas). Andrew M. Kropinski was supported by a Discover Grant from the Natural Sciences and Engineering Research Council of Canada.

Footnotes

Published ahead of print 30 November 2011

REFERENCES

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental material