restez -- submission · Issue #232 · ropensci/software-review (original) (raw)

Hi all!

Find attached my review. Sorry for the delay. Great job overall @DomBennett! Note that it's my first time reviewing for rOpenSci, and partially for that reason, I think I tended to focus on higher level stuff.

Please don't hesitate to ask for any clarification.

-- Evan


Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide.

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

Functionality

Final approval (post-review)

Estimated hours spent reviewing:

I estimate that I spent 13 hours on this review.


Review Comments

General Comments

Overall, this is a well thought out and documented package. I don't work that often with GenBank sequences myself, but based on discussion with others, it seems this package will usefully solve the problem of easily downloading and querying sequences from NCBI. In general, all described functionality in the package seemed to operate as intended. I particularly appreciated the pkgdown site and multiple vignettes. These will be great for new users and offer an easy way to communicate changes and new features as the package develops. I tried to conduct my review in the spirit of a "naive user". I first went through some of the code base checks recommended in the rOpenSci reviewing guide. From there, I worked through the package functionality in a way I expect most new users would: I went from the README to the "Get started" tutorial on through the other vignettes. I referenced the function descriptions and help files throughout. I think my comments represent mostly minor issues, but hopefully they will improve overall organization and usability of the package. I've arranged them below in rough order of importance.

Specific Comments

Explain Maintenance of Multiple SQL Databases: Given the performance issues associated with querying large databases, users might reasonably want to have multiple local SQL databases containing data from different taxa, different sequence lengths, etc. I would guess that re-coding functions to allow for the assignment of different SQL database names within the same restez path would be a bit much at this point (but potentially a feature for future versions?). In lieu of that, I would simply recommend pointing out to users that setting different restez paths would be an easy way to maintain different databases containing different sets of information that could referenced from within one analysis script. This is a very simple solution, but it didn't occur to me until I spent a few hours with the package, so a new user would probably benefit from knowing that it's an option for maintaining multiple local databases.

More Informative Error Messages When restez Path Not Set: You've done your due diligence in reminding users to run restez_path_set() when they call library(restez), but I wonder if this is enough to prevent errors from new users. If users have a GenBank database on their machine, start an R session, run library(restez), then attempt any of the query functions, they will be met with some opaque errors. Especially when people are loading a host of packages to initiate a script, package-specific warnings can get overlooked. How much trouble would it be to check that the restez path is actually set as a first step in the query functions?

Change list_db_ids() Defaults: I understand the intuition behind setting n = 100 as the default for this function (prevent users from gathering all IDs in a massive database), but I would consider either changing this or printing a warning to the screen telling users by default they will retrieve a maximum of 100 sequence IDs. I think this is pretty critical because this will likely be a key step in many users' workflow: download a GenBank dataset, query to get a vector of all sequence IDs, then go from there. I think it will cause confusion if users download a large database, gather IDs, but then don't realize they're only working with the first 100 IDs by default.

Don't Export restez_ready()?: Is there a specific use case for restez_ready()? It seems to me it replicates the functionality of restez_status(). Both return TRUE/FALSE, but restez_status() has more useful information. I would tend towards not exporting restez_ready() just to simplify things for users.

Update restez_status() With Number of Records: Along these lines, one critical piece of information that might be usefully added to restez_status() is the number of records contained in the current SQL database. Users can currently see the size of the database, but naturally they might want to know the actual number of sequences. The only way I could think to easily find this information at present was to run list_db_ids() on the database, then find out the length() of this vector.

Clarify SQL Database Size Requirements: I think there is some issue with the communication of expected database sizes when running db_download(). Or else, the files I happened to pick have extremely long sequences relative to examples in the help files? As a test run, I attempted to download all of the viral sequences on GenBank. db_download() informed me nicely about what was expected:

You've selected a total of 54 file types. These represent:    
● 'Viral'  
Each file contains about 250 MB of decompressed data.  
54 files amounts to about 13.5 GB  
Additionally, the resulting SQL database takes about 50 MB per file  
In total the uncompressed files and SQL database should amount to 16.2 GB Is that OK?

However, after download and database creation (which went fine), restez_status() told a different story:

> restez_status()  
Checking setup status at '~/Desktop/restez' ...  
... found 55 files in 'downloads/'  
... totalling 12.18 GB  
... found 'sql_db' of 18.35 GB  
[1] TRUE

So all told, my local, raw sequence files were smaller than expected (only 12 GB), but the SQL database was much larger than anticipated, resulting in a local database folder that was roughly twice what was expected (30 GB). Is this a peculiarity of this dataset? Something that went wrong with the database generation?

Revise References To Defunct Functions: There were some confusing references to gb_download() in various parts of the documentation, including the README file. I gather this function was replaced by db_download() in the current version of the package? In any case, appropriately modifying these references is important since anyone trying to follow along with the example code in the README and elsewhere will pretty quickly run into some confusion at present.

Differing gb_record_get() and entrez_fetch() Behavior: It's great that you've integrated restez functionality with some things offered by rentrez. However, I noticed that behaviors of "regular" restez functions and those rentrez wrappers is slightly different. For example (after running demo_db_create()), running cat(gb_record_get("demo_1")) and cat(entrez_fetch(db = "nucleotide", id = "demo_1", rettype = "fasta")) do not return the same information, yet they should be querying the same (local) sequence record. Is this the expected behavior?

Typo Fixes: spelling::spell_check_package() and spelling::spell_check_files("README.Rmd") revealed some minor typos in various portions of the package documentation, particularly in README.Rmd and other vignette .Rmd files. These didn't significantly affect my understanding of the package but should be corrected. In addition, when db_download() is run, there seems to be a typo in the printed text? If a user wanted all mammalian sequences, I think they would want to download file types "12 13 15", not "12 14 15" (which would give viral sequences in addition to mammalian sequences). After reading through the package vignettes more, it occurs to me this may be the result of GenBank changes (i.e., particular taxa getting assigned different file numbers)? In that case, it might be better to revise your text such that it doesn't reference particular file numbers at all in order to future-proof it.

Make Vignette Availability More Obvious: It's great that you put in the time to generate useful vignettes for this package. I'd recommend some subtle changes to the README to make this more apparent for users. Why not move the sentence "For more detailed tutorials, visit the restez website" under a new heading below "Quick Examples" called "Detailed Tutorials"? I think that would be preferable since then new users could first execute the "Quick Examples" code to see how setup, querying, and Entrez wrappers work. Then they could move on to more in-depth tutorials with the vignettes. I think the link in this case should also point directly to the articles page of the package website (rather than the website index, which just replicates the README page).

Revise README Graphic: I appreciated the README file graphic that illustrated the overall package organization. I wonder if a couple of small changes could make this more comprehensive, however. First, would it make more sense for the white box for "restez/" to be moved above the blue box containing "downloads/" and "sql_db"? To me, this would be more accurate since "downloads/" and "sql_db" will be found within the "restez/" directory on a user's machine. Second, could you update the "Query" box to contain all the exported gb_*_get() functions in the package?

Better Organize Function Types (Families): This primarily applies to the ?restez description page and the function "References" page on the website. This occurred to me only because the ?restez help page describes having three sets of functions but then goes on to list four. In addition, the function organization on the "References" page is a little weird since functions from the same grouping can actually come from different source files (and those source files can overlap with other groups). For example, under the "Database" functions, some are from R/setup.R while others are from R/general-get-tools.R. This might make it a little hard for users to wrap their minds around which functions are to be used at which stage of the restez workflow. I would probably suggest four function categories (something like restez Path Setup, GenBank Database Functions, GenBank Query Functions, and Entrez Wrappers) all organized such that source files don't overlap among categories.

Harmonize Vignette Workflows: This is minor, but I might consider reorganizing the multiple vignettes slightly such that they all follow the same conventions and workflow (i.e., all set restez paths to the user's desktop, all delete local GenBank databases following analysis to illustrate best practices, etc.).

Performance Issues With Querying Large Databases: There may not be a good way around this, but I noticed that the gb_*_get() family of functions were a bit sluggish when used to query a large database (i.e., all GenBank viral sequences). For example, a gb_definition_get() call on the viral database required ~24 seconds to return the definition for a single ID, and most of this time seemed to be spent on DBI::dbFetch() according to Rprof() (note I'm certainly no expert in profiling functions). Perhaps for future updates you might consider whether different functions would allow for speedier ID matching of the user's query in the local database? Another solution could be to facilitate users loading specific records from their local SQL database into an R environment object? This might not be the most efficient method, but I'd imagine these objects could be parsed faster?

In the course of discussing with @noamross, he suggested migrating to a MonetDB structure from SQL might improve these performance issues. I definitely don't think this is critical to package functionality at present, but just a consideration for future package development.

roxygen2 Update: devtools::check() and devtools::test() returned no major problems with the package, but I did note that there is a newer version of roxygen2 (6.1.0) available. Not sure if updating will improve the package in any discernible way.