This module is useful for downloading very large result sets of sequences
from NCBI given a text query. It uses YAML to create a configuration
file to maintain project state in case network or server issues interrupts
execution, in which case it may be easily restarted after the last batch.
Working scripts are included in the script directory:
The recommended workflow is:
1. Copy the scripts and edit them for a specific project. Use a
new number as the project ID.
2. Begin downloading by running fetch-all.pp, which will first
submit a query and save the resulting WebEnv key in a project
specific configuration file (using YAML).
3. The next morning, kill the fetch-all.pp process and run
fetch-missing.pp until it completes.
4. Restart fetch-all.pp.
5. If you wish to re-download "not available" sequences, you may
run fetch-unavailable.pp. However, they will be downloaded at
the end of fetch-all.pp if it completes normally.
If your query result set is so large that your WebEnv times out, simply
start a new project with that last index of the previous project, and
it will pick up the result set from there (with a new WebEnv).
Warning: You may lose a (very) few sequences if your download extends
across multiple projects. However, our testing shows that the results
generated with the same query within a few days of each other are largely
in the same order.
Note: This module was used to download 11,550,000 fasta-formatted protein
sequences over the course of eight days.
INSTALLATION
To install this module, run the following commands: