StarBLAST¶
Welcome to StarBLAST!¶
StarBlast is a scalable extension of SequenceServer BLAST, making BLAST accessible to educators and researchers running classroom-scale searches concurrently.
StarBLAST utilizes cctools for faster, distributed computing and CyVerse’s Visual Interactive Computing Environment (VICE).
What is SequenceServer?¶
SequenceServer is a front-end implementation of BLAST with improved GUI and customizable database input developed by the Wurmlab at Queen Mary University of London (Priyam et al., 2019). However, it is limited in its scaling capabilities and may be difficult to deploy for some people. StarBLAST extends SequenceServer for easier deployment and to scale for a larger number of simultaneous users (e.g., students).
The StarBLAST Suite¶
Navigate to each implementation for information on guided deployment:
Contacts & Issues¶
If you have questions, suggestions or have encountered a problem, please raise an issue on our GitHub Issues page.
Official Publication & Citation¶
The official publication at the Journal of Open Source Education (JOSE): https://doi.org/10.21105/jose.00102
Cite our work as: “Cosi, M., Forstedt, J., Gonzalez, E., Xu, Z., Peri, S., Tuteja, R., Blumberg, K., Campbell, T., Merchant, N. & Lyons E. (2021). StarBLAST: a scalable BLAST+ solution for the classroom. Journal Of Open Source Education, 4(38), 102. doi: 10.21105/jose.00102”
StarBLAST-VICE: Web Deployment for Small Classes (<25)¶
StarBLAST-VICE is a customizable implementation of SequenceServer, deployed as a VICE (Visual and Interactive Computing Environment) web application and hosted on the CyVerse Discovery Environment (DE). StarBLAST-VICE is launchable with a maximum of 8 CPU cores, 16 GB RAM, and 512 GB disk space.
Note
Before proceeding, a CyVerse account is required. Click here to register or log in.
Launching StarBLAST-VICE with Example Databases¶
- Click on the following button to launch SequenceServer in CyVerse Discovery Environment with the SWISS-PROT protein database (requires CyVerse account). If you are already in the DE, you can navigate to the StarBLAST-VICE app through the Apps button and searching for “StarBLAST-VICE”.
Note
SWISS-PROT is a curated protein sequence database, read more on the release or its orignal publication.
- Choose your own analysis name and the DE output folder. Click “Launch Analysis”.
- Check the notifications Bell Icon for a link to access your SequenceServer instance. This might take a few minutes. Once the notification shows that the app is running, click on the link. This will open a loading screen in a new tab; Once the app is loaded, you should be able to BLAST through the SequenceServer app.
- To test, click here for a sample DNA sequence.
- Paste the query sequence, select both the available databases and submit job.
Adding Your Own Databases to StarBLAST-VICE¶
To add your own BLAST databases you will need a .fasta (or .fa, .faa, .fna)
file containing the reference sequences you’d like to use. These are easily aquirable from NCBI or other databases.
- Within the CyVerse DE, click on the “Data” icon.
- Select “Upload” and specify the import from your Desktop or its URL. This will be stored in your personal folder.
- Click on the “Apps” icon and use the search bar to find “Create BLAST Database” or click here. As there is a possiblity that there may be multiple apps with the same name, please locate the “Create BLAST Database” developed by developed by Upendra Kumar Devisetty (use the better reviewed one).
- Enter a name for your database under “Analysis Name”; this will become your database containing folder.
- “Select output folder” should be your personal folder or any folder of your choice (default will be a foler named “analyses” within your personal folder).
- In the “Inputs” tab, select “Browse” and choose the fasta file you uploaded. Select Nucleotide or Protein under “Input Sequence Format”. Under “Prefix” choose a name to well reflect your database (e.g.
a_thaliana
).
- Click “Launch Analysis” and wait to be notified of its completion. Upon completion, navigate to the output folder specified in step 4.1; Inside you will find a directory with the name you specified in step 4 followed by a timestamp. Within this folder you will find logs and the newly generated database (if nucleotide
.nhr,.nin, .nog, .nsd, .nsi, .nsq
files will be found; if protein,.phr, .psq
files will be found).
Launching StarBLAST-VICE with Your Own Databases¶
To launch StarBLAST-VICE with your own database:
- Use the same button for Example Databases, but do not click “Launch Analysis” just yet.
- In the “Input” tab, select the folder containing your database (if not specified, default is
swissprot-db
).
Note
The user will not be able to see the databases within the folder at this step. Ensure that databases files (as explained in step 4.2 of the previous section) are present beforehand.
- Click “Launch Analysis”. This might take a few minutes.
Accessing your running Apps¶
The notification bell should show your currently running apps and jobs.
In case you want to see all your jobs and access your running apps (and app history), navigate to the Analyses button.
StarBLAST-Docker: Cloud Deployment for Medium Classes (25-100)¶
To deploy StarBLAST setup in a cloud provider, you will need accounts with those providers. This example uses XSEDE’s JetStream Cloud service. You can access JetStream using an XSEDE account, a Globus account, or via institutional access to XSEDE (search for your institution name from the drop down menu in JetStream’s login page).
This setup uses a “Foreman” instance for the front-end sequenceServer and one or more “Worker” instances to distribute the computational load of running blast. Docker containers are used to deploy the Foreman and Workers through deployment scripts. These deployment scripts are designed to:
Launching Foreman & Worker Instances¶
1. Login to JetStream Cloud.
2. From JetStream’s top menu, navigate to “Projects” and select “Create New Project”.
3. In the “Project Name” field, name your project and add a description.
4. From JetStream’s dashboard, select “Launch New Instance”.
5. Be sure to change the default tab from “Show Featured” to “Show All”, search for “Docker_starBLAST” and select the “Docker_starBLAST” image (or click here); click “Launch”.
6. In the pop up menu you can customize your image (e.g. Instance Size. Use a minimum of m1.xlarge instance for Foreman, with at least 60GB disk space); select “Advanced Options”.
7. Select “Create a New Script”.
8. Title the script “Foreman” or similar, select “Raw Text” and copy and paste the Foreman script, linked below. The scripts generate a password and username based on the user account, but these can be personalized if needed (not suggested for new users). Select “Save and Add Script” and then “Continue to Launch”.
Deployment Scripts
- The deployment scripts for a Foreman instance (atmo_deploy_master.sh) can be found here.
- The deployment scripts for a Worker instance (atmo_deploy_worker.sh) can be found here.
Note
This step is required to be done once for the Foreman and once for each Worker instance. The deployment scripts are stored for future use.
9. Repeat steps 4-8 for one or more Worker instance(s), using the Worker deployment script. Use large or extra large images (at least 60GB of disk space is required).
Note
JetStream cloud will take at least 10-20 minutes and the wait-time will increase with the size of the BLAST database.
Start BLASTING! Now anyone can enter the <FOREMAN_IP_ADDRESS>
into their browser and access SequenceServer.
StarBLAST-HPC: HPC Deployment for Large Classes (>100)¶
The StarBLAST-HPC Setup is designed to distribute BLAST searches across multiple nodes on a High-Performance Computer and uses a Master-Worker set-up similar to StarBLAST-Docker (an atmosphere instance as the Master, and the HPC as the Worker). It is suggested that the Worker is set up ahead of time.
Some command line knowledge is required for setup.
HPC Requirements and Setup¶
It is important that the following software are installed on the HPC:
- iRODS version 4.0 or newer
- ncbi-BLAST+ version 2.9.0 or newer
- CCTools version 7.0.21 or newer
- glibc version 2.14 or newer
- Support for CentOS7
- CyVerse user account
iRODS, ncbi-BLAST+ and CCTools should be available in your home directory, which can be found using
cd
pwd
It should output something similar to
/home/<U_NUMBER>/<USER>/
iRODS Installation Guide¶
- From your home directory, obtain and install iRODS with the command
wget https://files.renci.org/pub/irods/releases/4.1.10/ubuntu14/irods-icommands-4.1.10-ubuntu14-x86_64.deb
apt-get install ./irods-icommands-4.1.10-ubuntu14-x86_64.deb
- Upon installation, set up the iCommands (requires a CyVerse account):
iinit
- You will be prompted to connect to the CyVerse with:
host name (DNS): data.cyverse.org
port #: 1247
username: <CyVerse_ID>
zone: iplant
password: <CyVerse_password>
iRODS should be installed and configured. If problems persists, a more in depth tutorial on iRODS and iCommands installation can be found here.
ncbi-BLAST+ Installation Guide¶
- From your home directory, obtain and decompress ncbi-BLAST+ with
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.9.0/ncbi-blast-2.9.0+-x64-linux.tar.gz
tar -xvf ncbi-blast-2.9.0+-x64-linux.tar.gz
- Add ncbi-BLAST+ to the path (change the path to reflect the correct location of the ncbi-BLAST+ bin files):
export PATH=$HOME</PATH/TO/BLAST/BIN/>:$PATH
At this point, ncbi-BLAST+ should be installed and accessible.
- BLAST databases need to be downloaded in a
<DATABASE>/
directory in the home folder.
/home/<U_NUMBER>/<USER>/<DATABASE>/
Note
An example of BLAST databases can be downloaded with iRODS here: /iplant/home/cosimichele/200503_Genomes_n_p
. Read more on installing iRODS and iCommands above.
CCTools Installation Guide¶
- From your home directory, obtain and decompress CCTools with
wget https://ccl.cse.nd.edu/software/files/cctools-7.1.6-source.tar.gz
tar -xvf cctools-7.1.6-source.tar.gz
- Add CCTools to the path (change the path to reflect the correct location of the CCTools bin files):
export PATH=$HOME</PATH/TO/CCTOOLS/BIN/>:$PATH
At this point, CCTools should be installed and accessible.
Note
CCTools only works if your HPC has glibc version 2.14 or newer. In the following examples, glibc and BLAST+ are loaded through module load
. module load
is not necessary if the HPC system already supports glibc 2.14 and if ncbi-BLAST+ has been added to the path as described above.
Launching Workers on the HPC¶
The HPC uses a .pbs and qsub system to submit jobs.
- Create a
.pbs
file that contains the following code and change the<VARIABLES>
to preferred options:
#!/bin/bash
#PBS -W group_list=<GROUP_LIST>
#PBS -q windfall
#PBS -l select=<N_OF_NODES>:ncpus=<N_OF_CPUS>:mem=<N_MEMORY>gb
#PBS -l place=pack:shared
#PBS -l walltime=<MAX_TIME>
#PBS -l cput=<MAX_TIME>
module load blast
module load unsupported
module load ferng/glibc
module load singularity
export CCTOOLS_HOME=/home/<U_NUMBER>/<USER>/<CCTOOLS_DIRECTORY>
export PATH=${CCTOOLS_HOME}/bin:$PATH
cd /home/<U_NUMBER>/<USER>/<WORKERS_DIRECTORY>
MASTER_IP=<MASTER_IP>
MASTER_PORT=<PORT_NUMBER>
TIME_OUT_TIME=<TIME_OUT_TIME>
PROJECT_NAME=<PROJECT_NAME>
/home/<U_NUMBER>/<USER>/<CCTOOLS_DIRECTORY>/bin/work_queue_factory -T local -M $PROJECT_NAME --cores <N_CORES> -w <MIN_N_WORKERS> -W <MAX_N_WORKERS> -t $TIME_OUT_TIME
An example of a .pbs
file running on the University of Arizona HPC:
#!/bin/bash
#PBS -W group_list=lyons-lab
#PBS -q windfall
#PBS -l select=2:ncpus=12:mem=24gb
#PBS -l place=pack:shared
#PBS -l walltime=02:00:00
#PBS -l cput=02:00:00
module load blast
module load unsupported
module load ferng/glibc
module load singularity
export CCTOOLS_HOME=/home/u12/cosi/cctools-7.0.19-x86_64-centos7
export PATH=${CCTOOLS_HOME}/bin:$PATH
cd /home/u12/cosi/cosi-workers
MASTER_IP=128.196.142.13
MASTER_PORT=9123
TIME_OUT_TIME=1800
PROJECT_NAME="starBLAST"
/home/u12/cosi/cctools-7.0.19-x86_64-centos7/bin/work_queue_factory -T local -M $PROJECT_NAME --cores 12 -w 1 -W 8 -t $TIME_OUT_TIME
In the example above, the user already has blast installed (calls it using module load blast
). The script will submit to the HPC nodes a minimum of 1 and a maximum of 8 workers per node.
- Submit the
.pbs
script with
qsub <NAME_OF_PBS>.pbs
Setting Up the Master VM on the Cloud Service¶
Set up the Master instance for starBLAST-HPC by following the same steps as for StarBLAST-Docker, but without adding the Master deployment script. Additionally, BLAST databases need to be loaded manually onto the <DATABASE>/
folder.
Once the VM is running, access it through ssh or by using the Web Shell (“Open Web Shell” button on your VM’s page). Once inside follow the next steps.
Note
IMPORTANT: THE PATH TO THE DATABASE ON THE MASTER NEED TO BE THE SAME AS THE ONE ON THE WORKER
- Ensure the databases on both the Master VM and Worker HPC are in the same directory. On the Worker HPC go to the
<DATABASE>/
directory and do
pwd
Then, on your Master VM, create the directory with the same path output above
mkdir -p SAME/PATH/TO/HPC/DATABASE/DIRECTORY/
- Now the
<DATABASE>/
directories have been set up to contain the desired databases. You can use the same databases preset for StarBLAST-Docker or make your own from a.fasta (or .fa, .faa, .fna)
file using BLAST+’s makeblastdb referenced in StarBLAST-VICE. Both require iRODS (JetStream comes with iRODS pre-installed) and a CyVerse account.
Access iRODS using:
iinit
You will be prompted to connect to the CyVerse with:
host name (DNS): data.cyverse.org
port #: 1247
username: <CyVerse_ID>
zone: iplant
password: <CyVerse_password>
- Once connected, retreive and move the databases to your
<DATABASE>/
folder (shown for preset):
iget -rKVP /iplant/home/cosimichele/200503_Genomes_n_p
mv GCF_* /DATABASE/DIRECTORY/
- Move the databases to the HPC using either
sftp
or the steps as above if your HPC system has iRODS. - Use this code within the Master instance to launch sequenceServer:
docker run --rm --name sequenceserver-scale -p 80:3000 -p 9123:9123 -e PROJECT_NAME=<PROJECT_NAME> -e WORKQUEUE_PASSWORD=<PASSWORD> -e BLAST_NUM_THREADS=<N THREADS> -e SEQSERVER_DB_PATH="/home/<U_NUMBER>/<USER>/<DATABASE_DIRECTORY>" -v /DATABASE/ON/MASTER:/DATABASE/ON/WORKER zhxu73/sequenceserver-scale:no-irods
An example is:
docker run --rm --name sequenceserver-scale -p 80:3000 -p 9123:9123 -e PROJECT_NAME=starBLAST -e WORKQUEUE_PASSWORD= -e BLAST_NUM_THREADS=2 -e SEQSERVER_DB_PATH="/home/u12/cosi/DATABASE" -v /home/u12/cosi/DATABASE:/home/u12/cosi/DATABASE zhxu73/sequenceserver-scale:no-irods
Note
The custom Database folder on the Master needs to have read and write permissions
Start BLASTING! Now anyone can enter the <MASTER_IP_ADDRESS>
in their browser to access SequenceServer.
Using SequenceServer¶
SequenceServer allows to access BLAST+ commands through a simple GUI. Here, we show examples of how to BLAST using SequenceServer. For additional documentation please visit SequenceServer’s official website and original publicaiton.
Note
These examples will take into consideration that you already have launched StarBLAST. Visit the other User Guides to know more on launching StarBLAST.
SequenceServer’s Main Page¶
On the main page, the user will see:
- The main input box where nucleotide (DNA) or amino acid (protein) sequences can be input using the FASTA convention.
- The nucleotide databases (left) and protein databases (right). The user will be able to choose which databases to BLAST against by clicking the boxes left to the databases’ names.
- The advanced parameters box. A list and description of all the advanced options can be accessed by pressing the “?” button.
Note
Advanced Parameters can heavily influence the resulting BLAST results, we suggest to read the descriptions beforehand.
The input box will recognize the added nucleotide or amino acid sequence. The user can then select the database of choice (this step can be performed before adding the query sequence). In the example below, an isoform of the Wacky protein FASTA sequence was added to the input box and the Drosophila melanogaster (D. melanogaster) DNA database was selected.
BLAST Loading & Results Page¶
After clicking BLAST (in this case TBLASTN), the page will switch to a loading screen. The length of this screen is dependable on:
- Computational power of the foreman (BLASTing is done but the machine has difficulties displaying the results due to the number of outputs).
- Computational power and availability of workers.
- Length of query.
Here below is the result output of the Wacky BLAST search. This page will display BLAST-related statistical results such as Query coverage (%), Total score, E-value, and Identity for the whole query (top) and specific sequences (below). For more information on the BLAST output, visit the NCBI BLAST FAQ page or this Medium article.
No-hit Example & Further Reading¶
Below, we input the protein sequence of the human p53 gene, a well known tumor suppressor. Then, we purposefully select non human databases to check for possilble BLAST hits, expecting no results.
Here is the BLAST results page reporting no resulting BLAST hits, as expected.
For a more comprehensive and in-depth understanding of BLAST, results and advanced parameters, please refer to the official NCBI BLAST Handbook.