Blast Search Tutorial

In this tutorial, we are going to perform a blast search against a simple local Blast database. BLAST is a tool bioinformaticians use to compare a sample genetic sequence to a database of known seqeuences; it’s one of the most widely used bioinformatics tools. We will follow an existing blast tutorial but we will convert the whole practice into bioflow pipeline.

You can find the original tutorial website by clicking here. Please open it in another browser tab.

Bioflows Command Line

Open a new terminal window and execute bf executable.

$ bf
    BioFlows is a distributed pipeline framework for expressing ,
designing and running scalable reproducible and distributed computational bioinformatics workflows in cloud containers.
BioFlows Framework consists of software tools and cloud microservices that communicate together to achieve a highly distributed ,
highly coordinated and fault tolerant environment to run parallel bioinformatics pipelines onto cloud containers
and cloud servers. BioFlows also has BioFlows Description Language (BDL) which is an imperative and declarative standard for
describing and expressing computational bioinformatics tools and pipelines, BDL is flexible , easy to use
and a human readable language that enables researchers to design reproducible and scalable computational pipelines.
The language is based entirely on Yet Another Markup Language (YAML).

        Usage:
          bf [command]

        Available Commands:
          Dag         This command enables creation of GraphViz graph for BioFlows Pipeline
          Hub         Helper actions to interact with BioFlows centralized hub which contains published tools and pipelines
          Node        A group of helper commands which enables joining remote cluster or starts a local cluster of Bioflows workers
          Show        This command enables manipulation of tools and BioFlows Pipeline(s)
          Tool        Helper commands to manage and run single BioFlows Tools
          Workflow    A group of helper commands that allow manipulating, managing , running and submitting BioFlows Pipeline(s)
          help        Help about any command
          validate    validates a given BioFlows tool or pipeline definition file. It checks whether the file is valid and well-formatted or not.
                The file path could be a Local File System Path or a remote URL.


        Flags:
              --config string          config file (default is $HOME/.bf.yaml)
              --data_dir string        The directory which contains raw data.
          -h, --help                   help for bf
              --output_dir string      Output Directory where the running tool will save data.
              --params_config string   A file which contains your Pipeline specific initial parameters' values. You can know the required parameters for your pipeline through reading its definition file or running bf validate command.
          -t, --toggle                 Help message for toggle
          -v, --version                version for bf

        Use "bf [command] --help" for more information about a command.

Creating the pipeline

id: blast_prions
type: pipeline
name: Blast_Prion
description:
  -"We’ll be running a BLAST (Basic Local Alignment Search Tool) example with a container from BioContainers. BLAST is a tool bioinformaticians use to compare a sample genetic sequence to a database of known seqeuences; it’s one of the most widely used bioinformatics tools."
website: https://pawseysc.github.io/container-workflows/08-bio-example/index.html
version: 0.0.1
steps:

As you can see, this is a pipeline tool with type: pipeline with a name, description, version and a reference to the original practice URL. The original author of the tutorial has downloaded a query fasta sequence and a genome sequence. He then used the genome sequence to create a local blast database index to use for fast local search using blastp tool, because the query sequence is a protein not a DNA, this is why we are going to use blastp.

There are many ways of how you design this pipeline yourself. you can simply download the original data first and put them in a folder and then add only makeblastdb and blast search steps in a pipeline or you can create two steps to download the data first and then followed by makeblastdb and blast search steps.

Please read the original practice first to understand what the author is doing and then read the following YAML file which includes all the steps.

id: blast_prions
type: pipeline
name: Blast_Prion
description:
  -"We’ll be running a BLAST (Basic Local Alignment Search Tool) example with a container from BioContainers. BLAST is a tool bioinformaticians use to compare a sample genetic sequence to a database of known seqeuences; it’s one of the most widely used bioinformatics tools."
website: https://pawseysc.github.io/container-workflows/08-bio-example/index.html
version: 0.0.1
steps:
  - id: DownloadFasta
    name: DownloadFasta
    inputs:
      - type: string
        displayname: URL for fasta file
        name: fasta_url
        value: http://www.uniprot.org/uniprot/P04156.fasta
    outputs:
      - type: file
        name: fasta_file
        value: "{{self_dir}}/P04156.fasta"
    command: "curl -OL {{fasta_url}} > {{fasta_file}}"
  - id: DownloadGenome
    name: DownloadGenome
    inputs:
      - type: string
        name: gfasta_url
        value : "ftp://ftp.ncbi.nih.gov/refseq/D_rerio/mRNA_Prot/zebrafish.1.protein.faa.gz"
    outputs:
      - type: file
        name: hs_gfasta
        value: "{{self_dir}}/zebrafish.1.protein.faa.gz"
    command: "curl -O {{gfasta_url}}"
  - id: UnzipGenome
    name: UnzipGenome
    depends: DownloadGenome
    command: "gunzip {{DownloadGenome.hs_gfasta}}"
    outputs:
      - type: file
        name: hs_fasta
        value: "{{DownloadGenome.location}}/zebrafish.1.protein.faa"

  - id: MakeBlastdb
    name: MakeBlastdb
    depends: DownloadFasta,UnzipGenome
    imageId: biocontainers/blast:v2.2.31_cv2
    command: "makeblastdb -in {{UnzipGenome.hs_fasta}} -dbtype prot"
  - id: FastaAlignment
    name: FastaAlignment
    depends: MakeBlastdb
    imageId: biocontainers/blast:v2.2.31_cv2
    outputs:
      - type: file
        name: results_file
        value: "{{self_dir}}/results.txt"
    command: "blastp -query {{DownloadFasta.fasta_file}} -db {{UnzipGenome.hs_fasta}} -out {{results_file}}"

Of course, if you understood the original practice well, these steps are self-explanatory. Please read carefully how the parameters are linked together.

Save that into a file, for instance, prions.yaml..

Drawing Pipeline Diagram

Now we have a reusable real bioinformatics pipeline that does something useful in practice and it is decoupled from any host file system. Now, assume that you are writing up your publication paper and need to have a visual diagram of this pipeline to include in your publication. All you need to do is to run the following bf command to have a publishable grade diagram.

First, you need to install graphviz software by running the following command.

$ sudo apt-get install graphviz

Second, you run bf Dag command like so..

$ bf Dag /your/script/location/prions.yaml | dot -Tsvg > prions.svg

If you opened the current working directory of where you run this command, you will find a file named prions.svg into the current working directory. double-click on this file to view it.

It will look something like..

../../_images/prions.svg

Running your Pipeline

You can now easily run your pipeline giving only the output directory where the engine will save the folders of each step including all the results files.

$ bf Workflow run --output_dir=/your/output/directory --data_dir=/any/other/data/directory /your/script/location/prions.yaml