How to BLAST Protein or DNA Sequences Against a Genome in Linux: A Step-by-Step Guide

- January 09, 2025

BLAST (Basic Local Alignment Search Tool) is a widely used bioinformatics tool for comparing nucleotide or protein sequences against databases, including genomes. It’s an essential tool for researchers in genomics, bioinformatics, and related fields to identify similarities between sequences and to annotate genomes. In this blog post, we will walk you through the steps to blast a protein or DNA sequence against a genome using Linux. This guide assumes you have a working knowledge of Linux and have BLAST+ installed on your system.

Prerequisites

Before we get started, you need to ensure that you have:

BLAST+ installed on your Linux system.
A sequence file (either in FASTA format) to query.
A genome database to blast against.

Step 1: Install BLAST+ on Linux

To install the BLAST+ tools, open a terminal and type the following command:

sudo apt-get update
sudo apt-get install ncbi-blast+

This will install the latest version of BLAST tools on your system. For macOS or other Linux distributions, you might use a different package manager (e.g., brew for macOS).

Step 2: Obtain a Genome Database

Next, you need a reference genome to blast your sequence against. You can either download a genome from NCBI or another public database, or you can use your own custom genome.

For example, you can download a genome from NCBI by using the wget command:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/genomic.fna.gz

Once you have the genome file, you will need to format it for use with BLAST using the makeblastdb command.

Step 3: Format the Genome Database for BLAST

BLAST requires the genome sequence to be in a specific database format. Use the makeblastdb tool to format the genome FASTA file.

makeblastdb -in genomic.fna -dbtype nucl -out genome_db

Here:

-in genomic.fna: The input FASTA file containing the genome.
-dbtype nucl: Specifies that the genome is nucleotide-based (use prot if you are working with a protein sequence).
-out genome_db: Specifies the name of the output database.

This will create the necessary database files for BLAST to use during sequence comparison.

Step 4: Prepare Your Query Sequence

Now that you have your genome database ready, you need the sequence that you want to blast against the genome. This can be either a protein or a DNA sequence. Your query sequence should be in FASTA format. Here’s an example of a protein sequence:

>protein_query
MKTLLILVLIYLASLRHHDQKSTVVRVAGGIEEAGV

Save this sequence in a file (e.g., protein_query.fasta).

Step 5: Perform BLAST Search

Now you are ready to run the BLAST search. Depending on whether your query is a nucleotide or protein sequence, use the corresponding BLAST tool (blastn for nucleotides, blastp for proteins).

For nucleotide sequences, use blastn:

blastn -query protein_query.fasta -db genome_db -out results.txt -evalue 1e-5 -outfmt 6

For protein sequences, use blastp:

blastp -query protein_query.fasta -db genome_db -out results.txt -evalue 1e-5 -outfmt 6

Explanation of parameters:

-query protein_query.fasta: Specifies the query sequence file.
-db genome_db: The name of the formatted genome database.
-out results.txt: The output file where results will be saved.
-evalue 1e-5: Sets the E-value threshold (adjust based on your requirements).
-outfmt 6: Specifies the output format (Tabular format, which is easy to read and parse).

Step 6: Interpret BLAST Results

The output file (results.txt) will contain tabular results with several columns. The most common ones are:

Query ID: The name of your query sequence.
Subject ID: The name of the matching sequence from the genome.
% Identity: The percentage of identical matches between the query and subject.
Alignment Length: The length of the alignment between the query and subject.
E-value: The statistical significance of the match.

Example output:

protein_query  chr1    100.0   50  0  50  1  50  1e-50

In this example, the query sequence perfectly matches a 50 amino acid stretch on chromosome 1 with an E-value of 1e-50, suggesting a very high level of significance.

Step 7: Optional: Visualize or Analyze the Results

You may want to visualize or further analyze the BLAST results, especially if you are working with large genomes or multiple query sequences. Tools like BLAST2GO, JBrowse, or even custom scripts can help in visualizing the results on genome browsers or annotating the findings.

Conclusion

BLAST is a powerful tool that can help you align protein or DNA sequences against a genome. By following these steps, you can quickly perform sequence comparisons, identify homologous genes, or annotate new sequences within a genome. With the flexibility of the command line and Linux-based tools, you can automate and scale your analyses for large datasets.

Search This Blog

AgriBio Insights