A Comprehensive Guide to RNA-Seq Analysis in R

RNA-Seq is a powerful method for studying gene expression and transcriptomics, and R provides an extensive ecosystem for analyzing RNA-Seq data. This guide will walk you through the key steps of RNA-Seq analysis using popular R packages.

Workflow Overview

Quality Control: Assess the quality of raw sequencing reads.
Alignment/Quantification: Align reads to a reference genome or quantify transcript abundance.
Normalization: Adjust data for library size and sequencing depth.
Differential Expression Analysis (DEA): Identify genes with significant expression changes.
Visualization and Functional Analysis: Use plots and pathway tools to interpret results.

Step 1: Install Necessary R Packages

# Install Bioconductor Manager
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install core packages for RNA-Seq analysis
BiocManager::install(c("DESeq2", "edgeR", "limma", "pheatmap", "clusterProfiler", "org.Hs.eg.db"))
install.packages("ggplot2")  # For visualization

Step 2: Load Libraries and Data

library(DESeq2)
library(edgeR)
library(ggplot2)
library(pheatmap)
library(clusterProfiler)
library(org.Hs.eg.db)

# Load count data and metadata
counts <- read.csv("counts.csv", row.names = 1)  # Raw count matrix
metadata <- read.csv("metadata.csv")  # Sample information

Example Count Matrix (counts.csv):

Gene	Sample1	Sample2	Sample3	Sample4
GeneA	120	100	500	400
GeneB	50	60	20	15

Example Metadata (metadata.csv):

SampleID	Condition
Sample1	Control
Sample2	Control
Sample3	Treatment
Sample4	Treatment

Step 3: Quality Control

Before starting the analysis, ensure the data is high quality.

Visualize Library Sizes

library_sizes <- colSums(counts)
barplot(library_sizes, main = "Library Sizes", ylab = "Total Reads", col = "blue")

Filter Lowly Expressed Genes

# Remove genes with low counts across all samples
keep <- rowSums(counts >= 10) >= 2
filtered_counts <- counts[keep, ]

Step 4: Create DESeq2 Dataset

Prepare Data

dds <- DESeqDataSetFromMatrix(countData = filtered_counts, 
                              colData = metadata, 
                              design = ~ Condition)

Normalize Data

dds <- DESeq(dds)
normalized_counts <- counts(dds, normalized = TRUE)

Step 5: Differential Expression Analysis

Extract Results

results <- results(dds, contrast = c("Condition", "Treatment", "Control"))

# Summary of results
summary(results)

# Save results to a CSV file
write.csv(as.data.frame(results), "differential_expression_results.csv")

Filter Significant Genes

significant_genes <- results[results$padj < 0.05 & abs(results$log2FoldChange) > 1, ]

Step 6: Visualization

MA Plot

plotMA(results, main = "MA Plot", ylim = c(-5, 5))

Volcano Plot

results_df <- as.data.frame(results)
results_df$Significant <- ifelse(results_df$padj < 0.05 & abs(results_df$log2FoldChange) > 1, "Yes", "No")

ggplot(results_df, aes(x = log2FoldChange, y = -log10(pvalue), color = Significant)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Volcano Plot", x = "Log2 Fold Change", y = "-Log10 P-Value")

Heatmap of Top Genes

# Select top differentially expressed genes
top_genes <- head(rownames(significant_genes), 20)
heatmap_data <- normalized_counts[top_genes, ]

pheatmap(log2(heatmap_data + 1), 
         cluster_rows = TRUE, 
         cluster_cols = TRUE, 
         main = "Heatmap of Top Differentially Expressed Genes")

Step 7: Functional Enrichment Analysis

Perform GO and KEGG Enrichment

# Convert gene IDs to Entrez IDs
library(clusterProfiler)
gene_list <- rownames(significant_genes)
entrez_ids <- mapIds(org.Hs.eg.db, keys = gene_list, column = "ENTREZID", keytype = "SYMBOL")

# GO Enrichment Analysis
go_results <- enrichGO(gene = entrez_ids, OrgDb = org.Hs.eg.db, ont = "BP")

# KEGG Pathway Analysis
kegg_results <- enrichKEGG(gene = entrez_ids, organism = "hsa")

# Visualize results
barplot(go_results, showCategory = 10, title = "GO Biological Processes")
dotplot(kegg_results, showCategory = 10, title = "KEGG Pathway Enrichment")

Step 8: Save Normalized Data

Save normalized data for downstream analysis or sharing:

write.csv(as.data.frame(normalized_counts), "normalized_counts.csv")

Step 9: Additional Analysis

Time-Series Analysis: Use packages like TCseq for dynamic RNA-Seq data.
Single-Cell RNA-Seq: Use tools like Seurat for scRNA-seq analysis.

Conclusion

RNA-Seq analysis in R is a robust approach for exploring gene expression and gaining biological insights. By combining tools like DESeq2, ggplot2, and clusterProfiler, you can uncover significant patterns and pathways. These steps can be tailored for specific research questions and datasets.

AgriBio Insights

Search This Blog