RNA-Seq is a powerful method for studying gene expression and transcriptomics, and R provides an extensive ecosystem for analyzing RNA-Seq data. This guide will walk you through the key steps of RNA-Seq analysis using popular R packages.
Workflow Overview
- Quality Control: Assess the quality of raw sequencing reads.
- Alignment/Quantification: Align reads to a reference genome or quantify transcript abundance.
- Normalization: Adjust data for library size and sequencing depth.
- Differential Expression Analysis (DEA): Identify genes with significant expression changes.
- Visualization and Functional Analysis: Use plots and pathway tools to interpret results.
Step 1: Install Necessary R Packages
# Install Bioconductor Manager
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Install core packages for RNA-Seq analysis
BiocManager::install(c("DESeq2", "edgeR", "limma", "pheatmap", "clusterProfiler", "org.Hs.eg.db"))
install.packages("ggplot2") # For visualization
Step 2: Load Libraries and Data
library(DESeq2)
library(edgeR)
library(ggplot2)
library(pheatmap)
library(clusterProfiler)
library(org.Hs.eg.db)
# Load count data and metadata
counts <- read.csv("counts.csv", row.names = 1) # Raw count matrix
metadata <- read.csv("metadata.csv") # Sample information
Example Count Matrix (counts.csv):
| Gene | Sample1 | Sample2 | Sample3 | Sample4 |
|---|---|---|---|---|
| GeneA | 120 | 100 | 500 | 400 |
| GeneB | 50 | 60 | 20 | 15 |
Example Metadata (metadata.csv):
| SampleID | Condition |
|---|---|
| Sample1 | Control |
| Sample2 | Control |
| Sample3 | Treatment |
| Sample4 | Treatment |
Step 3: Quality Control
Before starting the analysis, ensure the data is high quality.
Visualize Library Sizes
library_sizes <- colSums(counts)
barplot(library_sizes, main = "Library Sizes", ylab = "Total Reads", col = "blue")
Filter Lowly Expressed Genes
# Remove genes with low counts across all samples
keep <- rowSums(counts >= 10) >= 2
filtered_counts <- counts[keep, ]
Step 4: Create DESeq2 Dataset
Prepare Data
dds <- DESeqDataSetFromMatrix(countData = filtered_counts,
colData = metadata,
design = ~ Condition)
Normalize Data
dds <- DESeq(dds)
normalized_counts <- counts(dds, normalized = TRUE)
Step 5: Differential Expression Analysis
Extract Results
results <- results(dds, contrast = c("Condition", "Treatment", "Control"))
# Summary of results
summary(results)
# Save results to a CSV file
write.csv(as.data.frame(results), "differential_expression_results.csv")
Filter Significant Genes
significant_genes <- results[results$padj < 0.05 & abs(results$log2FoldChange) > 1, ]
Step 6: Visualization
MA Plot
plotMA(results, main = "MA Plot", ylim = c(-5, 5))
Volcano Plot
results_df <- as.data.frame(results)
results_df$Significant <- ifelse(results_df$padj < 0.05 & abs(results_df$log2FoldChange) > 1, "Yes", "No")
ggplot(results_df, aes(x = log2FoldChange, y = -log10(pvalue), color = Significant)) +
geom_point() +
theme_minimal() +
labs(title = "Volcano Plot", x = "Log2 Fold Change", y = "-Log10 P-Value")
Heatmap of Top Genes
# Select top differentially expressed genes
top_genes <- head(rownames(significant_genes), 20)
heatmap_data <- normalized_counts[top_genes, ]
pheatmap(log2(heatmap_data + 1),
cluster_rows = TRUE,
cluster_cols = TRUE,
main = "Heatmap of Top Differentially Expressed Genes")
Step 7: Functional Enrichment Analysis
Perform GO and KEGG Enrichment
# Convert gene IDs to Entrez IDs
library(clusterProfiler)
gene_list <- rownames(significant_genes)
entrez_ids <- mapIds(org.Hs.eg.db, keys = gene_list, column = "ENTREZID", keytype = "SYMBOL")
# GO Enrichment Analysis
go_results <- enrichGO(gene = entrez_ids, OrgDb = org.Hs.eg.db, ont = "BP")
# KEGG Pathway Analysis
kegg_results <- enrichKEGG(gene = entrez_ids, organism = "hsa")
# Visualize results
barplot(go_results, showCategory = 10, title = "GO Biological Processes")
dotplot(kegg_results, showCategory = 10, title = "KEGG Pathway Enrichment")
Step 8: Save Normalized Data
Save normalized data for downstream analysis or sharing:
write.csv(as.data.frame(normalized_counts), "normalized_counts.csv")
Step 9: Additional Analysis
- Time-Series Analysis: Use packages like
TCseqfor dynamic RNA-Seq data. - Single-Cell RNA-Seq: Use tools like
Seuratfor scRNA-seq analysis.
Conclusion
RNA-Seq analysis in R is a robust approach for exploring gene expression and gaining biological insights. By combining tools like DESeq2, ggplot2, and clusterProfiler, you can uncover significant patterns and pathways. These steps can be tailored for specific research questions and datasets.

Comments
Post a Comment