Genome assembly

Genome Assembly Infographic - BIOEDUC.COM
Bioinformatics Deep-Dive

Genome Assembly Guide

BIOEDUC.COM Explore Pipeline
Interactive Infographic

Reconstructing the Blueprint of Life

A genome cannot be read in a single sequence. Modern technology fragments DNA into millions of raw "reads." The assembly process uses high-powered bioinformatics to merge overlaps, map structures, and reconstruct the complete biological sequence.

Algorithm-Driven
QC Metrics
NGS & TGS Platforms
📰

The Giant Newspaper Analogy

Imagine printing thousands of copies of the same newspaper, running them all through a shredder, and then trying to reconstruct the original front-page story just by matching overlapping words on the shredded strips. That is de novo genome assembly.

Shredded Strips = Reads Complete Paper = Genome

Comparing Sequencing Modalities

Short vs. Long Reads
SR

Short Reads (NGS)

  • High Accuracy: >99.9% base accuracy.
  • Cost-Effective: Low cost per gigabase of data.
  • Resolution Limit: Cannot resolve large repetitive genomic regions (spanning limit).
Length: 100 - 300 bp | Illumina / MGI
LR

Long Reads (TGS)

  • Spans Repeats: Easily bridges complex repetitive DNA loops.
  • Accuracy Evolution: Raw error rates used to be high, now solved via PacBio HiFi (consensus sequencing).
  • Expense: Higher comparative equipment setup cost.
Length: 10k - 100k+ bp | Oxford Nanopore / PacBio

Assembly Pipeline Architecture

1

Fragmentation & Sequencing

High-molecular-weight DNA is isolated, sheared, and fed into sequencer to generate raw physical outputs.

2

Error Correction & Preprocessing

Removal of low-quality bases, adapter adapters, and computational polishing of sequencing artifact reads.

3

Contig Construction (Graph Stage)

Algorithms build overlapping consensus lines forming continuous sequences (contigs) without gap regions.

4

Scaffolding & Polishing

Paired-end reads or long reads map physical spacing. Unassembled gaps are marked with N-bases to align chromosome positions.

Core Assembly Algorithms

OLC (Overlap-Layout-Consensus)

Typically deployed on long read sets (PacBio/Nanopore). Performs pairwise alignments between every single read to construct the path.

Read 1 Read 2

De Bruijn Graph (k-mers)

Deploys primarily on short reads. Splits reads into smaller chunks of length k. Fast execution, handles high depth easily.

ATG TGC GCA

Assembly Quality Controls

BUSCO Complete Score 98.2%

Benchmarks standard single-copy universal ortholog lines expected in taxonomic branches.

N50 Statistics Essential Metric

The length where 50% or more of the reconstructed assembly size consists of contigs equal to or larger than this size value.

Formula: N50 Li 2

Interactive K-mer Sandbox

Type in a sequence below to see how De Bruijn Graph processes break down string reads into k-mers:

Extracted Blocks
ATG TGC GCA CAT ATG TGA

Reference Mapping

When assembling genomes, is there an established evolutionary reference model? If yes, map raw sequences directly to avoid computationally exhaustive loops:

🌱 De Novo No reference required
🗺️ Guided Assembly Uses existing template