Skip to content

Architecture

This page describes how cigar-sashimi is put together. End users do not need it; it is aimed at contributors and anyone integrating with the JSON output.

Components

Crate / package Role
cigar-sashimi-core (core/) Rust library + binary: reads BAM, walks CIGAR, extracts events, clusters, filters, emits JSON.
cigar-sashimi-render (render/) Node renderer: turns core JSON into SVG or self-contained interactive HTML.
cigar-sashimi (root script) Bash wrapper that runs core → renderer for normal CLI use.
cigar-sashimi-wasm (wasm/) WebAssembly bindings wrapping the core library for the browser.
web/ Browser app (index.html) that drives the WASM module and renders inline.

Pipelines

CLI:

BAM + region (+ optional GTF)
  → cigar-sashimi-core   (Rust)
  → JSON
  → cigar-sashimi-render (Node)
  → SVG or HTML

Browser: identical analysis, no server:

.bam + .bai (+ optional GTF)   (file pickers)
  → process_bam()              (WASM = core compiled to wasm32)
  → JSON string
  → renderer JS                (bundled in web/index.html)
  → interactive plot

Both paths call the same core code, so output is identical for identical inputs.

Core pipeline (in order)

  1. BAM query & per-read filtering (bam.rs) — query the region via noodles; drop reads by --exclude-flags and --min-mapq; walk each CIGAR; discard insertion/clip events shorter than the length thresholds; compute canonical minimizers per event.
  2. CIGAR walking (cigar.rs) — iterate ops in reference space: M/=/X add coverage; N emits a junction arc; D emits a deletion arc; I and S record events at the current reference position; H is ignored.
  3. Coverage accumulation (coverage.rs) — collapse per-base hits into a dense depth array.
  4. Annotation resolution (gtf.rs, introns.rs) — three modes:
  5. --gtf given → parse exons (merged into a flat track), derive introns as the complement, and parse per-transcript exon/CDS records.
  6. no --gtf, compression on → coverage-based exon/intron detection (--intron-window, --intron-cov).
  7. --no-compression → no annotation track.
  8. --assembly is parsed independently into an assembled-transcript track.
  9. Arc counting (arcs.rs) — deduplicate identical junction/deletion spans and count supporting reads.
  10. Clustering (cluster.rs) — group insertion and soft-clip events by reference position (5 bp tolerance) then by minimizer Jaccard similarity (--min-jaccard).
  11. Filtering (filter.rs) — apply count and coverage-ratio thresholds to arcs and clusters.
  12. Serialize (types.rs) — convert internal 0-based coordinates to 1-based inclusive and emit JSON.

Minimizers

Each insertion/soft-clip sequence is reduced to a short ordered set of canonical minimizer hashes (k-mers hashed in both orientations, minimum per sliding window). Two events are clustered when their minimizer sets reach the Jaccard threshold, which groups near-identical inserted/clipped sequences without full alignment.

Coordinates

Region input and all JSON output are 1-based inclusive. Computation is 0-based half-open internally; a single to_1based() pass converts at serialization (interval and point starts += 1; exclusive ends already equal the inclusive 1-based end).

Highlighting novel exons/introns

Novelty is computed by the renderer when an assembly track (--assembly) is shown alongside a reference (--gtf): an assembly exon is novel if its exact start:end is absent from the reference exon set, and an assembly junction is novel if its donor:acceptor pair is absent from the reference junction set.

By default the first and last exon of each assembled transcript are excluded from novel-exon highlighting, because assemblies rarely reproduce UTR ends exactly and would otherwise flag almost every terminal exon. --include-first-last-exon disables this suppression. This is why a genuinely novel terminal exon can appear unhighlighted — it is expected behavior, not a missing event; the tooltip still tags it [novel, terminal].

JSON contract

The JSON emitted by the core is the shared contract between Rust and the renderer. Changes to it are cross-module API changes and should be tested on both sides.

Top-level fields:

Field Present Contents
region always { chrom, start, end } (1-based inclusive).
coverage always { start, values[] } — per-base depth from start.
arcs always [{ type: "junction"\|"deletion", start, end, count, strand }].
insertion_clusters always [{ cluster_id, type, position, count, representative_length, representative_seq, minimizers[] }].
softclip_clusters always as above, plus clip_side: "left"\|"right".
exons if annotated [{ start, end }] — merged exon intervals.
introns if annotated [{ start, end }] — complement of exons in the region.
transcripts if --gtf per-transcript { transcript_id, gene_name, strand, exons[], cds[], cov?, attributes }.
ins_transcripts if --assembly assembled transcripts, each with exons[] and events[] (insertion/softclip with sequence + minimizers).
filters_applied always echo of the thresholds used.
metadata always { bam, total_reads_in_region, tool_version }.

Renderer

render/src/ builds a layout from the JSON (layout.js) and emits either deterministic SVG (render-svg.js) or interactive HTML (render-html.js). Tracks drawn: coverage (with intron compression and insertion-gap shading), junction/deletion arcs, insertion/soft-clip markers, the gene model, and the assembled-transcript track with optional novel-feature highlighting. The HTML build embeds a client script providing hover tooltips, wheel zoom, drag pan, and reset.

WASM bindings

wasm/src/lib.rs exposes a single entry point:

#[wasm_bindgen]
pub fn process_bam(
    bam_bytes: &[u8],
    bai_bytes: &[u8],
    region_str: &str,
    params_json: &str,
    gtf_bytes: Option<Vec<u8>>,
    gtf_ins_bytes: Option<Vec<u8>>,
) -> Result<String, JsValue>

It runs the same core pipeline against in-memory byte slices (no filesystem) and returns the same JSON schema as the CLI.