Architecture
This page describes how cigar-sashimi is put together. End users do not need it;
it is aimed at contributors and anyone integrating with the JSON output.
Components
| Crate / package | Role |
|---|---|
cigar-sashimi-core (core/) |
Rust library + binary: reads BAM, walks CIGAR, extracts events, clusters, filters, emits JSON. |
cigar-sashimi-render (render/) |
Node renderer: turns core JSON into SVG or self-contained interactive HTML. |
cigar-sashimi (root script) |
Bash wrapper that runs core → renderer for normal CLI use. |
cigar-sashimi-wasm (wasm/) |
WebAssembly bindings wrapping the core library for the browser. |
web/ |
Browser app (index.html) that drives the WASM module and renders inline. |
Pipelines
CLI:
BAM + region (+ optional GTF)
→ cigar-sashimi-core (Rust)
→ JSON
→ cigar-sashimi-render (Node)
→ SVG or HTML
Browser: identical analysis, no server:
.bam + .bai (+ optional GTF) (file pickers)
→ process_bam() (WASM = core compiled to wasm32)
→ JSON string
→ renderer JS (bundled in web/index.html)
→ interactive plot
Both paths call the same core code, so output is identical for identical inputs.
Core pipeline (in order)
- BAM query & per-read filtering (
bam.rs) — query the region via noodles; drop reads by--exclude-flagsand--min-mapq; walk each CIGAR; discard insertion/clip events shorter than the length thresholds; compute canonical minimizers per event. - CIGAR walking (
cigar.rs) — iterate ops in reference space:M/=/Xadd coverage;Nemits a junction arc;Demits a deletion arc;IandSrecord events at the current reference position;His ignored. - Coverage accumulation (
coverage.rs) — collapse per-base hits into a dense depth array. - Annotation resolution (
gtf.rs,introns.rs) — three modes: --gtfgiven → parse exons (merged into a flat track), derive introns as the complement, and parse per-transcript exon/CDS records.- no
--gtf, compression on → coverage-based exon/intron detection (--intron-window,--intron-cov). --no-compression→ no annotation track.--assemblyis parsed independently into an assembled-transcript track.- Arc counting (
arcs.rs) — deduplicate identical junction/deletion spans and count supporting reads. - Clustering (
cluster.rs) — group insertion and soft-clip events by reference position (5 bp tolerance) then by minimizer Jaccard similarity (--min-jaccard). - Filtering (
filter.rs) — apply count and coverage-ratio thresholds to arcs and clusters. - Serialize (
types.rs) — convert internal 0-based coordinates to 1-based inclusive and emit JSON.
Minimizers
Each insertion/soft-clip sequence is reduced to a short ordered set of canonical minimizer hashes (k-mers hashed in both orientations, minimum per sliding window). Two events are clustered when their minimizer sets reach the Jaccard threshold, which groups near-identical inserted/clipped sequences without full alignment.
Coordinates
Region input and all JSON output are 1-based inclusive. Computation is 0-based
half-open internally; a single to_1based() pass converts at serialization (interval
and point starts += 1; exclusive ends already equal the inclusive 1-based end).
Highlighting novel exons/introns
Novelty is computed by the renderer when an
assembly track (--assembly) is shown alongside a reference (--gtf): an assembly exon is
novel if its exact start:end is absent from the reference exon set, and an assembly
junction is novel if its donor:acceptor pair is absent from the reference junction set.
By default the first and last exon of each assembled transcript are excluded from
novel-exon highlighting, because assemblies rarely reproduce UTR ends exactly and would
otherwise flag almost every terminal exon. --include-first-last-exon disables this
suppression. This is why a genuinely novel terminal exon can appear unhighlighted — it is
expected behavior, not a missing event; the tooltip still tags it [novel, terminal].
JSON contract
The JSON emitted by the core is the shared contract between Rust and the renderer. Changes to it are cross-module API changes and should be tested on both sides.
Top-level fields:
| Field | Present | Contents |
|---|---|---|
region |
always | { chrom, start, end } (1-based inclusive). |
coverage |
always | { start, values[] } — per-base depth from start. |
arcs |
always | [{ type: "junction"\|"deletion", start, end, count, strand }]. |
insertion_clusters |
always | [{ cluster_id, type, position, count, representative_length, representative_seq, minimizers[] }]. |
softclip_clusters |
always | as above, plus clip_side: "left"\|"right". |
exons |
if annotated | [{ start, end }] — merged exon intervals. |
introns |
if annotated | [{ start, end }] — complement of exons in the region. |
transcripts |
if --gtf |
per-transcript { transcript_id, gene_name, strand, exons[], cds[], cov?, attributes }. |
ins_transcripts |
if --assembly |
assembled transcripts, each with exons[] and events[] (insertion/softclip with sequence + minimizers). |
filters_applied |
always | echo of the thresholds used. |
metadata |
always | { bam, total_reads_in_region, tool_version }. |
Renderer
render/src/ builds a layout from the JSON (layout.js) and emits either deterministic
SVG (render-svg.js) or interactive HTML (render-html.js). Tracks drawn: coverage
(with intron compression and insertion-gap shading), junction/deletion arcs,
insertion/soft-clip markers, the gene model, and the assembled-transcript track with
optional novel-feature highlighting. The HTML build embeds a client script providing
hover tooltips, wheel zoom, drag pan, and reset.
WASM bindings
wasm/src/lib.rs exposes a single entry point:
#[wasm_bindgen]
pub fn process_bam(
bam_bytes: &[u8],
bai_bytes: &[u8],
region_str: &str,
params_json: &str,
gtf_bytes: Option<Vec<u8>>,
gtf_ins_bytes: Option<Vec<u8>>,
) -> Result<String, JsValue>
It runs the same core pipeline against in-memory byte slices (no filesystem) and returns the same JSON schema as the CLI.