Bioconductor-Friendly Multithreaded BAM Processing • BamScale

Bioconductor-Friendly Multithreaded BAM Processing

BamScale is a multithreaded BAM processing package for R built on top of the ompBAM C++ engine. It is designed for Bioconductor users who need high-throughput BAM parsing while preserving familiar Rsamtools and GenomicAlignments workflow patterns.

Why BamScale in Bioconductor Workflows?

BamScale focuses on three goals:

speed on modern multi-core systems,
compatibility with common Bioconductor input/output contracts,
transparent benchmarking and reproducibility.

In many Bioconductor pipelines, BAM access is still a major bottleneck. The core bottleneck is not only BAM decompression itself, but also the fact that existing R-facing workflows often rely on effectively single-threaded per-file access patterns. BamScale addresses this by exposing OpenMP threading from the ompBAM engine directly at the R interface, while still returning Bioconductor-friendly objects.

The practical design goal is not to replace familiar Bioconductor workflows with a separate ecosystem. Instead, BamScale aims to preserve the way users already work with:

Rsamtools::scanBam()-style field extraction,
GenomicAlignments::readGAlignments()-style alignment-object workflows,
BiocParallel file-level execution when multiple BAMs are processed together.

This means BamScale is intended to fit where existing tooling already fits, but with an additional within-file threading axis that can remove the current parsing bottleneck for alignment-centric workloads.

Key capabilities:

OpenMP-enabled per-file parallelism via threads,
optional multi-file parallelism via BiocParallel (BPPARAM),
ScanBamParam-like filtering (mapqFilter, flag, which, what, tag),
multiple output modes:
- data.frame and S4Vectors::DataFrame,
- GenomicAlignments::GAlignments,
- GenomicAlignments::GAlignmentPairs,
- scanBam-shaped list output (as = "scanBam").

Here, step1 refers to the common alignment-metadata extraction workload built from fields such as:

qname
flag
rname
pos
mapq
cigar

This is the kind of BAM access pattern used by many downstream QC, filtering, and fragment-level summary workflows.

Compatibility with Existing Bioconductor Workflows

BamScale is designed to be familiar to users who already work with Rsamtools and GenomicAlignments.

step1-style metadata extraction maps naturally onto scanBam()-like use cases.
as = "GAlignments" and as = "GAlignmentPairs" support alignment-object workflows used downstream in Bioconductor.
BiocParallel integration preserves the standard file-level parallel workflow model.

The main added capability is that BamScale can also use multiple threads within each BAM file through ompBAM. That is the key difference from the baseline workflow pattern and the main reason it can remove read-parsing bottlenecks in workloads such as metadata extraction and alignment-object construction.

Installation

Prerequisites

R with a C++17 toolchain
OpenMP-capable compiler/runtime
ompBAM available in your R library

Current install route (pre-Bioconductor release)

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("ompBAM")

if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}
remotes::install_github("cparsania/BamScale")

After Bioconductor acceptance

BiocManager::install("BamScale")

Quick Start

library(BamScale)

bam <- ompBAM::example_BAM("Unsorted")

# 1) Step1-style extraction
x <- bam_read(
  file = bam,
  what = c("qname", "flag", "rname", "pos", "mapq", "cigar"),
  threads = 4
)

# 2) Seq/qual in comparator-compatible mode
sq <- bam_read(
  file = bam,
  what = c("qname", "seq", "qual"),
  as = "data.frame",
  seqqual_mode = "compatible",
  threads = 4
)

# 2b) Seq/qual in compact mode (returns raw vectors, not plain strings)
sq_compact <- bam_read(
  file = bam,
  what = c("qname", "qwidth", "seq", "qual"),
  as = "data.frame",
  seqqual_mode = "compact",
  threads = 4
)

# 3) Fast chromosome-level counts
cnt <- bam_count(file = bam, threads = 4)

Interpreting `seqqual_mode = "compact"`

Compact mode is an optimized BamScale-specific representation for seq and qual. It does not return ordinary character strings:

seq is returned as a list-column of raw vectors containing BAM-native packed sequence bytes (two bases per byte)
qual is returned as a list-column of raw vectors containing per-base numeric Phred bytes
qwidth is required to decode compact seq back to base letters correctly
a quality byte value of 255 represents missing quality

Compact mode is therefore best interpreted as a lower-level, deferred-decoding representation. It is useful when extraction throughput matters more than immediate string materialization. If downstream code expects ordinary sequence or quality strings, use seqqual_mode = "compatible" instead, or decode compact output explicitly:

sq_compact_decoded <- decode_seqqual_compact(sq_compact)

Parallelism Model

BamScale can parallelize on two axes:

across files via BPPARAM workers,
within each file via OpenMP threads.

Approximate effective concurrency:

min(length(file), bpnworkers(BPPARAM)) * threads

When auto_threads = TRUE, BamScale preserves higher per-file thread counts when possible by reducing the number of concurrently active file workers before shrinking per-file threads.

Benchmark Results and Reproducibility

A dedicated benchmark article summarizes the current results and the relevant cross-tool comparisons:

pkgdown article:
- https://cparsania.github.io/BamScale/articles/benchmark-results.html
source:
- vignettes/benchmark-results.Rmd

The article includes:

step1, galignments, and seqqual benchmark results
fair cross-package comparisons against Rsamtools and GenomicAlignments
compact-versus-compatible seqqual results for BamScale

Benchmarking

Benchmark workflow and benchmark reporting assets are documented in:

inst/benchmarks/README.md

Current Limitations

param$which is currently implemented as sequential filtering rather than indexed random-access jumps.
seqqual_mode = "compact" is optimization-oriented and not intended for strict cross-package output-equivalence comparisons.
GAlignments and GAlignmentPairs outputs exclude unmapped records by design.

Community and Support

Bioconductor support site (recommended for user-facing questions):
- https://support.bioconductor.org
Development issues and feature requests:
- https://github.com/cparsania/BamScale/issues

When posting performance reports, include:

package versions,
hardware/storage context,
exact benchmark command and profile,
threads/BPPARAM settings.

Citation

citation("BamScale")

If BamScale contributes to performance claims, please also cite ompBAM.

Contributing

Pull requests are welcome. Please include:

a short motivation,
tests for behavior changes,
benchmark evidence for performance claims.

Before opening a PR, run:

R CMD build .
R CMD check --as-cran BamScale_*.tar.gz
Rscript -e "BiocCheck::BiocCheck('.')"

License

MIT (LICENSE).

BamScale