Skip to contents

BamScale logo

R-CMD-check pkgdown Bioconductor License: MIT

BamScale is a multithreaded BAM processing package for R built on top of the ompBAM C++ engine. It is designed for Bioconductor users who need high-throughput BAM parsing while preserving familiar Rsamtools and GenomicAlignments workflow patterns.

Why BamScale in Bioconductor Workflows?

BamScale focuses on three goals:

  • speed on modern multi-core systems,
  • compatibility with common Bioconductor input/output contracts,
  • transparent benchmarking and reproducibility.

In many Bioconductor pipelines, BAM access is still a major bottleneck. The core bottleneck is not only BAM decompression itself, but also the fact that existing R-facing workflows often rely on effectively single-threaded per-file access patterns. BamScale addresses this by exposing OpenMP threading from the ompBAM engine directly at the R interface, while still returning Bioconductor-friendly objects.

The practical design goal is not to replace familiar Bioconductor workflows with a separate ecosystem. Instead, BamScale aims to preserve the way users already work with:

This means BamScale is intended to fit where existing tooling already fits, but with an additional within-file threading axis that can remove the current parsing bottleneck for alignment-centric workloads.

Key capabilities:

Here, step1 refers to the common alignment-metadata extraction workload built from fields such as:

  • qname
  • flag
  • rname
  • pos
  • mapq
  • cigar

This is the kind of BAM access pattern used by many downstream QC, filtering, and fragment-level summary workflows.

Compatibility with Existing Bioconductor Workflows

BamScale is designed to be familiar to users who already work with Rsamtools and GenomicAlignments.

  • step1-style metadata extraction maps naturally onto scanBam()-like use cases.
  • as = "GAlignments" and as = "GAlignmentPairs" support alignment-object workflows used downstream in Bioconductor.
  • BiocParallel integration preserves the standard file-level parallel workflow model.

The main added capability is that BamScale can also use multiple threads within each BAM file through ompBAM. That is the key difference from the baseline workflow pattern and the main reason it can remove read-parsing bottlenecks in workloads such as metadata extraction and alignment-object construction.

Installation

Prerequisites

  • R with a C++17 toolchain
  • OpenMP-capable compiler/runtime
  • ompBAM available in your R library

Current install route (pre-Bioconductor release)

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("ompBAM")

if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}
remotes::install_github("cparsania/BamScale")

After Bioconductor acceptance

BiocManager::install("BamScale")

Quick Start

library(BamScale)

bam <- ompBAM::example_BAM("Unsorted")

# 1) Step1-style extraction
x <- bam_read(
  file = bam,
  what = c("qname", "flag", "rname", "pos", "mapq", "cigar"),
  threads = 4
)

# 2) Seq/qual in comparator-compatible mode
sq <- bam_read(
  file = bam,
  what = c("qname", "seq", "qual"),
  as = "data.frame",
  seqqual_mode = "compatible",
  threads = 4
)

# 3) Fast chromosome-level counts
cnt <- bam_count(file = bam, threads = 4)

Parallelism Model

BamScale can parallelize on two axes:

  • across files via BPPARAM workers,
  • within each file via OpenMP threads.

Approximate effective concurrency:

min(length(file), bpnworkers(BPPARAM)) * threads

When auto_threads = TRUE, BamScale preserves higher per-file thread counts when possible by reducing the number of concurrently active file workers before shrinking per-file threads.

Benchmark Results and Reproducibility

A dedicated benchmark article summarizes the current results and the relevant cross-tool comparisons:

The article includes:

  • step1, galignments, and seqqual benchmark results
  • fair cross-package comparisons against Rsamtools and GenomicAlignments
  • compact-versus-compatible seqqual results for BamScale

Benchmarking

Benchmark workflow and benchmark reporting assets are documented in:

Current Limitations

  • param$which is currently implemented as sequential filtering rather than indexed random-access jumps.
  • seqqual_mode = "compact" is optimization-oriented and not intended for strict cross-package output-equivalence comparisons.
  • GAlignments and GAlignmentPairs outputs exclude unmapped records by design.

Community and Support

When posting performance reports, include:

  • package versions,
  • hardware/storage context,
  • exact benchmark command and profile,
  • threads/BPPARAM settings.

Citation

citation("BamScale")

If BamScale contributes to performance claims, please also cite ompBAM.

Contributing

Pull requests are welcome. Please include:

  • a short motivation,
  • tests for behavior changes,
  • benchmark evidence for performance claims.

Before opening a PR, run:

R CMD build .
R CMD check --as-cran BamScale_*.tar.gz
Rscript -e "BiocCheck::BiocCheck('.')"

License

MIT (LICENSE).