Efficient manipulation of genomic sequences in Python, inspired by the design of Bioconductor's Biostrings package.
The core design relies on a "pool and ranges" memory model:
- DNAStringSet stores all sequences in a single contiguous block of memory (the pool).
- Individual sequences are defined by
startandwidthcoordinates (the ranges). - Slicing a
DNAStringSetreturns a view (a new set of ranges pointing to the same pool), making subsetting operations virtually instantaneous and memory-free, regardless of the data size.
To get started, install the package from PyPI
pip install biostringsThe DNAString class represents a single DNA sequence. It enforces the IUPAC DNA alphabet and supports efficient byte-level operations.
from biostrings import DNAString
# Create a DNA string
dna = DnaString("TTGAAAA-CTC-N")
print(dna)
# Output: TTGAAAA-CTC-N
# Basic operations
print(len(dna)) # 13
print(dna[0:3]) # DnaString(length=3, sequence='TTG')
# Reverse Complement
# Handles IUPAC ambiguity codes correctly (e.g., N -> N, M -> K)
rc = dna.reverse_complement()
print(rc)
# Output: N-GAG-TTTTCAAThe DNAStringSet is the primary container for handling collections of sequences (e.g., reads from a FASTA file).
from biostrings import DNAStringSet
# Efficiently create a set from a list of strings
seqs = [
"ACGT",
"GATTACA",
"TTGAAAA-CTC-N",
"ACGTACGT"
]
dss = DNAStringSet(seqs, names=["s1", "s2", "s3", "s4"])
print(dss)
# Output:
# <DNAStringSet of length 4>
# [ 1] 4 ACGT s1
# [ 2] 7 GATTACA s2
# [ 3] 13 TTGAAAA-CTC-N s3
# [ 4] 8 ACGTACGT s4This project has been set up using BiocSetup and PyScaffold.