Estimate archive size, usable payload, and synthesis cost

DNA data storage sounds futuristic because the density of molecular storage can be astonishingly high. The challenge is that a real archive is never just pure payload. Synthetic strands have a chosen length, a practical encoding scheme, and some fraction of the sequence budget must be spent on redundancy or other overhead so the stored file can survive synthesis and sequencing errors. This calculator turns those ideas into a quick planning model. You enter strand length, how many strands you have, how efficiently you expect to encode bits into each base, and what share of the raw capacity you want to reserve for error correction or similar overhead. The result is a fast estimate of effective archive capacity, total synthesis cost, and cost per megabyte.

That makes the tool useful for two different kinds of questions. First, it can answer a basic sizing question: if you synthesize a batch of oligos at a certain scale, how much digital information could they plausibly hold after accounting for redundancy? Second, it can answer a budgeting question: if each base pair costs a certain amount to synthesize, how expensive does each megabyte of usable payload become? Those are the two numbers most people want early in a project, even before they commit to a full wet-lab design.

The page uses a deliberately simple model, so the numbers should be read as directional rather than final. Real systems may reserve sequence for primers, addresses, and random-access tags. They also face biochemical constraints such as GC balance, homopolymer avoidance, strand loss, and uneven sequencing coverage. Instead of pretending to simulate all of that, this calculator lets you absorb many of those penalties into the two most important practical knobs: bits encoded per base and overhead. Lower the first value or raise the second value to make your estimate more conservative.

One detail worth noting before you calculate: the output label says MB, but the conversion here uses binary megabytes, meaning bytes divided by 1024 × 1024. That is technically a MiB-style conversion, although many tools still label it MB. If you compare your result with a vendor datasheet that uses decimal megabytes, the values will differ slightly even if the underlying archive size is the same.

What each input means and how the formula works

The first input, base pairs per strand, is the length of each synthetic oligo. Longer strands provide more physical sequence and therefore more total raw capacity, but longer sequences are not automatically better in practice because synthesis quality, downstream handling, and fixed non-payload regions can all become more important. The second input, number of strands, scales the archive directly. If you double the strand count while keeping everything else constant, you double the total number of bases and therefore double both raw and effective storage capacity.

The third input, bits encoded per base, captures how efficiently your encoding maps digital information into DNA symbols. In a perfectly unconstrained world, DNA's four nucleotides could represent up to 2 bits per base because four symbols equal 2² possibilities. Practical systems usually land below that ceiling. They may avoid certain patterns, reserve sequence for addressing, and include design rules that make synthesis and decoding more reliable. As a result, values such as 1.3 to 1.8 bits per base are common rough-planning choices, while a value close to 2.0 should be treated as an optimistic upper bound rather than a default assumption.

The fourth input, error correction overhead (%), is where the simplified model absorbs the cost of robustness. In the real world, overhead can stand in for error-correcting codes, primer or indexing burden, extra copies for dropout tolerance, or other non-payload sequence. If you type 30, the calculator assumes that 30% of the raw bit capacity is unavailable for user payload. A lower overhead percentage produces a more optimistic result; a higher percentage produces a more conservative one.

The optional cost field is straightforward. It treats synthesis cost as a constant price per base pair and multiplies that price by the total number of bases in the archive. It does not include sequencing, quality control, shipping, labor, storage media, or other experimental overheads. That means the cost estimate is intentionally narrow: it answers, 'What would the DNA itself cost at this unit price?' rather than 'What would the whole project cost end to end?'

The core relationship is compact enough to show directly:

EffectiveBits = (B_{s} \times N_{s} \times b) \times (1 - o)

MB = \frac{EffectiveBits}{8 \times 1024 \times 1024}

In plain language, the calculator first finds total bases by multiplying strand length by strand count. It then multiplies by bits per base to get raw bit capacity. Finally, it applies the overhead fraction, leaving the effective bit capacity available to actual stored data. The cost estimate is separate: total bases multiplied by cost per base pair. This simple chain is why the tool is useful for sensitivity testing. If you are unsure about one assumption, adjust only that field and see how much the final answer moves.

Worked example with the default values

Suppose you use the default settings: 200 base pairs per strand, 1,000,000 strands, 1.6 bits per base, 30% overhead, and a synthesis price of $0.0001 per base pair. First calculate total bases: 200 × 1,000,000 = 200,000,000 base pairs. Next convert those bases into raw digital capacity by multiplying by the encoding density: 200,000,000 × 1.6 = 320,000,000 raw bits.

Now account for overhead. If 30% of the raw capacity is not usable payload, then 70% remains. That gives 320,000,000 × 0.70 = 224,000,000 effective bits. Divide by 8 to get bytes and then divide by 1024 × 1024 to express the result in the page's MB output format. That works out to about 26.70 MB of usable payload. The synthesis cost in this simplified model is 200,000,000 × 0.0001 = $20,000, which then implies a cost per MB of roughly $749.18.

This example is useful because it shows how dramatically overhead shapes the answer. If you kept the same total bases and bits per base but raised overhead from 30% to 50%, usable payload would fall from 224,000,000 bits to 160,000,000 bits. Likewise, if you held overhead steady and raised bits per base from 1.6 to 1.8, the raw and effective capacities would both increase proportionally. In other words, the model rewards efficiency and scale, but every gain still passes through the overhead filter.

When you interpret your own output, think of the capacity line as a planning estimate rather than a guarantee. If the answer seems surprisingly small, it usually means one of two things: either the total amount of DNA in the model is smaller than you intuitively expected, or your overhead and encoding assumptions are deliberately conservative. If the answer seems surprisingly large, check that you are not using an unrealistically high bits-per-base value with unrealistically low overhead at the same time.

Assumptions, limits, and practical planning tips

The main simplifying assumption is that overhead is modeled as one percentage applied to the entire raw archive. That is mathematically clean and easy to understand, but real DNA storage systems often have multiple layers of loss or non-payload sequence. Primer binding sites may be fixed-length rather than percentage-based. Addressing may scale with archive organization. Some redundancy may be logical, while other redundancy may be physical, such as keeping more copies to survive dropout. If you want the calculator to imitate a stricter design, increase overhead or reduce the nominal payload-bearing length of each strand before you calculate.

A second important assumption is that bits per base is treated as constant across the whole archive. In practice, encoding efficiency depends on the exact coding strategy and the constraints you enforce. Stronger rules that avoid difficult motifs can improve biochemical reliability while lowering effective information density. That tradeoff is not a flaw in the calculator; it is the core reason this input exists. If you are comparing architectures, run an optimistic case and a conservative case side by side. The gap between those scenarios often tells you more than either single result by itself.

The cost output is intentionally narrow. It helps compare synthesis prices or estimate the order of magnitude of a DNA write, but it is not a total cost of ownership model. Sequencing, verification, library prep, sample storage, and expert labor can dominate the budget depending on project scale. A result that looks affordable in raw synthesis dollars may still be impractical as an everyday backup system. That is why DNA is usually discussed for long-term, infrequently accessed archives rather than normal active storage.

For quick planning, three scenarios are especially helpful. An optimistic exploratory run might use 1.8 to 2.0 bits per base and 10% to 25% overhead, which is useful for understanding a ceiling. A balanced planning run might use 1.5 to 1.7 bits per base and 25% to 50% overhead, which is often a better rough budget case. A high-robustness run might use 1.2 to 1.5 bits per base and 50% to 80% overhead, which is appropriate when you want to stress-test how much redundancy and design constraint could shrink the payload.

It also helps to remember where DNA fits among storage technologies. DNA can be extraordinarily dense and potentially durable over long timescales when stored correctly, but it is slow to write and slow to read. Tape is far more convenient for many enterprise backup workflows. Hard drives and SSDs are far better for routine access. The reason to model DNA is not that it wins every storage contest today, but that it opens a different design space: very dense, very cold, very long-lived archival storage where access latency is acceptable.

Storage medium	Approximate capacity context	Typical role	Access speed
DNA (synthetic archival)	Extremely dense; theoretical figures reach around 10¹⁷ to 10¹⁸ bytes per gram	Cold, long-term archives	Very slow
Magnetic tape	Tens of terabytes per cartridge	Enterprise backup and retention	Slow, mostly sequential
Hard disk drive	Several terabytes per drive	General-purpose storage	Moderate
Solid-state drive	Up to tens of terabytes per device	High-performance active workloads	Fast

If you are selecting inputs for a serious proposal, one good habit is to document what each assumption represents. For example, you might decide that a lower bits-per-base value already includes some primer burden and sequence constraints, while your overhead value is reserved mostly for ECC and redundancy. Being explicit about that interpretation makes the calculator easier to use consistently across multiple design discussions.

For further reading, classic papers by Church and colleagues, Goldman and colleagues, and Organick and colleagues offer useful context for how real DNA data storage systems balance density, redundancy, and random access. The calculator does not replace those design details, but it gives you a compact framework for thinking about the same tradeoffs.

Frequently asked questions

How much data can a gram of DNA theoretically store? Published estimates often place the theoretical range on the order of 10¹⁷ to 10¹⁸ bytes per gram under very favorable assumptions. That headline figure is about physical density, not necessarily about immediately usable payload in a practical archive. Once you add indexing, redundancy, and experimental realities, the effective number can be much lower.

Why is error correction necessary? DNA storage pipelines can suffer substitutions, insertions, deletions, and outright strand loss. Error correction and redundancy provide a buffer against those failures. In this calculator, overhead is the simple stand-in for all of that protection. The benefit is clarity: you can instantly see how more robustness changes usable capacity.

Is DNA storage ready for everyday backups? Usually no. For most users and most organizations, conventional media remain far cheaper and much faster for regular read and write activity. DNA is most compelling when the archive will be written infrequently, read rarely, stored for a very long time, and judged more by density and durability than by speed.

What values should I try first for bits per base and overhead? If you want a reasonable starting point, use something like 1.5 to 1.7 bits per base with 25% to 50% overhead. That usually gives a more believable planning estimate than jumping straight to the theoretical 2 bits per base with minimal redundancy. If you need a deliberately conservative estimate, lower bits per base and increase overhead together.

Does the calculator explicitly model primers, indexing, and random access tags? Not separately. You can approximate those burdens by either reducing payload-bearing strand length before entering it, reducing bits per base, increasing overhead, or combining those choices. The point is to keep the arithmetic transparent while still letting you reflect real design penalties.

Enter DNA archive parameters to run the calculation.

Quantity	Value
Total bases
Raw capacity (bits)
Effective capacity (MB)
Total synthesis cost (USD)
Cost per MB (USD)

Optional mini-game: Archive Sprint

Want a fast intuition for why payload and overhead need to be judged together? In this quick canvas mini-game, three DNA archive batches appear at a time. Your job is to tap the batch with the highest effective payload, not the most tempting raw numbers. Bigger strands and higher bits per base help, but a heavy overhead percentage can quietly erase the advantage. The game lasts about 75 seconds, ramps up every 20 seconds, works with touch or mouse, and also accepts keyboard picks with 1, 2, and 3.

Score0

Time75.0s

Streak0

ProgressWave 0

Optional challenge

Archive Sprint

Pick the batch with the highest effective payload using the same logic as the calculator: roughly compare bp × bits/base × (1 − overhead). Click or tap a card, or press 1, 2, or 3. Correct picks build streaks and earn small time boosts. Flashy raw-bit decoys start showing up after the warm-up.

Runs are short, mobile-friendly, and your best score is saved on this device.

Current run: not started.

Takeaway: usable DNA capacity is raw capacity multiplied by the fraction that remains after overhead is removed.

Best score: 0

The mini-game is separate from the calculator itself, so it never changes the math above. It is simply a playful way to rehearse the same idea the form quantifies: the most impressive-looking batch is not always the one with the most usable payload after redundancy is accounted for.

DNA Data Storage Capacity Calculator

Estimate archive size, usable payload, and synthesis cost

What each input means and how the formula works

Worked example with the default values

Assumptions, limits, and practical planning tips

Frequently asked questions

Optional mini-game: Archive Sprint

Archive Sprint

Embed this calculator

Estimate archive size, usable payload, and synthesis cost

What each input means and how the formula works

Worked example with the default values

Assumptions, limits, and practical planning tips

Frequently asked questions

Optional mini-game: Archive Sprint

Archive Sprint

Embed this calculator

Related Calculators

GC Content Calculator - Determine DNA Base Composition

PCR Amplification Yield Calculator | Estimate DNA Output by Cycles and Efficiency

Quantum Error Correction Overhead Calculator

Battery Self-Discharge Calculator - Estimate Capacity Loss During S...

Digital Hoarding Storage Cost Calculator

Self Storage Unit Cost Calculator | Estimate Total Fees and Cost per Square Foot