Introduction
Teams often talk about AI cost as if it were mostly a GPU problem, but many real-world projects spend just as much time and money assembling usable data. A dataset can look affordable when you only consider the first labeling pass, then become far more expensive after review, rework, versioning, and experiment compute are added. That is why this planner separates the budget into recognizable pieces instead of hiding everything inside one average rate.
The most important idea behind the calculator is compounding. If you double the number of samples, annotation usually doubles. If you add another iteration because the guidelines changed or an active-learning round is planned, annotation rises again and QA typically rises with it. Storage may stay small for text projects but become meaningful for image, audio, or video pipelines, especially when you keep multiple versions. Training compute is separate for a reason as well: many teams understate it when they focus only on labeling invoices and forget the cost of repeated model runs.
Use this tool for planning conversations, not false precision. It is helpful when you need to compare a small pilot with a production rollout, estimate the impact of a higher QA standard, or explain to stakeholders why a second labeling pass has a bigger budget effect than it first appears. The result is an estimate, but it is an estimate that can be read, questioned, and improved.
How to use
Start by deciding on one time horizon and sticking with it. If your team budgets projects as a single build cost, enter totals for the whole effort. If your storage charges are monthly and you want a monthly comparison, keep the other numbers on a monthly basis too. Consistency matters more than any one convention. Once the horizon is clear, fill in the fields from the top of the form using the same units throughout.
- Number of Samples is the count of items you plan to label, such as support tickets, images, short audio clips, medical scans, or sensor windows.
- Cost per Sample is the average cost of one labeling pass for one item. If your organization thinks in hourly labor, convert that hourly cost into a per-item rate using observed throughput.
- Preprocessing Cost covers fixed work such as cleaning, deduplication, schema mapping, data formatting, prompt or guideline prep, and similar setup tasks.
- Model Training Budget captures compute and experiment spend that should not be hidden inside annotation, such as GPU runs, fine-tuning cycles, sweeps, or evaluation pipelines.
- Iterations, QA percentage, and storage inputs describe how much the dataset will be revisited, reviewed, and retained.
After you click Estimate Budget, read the subtotals before you look at the total. If the annotation subtotal is already far above expectations, the problem is usually volume, per-sample pricing, or an unrealistic iteration assumption. If the QA subtotal seems too low, your review process may be broader than the percentage model used here. In that case, either raise the QA percentage or move fixed review labor into preprocessing.
A practical habit is to run the calculator three times: a baseline case, a conservative case, and an optimistic case. Change only a few variables each time. That approach makes the model easier to explain and helps you see which levers matter most. In many projects, the biggest drivers are not subtle at all: they are samples, cost per sample, and the number of effective passes over the data.
Formula and worked example
The math is intentionally direct. Annotation is calculated first because it is the core variable cost. QA is then applied as a percentage of annotation only. Storage is calculated from size and price per GB. Preprocessing and training are treated as separate budget lines and added at the end.
Suppose you plan to label 10,000 samples at $0.06 each, expect 2 iterations, reserve $1,200 for preprocessing, budget $2,500 for training, set QA at 15% of annotation, and estimate 300 GB of storage at $0.02 per GB for the chosen retention period. The calculation reads like this: annotation is 10,000 × 0.06 × 2 = $1,200. QA is 15% of that annotation subtotal, or $180. Storage is 300 × 0.02 = $6. Add preprocessing, training, and storage, and the total budget becomes $5,086.
The useful lesson is not just the answer. It is the shape of the answer. If iterations rise from 2 to 3, annotation jumps to $1,800 and QA rises with it. That means a change in process design can matter more than small differences in storage price. Likewise, if your per-sample estimate is based on guesswork rather than measured throughput, the total can drift quickly. A simple calculator is powerful when it shows you which input deserves better measurement.
Assumptions and reading the result
The result area shows three explicit subtotals before the final total: annotation, QA, and storage. Preprocessing and training are folded into the total because they are already entered as direct budget lines. The calculation assumes QA is proportional to annotation, which is often a good first approximation when review, auditing, and adjudication scale with labeling volume. It does not assume that preprocessing or training scale the same way. That separation keeps the estimate easy to audit.
There are also clear limits. The calculator does not include project management, legal review, privacy engineering, vendor setup fees, dataset licensing, compliance audits, or opportunity cost. Those may be significant in production settings. If one of those items is material for your team, the cleanest approach is to add it to preprocessing if it is mostly fixed, or adjust the per-sample figure if it scales directly with each item labeled.
- Sanity-check scale: doubling samples should roughly double annotation and QA if the other inputs are unchanged.
- Check iteration logic: extra passes multiply annotation first, then QA follows because it is a percentage of annotation.
- Confirm storage horizon: monthly storage prices should be paired with monthly retention totals, not mixed with annual assumptions.
- Read the total as a planning estimate: if you need purchasing accuracy, use the calculator to prepare better questions for vendors rather than treating it as the quote itself.
The best way to interpret the total is as a structured conversation starter. If the number is uncomfortable, do not ask only whether the price is too high. Ask which component is driving it. A large annotation subtotal suggests volume or task complexity. A large QA subtotal suggests either high quality standards or unclear instructions. A large training budget may indicate too many experiments or an expensive model strategy. The output is most useful when it points you toward the next operational decision.
Ways to make the estimate more realistic
The input most people guess badly is cost per sample. A quick fix is to derive it from time. If annotators handle 120 items per hour and fully loaded labor is $24 per hour, the implied base labeling cost is about $0.20 per item before review overhead. That number will still vary with task type, tooling, and domain expertise, but it is usually more defensible than choosing a round number that simply feels plausible.
Iterations deserve similar care. A second pass does not always mean relabeling the entire dataset. Sometimes only 30% of the data is revisited after a pilot or disagreement study. In those cases, an effective iteration count can be more honest than a whole number. One full pass plus a 30% rework pass is about 1.3 iterations. This is why the iterations field on the form accepts decimals even though many teams first think of it as an integer.
Finally, remember that small recurring items can become large over time. Storage is the classic example. A low price per GB feels negligible until raw files, processed files, intermediate artifacts, backups, and additional dataset versions all coexist for months. If reproducibility matters, retention is not an afterthought. Put the intended horizon into the storage inputs now, or your later operating costs may look like a surprise when they were really just omitted from the original plan.
Practical FAQ
- Should internal staff time be included?
- Yes, if that time is a real constraint or real cost in your organization. Fixed setup work such as writing guidelines or building a validation set fits naturally inside preprocessing. Ongoing labeling work is often better represented in cost per sample. Some teams keep internal labor separate from cash spend, but if you do that, be explicit about what the calculator total excludes.
- How should setup fees or vendor minimums be modeled?
- Fixed fees belong in preprocessing because they do not scale neatly with each item labeled. Minimum commitments can be handled by increasing the effective number of samples, raising the per-sample rate, or simply adding the shortfall as a fixed preprocessing line. The best choice is whichever one you can explain consistently across all scenarios.
- What if my workflow has label, verify, and adjudicate stages?
- Think about how the work is priced. If verification behaves like a second structured pass over many items, modeling it through iterations may be clearer. If it behaves like review time applied to annotation output, the QA percentage is often cleaner. When both happen, split them: use iterations for the repeated labeling work and QA for review overhead.
- Why separate storage and training instead of folding them into one rate?
- Because they behave differently. Storage usually depends on retention and copies, while training depends on experiment intensity and model choice. Combining them into one blended rate can hide the real driver of change. If the total rises after a new active-learning loop, you want to know whether the increase came from more labels, more review, more retained data, or more compute. Separate lines make that visible.
Mini-game: Budget Rebalance
Optional break: this arcade mini-game turns budget planning into a fast review drill. Cost events fall into five lanes that match the calculator categories. Tap the correct lane, or press keys 1 through 5, exactly when an event reaches the glowing review line. Good timing contains the cost spike; poor timing lets pressure build in that category. Survive the full run, keep budget integrity high, and notice which lane becomes hardest to control as the round changes.
