It’s quite difficult to approximate the range and unpredictability of test data, especially if you haven’t been studying that data yourself. You can write a test data generator, but unless you tweak it to account for the things that make data look “real”, you won’t get data that reflects your real production data. Real data has typos, odd outliers, and non-linear distributions.
Claude Code can help you generate data that’s closer to what you’d see in production, but as with anything LLM-assisted, the quality heavily depends on how well you understand the process yourself.
Here’s what you need to know when asking Claude Code to generate data, and why a “general” approach will produce data that misses many of your edge cases.
Test data isn’t uniform
Here’s an example of an LLM-written, naive generation script:
import random
from faker import Faker
fake = Faker()
for _ in range(10000):
row = {
"name": fake.name(),
"age": random.randint(18, 80),
"income": random.randint(30000, 200000),
"created_at": fake.date_time_this_year(),
}
This will give you really basic test data. Real data almost never distributes uniformly, e.g income won’t spread evenly between $30K and $200K, rather it’ll cluster at lower values and have a few high “outliers”. Customer ages in a consumer app won’t fall in a flat line either. Nor will timestamps: they’ll cluster during business hours, campaigns, and might be lighter on weekends.
When you generate everything with random.randint, you’ll lose arguably one of the most important facets of that data: the shape. The unusual and unpredictable data points are the ones that are most likely to break your app anyway.
When asking Claude Code to help you write a generator, you’ll want to be explicit about distributions, which means giving it the shape of the data you want.
Getting distributions right
This obviously depends on how your actual data scatters, but there are a few general rules of thumb for data distributions.
Financial data
Money will often use a log-normal distribution. Financial data typically clusters at lower amounts and has fewer high numbers, with outliers getting higher and higher, and this is more or less replicated by log normal. It also happens to conform to Benford’s Law, which notes that naturally occurring quantities start with lower leading digits far more often than higher ones. A financial auditor running a Benford test will notice that uniformly distributed transaction amounts are synthetic.
Numpy has functions that will distribute data in different ways:
import numpy as np
# Income in USD — log-normal fits consumer income distributions well
incomes = np.random.lognormal(mean=10.8, sigma=0.85, size=10000).astype(int)
Age data
When it comes to user ages, you’ll need to understand whether your users are normally distributed or bimodal. A lot of consumer apps have two user clusters: a younger adult group and an older group, and a valley between them. Once you determine the pattern in your production data, tell Claude Code what it looks like, and have it create a model for you:
young = np.random.normal(28, 5, 6000)
older = np.random.normal(52, 8, 4000)
ages = np.clip(np.concatenate([young, older]), 18, 90).astype(int)
Names and places data
Zipf’s law applies to names and cities: the most common values are orders of magnitude more frequent than rarer ones. This means that you’ll see many instances of “Smith” and “Johnson” in your name column. Faker can help you weight this with the default use_weighting=True, and this handles distribution reasonably well.
Temporal data
Uniform random selection is the wrong model for timestamps. Real event data follows a Poisson process with a time-varying intensity. Ask Claude Code to weight timestamp generation by hour of day and day of week based on what you’ve noticed about your traffic patterns (for B2B SaaS that usually means weekday work hours, while for B2C it might mean evenings/weekends).
The challenge that is correlated fields
When you’re generating fake data, you shouldn’t generate each field independently. This will ruin the “realness” of your fake data. Since Faker generates each field in isolation, you could end up with a 19-year-old VP of Finance earning $280K. All of your data rows would be similarly implausible. In real life, certain fields will be correlated, like job title and salary, or location and phone number.
You’ll want to tell Claude Code to generate correlated fields together. The easiest way to do this is to set conditional rules: generate age first, then derive an income range and likely job seniority from it. For location-dependent fields like phone area codes, generate the state first and look up a real area code from that state (instead of using fake.phone_number() independently).
Use a covariance matrix for tightly-coupled numerical fields like age and income:
means = [40, 75000] # age, income
cov = [[120, 15000], [15000, 800000000]] # age-income correlation ~0.4
samples = np.random.multivariate_normal(means, cov, size=10000)
When prompting Claude for this, use natural language to explain the relationships between fields, e.g. “older users tend to have higher incomes and more senior job titles” will help it figure out the conditional logic.
You might want to also give CC real numbers or ranges from production to show the type of values you see within your specific data.
Important edge cases
Good test data shouldn’t be perfect or clean. You’ll want to include some values that are designed to break your app. Claude Code is great at generating these on request, but you’ll need to specify your prompt:
String edge cases
Names with apostrophes (O'Brien) can break unparameterized queries, and names with emojis or zero-width characters will pass VARCHAR(50) constraints in some databases and fail in others. And a string that is exactly one character longer than your column’s max length is the most reliable way to find off-by-one errors in your INSERT logic.
Temporal edge cases
1970-01-01 00:00:00 (Unix epoch zero) is frequently misinterpreted as NULL, and the 2038-01-19 32-bit timestamp overflow is getting here faster than people think. Timestamps in DST transition windows (specifically the repeated hour when clocks fall back) occur twice and will confuse any system that assumes timestamps are monotonic.
Types of Null
NULL, empty string "", and the literal string "null" are all semantically distinct, but applications constantly conflate them. Ask Claude to include all three variants in your dataset so you can make sure your app handles them correctly.
Intentional duplicates
Real email columns have ~0.5–2% duplicates (many times shared family accounts, re-registrations). If you generate your emails with fake.unique.email(), you miss out on testing your deduplication logic.
When prompting CC for edge cases, do it in a separate generation pass: ask it to append N rows with specific anomaly patterns after generating the main dataset. This is so the main generation logic stays clean (additional info in a prompt will affect the weight an LLM gives everything else) while still covering the edge cases.
Be smart with your prod data
Before showing Claude samples from your production data, make sure to anonymize it and remove any PII. LLMs are obviously great at learning from patterns, but by feeding in real customer data you put a lot at risk. LLM policies will also probably not align with your company’s approach to security.
Validate before you test with it
Once you’ve got a generated dataset, make sure you do a few quick checks (to help you avoid false test passes). Some examples:
- Plot the leading digit distribution of any financial amounts (it should not be anywhere near uniform)
- The correlation between age and income should be positive and moderate
- You’ll want to human-review ~50–100 random records for general plausibility
Use Claude Code to help write these validation checks. It can be helpful to get them in the form of a second script that runs after data gen, then prints a summary.
Better data means better testing
Claude Code can help you get data that is nearly as unpredictable as your real data, but it works best when you give it certain constraints up front. Be as specific as you can about distributions, correlations, and edge case requirements, this’ll be much more efficient than modifying/cleaning up your data later.
If you want to run your generated dataset against your app in a production-like environment, give Shipyard a try. Spin up an ephemeral environment per branch with your test data, and run your full test suite before anything touches staging.