Engineering
Evals
Synthetic data

The company that doesn't exist

How we invented a fake fintech startup with 112 employees to test whether our company brain can keep secrets, get facts straight, and not confuse two Sophies.

By David Kofoed Wind · Co-founder & CEO · July 1, 2026 · 15 min read

July 1, 2026

How we invented an entire fintech startup with 112 employees to test our company brain and whether it can keep secrets, get facts straight and avoid confusing Sophie with the other Sophie.

Every company is full of information that not everyone is meant to see. The board knows things the team doesn't. Finance sees numbers the rest of us don't. Your DMs are your own.

At Agentwork we're building an AI that reads all of a company's data, the Slack messages, the docs, the CRM, the code, and answers questions about it. Think of it as a shared brain for the whole company. The hard part isn't answering the question. It's giving each person only the answer they're allowed to have. The CEO and a brand-new engineer should not hear the same thing.

The clearest way to show what we mean is to ask our system one question as two different people. So we did, at Saldra, a company we'll properly introduce in a minute.

The question: "Are we going to raise another round of funding?"

Asked as Mette Krarup, the CEO, it answered in full: no. The board decided back in December to stop raising, stretch the runway, and aim for profitability instead. Internally they call the plan "default-alive."

Asked as Frederik Kjær, a backend engineer, it declined. It could tell that runway and fundraising get mentioned at all-hands, but it found no decision it was allowed to share. That conversation lived in a private board channel Frederik can't see.

Same question. Two answers. Both correct.

Mette and Frederik don't exist. Neither does Saldra, the fintech they work for, nor the board that made the call. We invented all of it: 112 people and a year of data spread across Slack, Notion, a CRM, GitHub, Google Drive and more.

We made up all this data to be able to test our own product. Can our memory system keep a company's secrets, can it answer complicated questions and can it navigate the tribal knowledge that lives in a company's internal data?

This post is the long version of the answer. How we generated the data, what we hoped to achieve, how we made evals, a mocked ingestion pipeline, and an agent that automatically does research and suggests improvements.

Why we needed a fake company

Initially when we tried testing our own system, we connected it to our own internal Slack, Notion, Google Drive and Fireflies (meeting recording app). And then we started asking questions to see whether it worked. To improve our system we created a bunch of memory evals - questions we could ask the agent where we knew what the right answer was supposed to look like. Some examples were:

Which agent framework did we migrate to, and from what?
- We migrated the agent architecture from LangGraph to Pydantic AI (around April 2026).
Which of our investors is CTO of a startup that raised $1.1B, and what's the startup called?
- Lasse Espeholt (ex-DeepMind) - CTO of Ineffable Intelligence, which raised $1.1B (April 2026 chatter).
Who is our Head of Marketing?
- We don't have one

That led to a bunch of quick insights and improvements early on, but we quickly ran into challenges.

The first problem was that our own data is fairly limited. Being just a team of three people, the amount of complexity is relatively small. Nobody had joined the company and nobody had left. Nobody had the same first name. The next issue was that we were talking about our evals in Slack and meetings, and this led to evals becoming poisoned.

Separately to those issues, we didn't have great seed data to test our app with. Whenever we wanted to test our memory ingestion system, or test the system with realistic data we ran into problems. Our seed data was just a few hand-created people like "Customer Customersson", but that was limiting.

How to generate a company in a few shots

When we had decided to create a fake company as a synthetic dataset, we booted up Claude and asked it to get started. We kicked off with this prompt:

I want to create a synthetic dataset that we can use for Agentwork. It will be used both for evals and for sales demos. It needs to feel like a real company, with real employees, data etc. The end result is that we create a larger seed data source that we can ingest for evals and load for demos.

I have attached an example of how our data looks today when dumping the raw source records. But before we get to building the actual data, I want to make a few preliminary documents describing the company, their product(s), etc.

Let's make it a SaaS company. It should have raised some venture funding. It has ~100 employees. It is headquartered in Denmark, but has an office in London too. After this, first company description, then I want to build a story of a year in the life of this company and then build personas for each person in the company (not very deep for everyone person). And then based on that story and those people, I want to generate realistic data.

Start with the company overview, and when that is accepted, go and build the story and people directory.

Instead of trying to one-shot the solution and generate the full dataset all at once, we asked it to start by coming up with the company overview which we could review before continuing. It asked us to select an industry (we picked Fintech) and it generated a company-overview.md for Saldra (Claude came up with that) and came up with the high-level details.

After reviewing the company profile, it went on to generate year-in-the-life.md which described the important events from July 2025 to June 2026 (the year we decided to cover in our synthetic dataset) and people-directory.md which described each of the 112 people in Saldra.

The year-in-the-life doc included some important story arcs, like a payment system outage, an application for a banking license, a complicated German customer, issues with fraud and the board changing their mind a few times. The people-directory included light entries on everyone (their role, few notes) and then for a selected cast it had deep personas.

Before going further we decided to add a few additional data sources (a CRM and Github which we didn't support yet) and a mix of Danish and English language.

Finally Claude helped generate 2,965 fake records. Most of it was Slack messages across 16 public and 6 private channels. But we also added private Google Drive folders and a single private Github repo. We ensured that the permission boundaries were realistic so the right people had access to the right things.

Going deeper in the Saldra data

Before we continue, I want to show a bit of the data that we generated. Most of the data is just "filler" data. Realistic data is full of chatter and mostly irrelevant content. For example messages like this one:

"Heads up: bumping the DATEV client lib, watch for breakage" from Mads Thomsen in #engineering. Or like "anyone seen flaky tests on CI this morning? re-ran and they passed 🤷".

Because Copenhagen runs in two languages, Danish leaks into the casual messages. #general and #leadership carry lines like "Det giver mening" and "Skal vi tage en kaffe?" mixed into otherwise English threads. This bilingual mess is useful to eval against, it's what Nordic-company data looks like, and it makes entity resolution and recall harder in a realistic way.

Against that backdrop of chatter, the important stories carry weight. Here is the Copilot accuracy incident, from #incidents on 17 March 2026, in a public channel:

Clara Winther · 09:30 — Helio Health reports Copilot misstated a VAT total in a summary. This is exactly what we feared.

Sofie Bruun · 09:35 — Top priority. Kasper, can we add a hard guardrail: numbers must reconcile to source rows or Copilot declines?

Kasper Lund · 09:48 — Yes. Shipping a reconciliation check today; if the figure doesn't tie out, we refuse rather than guess.

Kasper Lund · 19:30 — Guardrail live in prod. Helio confirmed the summaries are correct now. Writing the post-mortem.

And the fraud thread from 4 February 2026 which sits in #compliance-risk, a private channel with nine members:

Sara Vendel · 09:30 — Monitoring just flagged a coordinated card-testing ring, dozens of small auths across new cards in minutes.

Thomas Bremer · 09:36 — This is the rule set we built after the spring incident. Block and freeze the affected BINs?

Sara Vendel · 09:41 — Done. Loss is negligible, caught within minutes. Night and day vs last spring.

We can also look deeper into a single person, like Sara Vendel:

★ Sara Vendel - Deputy MLRO / Financial Crime Lead · CPH · sara.vendel@saldra.com. Built the transaction-monitoring rules after the spring-2025 fraud incident — the rules that caught the February 2026 card-testing ring within minutes. The operational backbone of financial-crime defense; precise and quietly relentless. Reports to Thomas Bremer (Head of Compliance / MLRO).

She is an important character in the story, and multiple evals will be tied to her work. She is also an entity-resolution trap, because asking "who built the fraud monitoring rules?" tempts a system to answer Thomas Bremer, the more prominent MLRO, instead of Sara, the deputy who actually wrote them.

We can also look at a month in the year of Saldra:

August 2025 — Germany lands a lighthouse; the platform wobbles

4 Aug [public] Lena Brückner starts as Country Manager, DACH (Munich, remote). First German-market hire; her mandate is €1.5M of pipeline by mid-2026.

12 Aug [restricted→team] The AP payment-run incident (SEV2). A queue backlog delays scheduled supplier payment runs for ~30 customers by ~3 hours. Victor Holst (Platform) and Mikkel Holm (Payments) lead response; root cause is a poison-message retry storm. Post-mortem (#incidents, Notion) drives Mikkel's Q3 cards/payments reliability project. Customer comms go out; two customers escalate. (This is the year's reliability low-point and the reason reliability gets funded.)

19 Aug [team] James Okafor (Senior AE, London) sources Frankl & Vogt GmbH (German professional services, ~900 staff, 4 entities) — the Germany lighthouse. It is blocked on two things: the DATEV integration and a German-language product surface.

26 Aug [team] Copilot private beta expands to 12 customers. Feedback is mixed: finance teams won't trust AI with numbers. Clara Winther reframes Copilot around explainable, cited answers — every figure must link to the underlying transactions.

100 questions with known answers

One of the primary motivations for the synthetic dataset was the ability to run good evals against it. Like with the data itself, we asked Claude to generate a thorough set of evals that span different types of recall challenges. It came up with the following structure with 10 categories:

acl_boundary — same question, allowed vs denied by persona
factual_recall — single facts (HQ, funding, tagline)
decision_rationale — what was decided and why
ownership — who did or owns what
cross_source — synthesis across Slack, CRM, email, code, docs
contradiction — drift over time
temporal — when things happened
entity_profile — profile a person, customer, or partner
aggregation — counts (German customers, funding rounds)
negative_unknowable — not in the data; decline rather than fabricate

Out of the 100 evals, 46 were persona-scoped (they model a question being asked by a specific person with access to a specific subset of memories). And the 16 acl_boundary evals to test access control come in allow/deny pairs on an identical question that differs only by who asks - we don't want an engineer to get access to secret board decisions, and we don't want to limit the CEO to getting that same information.

When we first ran all the evals against our initial system we got 67% correct - plenty of room for improvement (it is doing better now!). More interesting than the specific score was the types of issues the agent was having.

The biggest issue turned out to be a gap in the synthetic data. Claude had generated evals that could not actually be answered. The data needed to answer the question was in the company storyline and people directory, but had not actually been encoded in source records. For example most people's titles were not accessible anywhere, so we ended up adding a new HRIS source that would include that kind of data.

The other biggest issues were around under-escalation (the agent answered a bit too quickly instead of digging deeper), entity confusion (two different Sophies in the dataset) and some grading noise (the answer was actually fine, but the eval rubric was bad).

Two copies of the same company

In addition to the eval setup, our other motivation was to create a good dataset for demos and testing our app. To achieve this we created a generator that creates two versions of the Saldra org when we deploy a new staging app (which we do for each new PR).

One version is called "Saldra ingested" and the other "Saldra fresh". They contain the same people, and fundamentally the same data, but in the ingested version all memories have already been ingested and processed. In the fresh version the connectors are all set up, but data ingestion is paused. This allows us to quickly test the app in a state full of data, while also testing our ingestion process itself (which takes at least 10 minutes for Saldra). Ingestion is also not deterministic since it requires an LLM to synthesize records into memories, so we can test both with and without determinism.

We have tried keeping the ingestion code we use for the synthetic dataset as similar to our normal ingestion code in order to properly test whether ingestion breaks due to a change. But this also creates challenges sometimes because the data connectors are not real connectors with proper authentication etc.

When demoing the system, we can use a PR app and demo the app as one of the two orgs depending on what we want to show. And it allows us to quickly switch between different users in the org to show off different interactions.

Building a self-improving system

Inspired by the work by Karpathy on auto-research, we set up our own system to do the same. We gave Claude Code access to our synthetic dataset, and we gave it the ability to run evals on its own. Then we told it to run the evals, figure out where our memory system falls short, propose a fix, test it and if it works produce a PR that we can review.

After each research run we ask Claude to record its notes in a Markdown file so it can review it in future sessions. Sometimes we just let it loose, and sometimes we ask it to focus on accuracy, performance or a specific sub-problem that we feel could be valuable to think about. We might also feed it high level ideas that we want to investigate further.

Claude is smart enough to run each eval multiple times, to build scripts for itself to make ingestion faster when it makes sense and to reason about what kinds of improvements are generally useful. We try to constrain it to think subtractively instead of just making prompts longer and longer.

In one of the recent runs, it spent many hours, many eval runs and many tokens, before concluding that none of the things it had hypothesized would work actually worked. Previously this kind of work would have taken an engineer weeks to perform, now it happens while we do other things.

In another recent run it looked for improvements to our entity resolution code, finding a new idea that significantly improved our eval performance across all entity resolution evals, without regression on any other evals. After merging that in we also asked Claude to come up with a set of even harder entity resolution evals that would pressure test our new solution. In this way the synthetic Saldra dataset is not static, but evolving with our product.

What comes next?

One obvious problem with our synthetic dataset is that it is only as good as it is. If the data we generated is not as complex as real data then we underestimate the real world and optimize for the wrong metric. One approach to solving this over time is to continue to rework the dataset as we learn more - and luckily LLMs are great at parsing large datasets to find structure and interesting bits.

Another challenge is the interactive nature of the Agentwork product. When you ask a question to the Agentwork agent, it might realise that the answer is not to be found in the memory system, but it knows who to ask. In that case, Agentwork can (with your permission) go out and talk to team members to figure out the right answer, but modelling this in our eval set requires a slightly more involved simulation of those interactions. Not super complicated to imagine how to solve this, but hopefully also not the last exciting feature we ship.