Over the last few years, I’ve spent a lot of time at the intersection of complex business data and custom business software. Building version one of the Icehouse Ventures investor portal meant taking financial data that had lived in Excel and Microsoft Access for years and turning it into a database that could power real software.
The gap between a random spreadsheet and “structured data” is bigger than most people realise. Excel is a magical tool for financial analysis but it’s not really designed to be a database, CRM or layout tool. Yet somehow all of use have reached for Excel (or Google Sheets) as a project management timeline, customer list, RSVP tracker or product requirements list. It’s fast, has just enough structure (rows and columns) and just enough flexibility (colours, borders and headings).
The problems I’ve run into most when being handed someone else’s spreadsheet aren’t really complexity problems. They’re habits and hygiene. Practices that are completely sensible to the person who built the spreadsheet, but completely invisible to the poor analyst, engineeer or agent that has to process it. Here are the common spreadsheet habits that have caused me the most pain.
Colour as Data
It’s the most natural thing in the world to slap colour on a spreadsheet when you are in a hurry. Red text for overdue. Green cells for approved. Yellow for someone-needs-to-follow-up. It looks clean. It communicates at a glance. The problem is that colour lives in the formatting layer, not the data layer. When someone has to export your spreadsheet to CSV, or push it through an API, or feed it to an AI agent the colour is gone. Whatever you were communicating with it disappears.
The simple fix is an explicit column. If red means “Overdue”, add a Status column and write “Overdue”. It’s a bit more typing upfront and a lot less confusion downstream.
Layout as Meaning
This one is more subtle. You have a list of transactions and instead of repeating “Q1 2024” on every row, you put it as a bold sub-heading (with a nice horizonal row underline to break things up) above the relevant rows. Visually? Elegant. Semantically? A problem. Here’s a test: if someone innocently re-sorted your spreadsheet alphabetically before it was handed to your tech team, would the sheet lose information? If the answer is yes, your data is broken.
Meaning has to survive re-sorting. A nice subheading that provides the only context for all the grouped rows below it doesn’t survive. The rows scatter, the heading stays put, and the relationship is gone. (And don’t get me started on merged cells and indentations across columns.) The fix again is a simple additional column. Repeat the category value on every row. It looks redundant. But it isn’t. That’s just what tidy data looks like.
Hiding Things
Excel’s “hide” feature is genuinely useful for skimming large data and non-destructive analysis. (Although, I’d rather people used Pivot Tables and the Auto-filter tool). But for data handoffs, hidden rows and columns are a nasty trap. When you export a spreadsheet with hidden rows or columns to CSV, everything comes across whether hidden or not. The recipient (me, your SaaS or your agent) has no way of knowing those rows exist unless they go looking in the original XLS for something that “isn’t there”. Automated pipelines certainly won’t and I almost imported several thousand hidden rows this week. This is particularly unpleasant because the bug is silent. The data looks clean, processes without error, and yet produces wrong numbers. Simple fix: Before handing off data, delete hidden rows and columns rather than hiding them. Or unhide everything and make a conscious decision about whether those rows belong.
Mixing Tags and Categories
These two concepts are easy to confuse and important to keep separate. A category is mutually exclusive and exhaustive. Every item belongs to exactly one. “Investment stage” is a category, a startup company is either Seed Stage or Series A, not both. A “bridge or extension” could be a valid modifier, but not an excuse to put the same company in to stages at once.
By contrast, a tag is optional and multiple. An item might have none, one, or many. “Themes we’re tracking” is a tag, a company could be operating at the intersection of AI, climate, and fintech simultaneously. The tell is comma-separated values inside a single cell: “AI, Climate, Fintech”. That’s a one-to-many relationship crammed into a one-to-one field. Perfectly readable as text, but risky as a data structure. If you find yourself reaching for the comma key inside a cell, that’s usually a sign the dataset has outgrown what a flat spreadsheet can do cleanly.
The fix: Categories get their own column. Tags either get individual columns with yes/no inside or a cell with commas and some coherence to avoid entropy or realistically the data moves to a proper database with a proper join to the pick-list.
Invisible Duplicates
Not all duplicates are errors. The same investor appearing twice in a list might be intentional, maybe a married couple sharing an email inbox, or two different people who happen to have the same name. The problem is when duplicates are ambiguous and there’s no way to tell. De-duplication is one of the most time-consuming parts of any data migration, and it’s made much worse when rows don’t carry any tie-breaker information. The fix is an additional identifier like email address, LinkedIn URL, phone number, website, company registration number. These serve two purposes: they let you merge genuinely duplicate records with more confidence, and they make intentional “twins but not duplicates” (same name, different person) immediately legible.
Seeing the patterns
Most of these problems have the same root cause: encoding information in ways that make sense to human eyes but are invisible to machines, databases, and agents. Colour, layout, hiden rows, and comma-delimited multi-values are all presentation tricks. They’re not “data”.
Good business data is explicit, survives re-sorting, and exporting, lives in labelled columns, and means one thing per cell. The closer your spreadsheet looks to a database table, the less translation work stands between your data and something useful. And in a world where AI agents, APIs and automated workflows are increasingly the ones doing that translation, the gap between “looks right” and “is right” matters more than ever.