Data Articles - Peter J Thomson

Over the last few years, I’ve spent a lot of time at the intersection of complex business data and custom business software. Building version one of the Icehouse Ventures investor portal meant taking financial data that had lived in Excel and Microsoft Access for years and turning it into a database that could power real software.

The gap between a random spreadsheet and “structured data” is bigger than most people realise. Excel is a magical tool for financial analysis but it’s not really designed to be a database, CRM or layout tool. Yet somehow all of use have reached for Excel (or Google Sheets) as a project management timeline, customer list, RSVP tracker or product requirements list. It’s fast, has just enough structure (rows and columns) and just enough flexibility (colours, borders and headings).

The problems I’ve run into most when being handed someone else’s spreadsheet aren’t really complexity problems. They’re habits and hygiene. Practices that are completely sensible to the person who built the spreadsheet, but completely invisible to the poor analyst, engineeer or agent that has to process it. Here are the common spreadsheet habits that have caused me the most pain.

Colour as Data

It’s the most natural thing in the world to slap colour on a spreadsheet when you are in a hurry. Red text for overdue. Green cells for approved. Yellow for someone-needs-to-follow-up. It looks clean. It communicates at a glance. The problem is that colour lives in the formatting layer, not the data layer. When someone has to export your spreadsheet to CSV, or push it through an API, or feed it to an AI agent the colour is gone. Whatever you were communicating with it disappears.

The simple fix is an explicit column. If red means “Overdue”, add a Status column and write “Overdue”. It’s a bit more typing upfront and a lot less confusion downstream.

Layout as Meaning

This one is more subtle. You have a list of transactions and instead of repeating “Q1 2024” on every row, you put it as a bold sub-heading (with a nice horizonal row underline to break things up) above the relevant rows. Visually? Elegant. Semantically? A problem. Here’s a test: if someone innocently re-sorted your spreadsheet alphabetically before it was handed to your tech team, would the sheet lose information? If the answer is yes, your data is broken.

Meaning has to survive re-sorting. A nice subheading that provides the only context for all the grouped rows below it doesn’t survive. The rows scatter, the heading stays put, and the relationship is gone. (And don’t get me started on merged cells and indentations across columns.) The fix again is a simple additional column. Repeat the category value on every row. It looks redundant. But it isn’t. That’s just what tidy data looks like.

Hiding Things

Excel’s “hide” feature is genuinely useful for skimming large data and non-destructive analysis. (Although, I’d rather people used Pivot Tables and the Auto-filter tool). But for data handoffs, hidden rows and columns are a nasty trap. When you export a spreadsheet with hidden rows or columns to CSV, everything comes across whether hidden or not. The recipient (me, your SaaS or your agent) has no way of knowing those rows exist unless they go looking in the original XLS for something that “isn’t there”. Automated pipelines certainly won’t and I almost imported several thousand hidden rows this week. This is particularly unpleasant because the bug is silent. The data looks clean, processes without error, and yet produces wrong numbers. Simple fix: Before handing off data, delete hidden rows and columns rather than hiding them. Or unhide everything and make a conscious decision about whether those rows belong.

Mixing Tags and Categories

These two concepts are easy to confuse and important to keep separate. A category is mutually exclusive and exhaustive. Every item belongs to exactly one. “Investment stage” is a category, a startup company is either Seed Stage or Series A, not both. A “bridge or extension” could be a valid modifier, but not an excuse to put the same company in to stages at once.

By contrast, a tag is optional and multiple. An item might have none, one, or many. “Themes we’re tracking” is a tag, a company could be operating at the intersection of AI, climate, and fintech simultaneously. The tell is comma-separated values inside a single cell: “AI, Climate, Fintech”. That’s a one-to-many relationship crammed into a one-to-one field. Perfectly readable as text, but risky as a data structure. If you find yourself reaching for the comma key inside a cell, that’s usually a sign the dataset has outgrown what a flat spreadsheet can do cleanly.

The fix: Categories get their own column. Tags either get individual columns with yes/no inside or a cell with commas and some coherence to avoid entropy or realistically the data moves to a proper database with a proper join to the pick-list.

Invisible Duplicates

Not all duplicates are errors. The same investor appearing twice in a list might be intentional, maybe a married couple sharing an email inbox, or two different people who happen to have the same name. The problem is when duplicates are ambiguous and there’s no way to tell. De-duplication is one of the most time-consuming parts of any data migration, and it’s made much worse when rows don’t carry any tie-breaker information. The fix is an additional identifier like email address, LinkedIn URL, phone number, website, company registration number. These serve two purposes: they let you merge genuinely duplicate records with more confidence, and they make intentional “twins but not duplicates” (same name, different person) immediately legible.

Seeing the patterns

Most of these problems have the same root cause: encoding information in ways that make sense to human eyes but are invisible to machines, databases, and agents. Colour, layout, hiden rows, and comma-delimited multi-values are all presentation tricks. They’re not “data”.

Good business data is explicit, survives re-sorting, and exporting, lives in labelled columns, and means one thing per cell. The closer your spreadsheet looks to a database table, the less translation work stands between your data and something useful. And in a world where AI agents, APIs and automated workflows are increasingly the ones doing that translation, the gap between “looks right” and “is right” matters more than ever.

Data Driven: Harnessing Data and AI to Reinvent Customer Engagement by Tom Chavez, Chris O’Hara and Vivek Vaidya is one of those rare books that you can tell has been written by real-world practitioners (not consultants or theorists). It’s the best book that I’ve read so far on marketing technology and customer experience.

The three authors have all been deeply immersed in projects using big-data to personalise and optimise marketing at scale:

Tom Chavez was a co-founder of marketing technology platform Krux before it sold to Salesforce. He’s now a co-founder of startup studio Super{Set}.
Chris O’Hara was responsible for marketing at Krux and now leads marketing for Salesforce Marketing Cloud.
Vivek Vaidya was a co-founder and the CTO at Krux and is now a co-founder of Super{Set}.

The book was published in 2018 and is the first time I’ve seen someone really tell the truth about the nitty-gritty of delivering data-driven marketing at scale. The authors focus on the practical ways that data flows through a modern business.

The case studies aren’t just abstract examples, but real projects that Tom, Chris and Vivek have worked on with clients.

Personally, I’ve lived through the complexities of integrating CRM systems with web analytics, customer loyalty data and all sorts of other complex data sets. The real value in bringing these things all together isn’t just in marketing, it’s transformative for the whole business.

The book is so good that I thought the best way to review it was to pull out some samples that summarise the key concepts in the author’s own words (I’ve lightly edited the samples to fit the blog format). To really get the value of the content you need to read the examples and context so I recommend you grab the hardcover, kindle or audiobook.

Principles of customer data

There are three foundational principles described in the book:

Embrace the always changing “human becoming”. Let go of static concepts like audience segments and inflexible stages of the marketing funnel. Your customers and audiences are leaving behind breadcrumbs that can help you engage with them in an ever-evolving, multifaceted way.
You have more data than you think … and you think you have more data than you actually do. Whether you believe you’re sitting on a mountain of data, or you think you’re hopelessly impoverished, the truth is somewhere in the middle. Build an inventory of your data assets, the data that’s hiding in plain sight as well as the missing (but gatherable) data within reach.
There is no single truth, just more and less useful theories. The data game isn’t about being right or wrong; it’s about being more right than wrong as you compile a continually evolving theory of the consumer. Make mistakes; learn fast.

How to use data

After unifying your data-in and your data-out in a single platform, apply the five sources of data-driven power:

Segmentation. Let a thousand flowers bloom as you define and continually re-define your customers and audiences.
Activation. Target and measure the audiences you want to reach across all the available channels and platforms. Know exactly where “your people” are and how to reach them.
Personalization. Build on segmentation and activation to achieve personalization in the broadest sense – cooler content, more relevant commerce, smarter selling and servicing of your customers.
Optimization. Adjust the velocity, pacing, reach, and frequency of your messages to achieve maximum efficiency in your marketing spend.
Insights. Use the output from your data systems to accumulate richer insights into your customers. Put this data back into your marketing engine to increase precision, effectiveness, and efficiency.

Avoid data pitfalls

The book warns readers to avoid five pitfalls when implementing a data driven strategy:

Absence of clear goals for data transformation: Technology itself is not a panacea. Leverage it at the right time in concert with the right people, woven into the right process.
Lack of a formal owner: Empower a leader and a team to get the job done.
Operating in a silo: Data transformation entails coordination of complexity. Break the silos and pre-empt the tribalism that can stymie the best-laid plans.
Boiling the ocean: Celebrate small wins to build organizational momentum. Leverage measurable results from early-stage initiatives to fund future steps.
Failure to anticipate risk: Remain zen-like but relentless when the inevitable glitches occur. Data pipelines will break; naysayers will grumble. Persuade constituents of the career and company benefits to them of getting behind an all-in data strategy.

Data layers

When establishing a data strategy for your organization, use the three-layer model to chart the who-what-when-where:

Know: Invest in data management to know your customers in a dynamic, 360-degree, real-time way. Reach them intelligently and with precision across every channel.
Personalize: Extend your brand and grow revenue by giving each customer more of what they want and less of what they don’t want. Use AI and machine learning to personalize all engagement—advertising, content, commerce, sales, and service.
Orchestrate: Reach your customers at just the right time and in the right place by mapping the journey they take with your brand. Measure the effectiveness of different touchpoints in varying sequences and combinations. Intentionally craft journeys that lead to the engagement you seek.

The authors also include guidance on applying artificial intelligence to marketing data and finish the book with several forecasts for future trends across customer experience, data and technology.

The book is available in hardcover, Kindle and audiobook. The audiobook is read by LJ Ganser who was a voice actor in Grand Theft Auto and has recorded almost 50 audiobooks. His voice is confident and clear. He really takes you along for the ride with Tom, Chris and Vivek for some of their client meetings in the early days of Krux.

Overall, a highly-recommended book for anyone that wants to understand how data and technology are changing marketing and customer experiences.

Tag: Data

The Five Deadly Sins of Business Data in Excel