The value of analyzing dirty data.

Imagine you are an investor sitting on a data set of transactions with the following 6 out of 20 attributes: apps, fintech, financial services, internet, information technology, mobile apps. And yes, they’re either-or in your table.

What can you make of it?

Well, a first impulse is to say that the labels are vague: what the heck is an internet company today? If you run a hair salon that makes the bookings online, are you an internet company? Been building software-based businesses for almost 20 years now and still have a hard time to describe what an internet technology company is. Wikipedia is rather vague either.

A second would be to notice the overlapping pairs - internet & internet tech, apps & mobile apps, or fintech & financial services. They mean kind of the same thing and hence are redundant.

A third would be to stop here and do nothing further. The data is dirty and, unless you can clean it up, it is useless.

And that’s it, right? Wrong. A fourth answer is to crunch it in Excel, analyze it and draw conclusions out of it. 🤷‍♂️

No comments yet