Are you doing business in 107 countries? Or 7? Data cleansing matters in predictive analytics

Data Hygiene for Predictive Analytics and AI

On a recent assignment where we set up lead generation in Salesforce, we used an existing customer database to build a statistical model to score a leads database.

The client does business in 7 countries. Or so they said, and I believed them.

But they quickly added a caveat: “no one has looked at our database in awhile.”

First, we looked at their billing country field. This was an open text field in their Salesforce system that could be edited by just about anyone. What we found was amazing. There were so many variations for each country that unique values quickly grew to 107. Misspellings, case differences, punctuation and abbreviations all added up to create many versions of the same country!

Our first task was to identify the obvious countries, and group them. Then we did corrections to the remaining data. This took 107 countries down to the correct 7.

How to prevent this from happening again? Our client locked the country field, and set up a drop-down menu of countries.

Now you might think, they worked with that situation for awhile. And it didn’t affect Salesforce users too much. So why was it a problem?

Well, clean data is necessary for sales intelligence, especially if you plan to model or score your data.

Even a simple field can pose a challenge.

A field should not have multiple values that mean the same thing. For example the state field should not have “Illinois,” “IL” and “Ill.” This could lead to one of the values being missed when grouping states for analysis. Important data could be left out, and the analytic results would then be misleading.

There may be other fields with which data can be cross-referenced.

If there is a shipping country and a billing country, chances are both should to be the same. So correct the data, then carry over the values into the other field so you have fewer blank and invalid rows.

The poor quality of data can be a clue to processes that are broken.

When you see a data problem, it’s time to be a detective. Explore if this same problematic logic is being used elsewhere, if an algorithm was coded incorrectly in another process, or if the wrong data was fed in. Assume that this problem might lie in more than one location.

Instill a proactive, ongoing culture that data cleanliness is everyone’s responsibility.

People who process data, create the data and consume the data all have a responsibility to keep an eye out for data quality. Like the TSA sign says in the airport: “If you see something, say something.”

If something doesn’t look right, escalate it to system admin, IT, or analytics team to review it.


Conclusion: Do not assume anything “should be obvious” or lay blame for bad data. As you can see from the simple example above about the country field, even the most basic data hygiene should be approached with respect. As my friend’s father, an electrician said:  “The day I stop fearing electricity is the day I will stop working.”

Leave a Reply

Your email address will not be published. Required fields are marked *