Are You Doing Business in 107 Countries? Or 7? Data Hygiene Matters in Predictive Analytics

Valgen Sales and Marketing Analytics for Salesforce

On a recent assignment where we set up lead generation in Salesforce, we used an existing customer database to build a statistical model to score a leads database.

The client does business in 7 countries. Or so they said, and I believed them.

But they quickly added a caveat: “no one has looked at our database in awhile.”

First, we looked at their billing country field. This was an open text field in their Salesforce system that could be edited by just about anyone. What we found was amazing. There were so many variations for each country that unique values quickly grew to 107. Misspellings, case differences, punctuation and abbreviations all added up to create many versions of the same country!

Our first task was to identify the obvious countries and group them. Then we did corrections to the remaining data. This took 107 countries down to the correct 7.

How to prevent this from happening again? Our client locked the country field, and set up a drop-down menu of countries.

Now you might think, they worked with that situation for awhile. And it didn’t affect sales users too much. So why was it a problem?

Well, clean data is necessary for sales intelligence, especially if you plan to model or score your data.

Even a simple field can pose a challenge. 

Without consistent coding guidance, it’s hard to create a segmentation or master-filter at the top, in order to analyze data. It was not wise to mix multiple countries within the analysis.

There may be other fields with which data can be cross-referenced.

If there is a shipping country and a billing country, chances are that both ought to be the same. So correct the data, then carry over the values into the other field so you have fewer blank and invalid rows.

The state of data quality can be a clue as to what other fields could be a problem.

Even if it’s not used in modeling, always keep in mind how the data was created, when it is updated and by whom, and what procedures are used to correct bad data.

Proactively address quality and instill an ongoing practice that data cleanliness is everyone’s responsibility.

Have a method to collect feedback and incorporate it. This way, when you are ready to perform analyses, there are fewer surprises with data quality.


Parting thoughts:  Do not assume anything “should be obvious” or lay blame for bad data. As you can see from the simple example above, all businesses have to approach data hygiene with care, caution and respect. As my friend’s father, an electrician said:  “The day I stop fearing electricity is the day I will stop working.”


Leave a Reply

Your email address will not be published. Required fields are marked *