Spring Cleaning: Data hygiene tips that keep your sales data always ready for analysis

Predictive analytics needs a foundation of clean data. Here are top tips from our most recent lead gen implementation on Salesforce.com. You can use these immediately in any environment:

  1. Address standardization and change of address. Typically up to 30% of records lack complete address information. This affects both deliverability and duplicate search. If you have not done so in 3 years, run a Change of Address (NCOA) process to get your customer’s new addresses. Stay up to date on your customers because 10% of businesses move each year. When you receive a street address change, but it’s a PO Box in the same city, retain both — one for mailing and one for secondary validation.
  2. Dupes are the silent killer. Because a single comprehensive definition does not fit all, de-dupe the files using a variety of match logic. You can try a loose match logic (a few criteria, gives more duplicates) or tight match logic (more criteria, resulting in fewer duplicates). Take address elements into account, but use transactional information to determine which record to keep or drop. For example, between duplicates you might want to merge a record with multiple contacts into a record with the largest sales amount or the record with the longest time on file. Run dupes within your accounts, as well as against other sources. If you see the same record in two tables, delete or mark clearly why and set a shelf life to expire one.
  3. Fix bad data entry practices. With key fields you use the most, particularly text fields, consolidate the misspellings and mixed-cases that make reporting difficult. As it can feel like a huge task, you do not have to do everything at once – fix just five fields this week, more next week. Here is an opportunity to update correction rules and fix legacy errors. As a bonus, segmentation and reporting become much easier.
  4. Match to a B2B database or Tier 1 compiled list. While the obvious next step can be to append the firmographic elements for analytics, you can use information from the match rate to help further clean your data. Segment the unmatched customers and evaluate if they: are duplicated somewhere else, have failed address standardization, have any useful fields, or are possibly orphaned from a prior merge exercise. It’s springtime, prune the dead weight!
  5. Append firmographic data. While technically it is not cleaning, append and enrichment of your data can provide big dividends from a hygiene perspective. Compare all contact information against the external data and fix format errors, missing extensions, suite number etc., while adding new contact info you did not have.

Are you doing business in 107 countries? Or 7? Data hygiene matters in predictive analytics

On a recent assignment for setting up lead generation, we took on an existing customer database to build a statistical model to score a leads database. The client does business in 7 countries, or so they said and I believed them. But they quickly added a caveat, “no one has looked at our database in awhile.”

First, we looked at their billing country field. This had been an open text field in their Salesforce.com system that could be edited by just about anyone. What we found was amazing. There were so many variations for each country that unique values quickly proliferated to 107. Misspelling, case difference, punctuation and abbreviations all conspired to create many versions of the same country!

Our first order of the day was to identify the obvious countries and group them, followed by corrections to the remaining data. This took 107 countries down to the correct 7. Then the country field was locked and going forward, a drop-down menu of countries is being used to prevent this proliferation from happening again.

For statistical modelers, why is cleaning data important for sales intelligence?

First of all, we want to point out that even a simple field can pose a challenge.  In the absence of consistent coding guidance, it becomes hard to create a segmentation or master-filter at the top with which to analyze data. And here, it is not wise to mix multiple countries within the analysis.

Second, there may be other fields with which data can be cross-referenced. If there is a shipping country and a billing country, chances are that both ought to be the same. So correct the data, then carry over the values into the other field so you have fewer blank and invalid rows overall.

Third, the state of data quality can give a clue as to what other fields could be a problem. Even if not used in modeling, always keep in mind how this data was created, who and when it is updated, and what procedures are used to correct bad data.

Finally, proactively start addressing quality and instill an ongoing practice of making data cleanliness everyone’s responsibility. Have a method by which to collect feedback and incorporate it. This way, when you are ready to perform analyses, there are fewer surprises with respect to data quality.

Parting thought:  Do not assume anything “should be obvious” or find where to lay blame for bad data. As you can see from the simple example above, all businesses have to approach data hygiene with care, caution and respect. As my friend’s father, an electrician said:  “The day I stop fearing electricity is the day I will stop working.”