The Data Quality Imperative

by , Featured Contributor, October 13, 2016



Earlier this month, OMD’s Julie Fleischer and Neustar’s Steven Wolfe Pereira poked at data integrity issues during a Chicago conference, creating some timely swirl by highlighting a simple truth: We lack data integrity standards. Advertisers are becoming skeptical of claims about data. We owe them some accountability, but they owe themselves some diligence as well.  


The enemy: oversimplification

People glom on to the difference between  “deterministic” and “probabilistic” as some sort of magical line in the sand, but the false sense of comfort is dangerous. Yes, facts exist (deterministic data), but riddle me this: If my credit card shows I shop at Whole Foods, am I rich? Any inference made from deterministic data sends us right back down a statistical rat hole.


Another schism divides planning and activation. I might know for sure that certain cookies clicked on a hotel ad. Planners may decide they want to contact that group. For television, there is no choice but to reduce that audience to a demographic.  For online, the exact people can be activated. This is one scenario that taps the value of tight integration between a data-management platform (DMP) and a demand-side platform (DSP).


Grim reality

The CEO of a large agency recently said to me, “Everyone tells me their data is great. How would I know?”  Indeed, how would he? Even high-quality data can be ruined by inappropriate use. Online contact usually depends on data, and the quality of the audience (as data) is just as important as the quality of the context. Data quality might be half of effectiveness.


It’s time for advertisers to hold themselves and their suppliers accountable for the quality of data and the conclusions derived from it. The risk of not getting this right will be the commoditization of data (and ergo, consumers). That stands, in my opinion, high on the list of strategic risks for the online ad industry.


So, by inferred popular demand, I present here, a listicle of quality attributes for advertising data.


Recency


If you are buying, say, “beauty category buyers,” it’s pretty safe to say that they are still interested in beauty after six months. But, if you are buying six-month-old “Auto Intenders,” there might be a pretty good chance they no longer need a car.


Veracity of the inference


How is the purported meaning related to the data? For example, if I went to a page that mentions the word “skin,” am I interested in skin cream?


Observation vs. declaration


Some data is derived by observing what people did (for example, cookie, panel). Other data is derived from what people said. Third-party data sites, which collect observations (see: http://www.bluekai.com/registry/ ) are pretty good at nailing interests from Web site behaviors; demographics, not so much.


Conformance with actual intent


Say you bought a segment of people interested in adhesives (glue intenders!) for a $2.50 cpm, and lo and behold, its 100,000,000 browsers. To validate, ask some of them. What will they say? I like glue? I studied principles of adhesion in engineering school?


Proximity of use to source


This is segment “telephone” in the data supply chain: From data collector, to data aggregator, to DMP, to DSP, and maybe a Boolean “and,” on-ramping, de-duping, and domain-space resolution. Organically grown soybeans can end up as Cheez Whiz™. 


Likelihood of actual reach


There are several reasons a cookie may never create reach. The user may never show up in the footprint, or simple cookie deletion. You could buy 50 million users and only find half — or a tenth — of them. Time helps, but this is a serious impairment.  


Fit with actual prospect density


You might buy a segment of 20 million new pet owners, but are there really that many out there? Inflated estimates of prospect density are the first symptom of naïve hope.


Census vs. sample


Sampling is the lovable, wonky, heart of statistics. It’s all a gamble unless you are counting cards, in which case you have a census (i.e. not a sample). If your uncertainty lasts for over four hours, please call your data scientist.


Noise vs. signal


Data is dirty. We call it noise. It runs from 0 to 99%. It’s best to know.


So, there’s a thought starter: Data is not magic. There is good data and horrible data. Much depends on how it is applied. It’s not all that esoteric. A little common sense can go a long way.


 


MediaPost.com: Search Marketing Daily

(47)