top of page
Writer's pictureJed Stephens

Take your data to dinner

Clean data is difficult to define. Simply, clean data is data that is analysis ready, but the term ‘analysis ready’ has baked into it much more than initially meets the eye. Perhaps then, the easiest way to get clear on what clean data is, is to understand what clean data is not. After all, as Tolstoy put it, “All happy families are alike; each unhappy family is unhappy in its own way.” 


Dealing with dirty data

Dirty data is, by definition, data that is not analysis-ready - it is fraudulent, corrupted, or distorted in one or several of a number of ways, depending on how the data is provided. In surveys, for example, questions can be hurriedly answered, answered by selecting at random, answered systematically randomly, such as by always selecting the third option, answered in pretence, answered boastingly, and so on and so on. And this is just to name a few. 


Suddenly, the data is corrupted, and what we are left with is not fit for clean, reliable analysis. Let alone decision-making. 


What this means in practice is that it is only the data from respondents who are genuine and also consciously engaged (with a survey) that should be considered “clean”. Clearly a market research firm needs to be able to provide the process of screening for the various forms of dirty data.


Clean data are necessary for real-world, tangible decisions

A simple equivalent might be something like this. Imagine that you’re a user on an online dating platform. You’re looking to find your match, but out there online, it’s tricky - how do you know the people you meet are who they say they are? What if they’re only funny over text? What if, in reality, they don’t look like their profile picture? What if their online persona is only just that? Meeting someone in person gives you an immediate sense of whether that person is genuine or not — but online, we become reliant on cues: does my match eat the same types of food as me, watch the same shows, or support the same causes? These are all proxies for that first meeting when your instinct takes over and screens your match. That first meeting is the dirty data detection mechanism of dating.


In an online survey, we almost never meet our participants in person. Getting to clean data, therefore requires building mechanisms to screen your ‘matches’ in the same way as you might in the online dating world. Without the proper infrastructure, one is liable to allow in those who are non-genuine. 


At this point, you’ve scrolled through your potential matches online, and now it’s time to get serious. You want to begin a conversation, get to know them better, and check that when you actually talk to them, they are who they say they are. Being a researcher myself, I would suggest something scientifically proven to achieve results, such as Arthur Aron’s 36 questions which lead to love. This is also why all our clients benefit from survey questions written by our experienced market research analysts. Without a well-written question, the answer will always appear as if it was dirty data.


Surveying and producing data beyond the ‘what’, looking at the ‘how’

We all know that on a date, the ‘how’ the question is answered is as important as the ‘what’ was answered. Enter SlashData’s bespoke survey tool. Whilst the survey is asking questions written by our experienced market research analysts, the survey tool gathers information about how respondents are answering the questions in order to screen for potentially non-genuine ‘matches’. It is this “how” that is critical to our advanced integrated cleansing system.


Clean data example 1

Here is a concrete example: when speaking to other researchers, they are often shocked to know that at SlashData, we know exactly how long it took a respondent to answer any given question. The industry standard screen is ‘total time in the survey,’ but this is easily exploitable by respondents. How genuine do you think your date would be, for example, if they spent ten minutes answering your first question about the other dates they had met online but only 10 seconds answering all your other questions about likes, dislikes, family, work, hobbies, hopes and dreams? Their total survey time might be 10 minutes and 30 seconds, yielding a respectable 2min, 37 seconds per question average (over four questions), which, by industry standards, would yield an acceptable ‘total survey time’. Industry standards would tell you your match is ‘clean’, but you’d be right in feeling more than a little dubious. 


At SlashData, we would be looking far closer at the time taken to answer each question because, when you’re looking for ‘the one’, it matters. This innovation would not be possible without our survey tool.


Clean data example 2

Another example: as your date progresses and you start to learn more it is only human nature to complete a consistency check. Think of it as checking your respondent’s life story. “You’ve made a React app? But you told me you didn’t know JavaScript”. A red flag. A little later, you realise your first date is actually a UI designer. Consistency rules are critical to detecting if the picture the respondent paints stacks up as a whole. For every three questions SlashData typically implements a consistency rule - a large survey may consist of sixty plus validations. These rules allow us to check whether the data we’re getting is reliable. While these rules can be implemented regardless of the survey tool, you should be enquiring if your market research firm takes care to consider and implement these. 


Data consistency is key

A flip side to consistency is advanced pattern detection. Is your date taking the same to consider each of your questions or are they disinterestedly answering via the easy way out? At SlashData, we’ve seen respondents who have been recruited from top panel providers who nevertheless always choose the third option on the list. Depending on the survey tool your market research company used this may be impossible to detect if the options are randomised - how do you know whether your respondent has scrolled down enough on the screen to grasp the full question, or whether they’re simply picking from options that they can immediately see? These characteristics are critical to know when deciding how trustworthy a respondent is. 


After all this, imagine the worst case scenario. Imagine after effort - all that swiping and small talk and screening out the dodgy ones -  you turn up to find that your date is…just here for the food. You’ve been let down and probably feel used. This is an instance where the incentives (a nice meal or cash amount for answering the survey) yield problematic results. In an online environment, it requires industry-leading adaptations to detect repeat responding and bot responses. An advanced cleansing methodology should take into account the device used to answer the survey, whether any proxy, VPN, or IP masking was used — all while not penalising legitimate use of these privacy tools. This requires the complex consideration of response patterns and respondent metadata. Bot and repeat responders are a real problem for clean data — both artificially inflate the number of survey responses as well as add considerable noise to the results. Consistency checks also form an important aspect of checking for this type of response data.


Love is a serious matter. Clean data is crucial.

Finding a genuine match takes effort. All are as true for screening online data as they are for online dates. To find that the data is ‘analysis ready’ requires a number of mechanisms to sift out a myriad of dodgy dates or dirty data that shouldn’t be underestimated in their importance. Per question speed tests, click pattern detection, consistency checks, AI bot detection, repeat responder detection are just some of the technical achievements of SlashData’s cleansing methodology. SlashData’s bespoke in-house survey tool is best in class in providing these inputs to any cleansing process. Is your data clean? It is critical to ask your market research partner if they have the information to actually ensure this is achieved.


If you want to better understand our process or explore a specific topic together, let’s talk.


About the author

Jed Stephens, Senior Data Scientist

Jed has several years of research experience in the academic and industry sectors, mainly focusing on applied statistical research and computational implementations of new statistical methods. He holds an MSc in Statistics. His interest is in turning data into informed, actionable decisions.


bottom of page