How SlashData improved on industry data collection methodology

Jed Stephens

Nov 13, 20245 min read

In the blog post, Take your Data to Dinner, we discussed the importance of clean online survey data: not all survey responses are suitable for analysis – in fact, negative and often fraudulent actors abound. That blog explained how to screen data in order to establish its reliability and cleanliness.

In this post, the focus is more technical. Looking strategically at what modern, industry-leading data collection methodology should look like, we unpack the nuts and bolts of the kind of information a market research firm should be collecting from their survey respondents so as to be informed about their quality of the data they provide.

What counts as a fraudulent response to our surveys

First and foremost, who are the negative actors that we’re talking about here? Before jumping in, it should be noted that these groups are not the only creators of dirty data. At SlashData, we have observed three categories of fraudulent respondents, broadly speaking:

Complete response bots. These bots circumnavigate a survey tool and deliver their answers as a complete set directly to the end point of an API. Think of it as a fully packaged response delivered without even a click needed on the survey tool. Their responses feel forced, clumsy, and lack the human signature.
Bot responses from web automation tools. It is always sad to see sophisticated web debug tools such as Selenium harnessed in this manner. These web automation tools work by taking control of the browser according to a programmed routine designed by a developer. It may not be wholly surprising that for a market research firm in the tech space (with a proud pedigree amongst the developer audience) we have seen a goodly few iterations of this sort of bot. Responses like these have hidden systematic patterns which can lead to their detection.
Humans working as click farms. This is an individual or a group utilising one or more physically or even virtually different devices in order to register multiple responses to a survey. These responses provide human-like responses but often reek of disinterest and are usually filled with contradiction. Their answers exude artificial resonance.

I’ll explain how we tackle each of these groups. As a prerequisite, I’ll explain the information SlashData gathers so as to do this. Thereafter the overall mechanism, a trust index, which integrates the multiple information sources. Equipped with this, I’ll return to each of these negative actor groups explaining each trip themselves into negative trust territory.

A decision is only as good as the information up which it is made.

Judging data quality on a number of unique information pieces

Before making any decision, it is obviously prudent to gather information. The nightmare situation of any court judge (or market research firm) is the presentation of new evidence mid-trial: not only does it prompt re-evaluating all that has gone before but, in some instances, it casts doubt on the information already provided. For some market research firms, the threat that new ‘mid-trial’ information poses is so concerning that it acts as a deterrent from even seeking such information in the first place. This is a research bias! It is a bias that helps explain why market research firms say that they already have enough information to decide the quality of a respondent and are not hungry for more or to be challenged.

To avoid dismissing information with an important impact on our understanding of the cleanliness of our data, we do not accept this position at SlashData. Instead, we have focused on acquiring several unique pieces of information in order to generate an overall mechanism – a trust index – which integrates the information we are able to gather about our respondents into a single score against which we are able to evaluate their credibility as a bona fide respondent.

It is obviously only worth designing a trust index if you have multiple sources of information such as click position monitoring, reCaptcha validation, IP and proxy detection methods; and this is just to name a few. Some of the key pieces of data that are core to SlashData’s approach in data collection standards are the viewpoint data (telling us exactly how much of the question the respondent engaged with), per-question timing, and third-party validation mechanisms such as GitHub verification for developer profiles.

Constructing a trust index for our data

Just having information does not give an answer. That is why constructing a trust index is such a sophisticated (and yet essential) process.

Here, The Trust Index designed at SlashData, is the key to a coherent system of quality control which works by tracking actions that build and diminish trust. To give an example, a response that does not use privacy persevering tools and is of the same geolocation and reported location builds trust. Similarly, a response that uses privacy preserving tools is trust neutral – it could be either a good or bad actor and based on this information alone, there is no sufficient evidence to determine either way. But couple privacy preserving tools with a signature that suggests it is the same device, close starting times and known incentives to defraud — a picture of overall trust diminishment occurs.

Deploy and Detect

Let us return to the three groups of negative actors we identified above to demonstrate how we are able to detect each by deploying our industry-leading data collection methodology.

Complete response bots. This type of bot scores very negatively on the Trust Index by bypassing many of our other data collection mechanisms – for instance, they would not have a sensibly defined viewpoint (because their responses did not come through a survey tool). This is a quick and reliable red flag to their credibility which means that we can be completely sure that this type of bot has been completely removed from our clean data at SlashData.
Bot responses from web automation tools. This is a more sophisticated type of bot fraud. Because the bot interacts with the survey tool, it is able to create responses that contain basic viewpoint information and other artefacts that our other mechanisms of data collection are on the lookout for. There are some dead give-aways, though: the bots’ response times between questions are either fixed (taking precisely 2 seconds for instance) or follow some type of probability distribution, such as being normally or uniformly distributed. A real human response would take more or less time per question depending on the length of the question, the sophistication of the question (grid or list) and if the question is multi- or single-choice. Too much uniformity in this area signals a response of low trust value. Web automation tools also fail to pass reCaptcha tests. A failure of a reCaptcha test in any instance is an extremely trust diminishing action.
Humans working as click farms. This type of fraud requires screening through multiple different data sources in order to be detected since it is a broad group within which are several levels of professionalism. The least sophisticated examples answer questions carelessly, almost randomly and thereby fail multiple trust tests for consistency; those that are slightly more sophisticated may attempt to change devices or use privacy preserving tools, but subtle pointers give them away and they are inevitably detected. The most sophisticated require the analysis of multiple patterns/combinations involving unlikely userAgent device signatures, privacy information and responses.

You cannot prevent, fight or detect fraud if you do not have the data.

Do not let cancerous data consume your research. The importance of partnering with a market research firm which is constantly committed to the improvement of data collection methodologies for online surveys is one way you can avoid publishing noise and instead focus on making a noise about your research!

We can get the clean data you need to optimise your strategy. How? Let’s talk.

About the author

Jed Stephens, Senior Data Scientist

Jed has several years of research experience in the academic and industry sectors, mainly focusing on applied statistical research and computational implementations of new statistical methods. He holds an MSc in Statistics. His interest is in turning data into informed, actionable decisions.