AI-powered surveys: Hyped or helpful?

Some wonder if AI-powered surveys are over-hyped vaporware that legitimizes bad science. Here’s how they really work.

The post AI-powered surveys: Hyped or helpful? appeared first on Marketing Land.

For marketers with an interest in research, it’s a good time to start talking about AI-facilitated online surveys. What are those, exactly? They’re surveys that use machine learning to engage with respondents (think of a chatbot) which then manage a lot of the back-end data involved with implementing and reporting (think of pure drudgery). We have had a great experience with GroupSolver, and other examples include Acebot, Wizer, Attuned (specific to HR) and Worthix. There are others out there, and probably even more by the time you read this.

The good news is that a conversation about AI-facilitated online surveys is well underway. The bad news is that it’s rife with exaggerated claims. By distinguishing between hype and genuine promise, it’s possible to set some realistic expectations and tap into the technology’s benefits without overinvesting your time and research dollars in false promises (which seems to happen a lot with AI).

As Dr. Melanie Mitchell puts it in “Artificial Intelligence Hits the Barrier of Meaning,” AI is outstanding at doing what it is told, but not at uncovering human meaning. If that’s true, what possible use can AI have for online surveys? There are five themes that need to be addressed.

Reduced customer fatigue

One misconception is that AI surveys reduce fatigue because traditional surveys are too long. Not quite. Surveys are only too long if they are poorly crafted, but that has nothing to do with how the instrument is administered. Where AI does help is in creating an experience that is very comfortable for the respondent because it looks and feels like a chat session. The informality helps respondents feel more at ease and is well-suited to a mobile screen. The possible downside is that responses are less likely to be detailed because people may be typing with their thumbs.

Open-ended questions

There are three advantages to how AI treats open-ended questions. First, the platform we used takes that all-important first pass at codifying a thematic analysis of the data. When you go through the findings, the machine will have already grouped them according to the thematic analysis the AI has parsed. If you are using grounded theory (i.e., looking for insights as you go), this can be very helpful in getting momentum towards developing your insights.

Secondly, the AI also facilitates the thematic analysis by getting each respondent to help with the coding process themselves, as part of the actual survey. After the respondent answers “XYZ,” the AI tells the respondent that other people had answered “ABC,” and then asks if that is also similar to what the respondent meant. This process continues until the respondents have not only given their answers but have weighed in on the answers of the other respondents (or with pre-seeded responses you want to test). The net result for the researcher is a pre-coded sentiment analysis that you can work with immediately, without having to take hours to code them from scratch.

The downside of this approach is that you will be combining both aided and unaided responses. This is useful if you need to get group consensus to generate insights, but it’s not going to work if you need completely independent feedback. Something like GroupSolver works best in cases where you otherwise might consider open-ended responses, interviews, focus groups, moderated message boards or similar instruments that lead to thematic or grounded theory analyses.

The third advantage of this approach over moderated qualitative methodologies is that the output can give you not only coded themes but also gauge their relative importance. This gives you a dimensional, psychographic view of the data, complete with levels of confidence, that can be helpful when you look for hidden insights and opportunities to drive communication or design interventions.

Surveys at the speed of change

There are claims out there that AI helps drive speed-to-insight and integration with other data sources in real-time. This is the ultimate goal, but it’s still a long way off. It’s not a matter of connecting more data pipelines; it’s because they do very different things. Data science tells us what is happening but not necessarily why it’s happening, and that’s because it’s not meant to uncover behavioral drivers. Unless we’re dealing with highly structured data (e.g., Net Promoter Score), we still need human intervention to make sure the two types of data are speaking the same language. That said, AI can create incredibly fast access to the types of quantitative and qualitative data that surveys often take time to uncover, which does indeed bode very well for increased speed to insight.

Cross-platform and self-learning ability

There is an idea out there that AI surveys can access ever-greater sources of data for an ever-broader richness of insight. Yes, and no. Yes, we can get the AI to learn from large pools of respondent input. But, once again, without two-factor human input (from respondents themselves and the researcher), the results are not to be trusted because they run the likely danger of missing underlying meaning.

Creates real-time, instant surveys automatically

The final claim we need to address is that AI surveys can be created nearly-instantaneously or even automatically. There are some tools that generate survey questions on the fly, based on how the AI interprets responses. It’s a risky proposition. It’s one thing to let respondents engage with each other’s input, but it’s quite another to let them drive the actual questions you ask. An inexperienced researcher may substitute respondent-driven input for researcher insight. That said, if AI can take away some drudgery from the development of the instrument, as well as the back-end coding, so much the better. “Trust but verify” is the way to go.

So, this quote from Picasso may still hold true: “Computers are useless. They can only give you answers,” but now they can make finding the questions easier too.

Summary

The good news is that AI can do what it’s meant to do – reduce drudgery. And here’s some more good news (for researchers): There will always be a need for human intervention when it comes to surveys because AI can neither parse meaning from interactions nor substitute research strategy. AI approaches that succeed will be the ones that can most effectively facilitate that human intervention in the right way, at the right time.

The post AI-powered surveys: Hyped or helpful? appeared first on Marketing Land.

Are your A/B tests just junk science?

Here are six lessons for marketers from the “Replication Crisis” in the sciences to keep in mind as you design your next round of A/B tests.

The post Are your A/B tests just junk science? appeared first on Marketing Land.

In today’s digital-first advertising world, many marketing leaders aspire to conduct marketing as a science. We speak in scientific terms – precision, measurement, data; we hire young professionals with bachelor of science degrees in marketing; we teach our teams to test their hypotheses with structured experiments.

Yet, most marketers have no idea that the sciences are facing a methodological reckoning, as it has come to light in recent years that many published results – even in respected, peer-reviewed journals – fail to replicate when the original experiments are reproduced. This phenomenon, known as the “Replication Crisis,” is far from a niche phenomenon. Recent reports suggest that the majority of psychology studies fail to replicate, and certainly many marketers are beginning to feel that, for all the “successful” A/B tests they’ve run, high-level business metrics haven’t much improved.

How could this happen? And what can marketers learn from it? Here are six key points to keep in mind as you design your next round of A/B tests.

The meaning of ‘statistical significance’

You might be tempted to skip this section, but most marketers cannot correctly define statistical significance, so stick with me for the world’s quickest review of this critical concept. (For a more thorough introduction, see here and here.)

We begin any A/B test with the null hypothesis:

There is no performance difference between the ads I’m testing.

We then run the test and gather data, which we ultimately hope will lead us to reject the null hypothesis, and conclude instead that there is a performance difference.

Technically, the question is:

Assuming that the null hypothesis is true, and any difference in performance is due entirely to random chance, how likely am I to observe the difference that I actually see?

Calculating p-values is tricky, but the important thing to understand is: the lower the p-value, the more confidently we can reject the null hypothesis and conclude that there is a real difference between the ads we’re testing. Specifically, a p-value of 0.05 means that there is a 5 percent chance that the observed performance difference would arise due to purely random chance.

Historically, p-values of 0.05 or less were deemed “statistically significant,” but it’s critical to understand that this is just a label applied by social convention. In an era of data scarcity and no computers, this was arguably a reasonable standard, but in today’s world, it’s quite broken, for reasons we’ll soon see.

Practical advice: when considering the results of A/B tests, repeat the definition of p-value like a mantra. The concept is subtle enough that a consistent reminder is helpful.

‘Statistical significance’ does not imply ‘practical significance’

The first and most important weakness of statistical significance analysis is that, while it can help you assess whether or not there is a performance difference across your ads, it says nothing about how big or important the difference might be for practical purposes. With enough data, inconsequential differences could be considered “statistically significant.”

For example, imagine that you run an A/B test with two slightly different ads. You run 1,000,000 impressions for each ad, and you find that version A gets 1,000 engagements, whereas version B gets 1,100 engagements. Using Neil Patel’s A/B calculator (just one of many on the web), you will see that this is a “statistically significant” result – the p-value is 0.01, which is well beyond the usual 0.05 threshold. But is this result practically significant? The engagement rates are 0.1 percent and 0.11 percent respectively – an improvement, but hardly a game-changer in most marketing contexts. And remember, it took 2M impressions to reach this conclusion, which costs real money in and of itself.

My practical advice to marketing leaders is to accept the fact that slight tweaks will rarely have the dramatic impact we seek. Embrace the common-sense intuition that it usually takes meaningfully different inputs to produce practically significant outcomes. And reframe the role that testing plays in your marketing so that your team understands significance analysis as a means of comparing meaningfully different marketing ideas rather than as a definition of success.

Beware ‘publication bias’

But… what about all those articles we’ve read – and shared with our teams – that report seemingly trivial A/B tests delivering huge performance gains? “How adding a comma raised revenue by 30 percent.” “This one emoji changed my business,” etc.

While there are certainly little nuggets of performance gold to be found, they are far fewer and farther between than an internet search would lead you to believe, and the concept of “publication bias” helps explain why.

This has been a problem in the sciences too. Historically, experiments that didn’t deliver statistical significance at the p-value = 0.05 level were deemed unworthy of publication, and mostly they were simply forgotten about. This is also known as the “file drawer effect,” and it implies that for every surprising result we see published, we should assume there is a shadow inventory of similar studies that never saw the light of day.

In the marketing world, this problem is compounded by some factors: the makers of A/B testing software have strong incentives to make it seem like easy wins are just around the corner. They certainly don’t publicize the many experiments that failed to produce interesting results. And in the modern media landscape, counterintuitive results tend to get shared more frequently, creating a distribution bias as well. We don’t see, or talk about, the results of all the A/B tests run with insignificant results.

Practical advice: Remember that results that seem too good to be true, probably are. Ground yourself by asking “how many experiments did they have to run to find a result this surprising?” Don’t feel pressured to reproduce headline-worthy results; instead, stay focused on the unremarkable, but much more consequential work of testing meaningfully different strategies and looking for practically significant results – that’s where the real value will be found.

Beware ‘p-hacking’

Data is a scientifically inclined marketer’s best friend, but it should come with a warning label, because the more data dimensions you have, the more likely you are to fall into the anti-pattern known as “p-hacking” in one way or another. P-hacking is the label given to some ways that data analysis can produce seemingly “statistically significant” results from pure noise.

The most flagrant form of p-hacking is merely running an experiment over and over again until you get the desired result. Remembering that a p-value of 0.05 means that there is a 5 percent chance that the observed difference could arise by random chance, if you run the same experiment 20 times, you should expect to get one “significant” result by chance alone. If you have enough time and motivation, you can effectively guarantee a significant result at some point. Drug companies have been known to do things like this to get a drug approved by the FDA – not a good look.

Most marketing teams would never do something this dumb, but there are subtler forms of p-hacking to watch out for, many of which you or your teammates have probably committed.

For example, consider a simple Facebook A/B test. You run two different ads, targeting the same audience, simple enough. But what often happens when the high-level results prove to be unremarkable, is that we dig deeper into the data in search of more interesting findings. Perhaps if we only look at women, we’ll find a difference? Only men? What about looking at the different age bands? Or iPhone vs. Android users? Segmenting the data this way is easy, and generally considered a good practice. But the more you slice and dice the data, the more likely you are to identify spurious results, and the more extreme your p-values must be to have practical weight. This is especially true if your data analysis is exploratory (“let’s check men vs. women”) rather than hypothesis-driven (“our research shows that women tend to value this aspect of our product more than men – perhaps the results will reflect that?”). For a sense of just how bad this problem is, see the seminal article “Why Most Published Research Findings Are False” which is credited for persuasively raising an early alarm about p-hacking and publication bias.

In the sciences, this problem has been addressed by a practice called “pre-registration,” in which researchers publish their research plans, including the data analyses they expect to conduct so that consumers of their research can have confidence that the results are not synthesized in a spreadsheet. In marketing, we typically don’t publish our results, but we owe it to ourselves to apply something like these best practices.

Practical advice for marketing teams: Throw the p=0.05 threshold for “statistical significance” out entirely – a lot of p-hacking stems from people searching endlessly for a result that hits some threshold, and in any case, our decision-making should not rely on arbitrary binaries. And make sure that your data analysis is motivated by hypothesis grounded in real-world considerations.

Include the cost of the experiment in your ROI

An often-overlooked fact of life is that A/B tests are not free. They take time, energy and money to design and execute. And all too often, people fail to ask whether or not an A/B test is likely to be worth their time.

Most A/B testing focuses on creative, which is appropriate, given that ad performance is largely driven by creative. And most of what’s written on A/B testing acts as if great creative falls from the sky and all you need to do is test to determine which works best. This might be reasonable if you’re talking about Google search ads, but for a visual medium like Facebook…creative is time-consuming and expensive to produce. Especially in the video era, the cost of producing videos to test is often higher than the expected gains.

For example, let’s say you have a $25k total marketing budget, and you’re trying to decide whether to spend $2k on a single ad, or $5k on five different variant ads. If we assume that you need to spend $1k on each ad variant to test its performance as part of an A/B test, you’d need your winning ad to perform at least 20 percent better than baseline for A/B testing to be worthwhile. You can play with these assumptions using this simple spreadsheet I created.

Twenty percent may not sound like much, but anyone who’s done significant A/B testing knows that such gains aren’t easy to come by, especially if you’re operating in a relatively mature context where high-level positioning and messages are well-defined. The Harvard Business Reviews reports that the best A/B test that Bing ever ran produced a 12 percent improvement in revenue, and they run more than 10,000 experiments a year. You may strike gold on occasion, but realistically, you’ll find incremental wins sometimes, and no improvement most of the time. If your budget is smaller, it’s that much harder to make the math work. With a $15k budget, you need a 50 percent improvement just to break even.

Practical advice: Remember that the goal is to maximize advertising ROI, not just to experiment for experiment’s sake. Run ROI calculations up front to determine what degree of improvement you would need to make your A/B testing investment worthwhile. And embrace low-cost creative solutions when testing – while there are trade-offs, it may be the only way to make the math work.

Don’t peek!

Marketers love a good dashboard, and calculations are so easy in today’s world that it’s tempting to watch our A/B test results as they develop in real-time. This, however, introduces another subtle problem which could fundamentally undermine your results.

Data is noisy, and the first data you collect will almost certainly deviate from your long-term results in one way or another. Only over time, as you gather more data, will your results gradually approach the true long-term average – this is known as “regression to the mean,” and to get meaningful results, you have to let this process play out.

If you look at the p-value on a continuous basis and declare victory as soon as your p-value exceeds a certain value, you are doomed to reach the wrong conclusions. Why? Statistical significance analysis is based on the assumption that your sample size was fixed in advance.

As March Madness heads up, a sports betting analogy might help develop intuition. Tonight, in a game that features superstar Zion Williamson’s highly anticipated return from injury, Duke will take on Syracuse in the ACC tournament. Duke is favored by 11.5 points, such that if you bet on Duke, they need to win by 12 points or more for you to win the bet. But critically, only the final score matters! If you tried to bet on a similar but different proposition, that Duke would take a 12 point lead at some point in the game… you’d get far worse odds, because as any basketball fan knows, a 12 point can come and go in the blink of an eye, especially in March.

Constantly refreshing the page to check in on your A/B tests is like confusing a “Duke to win by more than 11.5” bet with a “Duke to lead by more than 11.5 at any point in the game” bet. P-values will fluctuate up and down as data as collected, and it’s far more likely that you’ll see a low p-value at some point along the way than at the very end of the test.

Sometimes, peeking can still be warranted – in clinical trials of new medicines; for example, experiments are sometimes stopped midway if the preliminary results show great harm to one group or another. In these rare and extreme situations, it’s considered unethical to continue to test. But this is not likely to apply to your marketing A/B tests, which post very few risks, especially if well-designed.

Practical advice: Define your tests up-front, check in only periodically, and only consider stopping a test early if the results are truly extraordinary, such that continuing the test seems to have real practical costs.

Conclusion

A scientific approach to marketing undoubtedly holds incredible promise for the field. But nothing is foolproof; all too often, marketers deploy subtle scientific tools with only a superficial understanding and end up wasting a tremendous amount of time, energy, and money as a result. To avoid repeating these mistakes, and to realize the benefits of a rational approach, marketing leaders must learn from the mistakes that contributed to the Replication Crisis in the sciences.

The post Are your A/B tests just junk science? appeared first on Marketing Land.

Martech can deliver personalized consumer experiences, but humans are still required

The human component of marketing remains as important as ever to make technology more effective over the lifetime of a brand-consumer relationship.

The post Martech can deliver personalized consumer experiences, but humans are still required appeared first on Marketing Land.

In the ever-evolving age of digital marketing, marketers are competing against many things: time constraints, a noisy marketplace, endless channels and formats for content, just to name a few. But the most challenging is delivering contextually-relevant content. Brand marketers are beginning to see help come from artificial intelligence. The talk of AI and martech isn’t new, but real use cases are now becoming a reality. Specifically, AI lends itself well to analytics and predictive data that marketers are now using to deliver what people care about most at any given moment. By taking into account all of the detailed data points across every digital channel over the lifetime of a brand-and-customer relationship, marketing is becoming increasingly relevant and personalized.

Data is the fuel for AI

AI has proven most useful in analyzing data and putting meaningful optimizations to work in real-time. It has the potential to greatly collapse the time-to-market for optimization as compared to traditional operational processes that take weeks or even months to deploy, analyze and improve based on the results. It helps brands understand what, when and how to serve customers to drive the dream state marketers have been aspiring to reach for a long time: real-time contextual relevance.

To get to contextual relevance, AI finds its fuel in data. There are a couple of different types of data that brands can lean on to make AI more intelligent:

  • Implicit data – This is data collected from behaviors such as clicks, purchases, engagement with certain types of content/products, etc. It’s the bread crumb trail of data exhaust that is naturally left behind by digital engagements without having to ask consumers for it. It’s the largest data set that brands have and it’s also the hardest to manage and put to use in meaningful ways.
  • Explicit data – This is data collected by soliciting and receiving direct feedback from customers. This can be profile data points, NPS score or other survey response data, feedback during customer service interactions, etc.

Implicit data is the most powerful data set because it’s massive and the more data AI has to feed it, the smarter it becomes. Explicit data is incredibly important when it comes to AI too, though, but you don’t get a full picture of that data set without the merging of digital and customer service inputs.

The human channel and AI

The customer service and digital marketing gap are still wide, but closing that gap is more important now than ever. Customer expectations are rising as are the chances for people to abandon brands due to poor service and marketing. Customer service is the place where one-to-one personalization is king, and arguably where the most impactful interactions between brands and customers happen. When it comes to salvaging and strengthening relationships by creating unique, personalized experiences, service can be the definer.

While real human interaction will become the differentiator, AI can optimize these interactions. Just as a marketer would orchestrate email, SMS, push, app and web experiences, brands can add the “human channel” to the mix. AI can help determine when a human should be the channel of choice and suggest what the message from that human could be. As interactions become more digitally focused, including the emergence of chatbots, speaking with a knowledgeable human at a brand will anchor any brand-and-consumer relationship.

And regarding the explicit data that is needed to fuel AI? The information flow needs to be bi-directional to ensure that data can be captured by the human channel and fed back into the brand’s data ecosystem, ultimately to be used by other channels. While the human interaction is a cost center for a brand, the value is unmatched, bringing dual benefit to both the customer and the brand; valuable data is collected to make the next engagement even better. It’s a win-win.

Things to consider

Three things to consider when implementing AI solutions:

  • Scale: Unfortunately, many brands that fine-tune their data and marketing strategies do so without the ability to truly scale, so they stop at ‘walk,’ so to speak, and can’t get to run. If that applies to you, reconfigure your scaling strategy and work AI components in from the start, with the clear goal that it is there to boost your ability to operate exponentially.
  • Keep it personal: Leaning too much on machines and AI make brand-to-consumer relationships impersonal. Clearly define your engagement strategies and prioritize the touchpoints that benefit from the human touch. Sincerity in building the relationship is paramount.
  • Lean on your customer data: Unless they opt out, every customer is leaving a breadcrumb trail of feedback, and those data points are clues to how you can better serve them. AI can be instrumental in understanding a customer and strengthening those relationships by providing a contextually relevant view of the customer and generating recommendations on how best to move forward.

As an industry, we get excited about AI and how it can change the way we interact and understand our customers. But at the end of the day, the human component of marketing remains most important and is required to make our technology, including AI, more effective. Customer service reps, data scientists and digital strategists with an eye for emerging tech will be valuable players that will ensure the entire ecosystem is operating to its fullest potential.

The post Martech can deliver personalized consumer experiences, but humans are still required appeared first on Marketing Land.