A Marketer’s Guide to Kaggle for Analytics and Data Science

Kaggle, the Google-acquired data science platform, started as a virtual meeting point for machine-learning geeks to compete on predictive accuracy scores. It evolved into a Swiss Army knife for data science and analytics—one that can help data professionals, including data-driven marketers, elevate their analytics game. Despite being a free service, Kaggle can help address an […]

The post A Marketer’s Guide to Kaggle for Analytics and Data Science appeared first on CXL.

Kaggle, the Google-acquired data science platform, started as a virtual meeting point for machine-learning geeks to compete on predictive accuracy scores.

It evolved into a Swiss Army knife for data science and analytics—one that can help data professionals, including data-driven marketers, elevate their analytics game.

Despite being a free service, Kaggle can help address an increasing number of data challenges:

  • How to find reliable data sources to enrich existing customer and marketing data;
  • How to find ideas, inspiration, and relevant code for a new data analysis without reinventing the wheel;
  • How to collaborate efficiently on a data project with colleagues; 
  • How to apply machine learning and artificial intelligence to marketing analytics projects;

This is, of course, just a partial list. This post focuses on these and other marketing-friendly use cases for Kaggle. 

What is Kaggle?

Kaggle launched in 2010. It became known as a platform for hosting machine-learning competitions. The competitions were typically sponsored by large companies, governments, and research institutes. 

Their goal was (and still is) to leverage the collective intelligence of thousands of data scientists around the world to solve a data problem. 

kaggle competition list.
In a Kaggle competition, you can compete for jobs or money (or glory). But the platform has evolved from that initial use case. 

In 2017, Kaggle was acquired by Google. After the acquisition, it started branching out into more areas of data science and analytics. The aim is clear—to become a one-stop shop for data professionals. (It’s currently being rebranded as “the home for data science.”)

Below, I discuss five fresh and relevant features for marketers, regardless of technical ability:

  1. Kaggle datasets;
  2. Kaggle community analyses;
  3. Kaggle Notebooks;
  4. Kaggle cloud integrations;
  5. Machine learning with Kaggle.

To make the most of Kaggle, having some ability to work with code is helpful. If you don’t code however, no worries—this isn’t a technical post. 

1. Kaggle datasets: Access high-quality, relevant data.

Have you ever been in the following situation? You’re gazing over a large data file with lots of numbers but little explanation. You’re trying to figure out what each row and column represent, and no one seems to have precise documentation.

What if we could ensure our datasets were clearly documented? This goes beyond just having a data dictionary for feature definitions.

What if we knew who collected the data, the sources and methodology they used, and if any data is missing? And, if so, why? Is it random? Is there a pattern or reason behind it? Wouldn’t it be nice to know, too, if someone, somewhere, is actively maintaining the dataset?

This is the idea behind Kaggle datasets, a collection of thousands of high-quality datasets—all with an automatic quality score based on availability of metadata. These datasets are searchable and have helpful tags attached to them (e.g., industry, data type, associated analyses, etc.)

Where applicable, the data sources are verified, too. And there’s an added bonus: Given an initial dataset, Kaggle can make recommendations for relevant, complementary datasets. 

There are more than 20,000 datasets in Kaggle, including census, employment, and geographic data, which analysts can access and analyze directly from their browsers. Most importantly, there’s a large variety of datasets related to marketing, ecommerce, and sales.

searching for datasets in kaggle.
Some interesting marketing datasets to explore. They come with a quality score ranging from 1 to 10 based on how complete the documentation is.

How do you find datasets on Kaggle?

It couldn’t be easier:

  1. Connect to kaggle.com. (There’s an optional Google login.) 
  2. Look for the datasets section near the top of the page.
  3. Enter a keyword to search the datasets database.
  4. Scan the results, review the dataset quality scores, interestingness scores, and short descriptions.
  5. Select the dataset that resonates most with you.

Bonus dataset: Google Analytics data from the Google Merchandise store

google analytics data and kaggle.
(Image source)

If you work with Google Analytics, there’s a bonus for you: a dataset associated with the first Kaggle machine-learning competition, which was based on Google Analytics data and concluded earlier this year. 

Digital analysts can access raw, hit-level data (with full ecommerce implementation) that spans a full year of customer activity in the Google Merchandise store. 

Working with this dataset can be valuable in terms of understanding the underlying structure of Google Analytics data and experimenting with a number of advanced statistical and data mining techniques that can’t be applied when the data is in aggregate form (which is the norm with standard Google Analytics.)

2. Kaggle community analyses: Jump start your analysis by reviewing others’ work. 

When starting to analyze your marketing data, finding relevant datasets to combine with your original one is useful. But it’s even better if you can see all existing work that’s been published on a given dataset by other Kagglers. This can be a source of inspiration but also a time saver, especially in the initial stage of an analysis. 

It’s sometimes daunting to choose among all available analyses. Similar to a social network, Kaggle shows you how the community has interacted with each piece of work, which can help you spot ideas and analyses that stand out. It’s also a good opportunity to interact and network with members of the Kaggle community who have overlapping interests. 

image showing how communities can help generate a stronger idea.
Kaggle has 3.5 million members contributing code and data. It’s always possible to find inspiration in other Kagglers’ work. (Image courtesy of Kaggle)

A good example of this is the Google Analytics dataset from the previous section. It’s accompanied by hundreds of approaches on how to analyze digital analytics data from the Kaggle community—including some from Kaggle grandmasters

How do you find relevant marketing analyses on Kaggle? 

  1. After selecting a dataset as described in the previous step, you’ll notice that there are several independent Notebooks associated with it. (Notebooks are discussed below in more detail.) 
  2. Every Notebook represents an analysis that includes narrative, code, and output, such as visualizations and data tables with summary statistics. 
  3. To get started, select the one with the highest number of upvotes, a sign of quality and approval from the community. 
  4. If the analysis is indeed of high interest, it’s possible to “fork” the Notebook, thus generating a copy of both the code and data. 
  5. Then, either run the script as is or make changes by creating your own version. An interesting option is to substitute the original author’s data with your own similar dataset before executing the code. 

3. Kaggle Notebooks: Access a powerful laptop on the cloud.

By now, you’ve selected a dataset and collected some good ideas from the Kaggle community to help you get started. As a next step, you’ll want to apply this to your own data.

What’s the most suitable place for all this to happen? An obvious option is your local desktop or laptop. Alternatively, you can go the Kaggle way by working with Kaggle Notebooks (previously known as Kaggle Kernels). This has benefits, especially in cases when:

  • The dataset is several gigabytes in size and impractical to move around or load into local memory every time you analyze it.
  • The task is computationally intensive, and you don’t want to slow down your laptop for the rest of the day.
  • You’re planning to share your analysis with collaborators. 

Let’s have a closer look.

example of a kaggle notebook.
Kaggle Notebooks contain code, computation, and narrative. Work with R, Python, and SQL code directly from the browser—no need to install anything.

Notebooks and computation

A Kaggle Notebook is essentially a powerful computer that Kaggle lets you access in the cloud. It used to be available only for use with public data during competitions. Recently, Kaggle started offering it for private projects at no cost and with the option to use private datasets. 

Visually, Kaggle Notebooks look like Jupyter Notebooks, containing computation, code, and narrative—but they come with some nice extras:

  • They’re equipped with processing hardware, CPUs and GPUs, for computationally demanding analyses. This processing power is useful if you have a lengthy computation or expect a high volume of data to be returned after an API call.
  • They have RAM memory of 16 gigabytes, which can be used to fit large datasets into memory. (This is more capacity compared to the average laptop.) 
  • Notebooks have all the latest software libraries preinstalled, as well as versions of R and Python, the main programming languages for data science and analytics.
  • You can attach one or more datasets to a Notebook in a single click, with a total size of up to 100 gigabytes.

Notebooks and collaboration

You can share your analyses with colleagues—without the dreaded “but it works on my machine” scenario. When you share a private Notebook with your collaborators, they automatically access the same isolated computational environment, including the software libraries and version of the programming languages.

Thanks to Docker, the popular containerization technology, there’s no need to install or update software, and no risk of causing software conflicts. 

As soon as your work is done, select public or private visibility for the notebook and share it with collaborators. They can view and run the analysis interactively with one click, straight from their browser.

4. Kaggle cloud integrations: Get access to Google Cloud tech. 

Working within the Kaggle environment acquaints you with cloud workflows. It also offers exposure to new tools and tech—opportunities to pick up new skills, many of which are vital to marketers and digital analysts.

This is thanks largely to integrations Kaggle has with BigQuery and BigQuery ML, and Google Data Studio.

kaggle plus data studio.
google bigquery and kaggle.

I won’t discuss these integrations in great detail here—CXL has several sources (linked above) with detailed product walkthroughs. When it comes to how this works with Kaggle, the essence is that you can:

  • Access data stored in BigQuery directly via Kaggle with some SQL code, then analyze it directly on Kaggle with R or Python.
  • Build and evaluate regression and clustering models without extensive knowledge of machine-learning frameworks.
  • Load a dataset in Kaggle, shape it, and then—via the Data Studio connector—explore the data visually in the Data Studio interface or create dashboards to share with your team.

There’s also an integration with Google Sheets and a brand new one with Google AutoML (see the next section). I wouldn’t be surprised to see more integrations since Kaggle is now part of Google Cloud.

5. Machine learning with Kaggle: High-quality machine learning and AI with zero code. 

Integration with Google’s AutoML was announced in November 2019. It deserves a section of its own because of its potential impact.

As a concept, AutoML isn’t entirely new, but making it accessible as a product en masse via Kaggle is a noteworthy development. The human expertise that’s required for machine-learning development is scarce, a fact often brought up as a bottleneck for the field.

AutoML can lower the barrier to entry for development of machine-learning applications in marketing. It allows marketers with a general understanding of the machine-learning process to use advanced, powerful AI models safely—and without needing to be programmers.

AutoML, which is now available on Kaggle, can also save massive amounts of time spent developing and testing a model manually (the typical case right now).

This won’t, of course, be “AI at the push of a button.” The marketer (or whoever applies AutoML) will need to understand the basics of the process. Unlike other features in Kaggle, its use may result in costs for computation.

In any case, AutoML is a hands-on way to get started with machine learning and AI for marketing, directly within Kaggle.


Kaggle doesn’t cover all aspects of a data and analytics workflow. It’s not the tool to develop production-level systems or store and manage all of your analysis code and artifacts. However, it’s a practical collaboration tool with which marketers can access relevant datasets, explore data, and get ideas to jumpstart their analysis.

Computationally, it’s like a powerful, cloud-based laptop that’s always available for public or private projects. It’s also a bridge to many other cloud services provided by Google, such as BigQuery and Google Data Studio.

Last but not least, AutoML has the potential to be a game changer. It lowers the barrier to entry and empowers marketers to get directly involved in the development of AI and machine learning for projects. 

Becoming familiar with Kaggle Notebooks, the Cloud integrations, and all the other elements of the Kaggle environment can make a future transition to a full-fledged AI platform, including Google’s AI platform, much easier.

The best way to get started? Explore the datasets and ways the Kaggle community has analyzed them. Try the Google Analytics revenue prediction dataset and analysis Notebooks, or the conversion optimization dataset with ROI analysis for Facebook marketing campaigns. 

Happy Kaggling. 

The post A Marketer’s Guide to Kaggle for Analytics and Data Science appeared first on CXL.

Behavior-Based Attribution Using Google BigQuery ML

Why create a custom attribution tool? Because with out-of-the-box tools, you’re limited by their functionality, data transformations, models, and heuristics. With raw data, you can build any attribution model that fits your business and domain area best. While few small companies collect raw data, most large businesses bring it into a data warehouse and visualize […]

The post Behavior-Based Attribution Using Google BigQuery ML appeared first on CXL.

Why create a custom attribution tool? Because with out-of-the-box tools, you’re limited by their functionality, data transformations, models, and heuristics.

With raw data, you can build any attribution model that fits your business and domain area best. While few small companies collect raw data, most large businesses bring it into a data warehouse and visualize it using BI tools such as Google Data Studio, Tableau, Microsoft Power BI, Looker, etc.

Google BigQuery is one of the most popular data warehouses. It’s extremely powerful, fast, and easy to use. In fact, you can use Google BigQuery not only for end-to-end marketing analytics but to train machine-learning models for behavior-based attribution.

In a nutshell, the process looks like this:

  1. Collect data from ad platforms (e.g., Facebook Ads, Google Ads, Microsoft Ads, etc.);
  2. Collect data from your website—pageviews, events, UTM parameters, context (e.g, browser, region, etc.), and conversions;
  3. Collect data from the CRM about final orders/lead statuses;
  4. Stitch everything together by session to understand the cost of each session;
  5. Apply attribution models to understand the value of each session.
multi-channel attribution process map with google bigquery ml.
From start to finish: data collection; extract, transform, and load; Google BigQuery processing; and application—refining campaigns or reporting. 

Below, I share how to unleash the full potential of raw data and start using Google BigQuery ML tomorrow to build sophisticated attributions with just a bit of SQL—no knowledge of Python, R, or any other programming language needed.

The challenge of attribution

For most companies, it’s unlikely that a user will buy something with their first click. Users typically visit your site a few times from different sources before they make a purchase.

Those sources could be:

  • Organic search;
  • Social media;
  • Referral and affiliate links;
  • Banner ads and retargeting ads.

You can see all of the sources in reports like Top Conversion Paths in Google Analytics:

Top Conversion Paths report in Google Analytics.

Over time, you accumulate a trove of source types and chains. That makes channel performance difficult to compare. For this reason, we use special rules to assess revenue from each source. Those rules are commonly called attribution rules.

Attribution is the identification of a set of user actions that contribute to a desired outcome, followed by value assignment to each of these events.

There are two types:

  1. Single channel;
  2. Multi channel.
single channel vs. multi-channel attribution graphic.

Single-channel types assign 100% of value to one source. A multi-channel attribution model distributes the conversion values across several sources, using a certain proportion. For the latter, the exact proportion of value distribution is determined by the rules of attribution.

Single-channel attribution doesn’t reflect the reality of modern customer journeys and is extremely biased. At the same time, multi-channel attribution has its own challenges:

  1. Most multi-channel attribution models are full of heuristics—decisions made by (expert) people. For example, with a linear attribution model, the same value is assigned for each channel; in a U-shaped model, most value goes to the first and the last channel, with the rest split among other channels. These attribution models are based on algorithms that people invented to reflect reality, but reality is much more complex.
  2. Even data-driven attribution isn’t fraud-tolerant because, in most cases, it takes into account only pageviews and UTM parameters; actual behavior isn’t analyzed.
  3. All attribution models are retrospective. They tell us some statistics from the past while we want to know what to do in the future.

As an example, let’s look at the following figures, which reflect the stats of two November ad campaigns:

example of ad campaign comparison.

“November Offer 1” performs two times better in terms of Cost-Revenue Ratio (CRR) compared to “November Offer 2.” If we had known this in advance, we could’ve allocated more budget to Offer 1 while reducing budget for Offer 2. However, it’s impossible to change the past.

In December, we’ll have new ads, so we won’t be able to reuse November statistics and the gathered knowledge.

another ad campaign comparison (unknown).

We don’t want to spend another month investing in an inefficient marketing campaign. This is where machine learning and predictive analytics comes in.

What machine learning changes

Machine Learning allows us to transform our approach to attribution and analytics from algorithmic (i.e. Data + Rules = Outcome) to supervised learning (i.e. Data + Desired Outcome = Rules).

Actually, the very definition of attribution (left) fits the machine-learning approach (right):

comparison of definitions between attribution and machine learning.

As a result, machine-learning models have several potential advantages:

  1. Fraud tolerant;
  2. Able to reuse gained knowledge;
  3. Able to redistribute value between multiple channels (without manual algorithms and predefined coefficients);
  4. Take into account not only pageviews and UTM parameters but actual user behavior.

Below, I describe how we built such a model, using Google BigQuery ML to do the heavy lifting.

Four steps to build an attribution model with Google BigQuery ML

Step 1: Feature mining

It’s impossible to train any machine-learning model without proper feature mining. We decided to collect all possible features of the following types:

  • Recency features. How long ago an event or micro-conversion happened in a certain timeframe;
  • Frequency features. How often an event or micro-conversion happened in a certain timeframe;
  • Monetary features. The monetary value of events or micro-conversions that happened in a certain timeframe;
  • Contextual features. Information about user device, region, screen resolution, etc.;
  • Feature permutations. Permutations of all above features to predict non-linear correlations.
feature mining from analytics coming together into a single source.

When comparing simple, algorithmic attribution models, it’s important to collect and analyze hundreds and thousands of different events. Below, for example, are different micro-conversions that could be collected on an ecommerce product page for a fashion website:

  • Image impressions;
  • Image clicks;
  • Checking size;
  • Selecting size;
  • Reading more details;
  • Viewed size map;
  • Adding to Wishlist;
  • Adding to cart, etc.
example of fashion ecommerce product page.

Some micro-conversions have more predictive power than you might expect. For example, after building our model, we found that viewing all images on a product page has enormous predictive power for future purchases.

In standard ecommerce funnels, analysis of product pageviews with one image viewed is the same as product pageviews with all images viewed. A full analysis of micro-conversions is far different.

We use SegmentStream JavaScript SDK, part of our platform, to collect all possible interactions and micro-conversions and store them directly in Google BigQuery. But you can do the same if you export hit-level data into Google BigQuery via the Google Analytics API or if you have Google Analytics 360, which has a built-in export to BigQuery.

process of collecting behavioral events in google bigquery for processing.

Once raw data is collected, it should be processed into attributes, where every session has a set of features (e.g., device type) and a label (e.g., a purchase in the next 7 days):

description of features and labels in bigquery database.

This dataset can now be used to train a model that will predict the probability of making a purchase in the next 7 days.

Step 2: Training a model

In the days before Google BigQuery machine learning, training a model was a complex data engineering task, especially if you wanted to retrain your model on a daily basis.

You had to move data back and forth from the data warehouse to a Tensorflow or Jupiter Notebook, write some Python or R code, then upload your model and predictions back to the database:

example of length of process before bigquery ml.

With Google BigQuery ML, that’s not the case anymore. Now, you can collect the data, mine features, train models, and make predictions—all inside the Google BigQuery data warehouse. You don’t have to move data anywhere:

example of streamlined process with google bigquery ml.

And all of those things can be implemented using some simple SQL code.

For example, to create a model that predicts the probability to buy in the next 7 days based on a set of behavioral features, you just need to run an SQL query like this one*:

          ( model_type = 'logistic_reg')  
           labels.buyDuring7Days AS label

*Put your own Google Cloud project ID instead of projectId, your own dataset name instead of segmentstream, and your own model name instead of mlModel; mlLearningSet is the name of the table with your features and labels, and labels.buyDuring7Days is just an example.

That’s it! In 30–60 seconds, you have a trained model with all possible non-linear permutations, learning and validation set splits, etc. And the most amazing thing is that this model can be retrained on a daily basis with no effort.

Step 3: Model evaluation

Google BigQuery ML has all the tools built-in for model evaluation. And you can evaluate your model with a simple query (replacing the placeholder values with your own):

         ML.EVALUATE(MODEL `projectId.segmentstream.mlModel`,
                     labels.buyDuring7Days as label
                     date > 'YYYY-MM-DD'
        STRUCT(0.5 AS threshold)

This will visualize all the characteristics of the model:

  • Precision-recall curve;
  • Precision and recall vs. threshold;
  • ROC curve.
precision-recall curve, roc curve, and other charts.

Step 4: Building an attribution model

Once you have a model that predicts the probability to buy in the next seven days for each user, how do you apply it to attribution?

A behavior-based attribution model:

  • Allocates value to each traffic source;
  • Predicts the probability to buy at the beginning and end of the session;
  • Calculates the “delta” between the two.
example of how attribution is modeled by looking at the change in purchasing behavior.

Let’s consider the above example:

  1. A user came to the website for the first time. The predicted probability of a purchase for this user was 10%. (Why not zero? Because it’s never zero. Even based on contextual features such as region, browser, device, and screen resolution, the user will have some non-zero probability of making a purchase.)
  2. During the session, the user clicked on products, added some to their cart, viewed the contact page, then left the website without a purchase. The calculated probability of a purchase at the end of the session was 25%.
  3. This means that the session started by Facebook pushed user probability of making a purchase from 10% to 25%. That is why 15% is allocated to Facebook.
  4. Then, two days later, the user returned to the site by clicking on a Criteo retargeting ad. During the session, the user’s probability didn’t change much. According to the model, the probability of making a purchase changed only by 2%, so 2% is allocated to Criteo.
  5. And so on…

In our Google BigQuery database, we have the following table for each user session:

table with purchasing behavior data.


  • session_start_p is the predicted probability to buy in the beginning of the session.
  • session_end_p is the predicted probability to buy at the end of the session.
  • attributed_delta is the value allocated to the session.

In the end, we have values for all channels/campaigns based on how they impact the users’ probability to buy in the next 7 days during the session:

full attribution model illustrated.

Tidy as that chart may be, everything comes with pros and cons.

Pros and cons of behavior-based attribution


  • Process hundreds of features as opposed to algorithmic approaches.
  • Marketers and analysts have the ability to select domain-specific features.
  • Predictions are highly accurate (up to 95% ROC-AUC).
  • Models are retrained every day to adapt to seasonality and UX changes.
  • Ability to choose any label (predicted LTV, probability to buy in X days, probability to buy in current session).
  • No manual settings.


  • Need to set up advanced tracking on the website or mobile app.
  • Need 4–8 weeks of historical data.
  • Not applicable for projects with fewer than 50,000 users per month.
  • Additional costs for model retraining.
  • No way to analyze post-view where there is no website visit.


Machine learning still depends on the human brain’s analytical capabilities. Though unmatched in its data processing capabilities, Google BigQuery ML evaluates factors that the user selects—those deemed fundamental for their business based on previous experience.

Algorithms either prove or disprove the hypothesis of the marketer and allow them to adjust the features of the model to nudge predictions in the needed direction. Google BigQuery ML can make that process vastly more efficient.

The post Behavior-Based Attribution Using Google BigQuery ML appeared first on CXL.