Kaggle, the Google-acquired data science platform, started as a virtual meeting point for machine-learning geeks to compete on predictive accuracy scores.
It evolved into a Swiss Army knife for data science and analytics—one that can help data professionals, including data-driven marketers, elevate their analytics game.
Despite being a free service, Kaggle can help address an increasing number of data challenges:
This is, of course, just a partial list. This post focuses on these and other marketing-friendly use cases for Kaggle.
Kaggle launched in 2010. It became known as a platform for hosting machine-learning competitions. The competitions were typically sponsored by large companies, governments, and research institutes.
Their goal was (and still is) to leverage the collective intelligence of thousands of data scientists around the world to solve a data problem.
In a Kaggle competition, you can compete for jobs or money (or glory). But the platform has evolved from that initial use case.
In 2017, Kaggle was acquired by Google. After the acquisition, it started branching out into more areas of data science and analytics. The aim is clear—to become a one-stop shop for data professionals. (It’s currently being rebranded as “the home for data science.”)
Below, I discuss five fresh and relevant features for marketers, regardless of technical ability:
To make the most of Kaggle, having some ability to work with code is helpful. If you don’t code however, no worries—this isn’t a technical post.
Have you ever been in the following situation? You’re gazing over a large data file with lots of numbers but little explanation. You’re trying to figure out what each row and column represent, and no one seems to have precise documentation.
What if we could ensure our datasets were clearly documented? This goes beyond just having a data dictionary for feature definitions.
What if we knew who collected the data, the sources and methodology they used, and if any data is missing? And, if so, why? Is it random? Is there a pattern or reason behind it? Wouldn’t it be nice to know, too, if someone, somewhere, is actively maintaining the dataset?
This is the idea behind Kaggle datasets, a collection of thousands of high-quality datasets—all with an automatic quality score based on availability of metadata. These datasets are searchable and have helpful tags attached to them (e.g., industry, data type, associated analyses, etc.)
Where applicable, the data sources are verified, too. And there’s an added bonus: Given an initial dataset, Kaggle can make recommendations for relevant, complementary datasets.
There are more than 20,000 datasets in Kaggle, including census, employment, and geographic data, which analysts can access and analyze directly from their browsers. Most importantly, there’s a large variety of datasets related to marketing, ecommerce, and sales.
Some interesting marketing datasets to explore. They come with a quality score ranging from 1 to 10 based on how complete the documentation is.
It couldn’t be easier:
If you work with Google Analytics, there’s a bonus for you: a dataset associated with the first Kaggle machine-learning competition, which was based on Google Analytics data and concluded earlier this year.
Digital analysts can access raw, hit-level data (with full ecommerce implementation) that spans a full year of customer activity in the Google Merchandise store.
Working with this dataset can be valuable in terms of understanding the underlying structure of Google Analytics data and experimenting with a number of advanced statistical and data mining techniques that can’t be applied when the data is in aggregate form (which is the norm with standard Google Analytics.)
When starting to analyze your marketing data, finding relevant datasets to combine with your original one is useful. But it’s even better if you can see all existing work that’s been published on a given dataset by other Kagglers. This can be a source of inspiration but also a time saver, especially in the initial stage of an analysis.
It’s sometimes daunting to choose among all available analyses. Similar to a social network, Kaggle shows you how the community has interacted with each piece of work, which can help you spot ideas and analyses that stand out. It’s also a good opportunity to interact and network with members of the Kaggle community who have overlapping interests.
A good example of this is the Google Analytics dataset from the previous section. It’s accompanied by hundreds of approaches on how to analyze digital analytics data from the Kaggle community—including some from Kaggle grandmasters.
By now, you’ve selected a dataset and collected some good ideas from the Kaggle community to help you get started. As a next step, you’ll want to apply this to your own data.
What’s the most suitable place for all this to happen? An obvious option is your local desktop or laptop. Alternatively, you can go the Kaggle way by working with Kaggle Notebooks (previously known as Kaggle Kernels). This has benefits, especially in cases when:
Let’s have a closer look.
Kaggle Notebooks contain code, computation, and narrative. Work with R, Python, and SQL code directly from the browser—no need to install anything.
A Kaggle Notebook is essentially a powerful computer that Kaggle lets you access in the cloud. It used to be available only for use with public data during competitions. Recently, Kaggle started offering it for private projects at no cost and with the option to use private datasets.
Visually, Kaggle Notebooks look like Jupyter Notebooks, containing computation, code, and narrative—but they come with some nice extras:
You can share your analyses with colleagues—without the dreaded “but it works on my machine” scenario. When you share a private Notebook with your collaborators, they automatically access the same isolated computational environment, including the software libraries and version of the programming languages.
Thanks to Docker, the popular containerization technology, there’s no need to install or update software, and no risk of causing software conflicts.
As soon as your work is done, select public or private visibility for the notebook and share it with collaborators. They can view and run the analysis interactively with one click, straight from their browser.
Working within the Kaggle environment acquaints you with cloud workflows. It also offers exposure to new tools and tech—opportunities to pick up new skills, many of which are vital to marketers and digital analysts.
I won’t discuss these integrations in great detail here—CXL has several sources (linked above) with detailed product walkthroughs. When it comes to how this works with Kaggle, the essence is that you can:
There’s also an integration with Google Sheets and a brand new one with Google AutoML (see the next section). I wouldn’t be surprised to see more integrations since Kaggle is now part of Google Cloud.
Integration with Google’s AutoML was announced in November 2019. It deserves a section of its own because of its potential impact.
As a concept, AutoML isn’t entirely new, but making it accessible as a product en masse via Kaggle is a noteworthy development. The human expertise that’s required for machine-learning development is scarce, a fact often brought up as a bottleneck for the field.
AutoML can lower the barrier to entry for development of machine-learning applications in marketing. It allows marketers with a general understanding of the machine-learning process to use advanced, powerful AI models safely—and without needing to be programmers.
AutoML, which is now available on Kaggle, can also save massive amounts of time spent developing and testing a model manually (the typical case right now).
This won’t, of course, be “AI at the push of a button.” The marketer (or whoever applies AutoML) will need to understand the basics of the process. Unlike other features in Kaggle, its use may result in costs for computation.
In any case, AutoML is a hands-on way to get started with machine learning and AI for marketing, directly within Kaggle.
Kaggle doesn’t cover all aspects of a data and analytics workflow. It’s not the tool to develop production-level systems or store and manage all of your analysis code and artifacts. However, it’s a practical collaboration tool with which marketers can access relevant datasets, explore data, and get ideas to jumpstart their analysis.
Computationally, it’s like a powerful, cloud-based laptop that’s always available for public or private projects. It’s also a bridge to many other cloud services provided by Google, such as BigQuery and Google Data Studio.
Last but not least, AutoML has the potential to be a game changer. It lowers the barrier to entry and empowers marketers to get directly involved in the development of AI and machine learning for projects.
Becoming familiar with Kaggle Notebooks, the Cloud integrations, and all the other elements of the Kaggle environment can make a future transition to a full-fledged AI platform, including Google’s AI platform, much easier.
The best way to get started? Explore the datasets and ways the Kaggle community has analyzed them. Try the Google Analytics revenue prediction dataset and analysis Notebooks, or the conversion optimization dataset with ROI analysis for Facebook marketing campaigns.