Kaggle is a community data science platform that connects amateur and professional data scientists with companies who have problems that can be solved using data science skills. While many firms including NASA which we discussed in class leverage a crowdsourced data science model, Kaggle benefits from strong network effects having built up the reputation of being the go-to place for data science – it boasts of some of the world’s best data scientists who are attracted to the platform for the diversity in projects, active community, as well as reputation and prizemoney and by companies who are willing to pay for access to that talent pool (recruitment competitions) and the opportunity to solve some of their hardest problems (featured competitions). According to ComputerWorld, since its inception in 2009, the Kaggle community has submitted more than four million machine learning models to competitions, shared 170,000 forums posts, more than 250,000 kernels and 1,000 datasets [https://global-factiva-com.prd1.ezproxy-prod.hbs.edu/redir/default.aspx?P=sa&an=IDGCWA0020170609ed6900002&cat=a&ep=ASE]. Below we identify the core elements and processes of that help make Kaggle work based on both personal experiences competing on the platform as well as firsthand conversations with the Kaggle team.
Screenshot of Kaggle’s Competitions
Unique problems: At a given point in time, Kaggle hosts 10-20 competitions, each competition representing a different problem that a company wants to solve. The problems are typically very diverse from various industries and across time. Kaggle requires that a company explain why its problem is unique and solvable via a data science during its screening process. Kaggle offers research competitions in addition to featured and recruitment competitions in the event a company has a unique problem to solve but is unsure whether a solution exists.
Quality control: Once the problem has been identified, a Kaggle engineer will work with a dedicated resource at the company to review the underlying dataset, the target variable (what the company is looking to predict), and help the company come up with the evaluation metric if does not have it already. In addition to design and scope, Kaggle works with the company to define rules, logistics, and configure the launch of the competition. This process can take up to three months at which point the competition is typically open for a subsequent 2 to 3 months.
Prize money: The company puts up a cash prize for the winners (usually there is a first prize second and third) totaling anywhere from $15,000-$125,000. Featured competitions typically command the largest prize money and are featured at top of the webpage of the competition’s webpage.
Screenshot of Kaggle’s infrastructure for data analysis
Ease of use: Competition participants conduct their analysis on the Kaggle platform. Kaggle provides a uniform infrastructure for analysis it allows you to easily import the data and run a solution at scale. Kaggle community members often share an exploratory data analysis for each competition, making it easier for others to get started with their analysis. They are incentivized to do this through a rating system that awards its community members for sharing their work based on how well they are received across the community.
Screenshot of Live Leaderboard
Live leaderboard: Once the data scientist is satisfied with her solution, she submits it and her code is run against a test data set to be evaluated. The participant does not have access to the test dataset – this is referred to as “out of sample” testing – and ensures that the contestant’s data set does not overfit the data she has been working and that the algorithm is robust enough to work on data in real life. The contestant is automatically graded based on her accuracy. The definition of accuracy can vary but is roughly the total absolute difference between the actual and the model predicted numbers. Those with the out of sample difference tend to receive the highest scores. The scores are generated in real-time so that contestants can see their progress on a public leaderboard also maintain by Kaggle.
Connecting parties: At the end of the contest, the winners are awarded given prize money. Since the winners’ names and solutions are not initially made public, companies will pay Kaggle for either the winning solution or access to the contact information of the top contestants (or both). Companies have typically used Kaggle to recruit top talent worldwide as in the event of obtaining a winning solution, companies still need to integrate that solution into a production environment, where constraints may limit the ability to use complex solutions (something that Kaggle does not penalize).
Kaggle was acquired by Google in 2018. Google has continued to market Kaggle independently but has since integrated Kaggle into its cloud platform. Thus, contestants are now able to analyze larger datasets in a more real-world environment. Google currently provides this service free of charge to contestants under its mission to “democratizing AI for all.”