Machine learning has a dirty secret. Despite all the buzzwords thrown around and the wonders that have been accomplished leveraging this technology, there’s a very human element at the root of every application. Machine learning’s secret is that to train a model, a real flesh and blood person had to tag, label, describe, or otherwise interact with massive amounts of data so that a model can be trained.
For a good reason, the developing of this data is considered to be unengaging and unrewarding work. In HBO’s show Silicon Valley where the grunt work of the tech industry is a running gag, Erlich, one of the show’s main characters, attempts to trick a Stanford introductory computer science class into labeling millions of images of food with which to train his “See Food” application. The application couldn’t be built without the data, and the effort required to build up a meaningful dataset was something that was below his character’s station as a titan of the technology industry. It’s a commentary on a problem unique to machine learning. The data required to train the model is invaluable, it’s extremely difficult to get, and yet it’s considered lowly work.
A company that’s been solving this problem is Crowdflower. Founded in 2007
, the company has been adding the human element to machine learning by crowdsourcing the construction and cleaning of massive datasets to power the machine learning revolution by leveraging a massive online workforce that answers questions that are easy for a human, but impossible for a state of the art machine learning model.
CrowdFlower solves a complex problem in a scalable way.
- CrowdFlower continually audits its contributor base with “test” questions to ensure that contributors are constantly providing accurate responses. Traditionally, this relies on using cross-validation amongst multiple contributors to ensure that there is a consistent and precise answer to a problem.
If a person answers too many questions incorrectly, then they are removed from the system and can no longer participate in further tagging operations. Additionally, their previous results are thrown out.
- CrowdFlower enables companies to scale their operations up or down to match demand. Because the company runs so many projects with so many contributors, a company can receive fast project times when they need to build a dataset, but are not left with a significant amount of idle time when their tagging operations have completed. Organizing these large groups would be untenable for anyone but the largest data science application.
- Continuous training of datasets. Through its API, even when a model is deployed, developers can leverage the CrowdFlower platform in multiple innovative ways such as passing challenging to categorize cases to a real human in semi-real time, or auditing new data to ensure that the model is continuously performing correctly.
CrowdFlower charges for access to its platform of humans that it uses to categorize the data. When a customer is on the platform, he or she can set a price for each unit of data they want to collect, and CrowdFlower then takes a commission of roughly 20%. The higher the complexity of each group of data, the more it will take to entice a member of the community to take the job. Additionally, the higher the price set for each unit of data, the quicker the project will get done because more people will be willing to take on the task. As in most things, customers are paying for a combination of speed, complexity, and volume. The more of each of those factors, the more a customer is going to pay.
The company has had success thus far but is in an interesting dilemma. As the models they help train get better and better, the need for additional human intervention decreases. Tellingly, the better the models they help train become, and the more machine learning models can accomplish on their own, the less need there is for their services. While the total applications for machine learning are likely growing in the short-term and the demand for building datasets will likely continue to grow with it; eventually, they will have solved their way out of the solution they provide.