They say you shouldn’t judge a book by its cover, but we all do : vibrant colors, bold graphics, and eye-catching text draws us in. Many of YouTube’s top content creators and video uploaders spend hours perfecting their thumbnails, because thumbnails are often the first things viewers see while looking for something interesting to watch. Displaying thumbnails that accurately and appealingly represent content is a great marketing tool to draw viewers in. In contrast, if thumbnails do not accurately reflect the content the videos represent, they can leave the users disappointed and annoyed, leading to poor performance on metrics such as watch time and back-button clicks. This in turn causes the videos to become less prominent on YouTube, resulting in lower clicks, views, follow-on clicks, and ad revenue. Better thumbnails, therefore, create value for all the parties involved with the YouTube platform: the users, the content providers, and the advertisers, and help Google capture a part of this created value by charging advertisers a revenue share.
YouTube employs “Deep Learning” techniques to solve this problem of automatically and intelligently generating the “right” thumbnail for its videos. Often coined machine learning or neural networking, Deep Learning involves “training” a computational model so it can decipher natural language. The model relates terms and words to infer meaning once it is fed information. It’s then quizzed on this information and “learns” from the experience – like a child learning to communicate.
Inspired by the remarkable advances of deep neural networks (DNNs) in computer vision (such as image and video classification), YouTube recently launched an automatic thumbnail generator to help creators showcase their video content. Here is how it works: While a video is being uploaded, YouTube first samples frames from the video at one frame per second. Each sampled frame is then run by a quality model and assigned a quality score. The frames with the highest scores are selected, enhanced and rendered as thumbnails with different sizes and aspect ratios.
Now, judging the “quality” of a video frame can be very subjective – people often have very different opinions and preferences when selecting frames as video thumbnails. To overcome this challenge, YouTube collected a large set of well-annotated training examples to feed into their neural network. Fortunately, on YouTube, in addition to having algorithmically generated thumbnails, many YouTube videos also come with carefully designed custom thumbnails uploaded by creators. Those thumbnails typically follow the recommended practices when it comes to choosing a thumbnail: they are well framed, in-focus, and center on a specific subject (e.g. the main character in the video). To train the DNN model, YouTube’s developers fed it custom thumbnails from popular videos as positive (high-quality) examples, and randomly screen captures as negative (low-quality) examples. Compared to the previous automatically generated thumbnails, the DNN-powered model is able to select frames with much better quality. YouTube reports that, in a human evaluation, the thumbnails produced by the new model are preferred to those from the previous thumbnail in more than 65% of side-by-side ratings.
Youtube’s DNN powered thumbnail generator is a great example of a company employing advanced machine learning and data analytics techniques to make computers scalably perform tasks that could traditionally only be performed best by human beings. Deep learning is an interesting space to watch, and applications can vary from generating better YouTube thumbnails, to better accent recognition and sentiment analysis, and improved drug discovery and toxicology.