Every four years, the US media is blessed with a money-making bonanza. During the 2016 presidential debates, CNN was able to charge 40 times its standard rate for a 30 second TV commercial.1 The elections present an easy value creation and capture opportunity. Value creation is simplified because the issues are defined even before the race, the protagonists are predetermined, the journalists know exactly where to look for stories, and in return the audience expects. To enhance their value capture, the news agencies in return need to provide a reliable information flow, but most importantly, a credible prediction of the outcome at the end. This credibility determines the long-term ability of the news industry to capture additional value during the election season. However, for the second time in a few months, the news agencies got the prediction wrong. First, it was the Brexit, and then the US presidential elections. While the verdict on why the predictions turned out to be wrong is still not out, some of the issues with data aggregation and faulty assumptions are (now) glaringly obvious.’
FiveThirtyEight.com provides an overview of how election outcome prediction models work. At the most basic level, various survey and non-survey data are fed into supervised learning models, which are trained on historical data, and the signal from various models are aggregated based on predetermined criterion. As we near the actual elections, the weights on signals from regression models are reduced, and the weights on sample polling results are increased. The Atlantic and the PewResearch Center discuss some of the issues with the way data was aggregated for regression analysis:
Non-response bias: When sampling the population through polls, certain demographics such as low-income households or the rural population are difficult to reach. Data cleaning and regression models attempt to correct for this problem based on historical behavior of these demographic groups. But in the case of the 2016 elections, the underrepresented groups deviated overwhelmingly from the historical trend.
The “Shy Trumper” hypothesis: Because of the media frenzy and general societal reaction, openly claiming to be a Trump supporter was not a popular move. While still not proven, some claim that many Trump supporters may have represented themselves as Hillary supporters or “undecided” in the polls. The demographic models that classify the undecided lot into the different camps would have then also been skewed towards Hillary Clinton.
The Likely voter models: One could argue that this is an unnecessary adjustment to the dataset. But currently, regression on demographic characteristics are used to adjust poll results for whether a poll participant is likely to participate in the actual elections. It is sort of an indirect way of further adjusting the weights on signals from different regression models. So if certain races or income classes show up in higher numbers than expected, or many traditional Midwesterners choose to abstain from voting, then the predictions would turn out to be inaccurate.
So why in this age of machine learning, and especially the ability to easily pursue sentiment analysis, did the news channels not do anything to boost their predictive power? More importantly, why did they not review their methods after the Brexit outcomes? Some of this has to do with competition and incentives. Although in the long-run a news agency’s revenue potential is tied to credibility, the short-term focus on competing for revenue share on “speed-to-market” does not leave any time for a detailed examination of modeling assumptions and outputs. Hopefully, the two recent episodes of missed forecasts will serve as a wake-up call for the media channels using predictive analytics.