If we define "big data" as huge amounts of information about the behavior of individual economic agents, “small data” must be small amounts of information about the behavior of huge collections of individuals. In this article, we explore the importance of small data in risk modeling and other applications and explain how the analysis of small data can help make big data analytics more useful. We consider the problem of representativeness and whether small data can help make big data more representative of the underlying population. We also look at recessions and the potential to use big data when making predictions. Finally, we look at the calibration of big data models to aggregate predictions made using small data analytics.
Take all the data that exist in the universe and define two mutually exclusive and exhaustive subsets. For convenience, label these "big data" and "small data."
What is it about a particular data point that makes it "big"? Most would say that a complete log of my past internet searches is "big" but interviewing me to ask whether I'm employed is much smaller. Mining tracked movements to discover restaurant preferences is, in data terms, huge. Calling on a landline to ask questions about such preferences, even if repeated a million times, is microscopic.
So in all cases, big data involves the collection of breadcrumbs left by people as they go about their day-to-day business. Another common theme is the use of digital technology for information retrieval. Can we identify any pre-digital forms of big data?
Credit bureaus, companies like Experian and Equifax, may be prime candidates. Either to your benefit or detriment, the entities track your activities in the consumer finance sector – applications, trades, payments, and defaults – and then build scores to predict your probability of default and other types of credit behavior. Equifax, for example, dates back to 1899 and provides data that are highly detailed, individual-specific, and voluminous, characteristics that are common to other quintessential forms of big data. Another early example might be CARFAX, a US service established during the 1980s to provide detailed information about the history of individual used cars. It would be hugely ironic, however, if we identify a pioneering big data company that was named after two 19th century technologies, these being the motor car and the facsimile machine.
In both of these instances, the information housed by the companies would have been costly to collect, compile, and maintain during the early decades, and errors would have been common. An interesting feature shared by both companies is that they can gather data without the consumer having to do anything. If I service my car or trade it in, CARFAX will find out about it from the dealership; if I fail to pay my credit card bill, the bank will inform Equifax. For the dealerships and banks, making these reports incurs some cost, much lower in the digital age, but also provides many commercial benefits. Most notably, all lending institutions can manage risk much more effectively in a world of universal credit reporting.
Rather than the number of observations or the technology used for collection, the defining feature of big data seems to be that it is collected unobtrusively. It is this lack of intrusion that allows the size of datasets to grow explosively. Digital technology enables data collection to become more unobtrusive, even surreptitious, in a wide variety of areas. It is, however, possible to identify a handful of datasets that were already "big" before smartphones became ubiquitous.
The notion of collecting data unobtrusively is not exactly new, though technology dramatically expands the efficiency of data collection. One of the jobs I held briefly during my student days in the late 1980s involved sitting on a street corner and counting all the white cars that passed a particular point, arguably the most boring job ever conceived. While it strongly motivated me to stay in school, my toils likely went unnoticed by passing motorists. It is humbling to think that Google Maps can now collect the output of a few billion teenage Hugheses every minute of every day. I know this will come as a surprise to many of you, but the data are also now of far better quality. I vividly remember daydreaming at various times and making up bogus numbers to satisfy my (largely uninterested) bosses.
In contrast, small data are obtrusive, and they have a significant marginal cost of collection. These costs might be borne by the collector of the data, by the subject, or by both. We identify datasets where costs are incurred only by the collector as an intermediate form that is neither big nor small. The point here is that companies may be willing to gather large databases from unencumbered members of the public so long as the benefits of collection exceed the costs. If costs fall on the subjects, it is highly unlikely they will be willing to comply in large numbers. Datasets that are costly for both sides of the transaction will typically be small in every sense of the word. Costs here are both monetary and opportunistic.
Now that we have separated the concept of big data from the technology used to collect it, we can start to get a little more concrete in our analysis of its properties. Big data is not really all that different from small data. The passive manner of its collection gives it a number of interesting attributes, most of which are unquestionably good.
In this article, though, our aim is to identify shortcomings in the big data paradigm and make a positive case for the continued importance of small data. Examples of small data include most macroeconomic series, information collected from old-style surveys of varying types, and data that are generally less detailed in nature. If a database existed in 2000, say, it is most likely "small," though some datasets will be incorrectly classified using this criterion.
Quite recently, I discovered the "timeline" feature on Google Maps. Initially, I was aghast that the good folks at Alphabet knew where I was 24/7, and I vowed to turn it off. Proactive person that I am, however, I let it slide for a few days and then noticed the increasingly helpful transport-related suggestions being made by the app.
Are you listening, Big Brother? I find this stuff quite useful!
Many people, though, would have switched it off. I both envy and pity these people – they are either more adventurous or more paranoid than I am, possibly justifiably so. Personally, I figure that if some corporation wants to track my boring life, it can knock itself out. Targeted ads are sometimes helpful, and a closely tracked digital life provides lots of small efficiency gains. I can, however, understand and respect the choice the refusers have made.
Many of the sets of big data you see are simply not very representative. We all have a choice, and therein lies the rub. Google, Facebook, and a number of others are in the process of collecting reams of data on the behavior of people who, for whatever reason, don't mind being tracked. Insofar as the included group is representative of the target population of interest, there is no problem. If the target is different, however, an unrepresentative dataset can often be difficult or impossible to adjust to make it consistent with the desired group.
One useful example of this comes from measuring price inflation. Academics at the Massachusetts Institute of Technology used big data techniques to scrape prices from internet transactions: They creatively called their work the Billion Prices Project (BPP). By unobtrusively collecting web-based data, the researchers are able to provide price information at a very detailed level, at a daily or even minute-by-minute frequency, for individual products and services or any aggregation thereof. The downside, of course, is that offline transactions are simply not recorded. Most services, like haircuts and plumbers, are typically kicked old-school, as are transactions for important staples like food and shelter. Retailers often have different prices for goods purchased online, where Amazon is a constant competitor. Many transactions are still conducted using cash, and these are notoriously difficult to track.
Anyone can check the veracity of the index implied by the BPP by comparing it to the official, small-data Consumer Price Index (CPI). In late 2015 in the US, the BPP aggregate price index fell somewhat faster than the CPI, but it accelerated slightly more during the first half of 2016. Overall, the two series track quite closely, however, which should reassure users of the underlying big data that the statistics are behaving appropriately. Nonetheless, those looking for a general indicator of inflation need to reconcile the situations in which the BPP index deviates significantly from the official underlying price level. If this can be achieved, the detailed analysis made possible by the BPP can then be used most productively.
Big data, apparently, does not move markets. However, intrusion can dramatically improve the quality of statistics.
In collecting the US unemployment rate, for instance, authorities from the Bureau of Labor Statistics (BLS) carefully design the survey to be representative of the entire US population. While they do not have the power to compel individuals to answer the questions, they can at least appeal to their sense of civic duty to participate, and they apparently enjoy a lot of success in achieving this end. The underlying data are collected carefully, consistently, and with integrity. It is intrusive, given that 60,000 households must agree to participate in a series of eight phone or face-to-face interviews conducted over a 16-month window. The end result, though, is a statistic that is almost universally trusted and that often moves markets the moment it is released.
One "big" data point that comes close is the ADP National Employment Report (NER), which is produced by ADP in conjunction with Moody's Analytics. We use big data sourced from ADP payroll information to create a preview of the next US employment report. The resultant statistic has considerable predictive power and is closely followed by market participants eager to know the looming jobs number. The ADP data themselves are highly detailed, though somewhat unrepresentative, allowing interested users of the data to dig very deeply into the nature of the US labor market.
The model we use to "nowcast" the employment report could, however, be used to calibrate big data-style analyses based on the ADP numbers. The leap of faith required here is that the detailed information captures relativities between groups tolerably well while also assuming that the "big picture" view provided by the BLS report accurately represents the target population.
The moral here is that you can potentially use small data to make big data analyses more representative. Using both forms of data allows you to have your cake and eat it too.
One interesting point about big data moving markets is that it may be doing so in the form of a constant drip as opposed to the sudden deluge provided by the employment report. The idea is that if market participants know, from insights delivered by big data, the precise position of the economy at all times, the information will always be reflected in market prices. For this to be true, we would expect the impact of the official employment report to be nullified or at least dramatically downgraded. An unexpected jobs report would be a thing of the past, and large market movements would not be seen immediately after any important economic release.
We have seen no research indicating that economic releases are less important than they used to be.
The potential uses of big data in forecasting are many and varied. They ostensibly come in two forms: using big data to potentially improve traditional small data macro forecasting techniques; and using big data to make detailed, granular predictions of micro behavior. Both are of interest to us here.
On the first use, we have seen some small successes and a few highly publicized failures. While our primary concern is with economic and financial forecasting, a useful case study comes from the world of epidemiology – specifically, Google Flu Trends (GFT). This service used Google searches for flu-related keywords to predict, in the short term, flu occurrences as recorded by the Centers for Disease Control and Prevention (CDC) in the US. Initial claims suggested that big data was highly predictive of recorded doctor visits, but within a few years it had started to dramatically overpredict instances of the disease. The program was put into hiatus in 2015, though Google continues to encourage academic research into disease prediction using its abundant search data.
The failure of GFT is usually attributed to two problems, the first of which being overfitting. The researchers at GFT found a high correlation with many search terms that were not directly related to the flu but instead shared a common seasonal pattern with instances of the dreaded grippe. For example, one driver of the prediction model was searches for "high school basketball" whose season piques the interests of American Google users at the same time as the winter flu season. If the basketball season is particularly competitive and prompts greater interest, however, this will have no bearing on the transmission of disease-carrying microorganisms. Those with a background in econometrics will recognize this problem as one of spurious correlation, one of the primary costs of an overreliance of data mining techniques. While ex post predictions were undoubtedly helped by searches for hoops, more pertinent ex ante predictions were unquestionably harmed.
The other problem with GFT was a kind of Hawthorne effect. This is the situation in which human participants in an experiment know they are being monitored and change their behavior either consciously or unconsciously. In the case of GFT, Google understandably wanted to make its research findings available to its users. The problem was that publicity changed the nature of searches being made. By making the watched aware that they are being watched, the utility of watching is often fatally compromised.
Those who are interested in using big data to look for shifts in aggregate credit trends, future financial market performance, or the outlook for the macroeconomy should learn the lessons provided by the failure of GFT. It may be tempting to blindly mine big data, but all revealed correlations must have some additional basis in order to be robust and useful. You may find that searches for "living more frugally" are correlated with increases in claims for unemployment insurance, but the model will fail if asceticism and minimalism ever become trendy again (assuming they ever were).
For those looking to trade using big data, any advantage gained will likely be very short-lived. The prospects for using big data to predict more cyclical aspects of the economy – like credit performance – are somewhat better. Identifying signals that portend turning points, the hardest things to forecast effectively, may be quite fruitful though it will be extremely difficult to find the right signals among the myriad false flags. The other thing is to make sure the signals you receive from unobtrusive data are not warped by any incentives acting on the behavior of the people being watched.
Human nature tells us that more information must always be better than less; that if I had data on everything, I could accurately predict absolutely anything. Veterans of the forecasting game, however, know that this way of thinking is dangerously naïve. Ridiculously simple models – like autoregressive integrated moving averages (ARIMAs), exponential smoothers, or random walks – are extremely hard to beat for forecast accuracy. Small data macro models like vector autoregressions (VARs) and structural models hold a key place in the macro forecaster's arsenal because they can capture feedback loops that are often critically important when projecting the behavior of complex systems.
If used wisely, big data could enhance these traditional old school forecasting models. The likelihood that big data techniques replace small data forecasting methods, however, is vanishingly small.
If we want to make highly detailed predictions about human behavior, big data is really the only game in town. In the credit space, perhaps the most exciting potential applications of big data involve scraping social media and other online activities for clues about potential creditworthiness. Credit, after all, is a matter of trust, and if it can be demonstrated that someone has deep and strong connections to a wide social network, it should be indicative of sound personal attributes like reliability and constancy. Scores based on a person's social network should be especially powerful when applied to the underbanked, including young people and new migrants, and when utilized in countries without a well-established credit reporting infrastructure.
In areas where credit scores are known to work well, however, the best we can hope for is that big data will allow incremental efficiency gains to be made.
Traditional credit scores rely on an observed historical willingness and ability to remain current on one's obligations. It's strictly business. The score does not care whether you used your credit card to sponsor a poor child or to buy yourself a new big-screen television. It only cares about the extent to which you have utilized your available credit line and whether you subsequently remained current on your obligation to repay the incurred debt.
It's very difficult to improve your credit score artificially. I have heard of some folks in the US who maintain numerous credit cards with high limits in a bid to boost their scores, to what end I'm not entirely sure. I have also seen research conducted by credit bureaus that suggests that average credit scores are rising because people, in general, are getting better at managing their financial affairs. I'm rather skeptical that this is what's happening – a strong labor market may simply be holding down default rates – but the theory is indeed vaguely plausible.
In contrast, a score based on social media mixes business with pleasure. Suppose I go to a dealership to buy my dream car. It turns out that the score based on my social network is too low and I get rejected for the finance I need to buy it. Now I send you a friend request on Facebook. I sat three rows behind you in Algebra I and we once shared a cab ride back from a party. You may think I'm making a nice gesture in trying to reconnect when in fact I'm cynically asking you to approve my credit application.
While a pure, well-established social network may be a great indicator of creditworthiness, injecting questions of money into the network makes it fundamentally less social. A social network credit score may thus be self-defeating, providing illusory, short-term performance gains while simultaneously poisoning online society.
In the world of commercial credit, pleasure is never a factor. One application of big data to business scoring involves using web traffic, sentiment analysis, and text mining to improve traditional scores based on financial statements and stock and bond prices. I am hard-pressed to think of a business that wants to decrease its web presence or that wants to be viewed in a negative light by its key stakeholders or customers. These features are undoubtedly more important in some industries than in others, but all well-run companies would prefer to boost these kinds of metrics.
In the case of consumer analytics, the big data is often beside the point. I liked Richard Branson's LinkedIn post, therefore I'm more likely to buy cheese. Among commercial entities, where long-term profitability and growth are the only concerns, we feel that big data will be more impactful because online activity cuts to the heart of what the business is trying to achieve.
More information is always better than less. In every situation, if data collection is cheap or free or is a byproduct of core business activities, the data should be analyzed, even if their usefulness is not immediately apparent. Having the data, however, does not necessarily mean you should use them or that you should necessarily believe any particular piece of analysis that stems from their existence. The data, after all, are valuable simply because of the options they may provide to analysts when plying their trade.
With big data, the temptation of the analyst is akin to that of a kid in a candy store. With so much data, surely our forecasts will be more accurate, our customers will be happier and better targeted, and our profits will inevitably rise. This is all probably true, though we still need to develop the skills to decode the complex signals from big data to make it useful in our businesses. If everything is statistically significant, you then face the difficult question of working out which measured effects are most materially significant to the operation of the business.
Finally, we need to remember not to throw the baby out with the bathwater. Small data methodologies, especially where they pertain to forecasting, are very well-honed and will be very hard to beat using non-traditional data sources.
Big data analysts, nevertheless, should try their darnedest to do so.
Examines the role of disruptive technologies in the financial sector and how firms can improve their practices to remain competitive.
Previous ArticleThe Effect of Ride-Sharing on the Auto Industry
We look at climate risk and consider how a heating planet might impact a bank's performance
November 2019 Pdf Dr. Tony Hughes
Expanding Roles of Artificial Intelligence and Machine Learning in Lending and Credit Risk Management
With ever-expanding and improving AI and Machine Learning available, we explore how a lending officer can make good decisions faster and cheaper through AI. Will AI/ML refine existing processes? Or lead to completely new approaches? Or Both? What is the promise? And what is the risk?
November 2019 Pdf Dr. Douglas Dwyer, Dr. Tony Hughes
When banks manage risk, conservatism is a virtue. We, as citizens, want banks to hold slightly more capital than strictly necessary and to make, at the margin, more provisions for potential loan losses. Moreover, we want them to be generally cautious in their underwriting. But what is the best way to arrive at these conservative calculations?
October 2019 Pdf Dr. Tony Hughes
The traditional build-and-validate modeling approach is expensive and taxing. A more positive and productive validation experience entails competing models developed by independent teams.
September 2019 Pdf Dr. Tony Hughes
The industry is currently a hive of CECL-related activity. Many banks are busily testing their systems or finalizing their preparations for the go-live date, which is either in January 2020 or somewhat later, depending on the organization. Some are still making plans for implementation, and the rest are worried that they should be.
August 2019 Pdf Dr. Tony Hughes
The theory that banks are now safer because of CCAR, though, has not yet been tested.
July 2019 Pdf Dr. Tony Hughes
Loan-loss provisioning models must take a variety of economic and client factors into account, but, with the right approach, banks can develop sensible loss forecasts that are more accurate and less susceptible to volatility.
June 2019 WebPage Dr. Tony Hughes
As evidence of climate change builds and threats materialize,data will be invaluable in creating a framework for making future credit decisions.
June 2019 Pdf Dr. Tony Hughes
In recent years, attention has increasingly turned to the promise of artificial intelligence (AI) to further increase credit availability and to improve the profitability of banks and other lenders. But what is AI?
May 2019 Pdf Dr. Tony Hughes
Good-quality CECL projections can be developed using high-quality data that is available free of charge.
April 2019 Pdf Dr. Tony Hughes