Conceptual Challenges of Climate Risk Scoring

We compare two approaches to evaluating climate risk: estimating observable metrics, such as losses or adjusted default probabilities; and assigning constructed scores. The former approach allows for adding up—or otherwise mathematically combining—hazard scores to obtain one that summarizes the totality of a firm’s exposure. The latter does not. While we can define a hazard score consistently by combining any number of objective metrics, attempting to put these constructed scores together gives rise to mathematical problems. Treating the inherently subjective constructed variables as if they were objectively measured or observed leads to the combined score being largely determined by how we normalize them — a choice incidental to the problem at hand. Nevertheless, it could lead to equally valid models reaching very different conclusions based on the same data. This is because constructed variables represent orderings and are therefore defined up to a positive monotonic transformation. Using a rank-based approach that is robust to hazard score normalization gives rise to a setup in which Arrow’s Impossibility Theorem applies. Therefore, no such hazard score combination procedure can avoid violating at least one of the axioms specified by Arrow (1950). Due to this unavoidable inconsistency, we recommend the estimation approach whenever metrics of economic losses are available; and, if they are not, using only observable metrics to define hazard scores. We do not recommend combining constructed hazard scores.

1. Introduction

The global temperature is currently about 1^oC above pre-industrial levels. Climate change has already affected nearly every facet of the global economy—from infrastructure, agriculture, and commercial and residential property to human health and labor productivity.¹ Further warming is bound to result in economic damage, although the extent of it is not yet known; much depends upon the still uncertain policy response.² As climate impacts will vary across locations and industries, there is a growing interest among businesses in climate risk—their own, as well as that of their customers, suppliers, and competitors.

But how do we precisely define and quantify climate risk? In this paper, we narrow this question to physical risk, the risks resulting from climatic events such as wildfires, storms, and floods (as opposed to transition risks, which result from policy action taken to transition the economy off of fossil fuels). As global warming progresses, we expect the frequency and severity of physical risk events to increase, causing business disruptions (such as damaged facilities) that result in measurable, monetary losses. Expected damages can be accounted for while predicting the future value of a firm’s assets. Asset value, in turn, is a major determinant of default probability. So physical climate risk reflects either uncertain and unpredictable monetary losses or the impact of the loses upon asset values and default probabilities. All these variables are well-defined, their current values can be observed, and future ones can be estimated with a sufficiently rich dataset.

A major challenge in modeling physical climate risk is that weather hazards are diverse. Facilities resilient to floods, due for example to being located on a hill, may be vulnerable to storms or wildfires. Therefore, a well-designed climate risk model should consist of several sub-models, specific to different hazard types, with the final score integrating the risks. The problem is undeniably complex due to the stringent data requirements, as many different circumstances must be factored in. However, if the output of each sub-model can be translated to a decrease in earnings, revenue, or asset value—in other words, in an observable, monetary variable—we can integrate them into a “total climate score.” In more technical terms, if we know the distributions of a finite number of independent random variables (or the joint distribution of dependent ones), each representing losses due to a particular hazard, we can compute, or at least simulate, the distribution of their sum. In the simplest case, losses are additive: If, in any pre-determined period, a facility is expected to lose $10,000 of value due to floods, $20,000 due to hurricanes and nothing due to other climate hazards, the expected total loss is $30,000.

Over the years, we have observed the advent of a very different approach to evaluating physical climate risk. Firms are assigned scores—usually ranging from 0 to 100—meant to characterize their exposure to various climate hazards. Unlike losses, which can be observed when they occur and estimated beforehand, scores are constructed—usually by combining multiple variables (sometimes including expert judgement). Scores are not monetary, nor are they expressed in any well-defined units. They represent orderings: a flood score of 67 is meaningful when compared to another flood score of, say, 80, in which case we can tell which rated entity is more exposed. On the contrary, losses have a stand-alone, absolute interpretation; we need no information on other facilities’ exposures to understand what an expected loss of $30,000/year means. None of this is necessarily a problem with score—many practical applications only require an ordering.

However, a difficulty arises when one attempts to add scores up (or combine them using any other mathematical procedure) as if they were (observable) losses. We show that this approach can lead to two problems. Since scores are constructed and only represent orderings, we can apply any order-preserving transformation to any score without affecting the information it conveys or its interpretation. This makes the final score incredibly sensitive to the choice of normalization—and a subtle alteration here can completely change the ordering implied by the final score. On the other hand, one could treat scores explicitly as orderings and combine them as such. However, a seminal paper by Arrow (1950) has shown that this cannot be done without violating certain commonsense axioms. This result is known as Arrow’s Impossibility Theorem.³ In the climate context, it means that any total score constructed by aggregating three or more hazard orderings must display inconsistencies: The implications of the total score and its components will, at times, contradict each other. It is impossible to avoid this, hence the name of the theorem.

Both problems described above are exacerbated by the fact that hazard scores (e.g., flood scores) are sometimes constructed by integrating multiple sub-scores rather than observable variables. Then, issues arise at the hazard score level too—so even the components of the “total score” are flawed, and the effects compound. Neither issue, however, impacts observable climate risk metrics such as adjusted default probabilities or losses, nor does it affect scores based exclusively on observable variables.

The paper is organized as follows: Section 2 explains the differences between observable and constructed variables, focusing on the implications of combining them. It also pinpoints the setups in which Arrow’s Impossibility Theorem applies. Section 3 states the theorem in an accessible way, providing illustrative examples. Section 4 provides additional context and discusses the practical implications for existing climate scores. Section 5 concludes.

2. Observable and Constructed Variables

2.1 Definitions

Let us define an observable variable as one that we can either count or specify a fixed amount of (a unit) and express any value it can take as the unit’s multiple. Variables that can be captured from our environment have this property, hence the name. Let us call variables that are not observable “constructed”—since we can’t observe them, we must come up with them. The population of Beijing, the height of Empire State Building, or the weight of a blue whale are all observable; we can count or measure them in appropriate units. President Joe Biden’s GPA, Exxon Mobil’s governance score, or the result of an Amazon customer satisfaction survey are all constructed—they don’t admit units, they are defined by combining other variables, or generated to express judgements and preferences.

Observable variables are cardinal; in other words, their magnitudes have precise meanings. If we say that One World Trade Center is 2.3 times taller than One Canada Square, it is a different statement than saying it’s 2.4 times taller.⁴ This is because the heights of both buildings can be related to a common unit (be it 1 meter or 1 foot). On the contrary, constructed variables are ordinal—even if they admit numerical values, these can only be interpreted relative to one another. The fact that a student scored 90 points on a mathematics test tells us nothing about his skill without additional context (which we would not need to interpret the statement that he is six foot tall). What we can say is that he is presumably less skillful than a student who scored 180 points on the same test, but it would be a mistake to claim the latter is “twice as good.” Since there are no units of mathematical aptitude, what would the basis of such statement be? One consequence of this is that we could multiply the scores by any positive number (or apply any other positive monotonic—and therefore order-preserving—mathematical transformation), and they would still convey the same information as before. Of course, even though the magnitudes lack an absolute meaning, the ability to compare the two students can still be very helpful for, e.g., a college admissions committee.⁵ The fact that a bug in a computer system doubling—or squaring, or applying natural logarithms to, etc.—test scores of all applicants wouldn’t have any impact on the committee’s interpretation of them is certainly not a problem.

2.2 Estimating Observable Variables and Defining Constructed Ones

In empirical science, we often deal with observable variables. Specifically, we use statistical methods to capture how an observable (“dependent”) variable is impacted by other (“independent”) variables, which can, in principle, be either observable or constructed. The resulting model allows us to predict its value even before it is observed. For example, consider a model connecting losses due to storms across multiple firms in a given year to the values of their assets:

StormLoss_f is an observable variable—the US dollar is one possible unit. We can estimate the parameters β_0 and β_1 to evaluate how asset value contributes to losses caused by storms and use the result to predict future storm losses based on the value of assets of any firm of interest.

Consider an ostensibly similar relationship:

Sustainability is a constructed variable. We can’t define a unit of it or observe it. Nor can we estimate the above relationship from any dataset. We can define it: choose γ_0,γ_1 and γ_2 and express sustainability of a firm as a function or its Scope 1 carbon emissions and charitable donations. Could we use a different set of variables to predict Sustainability_f? Technically yes, but we don’t need to. We already know which variables determine sustainability because we have defined it. If we have the data on firm f’s Scope 1 emissions and charitable donations, we know all there is to know about Sustainability_f. For that reason, constructed variables are rarely used as dependent variables. If readily available, they can be utilized as independent variables to explain and predict an observable variable. Outside of the realm of science, however, we often see constructed variables. All kinds of evaluations are expressed using these.

2.3 The Problem with Combining Constructed Variables as if They Were Observable

So far, we argued that we can predict the value of an observable variable using statistical modeling and formally define a constructed variable. These are completely different exercises: one consists of uncovering and exploiting an existing empirical relationship; the other of creating one to express judgements in a precise and systematic manner. The main point of this paper is that there is an additional caveat related to the latter. While we can use both observable and constructed variables to estimate an observable variable (statistical methods allowing us to select the best predictors), there are mathematical problems associated with using a set of constructed variables to define a new constructed variable.

Why is it critical that variables used to define a new constructed variable are themselves observed or constructed? Consider two firms, A and B. Suppose we have two constructed scores for each of them, expressing, say, flood and storm risk (the lower the better), the sum of the two being the final score. Let the score values be as in the table below:

Because the scores are constructed, we can apply any positive monotonic transformation to them without affecting the ordering they imply. In other words, we can normalize and re-normalize any one of them in any way we like, since, in the absence of units, magnitudes mean nothing. Let us apply a simple transform to all the scores:

Results are summarized in the table below:

Ranks have reversed, even though the ordering for each hazard has not. Firm A is still more resilient to floods and Firm B to storms. Note that we could have used different normalizations for different hazards— for example, divided the flood scores by 10 and multiplied the storm scores by π, which would also have reversed the original ordering. The ability to apply positive monotonic transforms to constructed variables gives us so much room for maneuver that we can rearrange the final ordering in an (almost) completely arbitrary way. One thing we cannot do is to reverse the ordering if one firm has better scores than another in every category, but the more categories, the less of a constraint it is. Of course, if the flood and storm scores were observed— e.g., representing the number of events affecting each firm worldwide or losses in thousands of dollars, we couldn’t apply monotonic transforms, as the magnitudes would be directly linked to observations; losing $50,000 is very different form losing $25.000. That is one reason why, if unambiguous judgement is what we aim for, the constructed score should be a function of observed variables only.

2.4 Ranked Based Approaches: Setting the Stage for Arrow’s Impossibility Theorem

There is a mathematically interesting objection to be raised, however. Positive monotonic transformations don’t affect orderings, hence one natural solution to the problem outlined above would be to base the final constructed variable on ranks rather than score values. We could, in other words, use the data of the form:

Ranks follow from 25<49 and 25>4 (with the transform applied, 50<70 and 50>20). An alternative way to present the orderings would be:

In which a higher position denotes greater resilience or, equivalently, less exposure to a given hazard. We could construct the total score based on these exclusively. That would indeed prevent monotonic transformations from affecting the scores, as they are order-preserving. However, there is another problem associated with the rank-based solution: Arrow’s Impossibility Theorem applies. It tells us that the ordering implied by the new variable and the orderings used to construct it will be mutually inconsistent in general. The theorem is what we turn our attention to in the following section.

3. Arrow’s Impossibility Theorem

For the sake of brevity, we do not state Arrow’s Impossibility Theorem formally or provide the proof. We encourage the reader interested in the mathematical details to consult “Three Brief Proofs of Arrow’s Impossibility Theorem” by John Geanakoplos. Let us instead describe the theorem’s implications by using an example appropriate for our climate-related application.

Consider three firms, A, B, and C. They are ordered by their exposures to three types of acute weather events: Flood, Storm, and Wildfire. The orderings are summarized by the following table

in which a higher position implies greater resilience/less exposure to a given hazard. For example, firm B is the least exposed/most resilient to storms and the most exposed/least resilient to wildfires. A useful intuitive interpretation would be that there are three experts, each specialized in one hazard, who order firms according to their exposures. The challenge is to construct an aggregate climate ordering which takes the Flood, Storm, and Wildfire orderings and returns, for example:

The interpretation would be that having assigned their respective scores, the three experts get together and decide on an aggregate score. If the aggregation resulted in the ordering above, we would conclude that firm A is the least exposed/most resilient to acute weather events of the three types in general and firm C is the most exposed/least resilient.

Arrow’s Impossibility Theorem states that no way of aggregating orderings can simultaneously satisfy the following three criteria (“axioms”):

Non-Dictatorship: There does not exist a single hazard such that the aggregate ordering is always identical to the ordering according to it. In other words, it is not the case that the final ordering is always identical to either the Flood ordering or the Storm ordering or the Wildfire ordering. No single expert is a “dictator.”⁶ A violation of this axiom would mean that we only need to collect data on one of the hazards to infer the aggregate ordering, as the other ones have no impact upon it.
Unanimity: If firm X is less exposed/more resilient to every hazard than firm Y, then it should be deemed less exposed/more resilient by the aggregate ordering. For example, if the orderings were:

That is, if A was always above C, A would also have to be above C in the aggregate ordering. A violation of this axiom would be highly counterintuitive. How could one firm be more resilient to every hazard and yet less resilient to all of them combined? If all three experts independently judged that A is less exposed/more resilient than C to their respective hazards, they could hardly let the collective score imply the reverse.

3.Independence of Irrelevant Alternatives: If the ordering of firms X and Y remains unchanged for all hazards, the aggregate ordering of the two firms should not change either. If all experts retain their opinion about the relative exposures of firms X and Y, but change their minds about some other firm(s), that shouldn’t affect the aggregate ordering of X and Y.

Independence of Irrelevant Alternatives is trickier to understand than the other two axioms, so we explain what it means to violate it with an extended example. Let us first propose an aggregation procedure that satisfies Non-Dictatorship and Unanimity but violates Independence of Irrelevant Alternatives. A simple example of such is the Borda Count. Let us apply it as follows:

In each category, the least exposed firm receives N-1 points (in which N is the number of firms), the next one receives N-2 points and the last one 0 points.
Points in the Storm category are multiplied by 1.5. (This is a tie-breaking rule, in general Borda Count could use arbitrary weights for all hazards.)
Firms are ordered decreasingly by sums of points they received in each category.

As a preference-aggregating mechanism, Borda Count tends to favor broadly acceptable options. Particularly if there were many of them, having a high (but not necessarily first) position awarded by all experts could easily earn a firm more points than being ranked first by any two but close to last by the third. That is because the more firms there are, the smaller the relative difference in points awarded between any two consecutive positions. That property would make Borda Count an ostensibly reasonable mechanism for our purpose— we wouldn’t want to award a high resilience score to a firm very safe from two hazards but highly vulnerable to the third. Nevertheless, from the mathematical standpoint the rule choice is irrelevant. A straightforward corollary to Arrow’s Impossibility Theorem implies that any rule satisfying Non-Dictatorship and Unanimity will violate Independence of Irrelevant Alternatives. Therefore, even though many aggregation rules other than Borda Count answer this description, our choice is without loss of generality.

If we applied Borda Count to our initial example, firms would get the following numbers of points for each hazard:

Giving rise to the final ordering:

Note that each hazard has meaningful impact on the final ordering, so non-Dictatorship holds. If one firm ranks higher than another for every hazard, it receives more points in every category and so its total score is higher. Thus, Unanimity is likewise satisfied. Independence of Irrelevant Alternatives is violated in general, but it holds in this particular case. Suppose we only considered firms A and B. Then:

Therefore, even in individual comparisons firm A is less exposed/more resilient than either B or C and B is less exposed/more resilient than C, which is consistent with the rating involving A, B, and C.

However, suppose firm C decides to invest in increasing its resilience and moves part of its business to a location less exposed to storms (and no more exposed to the other hazards). Suppose the move places it ahead of A, but not B, in terms of storm resilience. The orderings thus become:

Adding up points:

Note that nothing about B has changed. It has 4 points as before, as both A and C are still inferior to it in terms of storm resilience. Yet it suddenly emerged as the least risky firm. Shareholders of firm A have every reason to complain—just because one firm became more resilient than A in only one category, A emerged as the least resilient of the three firms. Complaints are still better founded since, if we considered only firms A and B, we would still have:

This is because the relative ordering of A and B has not changed in any category.⁷ Thus, according to the rule, firm A is simultaneously the least resilient firm of A,B, and C, but if compared only to B—the leader of the ranking—it would be deemed more resilient. This is the consequence of violating the Independence of Irrelevant Alternatives axiom. Here is one way to interpret this result: If we asked the three experts to evaluate firms A,B, and C and select two for, say, a climate-resilient ETF, they would have chosen B and C, leaving A out. If, however, we independently asked them to compare A—the firm left out—so the leader of the ranking B, they would have concluded, using the very same methodology, that A is less exposed to climate risk.

Another consequence of the Independence of Irrelevant Alternatives violation is that firm B finds itself on top of the ranking, even though its relative position in all the hazard orderings remained the same. One would, of course, not expect stakeholders of firm B to have issues with it. However, suppose firm A overtook C in the storms rating, returning to the original ordering. Then, stakeholders of B could complain about it losing the top position due to a change in the ordering of A and C, both of which are less resilient to storms than B and which doesn’t affect B’s climate risk in any way.

4. Discussion

Arrow’s Impossibility Theorem was originally stated in the context of social welfare as a point about voting systems. The experts in our running example were voters and firms—candidates in an election. Having proven that every electoral system is bound to suffer from logical difficulties, Arrow himself was skeptical about the theorem’s practical significance, stating that any systems can work badly rather than will work badly.⁸ However, there seems to be little doubt as to the result’s significance with regards to climate scores. Elections are infrequent events, with relatively few candidates. Typically, who wins matters more than the ordering of the remaining contenders. On the contrary, climate scores are produced for tens of thousands (if not millions) of firms, and — in principle - the more frequently they can be updated, the better. Also, the score awarded to each firm matters — much more so than who is the most resilient company in any category. For these reasons, the issues associated with the Theorem have ample opportunity to arise.

How severely do they affect physical climate risk scores? Climate and ESG data providers tend not to make their methodologies public, so we don’t know how exactly (and how often) these problems materialize, only that they do (which the Theorem guarantees). However, there is a growing empirical literature on ESG scores—which, mathematically, work much like constructed physical climate scores—that might suggest an answer.

In the last few years, several authors have discovered inconsistencies in ESG scores across multiple providers. Berg, Koelbel, and Rigobon (2022) report the average correlation between the scores by five major ESG data providers at 0.61 and compare it to the 0.99 correlation between the credit ratings issued by Moody’s and S&P. Kotsantonis and Serafeim (2019) notice disagreement across ESG issue scores (analogous to climate hazard scores), concluding that these can be constructed in different ways, none of which is strictly superior to the next. Dane, Serafeim, and Sikochi (2021) show that additional disclosures lead to more disagreement about ratings across three ESG data providers. This result showcases the difference between observed and constructed variables that we discussed in Section 2: one would expect two different procedures of estimating an observed variable to converge as we provide more data. However, constructed variables, such as scores, reflect judgements, which tend to diverge.

We can interpret these results as a consequence of either inconsistent normalizations or Independence of Irrelevant Alternatives violations. Consider two climate data providers using the same datasets and methodology, which involves averaging constructed hazard scores. However, let the way they normalize hazard scores be different. Even if we assume that the ordering of all firms by each hazard are the same across both providers, the different normalizations can result in substantially divergent final orderings, as demonstrated in the example in Section 2.3.

If the two providers appreciated the fact that hazard scores are (constructed) orderings and treated them as such, most likely they’d violate Independence of Irrelevant Alternatives, as for most common-sense integration procedures both non-Dictatorship and Unanimity hold. As a result, if they employed the same datasets, the final orderings would also be the same. However, if any of the providers acquired more data, relative orderings of the firms already present in the “old” dataset would change as a consequence of the Independence of Irrelevant Alternatives violation, leading to divergent final scores across the data providers even for the firms they both rate. This is as if, in our Section 3 example, one provider asked the three Experts to rate only firms A and B and the other - A, B, and C.

5. Conclusion

We conclude by summarizing our recommendations with respect to evaluating physical climate risk. If possible, using an observed variable as the metric works best. The probability of being affected by an acute weather event or a monetary loss are objective and unambiguous. They remain so even if we cannot estimate them precisely (e.g., due to limited data availability). If we need to express an opinion, and no observed metric is appropriate for that purpose, care should be taken to define the score—a constructed variable—using only observed variables. Then, even though such exercise is inherently subjective, the score can avoid the internal consistency problems we have reviewed (whether it will be predictive of any observed variable of interest is another matter). Defining a new constructed variable using other constructed variables will—depending on the method used—either cause the normalization problem or result in a violation of at least one of Arrow’s axioms by the Impossibility Theorem.

6. Appendix A: Utility Functions

Constructed variables are extensively used in modern economics. After numerous attempts in the late 19th and early 20th century, economists abandoned the idea of utility (happiness) units. It turned out that the benefit of acquiring a good, or even a fixed sum of money, varies too much across individuals and circumstances for it to be possible to define a fixed increase. What is observable and objective is that, given a choice between two alternatives, an individual chooses one versus the other. If we define a set of alternatives, say, A_1,A_2,A_3…, we can ask an agent to perform pairwise comparisons and, without loss of generality, say he prefers A_1 to A_2 to A_3 and so on. Utility functions are a useful way of summarizing (“representing”) preferences. If, given the options A_1 and A_2, the agent chooses the former, we say that any function u such that u(A_1 )≥u(A_2 ) represents his preferences. For multiple options, as in the example above, a function u such that

represents the agent’s preferences. Importantly, the function itself remains an ordering. Its magnitude has no meaning. The easiest way to see it is to notice that the function:

For any a>0 represents the very same preferences as u, since the choices it implies the agent would make remain unchanged.

References

Arrow, Kenneth J., “A Difficulty in the Concept of Social Welfare.” Journal of Political Economy, 58, 1950.

Berg, Florian, Julian F.Kölbel, and Roberto Rigobon, “Aggregate Confusion: The Divergence of ESG Ratings.” Review of Finance, 2022.

Christensen, Dane, George Serafeim and Anywhere Sikochi, “Why is Corporate Virtue in the Eye of The Beholder? The Case of ESG Ratings.” The Accounting Review, 97, 2021.

Geanakoplos, John, “Three Brief Proofs of Arrow’s Impossibility Theorem.” Cowles Foundation Discussion Papers, 1368, 1996.

Kotsantonis, Sakis, and George Serafeim. “Four Things No One Will Tell You About ESG Data.” Journal of Applied Corporate Finance 31, 2, 2019.

McKenna, Phil, “Vote of No Confidence.” New Scientist, 198, 2008.

Nordhaus, William, “The Climate Casino: Risk, Uncertainty, and Economics for a Warming World.” Recorded Books, Inc. and Blackstone Publishing, 2013.

Footnotes

¹Intergovernmental Panel on Climate Change, “Global Warming of 1.5^oC,” 2018. World Meteorological Organization, “Provisional Report on the State of the Global Climate 2020”: https://library.wmo.int/index.php?lvl=notice_display&id=21804#.X_89fuhKg2x

²One could say we are well into what William Nordhaus called the “Climate Casino” in his eponymous book.

³Kenneth Arrow won the Nobel Memorial Prize in Economics in 1972 for his work in general equilibrium and welfare economics. The Impossibility Theorem was one of his key contributions to the latter field. We provide more context in the Discussion section.

⁴Accidentally, the former statement is true. One World Trade center is 1776 ft (541.3 m), One Canada Square is 770 ft (235 m).

⁵Most modern economics is built upon ordinal (constructed) utility functions. See Appendix A for a brief overview.

⁶Note that in our example, the aggregate ordering happens to be identical to the Flood rating. This is not a violation of non-Dictatorship—if the ordering proposed by the flood expert (or one of the other hazard experts) were different, it might end up differing from the aggregate. The axiom only says the aggregate ordering cannot always be the same as the Flood rating. Intuitively, when the three experts get together, at least two of them should have a say rather than the “dictator” drafting the aggregate rating without regard to what the other two have to say.

⁷Neither has the ordering between B and C. This is because the only change in ordering concerned firms A and C.

⁸McKenna, Phil, “Vote of No Confidence,” New Scientist, 198, 2008.