Your model is beautiful, but does it predict?

layout: true

<div class="my-footer"><span><a href="https://dajmcdon.github.io/dsges" style="color:white">dajmcdon.github.io/dsges</a></span></div>

---

background-image: url("gfx/cover.svg")
background-size: contain
background-position: top

.center[# Your model is beautiful, but does it predict?]

.pull-left[
<br/>
###Daniel J. McDonald
###University of British Columbia
#### NeurIPS: I Can't Believe It's Not Better
]

.pull-right[.center[.middle[
<br/>
<table>
<tr>
<td>![:scale 50%](gfx/qr-slides.png)</td>
<td>![:scale 50%](gfx/qr-code.png)</td>
<td>![:scale 50%](gfx/qr-mywww.png)</td>
</tr>
<tr>
<td><svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M208 352c-2.39 0-4.78.35-7.06 1.09C187.98 357.3 174.35 360 160 360c-14.35 0-27.98-2.7-40.95-6.91-2.28-.74-4.66-1.09-7.05-1.09C49.94 352-.33 402.48 0 464.62.14 490.88 21.73 512 48 512h224c26.27 0 47.86-21.12 48-47.38.33-62.14-49.94-112.62-112-112.62zm-48-32c53.02 0 96-42.98 96-96s-42.98-96-96-96-96 42.98-96 96 42.98 96 96 96zM592 0H208c-26.47 0-48 22.25-48 49.59V96c23.42 0 45.1 6.78 64 17.8V64h352v288h-64v-64H384v64h-76.24c19.1 16.69 33.12 38.73 39.69 64H592c26.47 0 48-22.25 48-49.59V49.59C640 22.25 618.47 0 592 0z"/></svg> Slides</td>
<td><svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M278.9 511.5l-61-17.7c-6.4-1.8-10-8.5-8.2-14.9L346.2 8.7c1.8-6.4 8.5-10 14.9-8.2l61 17.7c6.4 1.8 10 8.5 8.2 14.9L293.8 503.3c-1.9 6.4-8.5 10.1-14.9 8.2zm-114-112.2l43.5-46.4c4.6-4.9 4.3-12.7-.8-17.2L117 256l90.6-79.7c5.1-4.5 5.5-12.3.8-17.2l-43.5-46.4c-4.5-4.8-12.1-5.1-17-.5L3.8 247.2c-5.1 4.7-5.1 12.8 0 17.5l144.1 135.1c4.9 4.6 12.5 4.4 17-.5zm327.2.6l144.1-135.1c5.1-4.7 5.1-12.8 0-17.5L492.1 112.1c-4.8-4.5-12.4-4.3-17 .5L431.6 159c-4.6 4.9-4.3 12.7.8 17.2L523 256l-90.6 79.7c-5.1 4.5-5.5 12.3-.8 17.2l43.5 46.4c4.5 4.9 12.1 5.1 17 .6z"/></svg> Code</td>
<td><svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M256 8C118.941 8 8 118.919 8 256c0 137.059 110.919 248 248 248 48.154 0 95.342-14.14 135.408-40.223 12.005-7.815 14.625-24.288 5.552-35.372l-10.177-12.433c-7.671-9.371-21.179-11.667-31.373-5.129C325.92 429.757 291.314 440 256 440c-101.458 0-184-82.542-184-184S154.542 72 256 72c100.139 0 184 57.619 184 160 0 38.786-21.093 79.742-58.17 83.693-17.349-.454-16.91-12.857-13.476-30.024l23.433-121.11C394.653 149.75 383.308 136 368.225 136h-44.981a13.518 13.518 0 0 0-13.432 11.993l-.01.092c-14.697-17.901-40.448-21.775-59.971-21.775-74.58 0-137.831 62.234-137.831 151.46 0 65.303 36.785 105.87 96 105.87 26.984 0 57.369-15.637 74.991-38.333 9.522 34.104 40.613 34.103 70.71 34.103C462.609 379.41 504 307.798 504 232 504 95.653 394.023 8 256 8zm-21.68 304.43c-22.249 0-36.07-15.623-36.07-40.771 0-44.993 30.779-72.729 58.63-72.729 22.292 0 35.601 15.241 35.601 40.77 0 45.061-33.875 72.73-58.161 72.73z"/></svg> My WWW</td>
</tr>
</table>
]]]

???

* Thanks to the organizers for the invitation
* Cosma originally scheduled, but couldn't make it.
* I'll try to do my best impression.

---

# Themes

</br>

### <svg aria-hidden="true" role="img" viewBox="0 0 256 512" style="height:1em;width:0.5em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M224.3 273l-136 136c-9.4 9.4-24.6 9.4-33.9 0l-22.6-22.6c-9.4-9.4-9.4-24.6 0-33.9l96.4-96.4-96.4-96.4c-9.4-9.4-9.4-24.6 0-33.9L54.3 103c9.4-9.4 24.6-9.4 33.9 0l136 136c9.5 9.4 9.5 24.6.1 34z"/></svg> "Beauty" or perhaps "elegance" as a research goal

</br>

???

Beauty:

* Connects to the themes of this workshop well, I think.
* The idea that a certain mathematical explanation for a real-world phenomenon is somehow "satisfying" when it incorporates as many of the underlying factors as possible.

Long-gestating:

* I'll describe the details more later, but suffice it to say that this work started 10+ years ago, and still isn't even a completed working paper.

Why:

* Not the first, nor will we be the last to point out the flaws, but the models are still terrible.

### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#ffa319;overflow:visible;position:relative;"><path d="M313.941 216H12c-6.627 0-12 5.373-12 12v56c0 6.627 5.373 12 12 12h301.941v46.059c0 21.382 25.851 32.09 40.971 16.971l86.059-86.059c9.373-9.373 9.373-24.569 0-33.941l-86.059-86.059c-15.119-15.119-40.971-4.411-40.971 16.971V216z"/></svg> A parable from Macroeconomics

---

.pull-left[
![](https://static1.squarespace.com/static/5ff2adbe3fe4fe33db902812/t/600a67d7222f7e1fc65f70c5/1611294689562/Screen+Shot+2021-01-21+at+11.02.06+AM.png)
]

.pull-right[
![](https://hastie.su.domains/ElemStatLearn/CoverII_small.jpg)

]

???

* I've been teaching undergraduates from these familiar texts.
* Familiar to most in ML: how to understand evaluate predictive models.
* Statistical Models often predict well, even if they are not built up from a realistic model of the world.
* Ridge regression -- no more confidence intervals; p-values
* Random Forests; Boosting; Deep Learning -- non-parametric and hard to interpret as explanatory. But may predict very well.

---
class: middle

<blockquote cite="Jason Zweig, WSJ 8 August 2009"><a href=https://www.wsj.com/articles/SB124967937642715417>Over an 13-year period, [David Leinweber] found, [that annual butter production in Bangladesh] "explained" 75% of the variation in the annual returns of the Standard & Poor’s 500-stock index.</br></br>By tossing in U.S. cheese production and the total population of sheep in both Bangladesh and the U.S., Mr. Leinweber was able to "predict" past U.S. stock returns with 99% accuracy.</a></blockquote>

---
class: middle

### 1. Good prediction is possible despite no structural relationship

</br>

### 2. Erroneously conflating .tertiary[In-sample] accuracy with .tertiary[Out-of-sample] accuracy

</br>

???

* Quote illustrates the point with it's absurdity.
* But is a little bit problematic from a technical point of view.

---

## What about the reverse? .tertiary[No.]

### A correct causal model that gives accurate counterfactual predictions .tertiary[must] make accurate statistical predictions.

* Spirtes, Glymour, and Scheines &#8212; Causation, Prediction, and Search (1993)

* Pearl &#8212; Causality: Models, Reasoning, and Inference (2000)

* Morgan and Winship &#8212; Counterfactuals and Causal Inference: Methods and Principles for Social Research (2015, 2ed)

???

* By the contrapositive, inaccurate predictions -> incorrect causal model.

---
class: middle, inverse, center

# Macroeconomics and causal inference

???

* How has macro evolved with causal inference.

---

## Prior to mid-1970s, 
### Macroeconomics focused on empirical relationships

.center[![:scale 50%](gfx/phillips.png)]

.pull-right-wide[.small[Source: [AW Phillips "The Relation Between Unemployment and the Rate of Change of Money Wage Rates in the United Kingdom, 1861–1957"](https://doi.org/10.1111/j.1468-0335.1958.tb00003.x)]]

???

* Illustration of a typical (famous), empirical result in macro.
* Inverse relationship between wages (y-axis) and unemployment.
* Says: "when unemployment is low, we have to pay more to entice people to work"
* Makes perfect sense.
* Due to a number of factors:
    1. Getting those who don't want to work to work.
    2. Leave current job to come work for me. (competition)
    3. On the high side: everyone is looking, so I don't have to compete.

---

.pull-left-narrow[
</br></br>
![](https://www.nobelprize.org/images/lucas-13470-content-portrait-mobile-tiny.jpg)
]

.pull-right-wide[
<blockquote cite="Robert Lucas, Econometric Policy Evaluation: A critique"><a href=https://doi.org/10.1016/S0167-2231(76)80003-6>Given that the structure of an econometric model consists of optimal decision rules of economic agents, and that optimal decision rules vary systematically with changes in the structure of series relevant to the decision maker, it follows that any change in policy will systematically alter the structure of econometric models.</a></blockquote>
]

???

* If I as a central banker or congress "do something", it alters the observed relationship.
* Can't just intervene.
* So the Phillips curve doesn't tell me how to think about policy, only what happens in the absence of policy.
* Sets up a cyclical argument.

### The sentiment is really causal inference, but without the language of Pearl

???

* A graphical model illustrates this immediately, but that language wasn't around at the time
* Susan Athey, Guido Imbens and others for lots of fascinating work on Econ and Causal Inference.

---

## Dynamic Stochastic General Equilibrium Models

* Observe the highly aggregated actions of individual agents, and try to learn their reward function.

* Inverse Reinforcement Learning but with tiny amounts data

.pull-left[
`\begin{aligned}
\max_{c_t,l_t}U &=E\sum_{t=0}^\infty \beta^tu(c_t,l_t).\\
&\textrm{subject to}\\
 y_t &= z_t g(k_t,n_t),\\
 1 &= n_t + l_t \\
 y_t &= c_t + i_t \\
 k_{t+1} &= i_t + (1-\delta)k_t \\
 \ln z_t &= (1-\rho)\ln\overline{z} + \rho\ln z_{t-1} + \epsilon_t \\
 \epsilon_t &\stackrel{iid}{\sim} \mbox{N}(0,\ \sigma_\epsilon^2).
\end{aligned}`
]

.pull-right[
![:scale 80%](gfx/dsge.png)

.pull-right-narrow[.small[Source: [Brad Delong](https://www.bradford-delong.com) ]]
]

???

* Add "microfoundations" whose "deep" parameters are invariant to policy. (Supposedly.)
* Then I can "directly" intervene and see what happens.
* The model is for an individual, and then we assume that all individuals are the same, and then we assume that the aggregate looks like an individual.
* [Matthew Jackson and Leeat Yariv](https://doi.org/10.2139/ssrn.2684776) argue that this last assumption is false for most DSGE specifications.

* Don't observe the individual driver, but maybe "the number of crashes and the total number of parking tickets in the country in a 3 month period"

---

<blockquote cite='"On DSGE Models" by Christino, Eichenbaum, and Trabandt (2017 Working paper version)'><a href=https://faculty.wcas.northwestern.edu/~yona/research/DSGE.pdf>
People who don’t like dynamic stochastic general equilibrium (DSGE) models are dilettantes. By this we mean they aren’t serious about policy analysis.</a> 

</blockquote>

<blockquote cite='"Where Modern Macroeconomics Went Wrong" by Joseph Stiglitz'><a href=https://www.ineteconomics.org/uploads/papers/Where-Modern-Macroeconomics-Went-Wrong.pdf>I believe that most of the core constituents of the DSGE model are flawed—sufficiently badly flawed that they do not provide even a good starting point for constructing a good macroeconomic model.</a></blockquote>

???

* Stiglitz is far from the only critical voice (or even critical laureate). See Robert Solow's testimony before the House in 2010, Brad Delong, Robert Waldman, Chris Sims, and others.

### I guess I'm a dilettante, but at least I'm in good company.

---

## Long in development with no where to go

I started (and mostly finished) this work in 2009 with Cosma Shalizi and Mark Schervish, at the beginning of my PhD.

Gets revisited every 3 years. Mostly "on the backburner"

???

* Largely the "exploration" phase of my thesis work. 
* I was interested in this because I'd worked at the Fed. But wasn't exactly sure what my contribution would be.
* Coincided with the Financial Crisis of 2008-09. 
* Got funded by the first INET Grant cycle (George Soros), so that I could avoid TA duties over the summer.

* Not obvious how spending a lot of time on it would lead to tenure. Not clear it has any Journal home.

### The original question: "Why didn't these models predict the Financial Crisis?"

* .secondary[Macroeconomics answer:] because they were missing important structural mechanisms <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#ffa319;overflow:visible;position:relative;"><path d="M313.941 216H12c-6.627 0-12 5.373-12 12v56c0 6.627 5.373 12 12 12h301.941v46.059c0 21.382 25.851 32.09 40.971 16.971l86.059-86.059c9.373-9.373 9.373-24.569 0-33.941l-86.059-86.059c-15.119-15.119-40.971-4.411-40.971 16.971V216z"/></svg> lets add them

* .secondary[Our answer:] because they don't predict .tertiary[anything], you just got lucky up to now.

???

* Thesis work ended up focusing on proving that latter for time series (learning theory), rather than the empirical demonstration.

---

## Exploratory behaviour you should demand.

### 3 things your beautiful model should do:

<table style="width: 100%; vertical-align:top">
<tr>
<td style="width:33%; vertical-align:top">.center[.secondary[.huge[1.]]] <br/>Be internally consistent: if you simulate it, and train on the simulation, you should eventually recover the truth.</td>
<td style="width:33%; vertical-align:top">.center[.secondary[.huge[2.]]] <br/>It should work better on real data than it does on nonsense.</td>
<td style="width:33%; vertical-align:top">.center[.secondary[.huge[3.]]] <br/>It should predict as well as simple baselines (if not better).</td>
<tr>
<table>

???

* First two are about matching reality.
* The third is more about defining what we mean by "predict well". 
* We don't have a "bakeoff" like with the digits data or ImageNet. 
* Fall back on something like "does it do better than random guessing". Standard with AUC analysis, etc.

---
class: middle, inverse, center

# So how does the DSGE do?

.huge[1.] Repeat many times:

Simulate lots of data. Train on increasingly long periods. Predict.

---

## 1. Are it's parameter estimates consistent? .tertiary[No.]

.secondary[.large[Might be OK.]]

???

* Note the light blue window.
* Emphasize: 
    1. used the published model parameters. 
    2. Years of data
    3. Blue/Orange lines
    4. CI bands are quantiles over replicates (50/80)

* Perhaps not identifiable.
* Local minima might still yield good predictors 
* But if so, it undermines model interpretability

---

## 1. Can it eventually predict it's own data? .tertiary[No.]

???

* Relative to the model that generated the data (don't learn the parameters).

---

## 2. Mess with it's brain: just relabel all the series

This model is not symmetric in the inputs/outputs. They're meaningful (supposedly)

Example: Give it Income where it thinks it's getting the Interest Rate, re-train the model, and produce out-of-sample predictions.

* Of 5040 permutations, about 51% of the "wrong" ones are better.

---

## 3. Compare to simple baselines

<blockquote cite='"How Useful Are Estimated DSGE Model Forecasts for Central Bankers?" by Edge and G&#252rkaynak'><a href=https://www.jstor.org/stable/41012847>In line with the results in the DSGE model forecasting literature, we found that the root mean squared errors (RMSEs) of the DSGE model forecasts were similar to, and .secondary[often better than], those of the BVAR and Greenbook forecasts. [However...] these models [all] showed [...] almost no forecasting ability. Thus, our comparison is not between one good forecast and another; rather, .secondary[all three methods of forecasting are poor]....</a></blockquote>

???

* Define BVAR
* Define Greenbook

---
class: middle, center, inverse

# Bad models are used in other fields, too.

???

* About 16 months deploying and especially evaluating COVID forecasts.

---

## Forecasting COVID-19 Incident Cases in the US

* 20 months later, why are so many models worse than [Baseline](https://delphi.cmu.edu/forecast-eval/)?

* Of 38 teams, 10 routinely beat the baseline.

???

* Lines are Official CDC forecast (blue), pilot ensemble that uses recent skill (orange)
* The score is for quantile forecasts. See more at Covid19Forecast Hub.
* Very hard problem. Many teams are bad, but bad at different times.
* Weighting by Recent performance doesn't seem to help the way it should.
* This is a public evaluation (so you can recreate this and see who's doing well).

---

## Conclusions

1. Models that predict well can be useless for policy.

2. Models that predict poorly require serious caveats / justification before using for policy.

3. This is hard with social science.  
    In many other fields, we discard models that predict poorly (particle physics, astronomy, medicine, etc.)

4. Suggested minimal criteria for a policy-relevant model:
    - Internally consistent.
    - Robust to nonsense.
    - Better than simple baselines.
--
    - Others??
    
???

* Worth codifying some standard metrics to subject your new model too.
* Perhaps the same "public" declarations like medical studies.
* Just because it "controls for" or "includes" some potentially relevant feature doesn't mean it's better.

---

## Thanks

.pull-left[.center[
![:scale 30%](gfx/shalizi.jpg)

![:scale 30%](gfx/schervish.jpg)![:scale 39%](gfx/darren.jpg)
]]

.pull-right[
.center[
![:scale 50%](gfx/inet.png)

![:scale 50%](https://www.nsf.gov/images/logos/NSF_4-Color_bitmap_Logo.png)

![](https://www.nserc-crsng.gc.ca/img/logos/img-logo2-en.png)
]]

.center[.huge[<svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#6495ed;overflow:visible;position:relative;"><path d="M569.517 440.013C587.975 472.007 564.806 512 527.94 512H48.054c-36.937 0-59.999-40.055-41.577-71.987L246.423 23.985c18.467-32.009 64.72-31.951 83.154 0l239.94 416.028zM288 354c-25.405 0-46 20.595-46 46s20.595 46 46 46 46-20.595 46-46-20.595-46-46-46zm-43.673-165.346l7.418 136c.347 6.364 5.609 11.346 11.982 11.346h48.546c6.373 0 11.635-4.982 11.982-11.346l7.418-136c.375-6.874-5.098-12.654-11.982-12.654h-63.383c-6.884 0-12.356 5.78-11.981 12.654z"/></svg> .secondary[ICBINB Organizers] <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#6495ed;overflow:visible;position:relative;"><path d="M569.517 440.013C587.975 472.007 564.806 512 527.94 512H48.054c-36.937 0-59.999-40.055-41.577-71.987L246.423 23.985c18.467-32.009 64.72-31.951 83.154 0l239.94 416.028zM288 354c-25.405 0-46 20.595-46 46s20.595 46 46 46 46-20.595 46-46-20.595-46-46-46zm-43.673-165.346l7.418 136c.347 6.364 5.609 11.346 11.982 11.346h48.546c6.373 0 11.635-4.982 11.982-11.346l7.418-136c.375-6.874-5.098-12.654-11.982-12.654h-63.383c-6.884 0-12.356 5.78-11.981 12.654z"/></svg>]]

???

* Aaron, Jessica, Stephanie, Francisco, and Melanie