Predicting 2021 BADLY – Methodology

Maybe I’m crazy, but I spent my free time this past week building a bad election prediction model.

It’s a bad model for three reasons.

The first is that I decided to build it in a week’s worth of spare time, so it’s missing a number of important features — most notably, any measure of the quality of local candidates (even incumbency, or whether a candidate from a party is even running in that riding — I literally assumed, incorrectly, that the BQ runs in every Quebec riding and the other five parties run in all 338 ridings).

The second is that I decided to build it in a week’s worth of spare time, so it’s missing the robustness that a real model should have; I hacked in some local variation that seemed about right to me, and it runs a little slower than would be strictly ideal — about 6 or 7 seconds per simulated election, which sounds fast, but not when you want thousands of iterations.

The third is that it’s not even built with 2021 polling data. Which is a pretty serious flaw, if you ask me.

Why bother

Well, it seemed like a cool idea at the time, for one. And for another, I think that (with better data and a little more time) it could be an improvement on Canadian election projection techniques. Here’s a quick rundown of the major election forecasting techniques, with massive simplifications:

American elections involve the electoral college, where each state has a certain number of electors and (in most cases) the winner of the state gets the electors. So the key things to do in a US election projection are to analyze the polls for each state, and then develop a model that represents the uncertainty inherent in the polls, and the potential for a swing to be correlated across states. The classic swing was in 2016 where lower-education white voters in the Midwest went for Trump much more than anticipated, pulling a bunch of states over to his side. This correlation is the tough part; ask Sam Wang at Princeton about that. US elections have the most energy in prediction, and are the most well known — even a leading Canadian predictor, 338 takes their name from 538.

This method doesn’t work in Canada for two big reasons. The main one is that our polling data is much thinner on the ground; in the US, there are usually only 10-15 relevant states for a (non-landslide) election, and they get polled heavily. There’s good data of what the voters in Iowa are thinking. In Canadian elections, there’s not 50 states in play, there’s 338 ridings, so the polling is spread thinly. Forget riding-level polling, pollsters don’t even report province-level polling for provinces like Nova Scotia, with 11 ridings. (Which makes sense; a 2000 person poll, which would be on the large side, will only average 5 or 6 people in a riding, and in some cases may have 1 or 2.)

The second reason is that US elections are basically two-way affairs; in Canada, there are somewhere between 4 and 6 major parties, and there are red-blue, blue-orange, red-orange, rouge-azur and so on. Recent US elections have had 5 or 6 states switching; recent Canadian elections have had 62 (2019) and 158 (2015) seats switch parties.

The primary method I’ve seen done (some places are not clear about their methodology) here in Canada are variations on the ‘proportional swing’ method. This means, essentially, that if last time the Pink party got 20% of the vote in a particular province, and in a specific riding, they got 14%, and today they’re polling 30% in that province (150% of the old vote), then we assume they get 21% in the riding (14% * 150% = 21%).

This is not an unreasonable methodology; it allows regional polling to have an impact on local seats, and it’s a reasonable assumption that if a party is doing well somewhere, it’ll be doing much better where it was strongest, and not as much better where it was weakest last time.

The United Kingdom shares the same problems in forecasting as Canada, which makes sense given that our electoral system is derived from theirs. Lots of small constituencies, many of them in play, multiple parties across the spectrum including regional parties. Historically, swing models were used. However, more recently, the UK has seen the use of more advanced statistical models, using MRP (Multilevel Regression and Poststratification), developed by Andrew Gelman and colleagues.

I was inspired by the use of MRP to try a not entirely dissimilar technique; discrete choice microsimulation. Hopefully, it’s interesting; I’d like to look into it more (without such a crushing deadline – election projections are a lot less interesting the day after.)

My approach

To try and bang this out as quickly as possible, here’s my technique in a nutshell:

Estimate models. First, I take survey data (more on the provenance in a bit) that has people’s vote preferences. I estimate a multinomial logit discrete choice model, which is a statistical model that gives a ‘score’ to each alternative (in this case, each party). It has some useful mathematical properties that are convenient for simulation of this type of decision. As a result, there is a model that predicts how likely someone is to vote for each party based on their demographics (like age, income and education) and where they live (including region, as well as properties of the riding). There’s also a second model that predicts turnout; the 60% or so of the population that actually votes is older, more educated and wealthier than the population at large.

Create a world. Since the model requires us to know about people and where they live, I built a synthetic population. This uses actual Census responses from individual people, which keep all of the correlations between different demographic components. I use a procedure called combinatorial optimization to produce a population that contains these records, but matches key demographic measures for each riding. So my virtual version of Calgary Centre has 44% of residents with a university degree, and 71% of households living in apartments, and so on, just like the real one. (To save on run time, my synthetic version of Canada only has 5% of the population of the real one.)

Simulate the election. This is a process of running through the 1.8 million or so people in my virtual Canada, and applying the models. Everybody gets a score reflecting how likely they are to vote, and then a score for each of the parties, including a random component that represents all the variance that isn’t in the models. If the person votes, then they vote for their preferred party, and this gets added to the score.

Repeat. I run the models over and over with slight variances to the parameters; these reflect uncertainties in the process. So one run, the turnout might be a little lower, the Conservatives might be a little more attractive, low income people might be more likely to vote for the NDP, and the Green candidate in Calgary-Confederation might do a little better. The next run, a different mix of these. This is the least-founded part of the model; basically, I messed around with the variances to get something like a +/- of 3% for national vote shares and +/- 6% for riding level shares.

The data-free approach

So as I mentioned, this approach is based on building a model out of polling data — the individual records of people who were polled, not just aggregate totals. But I don’t have any polling data. So I tried to do the next best thing; I made some polling data. Actually, I synthesized some polling data using the same approach as to build a synthetic population.

The 2019 Canadian Election Study is a large set of highly detailed surveys of people done both before and after the election, for academic use. I used the online panel, which is around 30K people. I then used the most recent Angus Reid poll data (September 15-18 2021), which include the vote shares by region, by age group and gender, by income and by education level. I didn’t pick Angus Reid because I think they are better or more reliable than other pollsters; I merely picked them because they provide the most detailed demographic breakdowns.

So, for example, the Angus Reid poll has the Liberals polling at 14% amongst Alberta residents, 31% in Quebec, 38% nationally amongst women over 55, 29% nationally amongst people with household incomes between $50K and $100K, and 27% amongst people with a high school education or less.

Using the combinatorial optimization software, I sampled 2042 surveys (the same as in the poll) that matched all of these totals, for all parties, geographies and demographics. The sampling process tried about a million swaps in and out of the synthetic poll to match, which sounds like a lot but took a minute or so. Computers are great!

So this synthetic poll matches the top line numbers, and contains more detailed results — I know which constituency every survey is in, their exact age, more detailed income, and so on. The problem is that it’s not actually current survey data; it surveys people who are choosing parties led by Andrew Scheer and Elizabeth May, and where the People’s Party support is only 2%. If the PPC does as well as polling suggests, they may have 3 times their vote share, which means that the new supporter demographics are likely to be different than the 2019 ones; 2 out of every 3 PPC voters are new to the party by definition.

Building models

With synthetic poll data, I get to estimate models. This is establishing the relationships between people and the choices they make. The underlying approach I use is multinomial logistic regression (using the free and open but slow Biogeme). The theory comes from economic circles; Daniel McFadden won the 2000 Nobel Memorial prize for it. Basically, each choice has a ‘utility’ assigned to it, which can be thought of as a score. The higher the score, the more attractive the alternative. I always like a worked example; here’s the score for not voting in the turnout model:

  • -0.5150 points to start with.
  • 0.0221 points for every thousand dollars below $50K in income
  • -0.00281 points for every thousand dollars above $50K (to a max of $250K) in income
  • -0.0863 points for each year above 50 in age
  • 0.000918 points for the age above 50 squared
  • Points for education: 0.7923 if you have less than a high school education, 0.2543 for high school, 0 for college/apprenticeship/CEGEP/etc., -0.3928 for a Bachelor’s degree, -0.5244 for a higher degree.

So to compare three people, say Maurice, a 25 year old with a high school diploma and $25K income, Ibrahim, a 50 year old with a college diploma and $50K income, and Shirley, a 60 year old with a doctorate and $200K income:

CategoryMauriceIbrahimShirley
Base-0.5150-0.5150-0.5150
Income0.55190-0.4217
Age00-0.7707
Education0.25430-0.5244
Total Score0.2912-0.5150-2.2319

So these are the utilities for not voting; as it happens, I set the utility for voting at 0 (these utilities are an arbitrary scale relative to each other). So Maurice is more likely to not vote than to vote, while Shirley is very unlikely to not vote. I say ‘more likely’ and ‘unlikely’ because there’s also a random component that represents the unknown (‘unobserved’ in science-talk) portion of the utility. Maybe Maurice is also the president of his campus political club, maybe Shirley is claustrophobic and hates the idea of a polling booth. This random portion is assumed to be Gumbel distributed, which is very much like a normal (bell-curve) distribution, with slightly wider tails and much better computational properties. With this random component included, Maurice is about 57% likely to not vote, Ibrahim is 37% likely to not vote, and Shirley is only 10% likely to not vote.

The same thing applies for the vote choice model, except that there are six alternatives instead of 2 (I assume that the Liberal, Conservative, NDP, Green and People’s Party run in all ridings, the Bloc in all Quebec ridings, and no one else). And the relationships are more complicated. What’s in the model is:

  • Constant for each party – the part unexplained by everything else
  • Region level constant – adjusting the vote by region
  • Gender
  • Age
  • Income
  • Language (first official language — French or English; this is important for Bloc voters)
  • Education
  • Immigrant status
  • Visible minority status
  • Density of electoral district (Conservatives do better in lower density areas)
  • Share of vote in previous election by party (for the PPC I use the Liberal+NDP+Green share, since the model is estimated on 2019 poll data and in this case the previous election is 2015, with no PPC)

The biggest thing missing here is the candidate quality — even just the presence of an incumbent; the reason for this is there wasn’t enough time to prepare data. But more complexities could be investigated — do white voters behave differently in areas with high visible minority populations? do different ethnic communities behave differently? are there interactions between these variables?

These models are then ‘calibrated’, that is, the constant shares are adjusted to match the overall totals; for voting turnout, I set the goal of 64% turnout on average (about the last 4 and 6 election average), and for the national and regional party shares, I used the CBC Poll Tracker aggregation — poll aggregation is a lot of hard work, and I want to stand on the shoulders of giants.

Fake Canada

As mentioned above, I built a ‘synthetic population’ to represent all of these different demographic variables; each riding has it’s own mix of age, income, education and so on. Statistics Canada makes ‘microdata’ available (sort of; the US Census Bureau does a much better job). These represent individual long-form 2016 Census responses; to avoid identification, a lot of variables are rounded or coded, the file is only a sample of these responses, and the location is vaguely identified by the province and metro area. Large metro areas, over about 300K are identified specifically, while smaller metros are combined; for instance the records from Quebec are in Montreal / Quebec City / Ottawa (i.e. the Gatineau part of the Ottawa CMA in Quebec) / Sherbrooke or Trois-Rivières / Somewhere else in Quebec.

So each district uses somewhat local residents. (I assigned each constituency to the area with the largest share.) These records are then sampled, and resampled, (a lot, 300 million samplings) to build a population that matches the total population in each constituency by the following dimensions:

  • Household size
  • Household income
  • Dwelling type
  • Age (detailed 5 year bins)
  • First official language spoken (English or French)
  • Indigenous status
  • Citizenship
  • Education level
  • Visible minority status
  • Immigrant status (by year of immigration — pre 1991, 1991 to 2005, 2006 or more recent)

As mentioned above, these are scaled down to represent only 5% of the population in the district; this speeds up the runtime by a factor of 20 but keeps plenty of variety in the dataset; most constituencies have around 5000 people in them.

Simulating…

And then, the thing is just to put the pedal to the metal and crank through the simulations. Each simulation takes the 1.8 million people in the synthetic population and goes through the following steps:

  • Ditch the people who can’t vote
  • Assign a turnout utility (score) to each person
  • Assign utilities for each of six parties for each person
  • Adding the random component, pick a turnout status and a preferred party
  • Add up the votes cast by the people who voted in each electoral district
  • Prepare totals — who won each seat, what is the overall turnout, that sort of thing and write them to a file

Each iteration, the parameters in the turnout model and the vote preference models are given some random variation; this means that, for instance, the high school parameter for not voting is 0.2543; it’s changed each time with a standard deviation of roughly 0.1, so most of the time it’ll be between 0.2 and 0.3, and occasionally outside that. Some times high school educated people will come out to vote more often and some times less, basically. This creates some variance that I hope is reasonable. For the turnout model, I use the standard error from the estimation, reflecting the uncertainty around each parameter.

For the main election model, I’ve had to turn down the uncertainties to get a more realistic behaviour — I’m aiming for about +/-3% vote share swings nationally, and around +/- 8% or so in individual ridings. This is sort of arbitrary; the estimated uncertainties of the parameters are massive because I’m only using 2000 observations, so I only use 10% of the standard error (20% for the national vote share). This still permits waves where men turnout for the NDP more than estimated and so on, just keeps them to (what seems to my sleepy eyes) a reasonable rate. I’ve added in one additional source of uncertainty; a district-level noise, where each riding can swing a little outside the national and regional trends.

(For the nerds, I used python and specifically pandas for the model. )

And that’s the (bad) model that I’ve built, and the model that’s running right now on my machine. I’ll see what nonsense it spits out.

One thought on “Predicting 2021 BADLY – Methodology

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s