Author Archives: Michael Alvarez

Pre-registration of 16 and 17 year olds in California

California has recently launched a program, which allows eligible 16 and 17 year old in California to pre-register to vote. When they turn 18, their registration becomes active. Here’s more information from the CA Secretary of State’s website.

It’s going to be quite interesting in 2018 and 2020 to evaluate how this initiative works. Will those who pre-register be more likely to turnout to vote than those who do not (obviously, controlling for all of the factors that might lead 18 year olds to register and vote)? Who uses this program, and does it have any consequences for how organizations and campaigns conduct voter registration drives and get-out-the-vote activities? Lots of interesting questions here for future study!

New research on election forensics

Election forensics is a hot topic for research these days, and recently Arturas Rozenas from NYU published an interesting new paper at Political Analysis (the journal I co-edit). His paper, “Detecting Election Fraud from Irregularities in Vote-Shares” should be of interest to folks studying election integrity and election fraud. Here’s the abstract:

I develop a novel method to detect election fraud from irregular patterns in the distribution of vote-shares. I build on a widely discussed observation that in some elections where fraud allegations abound, suspiciously many polling stations return coarse vote-shares (e.g., 0.50, 0.60, 0.75) for the ruling party, which seems highly implausible in large electorates. Using analytical results and simulations, I show that sheer frequency of such coarse vote-shares is entirely plausible due to simple numeric laws and does not by itself constitute evidence of fraud. To avoid false positive errors in fraud detection, I propose a resampled kernel density method (RKD) to measure whether the coarse vote-shares occur too frequently to raise a statistically qualified suspicion of fraud. I illustrate the method on election data from Russia and Canada as well as simulated data. A software package is provided for an easy implementation of the method.

And since Political Analysis requires that authors provide code and data to replicate the work reported in their paper, here’s the replication materials from Arturas.

Tweeting the Election

We are pleased to launch this new website,, which provides live and real-time summary analytics of the conversation currently occurring on Twitter, about election administration and voting technology, as Americans go to vote in the 2016 Presidential election. This website was developed at Caltech, by Emily Mazo, Shihan Su, and Kate Lewis, as their project for CS 101, and in collaboration with myself and the Caltech/MIT Voting Technology Project.

This website offers two views of some of the discussion about the election that is now occurring on Twitter. These visualizations compare the discussion occurring amongst people with different political ideologies: we have separated the tweets by the predicted ideology of the author.

The first view is geographic, showing the sentiment of incoming tweets by state. In the geographic view, which shows the average sentiment in every state over the past six hours and is updated every few minutes, dots on the map display the most recent tweets, and you can hover over those dots to view the content of the tweet.

At the top of the website, clicking over to the timeline view, you can see the sentiment of the recent incoming tweets by hour, for the past twelve hours.

In each view, we offer data on four different election administration and voting technology issues: tweets about absentee and early voting, polling places, election day voting, and voter identification. We collect these tweets by filtering results from the Public Streaming Twitter API, by keyword, for tweets falling into these categories.

Furthermore, we classify each incoming tweet in two ways. First, we do a sentiment analysis on the incoming tweet, to classify it as positive (given as green in both views) or negative (given as red in both views). We also classify the Twitter users as being more likely to be liberal or conservative in their political orientation, so that viewers can get a sense for how discussion on Twitter of these election administration and voting technology issues breaks by ideology.

We continue to work to improve our analytical tools, and the presentation of this data on this website. We welcome comments, please send them to

Finally, this website is currently best viewed in fullscreen mode, on a larger laptop or computer screen. Mobile optimization, and viewing on smaller sized screens, is for future development.

Some Background

This is part of a larger project at Caltech, which began in the fall of 2014, where we began collecting tweets like these for studying election administration. The background of the project, as well as some initial analyses at validation of these data, is contained in the working paper, Adams-Cohen et al. (2016), “Election Monitoring Using Twitter” (forthcoming).

1. How we collect these tweets
The Tweet data used in this visualization is collected from Twitter’s Public Streaming API. The stream is filtered on keywords for the four different topics of interest:

Election Day Voting: Provisional ballot, Voting machine, Ballot.
Polling Places: Polling place line, precinct line, Pollworker, Poll worker.
Absentee Voting: Absentee ballot, mail ballot, vote by mail, voting by mail, early voting.
Voter ID: Voter identification, Voting identification, Voter ID.

2. Sentiment analysis
The Tweets collected from this stream were then fed into two classification models. The first model classified them into positive or negative classes based on their text. This model was created with crowdsourced data; about 5000 Tweets from a previous collection of Tweets collected in the same manner as those in the visualization were labeled with sentiment (valence) on a positive-to-negative scale, by at least three crowd workers, and then averaged to create a standard label for that Tweet. This training sets of Tweets and labels was then used to create a term frequency-inverse document frequency vector for the the words in each Tweet in the set. These vectors were used to train a decision tree regression model to predict the value for the sentiment in future Tweets (a high positive predicted value indicating higher positive sentiment, and a more negative value indicating a more negative sentiment).

The Tweets that appear on tweetingtheelection were streamed from the Twitter API, stripped of stop words, hashtags, URLs, and any other media, then processed to create Tf-Idf vectors to represent each Tweet using the same vocabulary as the original model. These vectors were then passed through the decision tree regression model, which predicted sentiment labels for them.

3. Ideology classification
This model classify tweets into republican or liberal based on their text.
Training data
Training data for this model is obtained through two process.
First, we obtain ideal point estimation for twitter users from Pablo Barbera’s work <“Birds of the Same Feather Tweet Together. Bayesian Ideal Point Estimation Using Twitter Data”>. In his work, Barbera develop a latent space model to estimate user’s ideal point by studying the following links between users.

Second, we match the user id obtained from Barbera with the tweets collected by R. Michael Alvarez’s lab from 2016-04-19 to 2016-06-23. In order words, we label tweets using the label of the user who create the tweets. We matched around 55,000 tweets and use these as our training data.

4. Classifier
The classifier we use is a convolutional neural network.
The model originally developed by Yoon Kim in his work . We adopt Denny Britz’s Implementation of this model in tensor flow.

The model we adopted has four layers. The first layer embeds words into 128 dimensional vectors, which are learned within the neural network. The next convolution layers use three filter sizes(3, 4, 5), i.e. sliding over 3, 4, 5 words at a time. Then applies 128 filters to each of the three filter size. Next, the model max-pool the result of convolutional layer into a 128 feature vector, add dropout regularization. Then the final softmax layer classify the result.

5. Geocoding
The incoming tweets are geocoded. We use the coordinates of where the tweet was sent from, if they are provided (if the user opted into geocoding). If that information is unavailable, we use the location of the user from their profile (if provided).


There are many people to thank. First off, has created by Emily Mazo, Shihan Su, and Kate Lewis, as their project for CS 101 at Caltech. That class is taught by Yisong Yue and Omer Tamuz, professors at Caltech.

Earlier work done on this project, in particular the ongoing collection of Twitter data like these since 2014 at Caltech, has been done in collaboration with Nailen Matschke (who developed the original python-based Twitter collection tool, as well as the MySQL database where all of the data collected so far is stored); with Clare Hao and Cherie Jia, who worked in the summer of 2016 as SURF students, developing some preliminary python tools to analyze the Twitter data; and Nick Adams-Cohen, a PhD student in the social sciences at Caltech, who is studying the use of Twitter data for public opinion and political behavior research.

We’d like to thank Pablo Barbera for sharing his ideological placement data with us.

Other colleagues and students who have helped with this research project include Thad Hall, Jonathan Nagler, Lucas Nunez, Betsy Sinclair, and Charles Stewart III.

How secure are state voter registration databases?

Over the past few months, there’ve been a number of reports that state voter registration databases have come under cyberattack. Most recently, FBI Director James Comey discussed the attacks in a hearing of the U.S. House Judiciary Committee. While details have been few, between what’s been reported by the FBI recently, and the various attacks on the email accounts of political parties and political leaders in the U.S., it’s seems clear that the U.S. election infrastructure is being probed for vulnerabilities.

So exactly how secure are state voter registration databases? I’ve been asked about this a number of times recently, by the media, colleagues, and students.

The potential threats to state voter registration databases have been known for a long time. In fact, I wrote a paper on this topic in October 2005 — that’s not a typo, October 2005. The paper, “Potential Threats to Statewide Voter Registration Systems”, is available as a Caltech/MIT Voting Technology Project Working Paper. It’s also part of a collection of working papers in a NIST report, “Developing an Analysis of Threats to Voting Systems: Preliminary Workshop Summary.”

The context for my 2005 paper was that states were then rushing to implement their new computerized statewide voter registries, as required after the passage of the Help America Vote Act. At the time, a number of researchers (myself included) were concerned that in the rush to develop and implement these databases, and that important questions about their security and integrity needed to be addressed. So the paper was meant to provide some guidance about the potential security and integrity problems, in the hopes that they would be better studied and addressed in the near future.

The four primary types of threats that I wrote about regarded:

  • Authenticity of the registration file: attacks on the transmission path of voter registration data from local election officials to the state database, or attacks on the transmission path of data between the state registry to other state officials (for example, departments of motor vehicles).
  • Security of personal information in the file: state voter files contain a good deal of personal information, including names, birthdates, addresses, and contact information, which could be quite valuable to attackers.
  • Integrity of the file: the primary data files could be corrupted, either by mistakes which enter the data and are difficult to remove, or by systematic attack.
  • System failure: the files could fail at important moments, either due to problems with their architecture or technology, or if they come under systematic “denial of service” attacks.

By 2010, when I was a member of a National Academies panel, “Improving State Voter Registration Databases”, many of these same concerns were raised by panelists, and by the many election officials and observers of elections who provided input to the panel. It wasn’t clear how much progress had been made by 2010, towards making sure that the state voter registration systems then in place were secure.

Fast-forward to 2016, and very little research has been done on the security and integrity of state voter registration databases; despite the concerns raised in 2005 and 2010, there’s not been a great deal of research focused on the security of these systems, certainly nowhere near the amount of research that has focused on the security of other components of the American election infrastructure, in particular, the security of remote and in-person voting systems. I’d be happy to hear of research that folks have done; I’m aware of only a very few research projects that have looked at state voter registration systems. For example, there’s a paper that I worked on in 2009 with Jeff Jonas, Bill Winkler, and Rebecca Wright, where we matched the voter registration files from Oregon and Washington, in an effort to determine the utility of interstate matching to find duplicates between the states. Another paper that I know of is by Steve Ansolabehere and Eitan Hersh, which looks at the quality of the information in state voter registries. But there’s not been a lot of systematic study of the security of state voter registries; I recommend that researchers (like our Voting Technology Project) direct resources towards studying voter registration systems now, and in the immediate aftermath of the 2016 election.

In addition to calling for more research on the security of state voter registration databases, election officials can take some immediate steps. The obvious step is to take action to make sure that these databases are now as secure as possible, and to determine whether there is any forensic evidence that the files might have been attacked or tampered with recently. A second step is to make sure that the database system will be robust in the face of a systematic denial of service attack. Third, election officials can devise a systematic approach towards providing pre- and post-election audits of their databases, something that I’ve strongly recommended in other work on election administration auditing (with Lonna Atkeson and Thad Hall). If election officials do audit their voter registration databases and processes, those reports should be made available to the public.

Felony Disenfranchisement

I frequently am asked by students, colleagues, and the media, about how many people in the U.S. cannot participate in elections because of felony disenfranchisement laws. Given the patchwork quilt of felony disenfranchisement laws across the states, and a lack of readily available data, it’s often hard to estimate what the rate of felony disenfranchisement might be.

The Sentencing Project has released a report that provides information and data about felony disenfranchisement and the 2016 federal elections in the U.S. Here are their key findings, quoted from their report:

“Our key findings include the following:

– As of 2016, an estimated 6.1 million people are disenfranchised due to a felony conviction, a figure that has escalated dramatically in recent decades as the population under criminal justice supervision has increased. There were an estimated 1.17 million people disenfranchised in 1976, 3.34 million in 1996, and 5.85 million in 2010.

– Approximately 2.5 percent of the total U.S. voting age population – 1 of every 40 adults – is disenfranchised due to a current or previous felony conviction.

– Individuals who have completed their sentences in the twelve states that disenfranchise people post-sentence make up over 50 percent of the entire disenfranchised population, totaling almost 3.1 million people.

– Rates of disenfranchisement vary dramatically by state due to broad variations in voting prohibitions. In six states – Alabama, Florida, Kentucky, Mississippi, Tennessee, and Virginia – more than 7 percent of the adult population is disenfranchised.

– The state of Florida alone accounts for more than a quarter (27 percent) of the disenfranchised population nationally, and its nearly 1.5 million individuals disenfranchised post-sentence account for nearly half (48 percent) of the national total.

– One in 13 African Americans of voting age is disenfranchised, a rate more than four times greater than that of non-African Americans. Over 7.4 percent of the adult African American population is disenfranchised compared to 1.8 percent of the non-African American population.

– African American disenfranchisement rates also vary significantly by state. In four states – Florida (21 percent), Kentucky (26 percent), Tennessee (21 percent), and Virginia (22 percent) – more than one in five African Americans is disenfranchised.”

This looks like a useful resource for those interested in understanding the possible electoral implications of felony disenfranchisement laws across the U.S.

Questions about postal voting

Since the origins of the Caltech/MIT Voting Technology Project in 2000, the VTP has noted a number of concerns about postal voting. Our original report in 2001 noted that postal voting represents clear tradeoffs, with benefits including convenience, but with potential risks, especially regarding the reliability and security of balloting by mail.

Our most recent report reiterated these same concerns, but added another, as there is new research indicating that many of the reductions in residual votes (a key measure of voting system reliability and accuracy) are at risk because of the increase in postal voting. One of these papers studies residual votes in California (“Voting Technology, Vote-by-Mail, and Residual Votes in California, 1990-2010”). The other is a national-level study, “Losing Votes by Mail.” There is an important signal in the residual vote data from recent elections, increased postal voting is associated with increased residual votes.

Now comes word of a new concern about the reliability of postal voting. Upcoming Austrian elections might be postponed due to faulty glue used in the ballot envelopes. This video helps explain the problem.

While we’ve raised questions in the past about the reliability of the mail system for balloting (in particular, noting that there’s always a risk that balloting materials might be delayed or misdirected, especially for overseas and military voters covered by the UOCAVA and MOVE Acts), a basic malfunction of postal voting material is not an issue that we’ve heard much of in the past. But clearly it may be an issue in the future, so researchers will need to keep an eye on what is learned from this Austrian postal ballot problem, how it it resolved, and determine how to prevent problems like these from happening.

California’s massive 2016 voter guide

I’m glad that I recently had a large and sturdy mailbox installed at the end of our driveway. Our previous mailbox was small, rusty, and was starting to lean to one side — had the mail carrier tried to leave California’s massive, 224-page, 2016 general election voter information guide in our old mailbox, I have no doubt it would have immediately toppled over.

The LA Times has a fun video that shows the printing of this super-sized voter information guide:

Don’t get me wrong, I think that it’s great that California voters receive the voter information guide from our Secretary of State (it’s available online in pdf format as well, which might be more easily usable for many voters). The information guide helps remind voters about the upcoming election, it provides useful information about voter rights and resources about registration and voting, and it also gives lots and lots of detailed information about all of the ballot measures that we will have on our ballots in California this fall.

But with seventeen statewide measures on the ballot (this does not include county or local measures), the information guide is a bit intimidating this election season. Californians are being asked to provide their input into a wide range of statewide issues, including fiscal matters like school and revenue bonds, tax extensions and new taxes, the death penalty, and marijuana legalization. These are important issues, and this fall voters will need to take a close look at the voter information guide to get a better understanding of these issues and to figure out how to cast their ballots.

With so many issues on the ballot, and with a lot of important candidate races (a presidential race, the U.S. Senate contest, and lots of competitive congressional and state legislative races), it’s a long ballot. Combine the long ballot with a lot of interest in this election, there’s a good chance we will see strong turnout through the state this fall, which even with widespread voting by mail will likely mean long waits at polling places on election day.

In any case, Californians should be on the lookout for their massive voter information guide in their mailboxes, or take a look at the online version. Just make sure that you have a sturdy mailbox, and don’t drop it on your toes when your copy arrives soon.

Estimating Turnout with Self-Reported Survey Data

There’s long been a debate about the accuracy of voter participation estimates that use self-reported survey data. The seminal research paper on this topic, by Rosenstone and Wolfinger, was published in 1978 (available here for those of you with JSTOR access). They pointed out a methodological problem in the Current Population Survey data they used in their early and important analysis: there seemed to be more people in the survey reporting that they voted, than likely voted in the federal elections they studied.

In the years since the publication of Rosenstone and Wolfinger’s paper, there’s been a lot of debate among academic researchers about this apparent misreporting of turnout in survey self-reports of behavior, much more than I can easily summarize here. But many survey researchers have been using “voter validation” to try to alleviate these potential biases in their survey data, which involves matching survey respondents who say they voted to administrative voter history record (after the election); this approach has been used in many large-scale academic surveys of political behavior, including many of the American National Election Studies.

In an important new study, recently published in Public Opinion Quarterly, Berent, Krosnick and Lupia, set out to test the validation of self-reports of turnout against post-election voter history data. Their paper, “Measuring Voter Registration and Turnout in Surveys: Do Official Government Records Yield More Accurate Assessments”, is one that people interested in studying voter turnout using survey data should read. Here’s the important results from their paper’s abstract:

We explore the viability of turnout validation efforts. We find that several apparently viable methods of matching survey respondents to government records severely underestimate the proportion of Americans who were registered to vote. Matching errors that severely underestimate registration rates also drive down “validated” turnout estimates. As a result, when “validated” turnout estimates appear to be more accurate than self-reports because they produce lower turnout estimates, the apparent accuracy is likely an illusion. Also, among respondents whose self-reports can be validated against government records, the accuracy of self-reports is extremely high. This would not occur if lying was the primary explanation for differences between reported and official turnout rates.

This is an important paper, which deserves close attention. As it is questioning one of the common means of trying to validate self-reported turnout, not only do we need additional research to confirm their results, we need new research to better understand how we can best adjust self-reported survey participation to get the most accurate turnout estimate that we can, using survey data.

Estimating racial and ethnic identity from voting history data

Researchers who have participated in redistricting efforts, or who for other reasons have used voter history files in their work, know how difficult it is to estimate a voter’s racial and ethnic identity from these data. These files typically contain a voter’s name, date of birth, address, date of registration, and their participation in recent elections. The usual approach that many have take to estimate each voter’s racial or ethnic identity has been to use “surname dictionaries” which will classify many of the last names in a voter history file to many racial or ethnic groups.

The obvious problem is that with an increasingly diverse society, this surname matching procedure may be less and less accurate. The surnames of many Americans are no longer necessarily accurate for estimating racial or ethnic identity.

Charles recently wrote about one recent paper in Political Analysis on this topic, by Kosuke Imai and Kabir Khanna, “Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records”. Charles provided an excellent summary of this article, but I’d like to point out to readers that the Imai and Khanna article is now available for free reading online, so check it out asap!

The the other recent article in Political Analysis on this question is by J. Andrew Harris, “What’s in a Name? A Method for Extracting Information about Ethnicity from Names.” Here’s Harris’s abstract:

Questions about racial or ethnic group identity feature centrally in many social science theories, but detailed data on ethnic composition are often difficult to obtain, out of date, or otherwise unavailable. The proliferation of publicly available geocoded person names provides one potential source of such data—if researchers can effectively link names and group identity. This article examines that linkage and presents a methodology for estimating local ethnic or racial composition using the relationship between group membership and person names. Common approaches for linking names and identity groups perform poorly when estimating group proportions. I have developed a new method for estimating racial or ethnic composition from names which requires no classification of individual names. This method provides more accurate estimates than the standard approach and works in any context where person names contain information about group membership. Illustrations from two very different contexts are provided: the United States and the Republic of Kenya.

Harris’s paper is open access, which means it’s also freely available for people to read online.

There’s a lot of interesting research going on in how to use these types of administrative datasets for innovative research; I encourage readers to take a look at both papers, and I’d also like to note that the code and data for both papers are available on the Political Analysis Dataverse.