Author Archives: Michael Alvarez

Tweeting the Election

We are pleased to launch this new website, www.tweetingtheelection.com, which provides live and real-time summary analytics of the conversation currently occurring on Twitter, about election administration and voting technology, as Americans go to vote in the 2016 Presidential election. This website was developed at Caltech, by Emily Mazo, Shihan Su, and Kate Lewis, as their project for CS 101, and in collaboration with myself and the Caltech/MIT Voting Technology Project.

This website offers two views of some of the discussion about the election that is now occurring on Twitter. These visualizations compare the discussion occurring amongst people with different political ideologies: we have separated the tweets by the predicted ideology of the author.

The first view is geographic, showing the sentiment of incoming tweets by state. In the geographic view, which shows the average sentiment in every state over the past six hours and is updated every few minutes, dots on the map display the most recent tweets, and you can hover over those dots to view the content of the tweet.

At the top of the website, clicking over to the timeline view, you can see the sentiment of the recent incoming tweets by hour, for the past twelve hours.

In each view, we offer data on four different election administration and voting technology issues: tweets about absentee and early voting, polling places, election day voting, and voter identification. We collect these tweets by filtering results from the Public Streaming Twitter API, by keyword, for tweets falling into these categories.

Furthermore, we classify each incoming tweet in two ways. First, we do a sentiment analysis on the incoming tweet, to classify it as positive (given as green in both views) or negative (given as red in both views). We also classify the Twitter users as being more likely to be liberal or conservative in their political orientation, so that viewers can get a sense for how discussion on Twitter of these election administration and voting technology issues breaks by ideology.

We continue to work to improve our analytical tools, and the presentation of this data on this website. We welcome comments, please send them to tweetingtheelection@gmail.com.

Finally, this website is currently best viewed in fullscreen mode, on a larger laptop or computer screen. Mobile optimization, and viewing on smaller sized screens, is for future development.

Some Background

This is part of a larger project at Caltech, which began in the fall of 2014, where we began collecting tweets like these for studying election administration. The background of the project, as well as some initial analyses at validation of these data, is contained in the working paper, Adams-Cohen et al. (2016), “Election Monitoring Using Twitter” (forthcoming).

1. How we collect these tweets
The Tweet data used in this visualization is collected from Twitter’s Public Streaming API. The stream is filtered on keywords for the four different topics of interest:

Election Day Voting: Provisional ballot, Voting machine, Ballot.
Polling Places: Polling place line, precinct line, Pollworker, Poll worker.
Absentee Voting: Absentee ballot, mail ballot, vote by mail, voting by mail, early voting.
Voter ID: Voter identification, Voting identification, Voter ID.

2. Sentiment analysis
The Tweets collected from this stream were then fed into two classification models. The first model classified them into positive or negative classes based on their text. This model was created with crowdsourced data; about 5000 Tweets from a previous collection of Tweets collected in the same manner as those in the visualization were labeled with sentiment (valence) on a positive-to-negative scale, by at least three crowd workers, and then averaged to create a standard label for that Tweet. This training sets of Tweets and labels was then used to create a term frequency-inverse document frequency vector for the the words in each Tweet in the set. These vectors were used to train a decision tree regression model to predict the value for the sentiment in future Tweets (a high positive predicted value indicating higher positive sentiment, and a more negative value indicating a more negative sentiment).

The Tweets that appear on tweetingtheelection were streamed from the Twitter API, stripped of stop words, hashtags, URLs, and any other media, then processed to create Tf-Idf vectors to represent each Tweet using the same vocabulary as the original model. These vectors were then passed through the decision tree regression model, which predicted sentiment labels for them.

3. Ideology classification
This model classify tweets into republican or liberal based on their text.
Training data
Training data for this model is obtained through two process.
First, we obtain ideal point estimation for twitter users from Pablo Barbera’s work <“Birds of the Same Feather Tweet Together. Bayesian Ideal Point Estimation Using Twitter Data”>. In his work, Barbera develop a latent space model to estimate user’s ideal point by studying the following links between users.

Second, we match the user id obtained from Barbera with the tweets collected by R. Michael Alvarez’s lab from 2016-04-19 to 2016-06-23. In order words, we label tweets using the label of the user who create the tweets. We matched around 55,000 tweets and use these as our training data.

4. Classifier
The classifier we use is a convolutional neural network.
The model originally developed by Yoon Kim in his work . We adopt Denny Britz’s Implementation of this model in tensor flow.

The model we adopted has four layers. The first layer embeds words into 128 dimensional vectors, which are learned within the neural network. The next convolution layers use three filter sizes(3, 4, 5), i.e. sliding over 3, 4, 5 words at a time. Then applies 128 filters to each of the three filter size. Next, the model max-pool the result of convolutional layer into a 128 feature vector, add dropout regularization. Then the final softmax layer classify the result.

5. Geocoding
The incoming tweets are geocoded. We use the coordinates of where the tweet was sent from, if they are provided (if the user opted into geocoding). If that information is unavailable, we use the location of the user from their profile (if provided).

Thanks

There are many people to thank. First off, www.tweetingtheelection.com has created by Emily Mazo, Shihan Su, and Kate Lewis, as their project for CS 101 at Caltech. That class is taught by Yisong Yue and Omer Tamuz, professors at Caltech.

Earlier work done on this project, in particular the ongoing collection of Twitter data like these since 2014 at Caltech, has been done in collaboration with Nailen Matschke (who developed the original python-based Twitter collection tool, as well as the MySQL database where all of the data collected so far is stored); with Clare Hao and Cherie Jia, who worked in the summer of 2016 as SURF students, developing some preliminary python tools to analyze the Twitter data; and Nick Adams-Cohen, a PhD student in the social sciences at Caltech, who is studying the use of Twitter data for public opinion and political behavior research.

We’d like to thank Pablo Barbera for sharing his ideological placement data with us.

Other colleagues and students who have helped with this research project include Thad Hall, Jonathan Nagler, Lucas Nunez, Betsy Sinclair, and Charles Stewart III.

How secure are state voter registration databases?

Over the past few months, there’ve been a number of reports that state voter registration databases have come under cyberattack. Most recently, FBI Director James Comey discussed the attacks in a hearing of the U.S. House Judiciary Committee. While details have been few, between what’s been reported by the FBI recently, and the various attacks on the email accounts of political parties and political leaders in the U.S., it’s seems clear that the U.S. election infrastructure is being probed for vulnerabilities.

So exactly how secure are state voter registration databases? I’ve been asked about this a number of times recently, by the media, colleagues, and students.

The potential threats to state voter registration databases have been known for a long time. In fact, I wrote a paper on this topic in October 2005 — that’s not a typo, October 2005. The paper, “Potential Threats to Statewide Voter Registration Systems”, is available as a Caltech/MIT Voting Technology Project Working Paper. It’s also part of a collection of working papers in a NIST report, “Developing an Analysis of Threats to Voting Systems: Preliminary Workshop Summary.”

The context for my 2005 paper was that states were then rushing to implement their new computerized statewide voter registries, as required after the passage of the Help America Vote Act. At the time, a number of researchers (myself included) were concerned that in the rush to develop and implement these databases, and that important questions about their security and integrity needed to be addressed. So the paper was meant to provide some guidance about the potential security and integrity problems, in the hopes that they would be better studied and addressed in the near future.

The four primary types of threats that I wrote about regarded:

  • Authenticity of the registration file: attacks on the transmission path of voter registration data from local election officials to the state database, or attacks on the transmission path of data between the state registry to other state officials (for example, departments of motor vehicles).
  • Security of personal information in the file: state voter files contain a good deal of personal information, including names, birthdates, addresses, and contact information, which could be quite valuable to attackers.
  • Integrity of the file: the primary data files could be corrupted, either by mistakes which enter the data and are difficult to remove, or by systematic attack.
  • System failure: the files could fail at important moments, either due to problems with their architecture or technology, or if they come under systematic “denial of service” attacks.

By 2010, when I was a member of a National Academies panel, “Improving State Voter Registration Databases”, many of these same concerns were raised by panelists, and by the many election officials and observers of elections who provided input to the panel. It wasn’t clear how much progress had been made by 2010, towards making sure that the state voter registration systems then in place were secure.

Fast-forward to 2016, and very little research has been done on the security and integrity of state voter registration databases; despite the concerns raised in 2005 and 2010, there’s not been a great deal of research focused on the security of these systems, certainly nowhere near the amount of research that has focused on the security of other components of the American election infrastructure, in particular, the security of remote and in-person voting systems. I’d be happy to hear of research that folks have done; I’m aware of only a very few research projects that have looked at state voter registration systems. For example, there’s a paper that I worked on in 2009 with Jeff Jonas, Bill Winkler, and Rebecca Wright, where we matched the voter registration files from Oregon and Washington, in an effort to determine the utility of interstate matching to find duplicates between the states. Another paper that I know of is by Steve Ansolabehere and Eitan Hersh, which looks at the quality of the information in state voter registries. But there’s not been a lot of systematic study of the security of state voter registries; I recommend that researchers (like our Voting Technology Project) direct resources towards studying voter registration systems now, and in the immediate aftermath of the 2016 election.

In addition to calling for more research on the security of state voter registration databases, election officials can take some immediate steps. The obvious step is to take action to make sure that these databases are now as secure as possible, and to determine whether there is any forensic evidence that the files might have been attacked or tampered with recently. A second step is to make sure that the database system will be robust in the face of a systematic denial of service attack. Third, election officials can devise a systematic approach towards providing pre- and post-election audits of their databases, something that I’ve strongly recommended in other work on election administration auditing (with Lonna Atkeson and Thad Hall). If election officials do audit their voter registration databases and processes, those reports should be made available to the public.

Felony Disenfranchisement

I frequently am asked by students, colleagues, and the media, about how many people in the U.S. cannot participate in elections because of felony disenfranchisement laws. Given the patchwork quilt of felony disenfranchisement laws across the states, and a lack of readily available data, it’s often hard to estimate what the rate of felony disenfranchisement might be.

The Sentencing Project has released a report that provides information and data about felony disenfranchisement and the 2016 federal elections in the U.S. Here are their key findings, quoted from their report:

“Our key findings include the following:

– As of 2016, an estimated 6.1 million people are disenfranchised due to a felony conviction, a figure that has escalated dramatically in recent decades as the population under criminal justice supervision has increased. There were an estimated 1.17 million people disenfranchised in 1976, 3.34 million in 1996, and 5.85 million in 2010.

– Approximately 2.5 percent of the total U.S. voting age population – 1 of every 40 adults – is disenfranchised due to a current or previous felony conviction.

– Individuals who have completed their sentences in the twelve states that disenfranchise people post-sentence make up over 50 percent of the entire disenfranchised population, totaling almost 3.1 million people.

– Rates of disenfranchisement vary dramatically by state due to broad variations in voting prohibitions. In six states – Alabama, Florida, Kentucky, Mississippi, Tennessee, and Virginia – more than 7 percent of the adult population is disenfranchised.

– The state of Florida alone accounts for more than a quarter (27 percent) of the disenfranchised population nationally, and its nearly 1.5 million individuals disenfranchised post-sentence account for nearly half (48 percent) of the national total.

– One in 13 African Americans of voting age is disenfranchised, a rate more than four times greater than that of non-African Americans. Over 7.4 percent of the adult African American population is disenfranchised compared to 1.8 percent of the non-African American population.

– African American disenfranchisement rates also vary significantly by state. In four states – Florida (21 percent), Kentucky (26 percent), Tennessee (21 percent), and Virginia (22 percent) – more than one in five African Americans is disenfranchised.”

This looks like a useful resource for those interested in understanding the possible electoral implications of felony disenfranchisement laws across the U.S.

Questions about postal voting

Since the origins of the Caltech/MIT Voting Technology Project in 2000, the VTP has noted a number of concerns about postal voting. Our original report in 2001 noted that postal voting represents clear tradeoffs, with benefits including convenience, but with potential risks, especially regarding the reliability and security of balloting by mail.

Our most recent report reiterated these same concerns, but added another, as there is new research indicating that many of the reductions in residual votes (a key measure of voting system reliability and accuracy) are at risk because of the increase in postal voting. One of these papers studies residual votes in California (“Voting Technology, Vote-by-Mail, and Residual Votes in California, 1990-2010”). The other is a national-level study, “Losing Votes by Mail.” There is an important signal in the residual vote data from recent elections, increased postal voting is associated with increased residual votes.

Now comes word of a new concern about the reliability of postal voting. Upcoming Austrian elections might be postponed due to faulty glue used in the ballot envelopes. This video helps explain the problem.

While we’ve raised questions in the past about the reliability of the mail system for balloting (in particular, noting that there’s always a risk that balloting materials might be delayed or misdirected, especially for overseas and military voters covered by the UOCAVA and MOVE Acts), a basic malfunction of postal voting material is not an issue that we’ve heard much of in the past. But clearly it may be an issue in the future, so researchers will need to keep an eye on what is learned from this Austrian postal ballot problem, how it it resolved, and determine how to prevent problems like these from happening.

California’s massive 2016 voter guide

I’m glad that I recently had a large and sturdy mailbox installed at the end of our driveway. Our previous mailbox was small, rusty, and was starting to lean to one side — had the mail carrier tried to leave California’s massive, 224-page, 2016 general election voter information guide in our old mailbox, I have no doubt it would have immediately toppled over.

The LA Times has a fun video that shows the printing of this super-sized voter information guide:
http://www.latimes.com/la-pol-vn-printing-the-california-voter-information-guide-2-20160909-premiumvideo.html

Don’t get me wrong, I think that it’s great that California voters receive the voter information guide from our Secretary of State (it’s available online in pdf format as well, which might be more easily usable for many voters). The information guide helps remind voters about the upcoming election, it provides useful information about voter rights and resources about registration and voting, and it also gives lots and lots of detailed information about all of the ballot measures that we will have on our ballots in California this fall.

But with seventeen statewide measures on the ballot (this does not include county or local measures), the information guide is a bit intimidating this election season. Californians are being asked to provide their input into a wide range of statewide issues, including fiscal matters like school and revenue bonds, tax extensions and new taxes, the death penalty, and marijuana legalization. These are important issues, and this fall voters will need to take a close look at the voter information guide to get a better understanding of these issues and to figure out how to cast their ballots.

With so many issues on the ballot, and with a lot of important candidate races (a presidential race, the U.S. Senate contest, and lots of competitive congressional and state legislative races), it’s a long ballot. Combine the long ballot with a lot of interest in this election, there’s a good chance we will see strong turnout through the state this fall, which even with widespread voting by mail will likely mean long waits at polling places on election day.

In any case, Californians should be on the lookout for their massive voter information guide in their mailboxes, or take a look at the online version. Just make sure that you have a sturdy mailbox, and don’t drop it on your toes when your copy arrives soon.

Estimating Turnout with Self-Reported Survey Data

There’s long been a debate about the accuracy of voter participation estimates that use self-reported survey data. The seminal research paper on this topic, by Rosenstone and Wolfinger, was published in 1978 (available here for those of you with JSTOR access). They pointed out a methodological problem in the Current Population Survey data they used in their early and important analysis: there seemed to be more people in the survey reporting that they voted, than likely voted in the federal elections they studied.

In the years since the publication of Rosenstone and Wolfinger’s paper, there’s been a lot of debate among academic researchers about this apparent misreporting of turnout in survey self-reports of behavior, much more than I can easily summarize here. But many survey researchers have been using “voter validation” to try to alleviate these potential biases in their survey data, which involves matching survey respondents who say they voted to administrative voter history record (after the election); this approach has been used in many large-scale academic surveys of political behavior, including many of the American National Election Studies.

In an important new study, recently published in Public Opinion Quarterly, Berent, Krosnick and Lupia, set out to test the validation of self-reports of turnout against post-election voter history data. Their paper, “Measuring Voter Registration and Turnout in Surveys: Do Official Government Records Yield More Accurate Assessments”, is one that people interested in studying voter turnout using survey data should read. Here’s the important results from their paper’s abstract:

We explore the viability of turnout validation efforts. We find that several apparently viable methods of matching survey respondents to government records severely underestimate the proportion of Americans who were registered to vote. Matching errors that severely underestimate registration rates also drive down “validated” turnout estimates. As a result, when “validated” turnout estimates appear to be more accurate than self-reports because they produce lower turnout estimates, the apparent accuracy is likely an illusion. Also, among respondents whose self-reports can be validated against government records, the accuracy of self-reports is extremely high. This would not occur if lying was the primary explanation for differences between reported and official turnout rates.

This is an important paper, which deserves close attention. As it is questioning one of the common means of trying to validate self-reported turnout, not only do we need additional research to confirm their results, we need new research to better understand how we can best adjust self-reported survey participation to get the most accurate turnout estimate that we can, using survey data.

Estimating racial and ethnic identity from voting history data

Researchers who have participated in redistricting efforts, or who for other reasons have used voter history files in their work, know how difficult it is to estimate a voter’s racial and ethnic identity from these data. These files typically contain a voter’s name, date of birth, address, date of registration, and their participation in recent elections. The usual approach that many have take to estimate each voter’s racial or ethnic identity has been to use “surname dictionaries” which will classify many of the last names in a voter history file to many racial or ethnic groups.

The obvious problem is that with an increasingly diverse society, this surname matching procedure may be less and less accurate. The surnames of many Americans are no longer necessarily accurate for estimating racial or ethnic identity.

Charles recently wrote about one recent paper in Political Analysis on this topic, by Kosuke Imai and Kabir Khanna, “Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records”. Charles provided an excellent summary of this article, but I’d like to point out to readers that the Imai and Khanna article is now available for free reading online, so check it out asap!

The the other recent article in Political Analysis on this question is by J. Andrew Harris, “What’s in a Name? A Method for Extracting Information about Ethnicity from Names.” Here’s Harris’s abstract:

Questions about racial or ethnic group identity feature centrally in many social science theories, but detailed data on ethnic composition are often difficult to obtain, out of date, or otherwise unavailable. The proliferation of publicly available geocoded person names provides one potential source of such data—if researchers can effectively link names and group identity. This article examines that linkage and presents a methodology for estimating local ethnic or racial composition using the relationship between group membership and person names. Common approaches for linking names and identity groups perform poorly when estimating group proportions. I have developed a new method for estimating racial or ethnic composition from names which requires no classification of individual names. This method provides more accurate estimates than the standard approach and works in any context where person names contain information about group membership. Illustrations from two very different contexts are provided: the United States and the Republic of Kenya.

Harris’s paper is open access, which means it’s also freely available for people to read online.

There’s a lot of interesting research going on in how to use these types of administrative datasets for innovative research; I encourage readers to take a look at both papers, and I’d also like to note that the code and data for both papers are available on the Political Analysis Dataverse.

VTP report: The Voting Technology Project: Looking Back, Looking Ahead”

The Caltech/MIT Voting Technology Project has recently released the first of a series of reports for the 2016 U.S. presidential election. This report, “The Voting Technology Project: Looking Back, Looking Ahead” outlines the history of the Caltech/MIT Voting Technology Project (VTP), and discusses some of the issues and states where the VTP will be focusing their collective research activities for this fall’s election.

As this report discusses, the VTP was formed immediately after the 2000 presidential election. In particular, the project was established to study the problems associated with voting technologies in that election, and to propose solutions for those technological problems before the next presidential election in 2004. To assess the problems with voting technology in 2000, the VTP was constituted as a bicoastal, interdisciplinary research group.

While studying the issues with voting technology in the 2000 presidential election was our initial focus, the team quickly figured out that voting technologies were not the only issues plaguing U.S. elections: our studies revealed that significant numbers of votes were lost in the 2000 presidential election to problems other than bad voting technology. The non-technological issues that the VTP identified were voter registration, absentee voting, and problems with polling place practices.

The VTP issued our first major research studies in June and July 2001 — fifteen years ago! The first of those studies examined the reliability of existing voting technologies, using the residual votes metric; the second study took a broader focus, and used the lost votes measure to compare the problems of voting technology to those associated with other aspects of the election process in the U.S. Both of these 2001 studies were significant: they were widely read by policymakers, election officials, other academics, and the interested public. These studies played important roles in the development of federal, state, and local election reform efforts after 2001, including the Help America Vote Act. These studies also laid the foundation for the development of a surge of interest in the study of election administration and voting technology by academics, which fifteen years later has grown to include researchers across the globe, who jointly have produced many important books and articles on voter registration, voter identification, absentee voting, voting technology, polling place practices, and election administration.

Fifteen years later, the VTP continues to carry out ambitious and important research on voting technology and election administration. As a project, we have released a number of important policy reports since 2001, we have published our research widely, we have helped election officials and policymakers across the globe improve their voting technology and election administration practices, we have trained many students in the science of elections, and we have collaborated widely with researchers at many other colleges and universities.

As this new report discusses, going into the 2016 November general elections in the U.S., the VTP will be focusing on many of the same issues which have received our attention in the past (ironically, in some of the same states where we have focused our studies in past elections). Our efforts will involve the study of how to improve polling place practices, in particular the elimination of long lines at polling places on Election Day. We will continue our studies of voter identification and authentication procedures, and how new technologies might allow for accurate identification without disenfranchisement. The VTP will be looking closely at the performance of old and new voting technologies that will be used this fall. Finally, we will also be studying voting-by-mail and early voting in the states which widely use those convenience voting options. We’ll provide additional reports about those studies as the election season progresses, and issue post-election evaluations when we have results to share with our colleagues and readers.

The 2000 presidential election was unique, and the combination of problematic voting technologies with a very close election focused the attention of the world on how American elections are conducted. The good news is that much has improved in the conduct of federal elections in the U.S. since 2000, and the research community now has metrics and methods to study election performance well.

However, in battleground states, where the presidential election will be fought, it’s likely that attention will again focus on administrative and technological issues in November 2016, especially if none of the presidential candidates can easily claim an Electoral College victory the evening of November 8, 2016. We hope that the release of this report, and the others that we will published between now and Election Day, will help minimize the number of votes that are lost in the electoral process.

Media exit polls, election analytics, and conspiracy theories

The integrity of elections is a primary concern in a democratic society. One of the most important developments in the study of elections in recent decades has been the rapid development of tools and methods for evaluation of elections, most specifically, what many call “election forensics.” I and a number of my colleagues have written extensively on election evaluation and forensics; I refer interested readers to the book that Lonna Atkeson, Thad Hall, and I wrote, Evaluating Elections, and to the book that I edited with Thad and Susan Hyde, Election Fraud.

One question that continues to arise concerns whether observed differences between election results and media exit polls is evidence of electoral manipulation or election fraud. These questions have been raised in a number of recent U.S. presidential elections, and have come up again in the recent presidential primary elections in the U.S. In a recent piece in the New York Times, Nate Cohn wrote about these claims, and why we should be cautious in the use of media exit polls to detect election fraud. Each of the points that Cohn makes is valid and important, so this is an article worth reading closely.

I’d add to Cohn’s arguments, and note that while media exit polls have clear weaknesses as the sole forensic tool for determining the integrity of an election, we have a wide variety of other tools and methods to use in situations where there are questions raised about an election.
As Lonna, Thad and I wrote in Evaluating Elections, a good post-election study of an election’s integrity should involve a variety of data sources and multiple methods: including surveys and polls, post-election audits, and forensic analysis of disaggregated election returns. Each analytic approach has it’s strengths and weaknesses (media exit polls included), so by approaching the study of election integrity using as many data sources and different methods as we can, we can best locate where we might want to launch further investigation of potential problems in an election.

I have no doubt that we will hear more about the use of exit polls to evaluate the integrity of the presidential election this fall. Keep in mind Cohn’s cautionary points about using exit polls for this purpose, and also keep in mind that there are many other ways to evaluate the integrity of an election that have been tested and used in past elections. Media exit polls aren’t a great forensic tool, as Cohn argues: the types of exit polls that the news media uses to make inferences about voting behavior are not designed to detect election fraud or manipulation. Rather, those interested in a detailed examination of an election’s integrity should instead use the full array of analytic forensic tools that have been developed and tested in the research literature.