Estimating racial and ethnic identity from voting history data

Researchers who have participated in redistricting efforts, or who for other reasons have used voter history files in their work, know how difficult it is to estimate a voter’s racial and ethnic identity from these data. These files typically contain a voter’s name, date of birth, address, date of registration, and their participation in recent elections. The usual approach that many have take to estimate each voter’s racial or ethnic identity has been to use “surname dictionaries” which will classify many of the last names in a voter history file to many racial or ethnic groups.

The obvious problem is that with an increasingly diverse society, this surname matching procedure may be less and less accurate. The surnames of many Americans are no longer necessarily accurate for estimating racial or ethnic identity.

Charles recently wrote about one recent paper in Political Analysis on this topic, by Kosuke Imai and Kabir Khanna, “Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records”. Charles provided an excellent summary of this article, but I’d like to point out to readers that the Imai and Khanna article is now available for free reading online, so check it out asap!

The the other recent article in Political Analysis on this question is by J. Andrew Harris, “What’s in a Name? A Method for Extracting Information about Ethnicity from Names.” Here’s Harris’s abstract:

Questions about racial or ethnic group identity feature centrally in many social science theories, but detailed data on ethnic composition are often difficult to obtain, out of date, or otherwise unavailable. The proliferation of publicly available geocoded person names provides one potential source of such data—if researchers can effectively link names and group identity. This article examines that linkage and presents a methodology for estimating local ethnic or racial composition using the relationship between group membership and person names. Common approaches for linking names and identity groups perform poorly when estimating group proportions. I have developed a new method for estimating racial or ethnic composition from names which requires no classification of individual names. This method provides more accurate estimates than the standard approach and works in any context where person names contain information about group membership. Illustrations from two very different contexts are provided: the United States and the Republic of Kenya.

Harris’s paper is open access, which means it’s also freely available for people to read online.

There’s a lot of interesting research going on in how to use these types of administrative datasets for innovative research; I encourage readers to take a look at both papers, and I’d also like to note that the code and data for both papers are available on the Political Analysis Dataverse.