Assessing the Reliability and Variability of the TopDown Algorithm for Redistricting Data
In a webinar on May 21, Census Bureau subject matter experts described the results of new research into the reliability and variability of the Disclosure Avoidance System’s TopDown Algorithm, the modernized approach to statistical noise infusion based on Differential Privacy, for meeting P.L. 94-171 redistricting use cases.
The research concluded the following:
- The TopDown Algorithm produced reliable statistics for all congressional districts and state legislative districts.
- The TopDown Algorithm produced reliable statistics (defined below) for various demographic groups in:
- Block groups (a proxy for voting districts) with a minimum total population between 550 and 599,
- Places and Minor Civil Divisions (a proxy for less-populated off-spine geographies) with minimum total populations between 350 and 399.
- The calibrated randomness of the TopDown Algorithm’s noise injection resulted in less variability than the first demonstration data released in October 2019 with relative variability decreasing as geography size increases.
As outlined below, “reliability” is defined as how well the TopDown Algorithm compares to the published 2010 Census data. Specifically, for any block group with a total count between 550 and 599 people, and for Minor Civil Divisions and places between 350 and 399, the difference between the TopDown Algorithm’s ratio of the largest demographic group and the corresponding swapping algorithm's ratio (used in the 2010 Census) for the largest demographic group is less than or equal to five percentage points at least 95% of the time.
A Note About Comparisons with the Published 2010 Census Swapped Data
It’s important to note that the research compares 2010 Census results using the April 2021 version of the TopDown Algorithm to the published 2010 Census results that included swapping, the disclosure avoidance method used in 2010. Some of the errors between the two data sets are due to that swapping algorithm.
To preserve the integrity of the traditional methods like the swapping algorithm used for 2010 and prior censuses, the parameters of those methods and their impact on data accuracy must be kept confidential.
Note also that all tuning of the Disclosure Avoidance System has used the unswapped data for comparison. We are only able to publicly share analysis on accuracy and fitness-for-use as comparisons to the swapped data because of the confidentiality requirements described above.
Identifying Redistricting Data Use Cases
The Census Bureau gathered use cases (user needs) to inform reliability criteria for redistricting data from redistricting data users, practitioners, experts, civil rights organizations, and the public. Through those discussions, we confirmed that population counts, including accurate counts of race, ethnicity, and voting age characteristics, were critical to core redistricting needs including the ability to draw districts of similar size, to accurately identify communities of interest in voting districts, and to analyze communities for racially polarized voting when enforcing Section 2 of the VRA.
To analyze performance towards the overall redistricting use case, we began comparing the output from the Disclosure Avoidance System to the output of the published data from the 2010 Census, starting with congressional and state legislative districts.
We then requested more specific use cases from the Department of Justice, which provided 20 Section 5 redistricting plans to add to our analysis. Section 5 of the Voting Rights Act required jurisdictions with a history of racial discrimination in voting to preclear any proposed voting change with the Department of Justice to ensure that the proposed change would not have a discriminatory effect on minority racial and ethnic groups. Although Section 5 is currently inactive, we still felt that these plans would be useful to our analysis, because they provided us examples of the smaller districts where most redistricting occurs.
While small demographic groups are important, in the context of redistricting, it is the largest among the demographic groups that have the potential to form districts where sufficiently large (and compact) minority groups have the opportunity “to elect representatives of their choice.”
Establishing Proxy Districts
A unique challenge to analyzing the impact of disclosure avoidance on redistricting is that districts are not pre-formed; it’s not possible to know the final configuration of districts until they are designed, which can only happen after the census data is published.
To overcome this challenge, we expanded our research to measure accuracy for census block groups as a proxy for voting districts. Block group geography is a useful proxy because they are available nationwide, providing wall-to-wall coverage for the country. Much like small districts, block groups are small geographic areas composed of blocks. Block groups typically contain between 600 and 3,000 people.
Through our earlier discussions with redistricting data users and experts as well as the Department of Justice, we learned that most of the work of redistricting is done for jurisdictions that have a population size of 2,500 people or less.
If a jurisdiction of that size is divided into four or five districts, each district will be between 625 and 500 people, respectively. Through conversations with these same groups, we also established that a jurisdiction with only 500 people is not likely to divide itself into districts. With that in mind, we established the 500-person district size for accuracy targeting.
In the April 2021 DAS, we used an optimized geographic spine. Spine optimization meant that official tabulation block groups became off-spine entities. We also studied other geographies that are off-spine and serve as the political boundaries for legislative entities. They instead gain their accuracy from the assemblage of blocks and other whole levels of geography that are contained within them.
These other targeted areas were places (both incorporated places, which are legally defined, and Census Designated Places, which are not legally defined), plus Minor Civil Divisions, which are the townships, boroughs, and towns in those states where they’re functioning governments.
It is important to note that these targets were not the only considerations as we tuned the TopDown Algorithm. We also used a broad array of additional accuracy measures based on feedback from data users regarding geography, racially polarized voting analysis, previous demonstration data analysis, and effects on downstream data products’ accuracy.
Establishing Reliability Criteria for Redistricting Data
Feedback from the Department of Justice and other data users taught us that, in the context of redistricting, it is the largest among the demographic groups that have the potential to form districts where sufficiently large (and compact) minority groups have the opportunity to “elect representatives of their choice.”
The Census Bureau believes that support for consideration of the largest demographic group(s) is as noted in Section 2 of the Voting Rights Act of 1965 (as amended) and is called for by one of the three Gingles Requirements in the U.S. Supreme Court case Thornburg v. Gingles (1986) when establishing a violation of Section 2.
The potential for creating an electoral district that provides minority citizens with the opportunity to elect candidates of their choice is not necessarily limited to those block groups in which that group is the “largest demographic group.” For example, a demographic group could make up the second largest population group in two or more contiguous, randomly created block groups. A different configuration of constituent blocks could result in that group being the basis of a district that affords the requisite opportunity to elect.
We therefore established the following criteria for reliability:
For any block group with a total count of between 550 and 599 people, and for Minor Civil Divisions and places containing between 350 and 399 people, the difference between the TopDown Algorithm ratio of the largest demographic group and the corresponding swapping algorithm’s ratio for the largest demographic group is less than or equal to five percentage points at least 95% of the time.
In other words, using block groups as an example, at least 95% of block groups across the country that have between 550 and 599 people will have a count for the largest demographic group that's less than 5% different from the count using the swapping algorithm.
Examples in the paper are instructive: Block Group 131350505461, in Lawrenceville, GA has a population of 10,000. In 2010 (using the swapping algorithm), 4,475 people in that block group identified as Black/Non-Hispanic, making Black/Non-Hispanic the largest racial or ethnic population in the block group. Our criteria would lead us to assess whether the count under the TDA was within 5% of that 2010 count of the Black/Non-Hispanic population. If so, that block group would meet our accuracy target.
We conducted that same type of analysis for every block group, place, and MCD in the country.
Variability Within and Between Data “Runs”
The second part of our research focused on the extent to which the random number generator produces unexpected variability in data results. The study confirmed that relative variability in the TDA decreases as we consider larger pieces of geography and population. These may be a reflection of post-processing.
The TopDown Algorithm generates noise using a cryptographically secure random number generator. The algorithm determines the amount of statistical noise or uncertainty to be injected or added or subtracted from every statistic being calculated, from a probability distribution centered around zero that is determined by the privacy-loss budget.
The most likely circumstance is that the generator pulls a zero and injects no noise. With slightly less likelihood, it will pull a one or a negative one. With even less likelihood, a two or a negative two, and so on.
Even with the exact same algorithm and the exact same settings, the random number generator produces variation from one run – or full system processing -- to the next, just by virtue of the difference in the random numbers that are generated.
By running the algorithm and producing data 25 times, we were able to assess not just the variability due to the settings of the system, possible with just one run, but also the variability of one run to the next. This is because with the “luck of the draw,” the generator might pull different random numbers that might yield a different result.
The analysis showed that variability did not appear to impact legislative districts.
Join Us Friday, June 4, for a Webinar on Research on Alternatives to Differential Privacy
We are hosting a webinar this Friday, June 4, to walk through the research and take audience questions. The webinar will be recorded and posted as part of our series on Disclosure Avoidance. Transcripts and recordings for the previous webinars are also on the series page.
|
|
-
Date: Friday, June 4
-
Time: 2:00 – 3:00 pm (ET)
- Dial-in: 888-996-4917
- Passcode: 9385910#
-
Link: Log-In Details
-
WebEx event number (If needed): 199 855 0149
|
Event Password: Census#1 (If required, this password is only for users who join from a WebEx application or mobile app.)
2021 Key Dates, Redistricting (P.L. 94-171) Data Product:
The Census Bureau’s Data Stewardship Executive Policymaking Committee (DSEP) will meet in early June to review the latest data regarding the TopDown Algorithm and approve settings and parameters. Their decisions will be informed by the feedback we’ve received from numerous stakeholders, which has resulted in ongoing fine-tuning of the algorithm since the release of the last demonstration data set on April 28. Additional fine-tuning as directed by the DSEP will continue through June, with quality control analysis leading to the FTP release of the redistricting data by August 16.
Early June:
- The Census Bureau’s Data Stewardship Executive Policy (DSEP) Committee makes final determination of PLB, system parameters based on data user feedback for P.L. 94-171.
Late June:
- Final DAS production run and quality control analysis begins for P.L. 94-171 data.
By August 16:
- Release 2020 Census P.L. 94-171 data as Legacy Format Summary File*.
September:
- Census Bureau releases PPMFs and Detailed Summary Metrics from applying the production version of the DAS to the 2010 Census data.
- Census Bureau releases production code base for P.L. 94-171 redistricting summary data file and related technical papers.
By September 30:
- Release 2020 Census P.L. 94-171 data** and Differential Privacy Handbook.
* Released via Census Bureau FTP site.
** Released via data.census.gov.
Was this forwarded to you?
Sign up to receive your own copy!
Useful Links:
|