Purpose

The primary purpose of this project is to evaluate the efficacy of PETs for algorithmic transparency. If successful, the goal is to enable researchers outside of Twitter to perform research on data and models within the firm using privacy enhancing technologies (without having direct access to the underlying information being studied).

Datasets

The central datasets in the first project come from the paper Algorithmic amplification of politics on Twitter, in addition to synthetic reproductions of the private datasets therein for the purpose of development and testing. The largest synthetic dataset contains approximately 1 billion rows of data

PETs usedRemote execution (sometimes called federated learning/analytics), differential privacy, and secure multi-party computation.
Details of computation

A dataset is uploaded to a PySyft domain node and the data owner configures their domain node with a user account for a data scientist. A data scientist can then obtain a pointer to the uploaded dataset that lets them perform operations on the dataset as if it were a normal NumPy array though they are not able to see the results of their computations in the process.

Once they have concluded their computation, they can see the results by using the adversarial differential privacy system, which adds statistical noise to their result. This process of adding noise also spends privacy budget (related), which is tracked at an individual data subject level.

Parties and trust relationshipThe current phase of the project is designed for an external trusted party to test the system by using PySyft to reproduce research results against a synthetic dataset. The next phase of the project is anticipated to replace this data with data from the paper for end-to-end testing.
Implementation statusOngoing Proof of Concept
Resources

Announcing our Partnership with Twitter to Advance Algorithmic Transparency

Investing in privacy enhancing tech to advance transparency in ML

Algorithmic Amplification of Politics on Twitter

Christchurch Call Initiative on Algorithmic Outcomes https://www.christchurchcall.com/media-and-resources/news-and-updates/christchurch-call-initiative-on-algorithmic-outcomes/

Christchurch Call Initiative on Algorithmic Outcomes https://www.amcham.co.nz/page-1334006/12928098


Background

Since 2016, Twitter users have been able to choose a preferred order for viewing their Home timeline from two options. The first option is to view Tweets from accounts the user has chosen to follow presented in reverse chronological order. The second option is to view Tweets that are algorithmically selected and ordered based on a personalization algorithm to prioritize content shown to each user based on the system design and how they interact with the algorithmic system, resulting in potential for older Tweets and those from accounts they do not follow to be prioritized in the Home timeline.

In October 2021, Twitter published learnings from an internal analysis of whether its recommendation algorithms amplify political content. The study analyzed millions of Tweets from elected officials in seven countries: Canada (House of Commons members), France (French National Assembly members), Germany (German Bundestag members), Japan (House of Representatives members), Spain (Congress of Deputies members), United Kingdom (House of Commons members), United States (official and personal accounts of House of Representatives and Senate members) from 1 April - 15 August 2020 and hundreds of millions of Tweets containing links to articles shared on Twitter during the same timeframe. 

The study found that Tweets from elected officials are algorithmically amplified when compared to political content on the reverse chronological timeline. Algorithmic amplification was found to be an individualized effect (i.e., similar users received different results) and Tweets posted from accounts on the political right received more algorithmic amplification than those posted by accounts on the political left in all countries but Germany. News outlets were categorized based on media bias rating from two independent organizations. The study found that right-leaning news outlets received greater algorithmic amplification compared to left-leaning news outlets. 

Twitter’s ML Ethics, Transparency, and Accountability (META) team aims to discover whether the algorithmic amplification identified in the study results from preferential treatment in the algorithm’s design rather than representing user interactions in order to reduce adverse impacts. In addition to sharing aggregate data with researchers in order to reproduce the study, META would like to provide researchers with access to the raw data from which the aggregates were calculated. But, privacy concerns have previously prevented sharing raw data, which limits reproducibility and the benefits of having many researchers examine these important issues from multiple perspectives and approaches.

Case Study description

On 20 January 2022, Twitter announced a partnership with OpenMined, an open-source nonprofit organization, to use PETs to replicate the findings of the political amplification study using synthetic data based on the original data. 

Most differential privacy work occurs in a “trusted curator” setting, where the data scientist has access to the raw (and often sensitive) data, and determines for themselves how much noise is sufficient to add in order to protect privacy, before publishing. This assumes a lot of expertise and trust.

In contrast, OpenMined has been building its differential privacy system in an adversarial setting, where differential privacy mechanisms are designed to protect data from an adversary that is studying it. In practice, this means that any output party is required to remain within a privacy budget (low trust). Additionally, the use of remote execution environments and Tensor pointers allows output parties to use the data without having intimate knowledge of sophisticated differential privacy mechanisms and how and when to apply them.

In addition, OpenMined’s differential privacy system allows for privacy budgets to be stored at an individual data subject level. This ensures that an output party can be further limited as to how much information they can learn about any individual in a dataset. The ability to store and track privacy budgets at an individual level also allows for much tighter privacy loss [source] compared to traditional differential privacy, which adopts a pessimistic approach by considering only the worst case estimate over all data subjects and all possible values of their data, for every single analysis

However, to the best of our knowledge, no large-scale demonstrations of this kind of differential privacy system (adversarial, and with a large number of individual privacy budgets) have been conducted. The partnership between OpenMined and Twitter would be the first to attempt to show an adversarial differential privacy system with millions of individual privacy budgets for different data subjects. 

A future aim of the project is to enable researchers to conduct studies with actual data rather than being limited to the data currently available via the public Twitter API. Therefore, this project serves as a first step towards implementing PETs to enable researchers to conduct research using the same data that Twitter uses in their own internal analyses to improve accountability while preserving privacy.

Outcomes and lessons learned

It is possible to have a differential privacy system that tracks millions of individual privacy budgets.

Differential privacy in an adversarial setting is possible, and displays several benefits:

  • The output party (a data scientist in this case) can work with a Tensor pointer using the exact same functions and methods as if they were using their statistical analysis framework of choice (NumPy, PyTorch, etc)
  • The output party is not forced to know how differential privacy works to get the insights they want.
  • Output parties are constrained by their own privacy budget as to how much they can learn about a given dataset, thereby limiting any adversarial party’s ability to do damage.

This case study demonstrated that it is possible to perform queries on a dataset on a PySyft domain node and get results without ever viewing the private or sensitive dataset being queried.

Consequently, privacy enhancing methods from this study are being further developed through an international pilot initiative to facilitate research on real world data across multiple platforms.

In September, 2022, in conjunction with the UN General Assembly and Christchurch Call leaders summit, New Zealand Prime Minister Jacinda Ardern and French President Emmanuel Macron announced the Christchurch Initiative on Algorithmic Outcomes, a partnership between New Zealand, the United States, Twitter, Microsoft and OpenMined to develop and test a differential privacy system to enable privacy preserving research across multiple online platforms. The pilot will serve as a “proof of function” regarding the use of PETs to facilitate open and transparent, multi-stakeholder research to enable better understanding of algorithmic outcomes, particularly the role of algorithms in content discovery and amplification, while preserving the privacy of individuals. The pilot will also demonstrate that the underlying techniques can be scaled to meet real-world legal, policy and other requirements