What has happened?
Yesterday afternoon (13/02), Social Science One announced what they called an “unprecedented” dataset of URLs shared to Facebook. The data, comprising about 38,000,000 URLs, spans from 2017/07/31–2019/07/31 and is the largest dataset Facebook have ever made available to researchers.
Who are Social Science One?
Social Science One is an organisation run by Gary King and Nathaniel Persily that creates “partnership[s] between academic researchers and private industry to advance the goals of social science”. First launched in 2018, Social Science One’s partnership with Facebook “focuses on the effect of social media on democracy and elections, with access to Facebook data” [1].
What is the data?
The Facebook full URL shares dataset contains information about the distribution of URLs on Facebook and how the platform’s users interacted with them [2].
This includes the aggregated data of the users who viewed, shared, liked, reacted to, shared without viewing, and had other interactions with, URLs. To be included in the data, the URLs must have been publicly posted at least one-hundred times. This also includes Facebook shares (content shared from a post, not as an original URL).
The data offers vast potential to researchers and could offer insights into:
- The most common types of websites linked to (news; government; retail; etc.)
- The most successful types of links in terms of likes, reactions and Facebook shares
For misinformation researchers, the possibilities include:
- How Facebook users actively engage with misinformation
- How many posts are reported by users as false, hate speech or spam
- The proportions of URLs that originate from sites creating disinformation
The ideas listed above are by no means exhaustive.
How much data is there?
Most people will be familiar with kilobyte (KB), megabyte (MB), gigabyte (GB) and probably terabyte (TB). For most of us, measurements of terabytes are where we will tend to max out. For example, my total server space for my research is ~13 terabytes – enough for about 25,000,000 images.
The Social Science One/Facebook release details how in constructing this dataset, they processed around an exabyte of raw data from Facebook. This is 1,048,576 terabytes. For the vast majority of people, this amount of information is inconceivable.
Data Description
The numbers:
Parameter/Measure | Value |
Date | 2017/07/31–2019/07/31 (31 months) |
Number of URLs | 37,983,353 |
Countries | 38 |
Ages | 18-24, 25-34, 35-44, 45-54, 55-64, 65+ |
The countries:
Some of the key URL attributes in the data are:
Attribute | Description | Example | |||||||||
Gender | The self-assigned Facebook user gender. | Male, Female, Other | |||||||||
Url_rid | A unique URL id created specifically for this data set. | fb.me/ | |||||||||
Clean_url | The full, original URL. | [https://www.nytimes.com/2018/07/09/world/asia/thailand-cave-rescue-live-updates.html] | |||||||||
Parent_domain | Parent domain name. | nytimes.com | |||||||||
Full_domain | Full domain name from the URL. | http://www.nytimes.com | |||||||||
first_post_time [timestamp] | The date/time when URL was first posted by a user on Facebook. Truncated in 10 minute increments. | YYYY-MM-DD HH:MM:SS | |||||||||
Share_title [text] | The og:title field in the original URL’s html. | 8 Saved, 5 to Go in Thai Cave Rescue | |||||||||
3pfc_rating | If the URL was sent to third-party fact-checkers. | Values: | if false: NULL | if true: True’, ‘False’, ‘Prank Generator’ , ‘False Headline or Mixture’, ‘Opinion’, ‘Satire’, ‘Not Eligible, ‘Not Rated. Others here. | |||||||
spam_usr_feedback | The total number of unique users who reported posts containing the URL as spam. | Provided as an integer | |||||||||
false_news_usr_feedback | The total number of unique users who reported posts containing the URL as false news. | Provided as an integer | |||||||||
hate_speech_usr_feedback | The total number of unique users who reported posts containing the URL as hate speech. | Provided as an integer | |||||||||
public_shares_top_country | URL shares are tallied by country and the country with the most (differentially private*) shares is provided. Only the top country is provided. | Provided as ISO 3166-1 alpha-2 code. For example: PL |
*see What is differential privacy? below
How is the data presented and accessed?
To access the data, academics and researchers must send proposals to Social Science One. If access is granted, training in data access and analysis will be provided to help researchers handle the huge troves of data. Proposals will be answered within a week and assessed on four main criteria (more information can be found here):
- Academic merit & feasibility
- Research ethics
- Likelihood of knowledge resulting from the project advancing social good
- Qualifications of the proposed team
Proposals will be assessed solely by Social Science One with no involvement or pre-conditions from Facebook.
The full codebook for the dataset and how it is presented can be found here. The codebook also details another aspect of the dataset: differential privacy.
What is differential privacy?
Differential privacy is a way for social media companies (and other organisations) to provide data for research in a way that protects individual users. Tianqing Zhu (Deakin University, Australia) discusses differential privacy in the following way:
“Differential privacy makes it possible for tech companies to collect and share aggregate information about user habits, while maintaining the privacy of individual users.” [3]
Differential privacy works by introducing statistical noise into a dataset and purposefully introducing incorrect values to protect the identity of individuals. Differential privacy takes every value in a set and essentially flips a coin on that value – with one side maintaining the value and the other changing the value.
This is explained well by Mark Hansen (Columbia University, United States) here using the visualisation of a spinner. If a spinner were used to decide whether a value is changed or not with an ε value of 0.00 this would be like flipping the coin mentioned above – with a probability of 50%. The higher the ε value, the more likely a value in the dataset is to be changed: Facebook’s use of an ε value of 0.45 means they have randomised far more than half the data.
While Social Science One have released two papers and open-source software to try to address this, the Facebook dataset is currently of no value to researchers as it consists primarily of statistical noise, not of true values.
These measures, taken by Facebook and not Social Science One, have not been well received as they severely limit the possibility for any research to be conducted. Social Science One strongly criticised Facebook for deciding to do this:
“after over a year of discussion, it became clear that applying differential privacy would be a legally acceptable path forward for Facebook. We disagreed with Facebook’s legal view, but the technology is now being used by the U.S. Census Bureau and other leading technology companies, and regulators seem to find it acceptable for sharing data in ways they would not otherwise allow. We think of differential privacy as a technological solution to a political problem”
“Differential privacy works by censoring certain values in the data and adding specially calibrated noise to statistical results or data cell values. This appropriately obscures any individual’s actions who may be in the data. However, from a statistical point of view, censoring and noise are the same as selection bias and measurement error bias — both serious statistical issues. It makes no sense to go through all this effort to provide data to researchers only to have researchers (and society at large) being misled and drawing the wrong conclusions about the effects of social media on elections and democracy.” [4]
Conclusion – the main takeaways
This data released from Social Science One and Facebook is contradictory. The original data itself presents an unprecedented view into how information, disinformation and misinformation is spread but Facebook’s decision to implement differential privacy to the standard it has means the data is, quite literally, of no value.
It’s disappointing that Facebook would make researchers go through a rigorous proposal and application process to then provide them with data that will yield considerably more false-positives and false-negatives than true results.
Facebook, and Social Science One, need to put effort into true data sharing – providing researchers with information that protects platform users but also allows productive research to be carried out.
References:
[1] Messing, Solomon; DeGregorio, Christina; Hillenbrand, Bennett; King, Gary; Mahanti, Saurav; Nayak, Chaya; Persily, Nathaniel; State, Bogdan; Wilkins, Arjun, 2019, “Facebook Privacy-Protected URL Table Release”, https://doi.org/INSERTNEWDOI, Harvard Dataverse, V0
[4] Ruggles, S., Fitch, C., Magnuson, D., & Schroeder, J. (2019, May). Differential privacy and Census data: Implications for social and economic research. In AEA papers and proceedings (Vol. 109, pp. 403-08).