Research Data Guide: Social Media Data

This guide covers content collected from social media platforms for research purposes (social media data) as well as other digital footprints.

Take care of your data management skills. Data management skills are fundamental for researchers. Together with data management planning, they ensure that researchers can identify and manage risks related to data handling (e.g., data protection, data security, data access rights, and data storage). The University of Helsinki's Data Support provides free data management training for researchers. Data Support also offers guidance and training, as well as tools for data management planning.

Click here to return to the Data Management Guides page.

Please note! Several social media platforms (e.g., X) have in recent years restricted researchers' access to their data, which has complicated data collection based on the use of APIs. We are in the process of updating the handbook accordingly to reflect the current situation.

This guide covers content collected from social media platforms for research purposes (social media data) as well as other digital footprints. For researchers, it is essential to understand the operational principles of the social media platform being studied. It is also important to actively monitor changes in social media platforms, as the field evolves at an exceptionally fast pace. Social media is still a relatively new research environment. These specific characteristics result in many details that need to be considered. However, this should not be discouraging, as social media data can be a valuable resource for research. See also Toolkit for Digital Methods.

Having pre-agreed guidelines for collecting social media data facilitates research. Since social media data collection usually begins quickly after identifying an interesting phenomenon, successful data management depends on pre-agreed guidelines within the research team. By planning data collection in advance, your research team can save valuable time, effort, and resources. Below are key questions for which it is advisable to establish basic solutions within the research team beforehand. Otherwise, decisions may have to be made in a rush, which can lead to immediate problems (the collected data may not align well with the research objectives) or long-term issues (data processing becomes difficult due to incompatible tools). Therefore, pay attention to these aspects as early as possible. (See also Venturini et al. 2018).

Key aspects to plan and clarify in advance include:

From which platforms will the data be collected?
How do the platform's own terms of service enable or restrict data collection?
What data will be collected – will the focus be on specific keywords or users?
What tools will be used for collection and, when working as a research team, who will carry out the data collection?
Where will the data be stored, and who will have access to it?
How will the data be processed and analyzed? With which tools? When working as a research team, who will process which part of the data?
How will the data collection process be documented?

Does your planned data processing pose a high risk to research participants? At the University of Helsinki, you can use the "Does my research require an impact assessment?" form, available in the researcher's data protection guidelines in Flamma, to assist in this evaluation.

The collection of social media data is governed by a developer agreement. Social media data collection largely depends on the terms set by social media platforms, and researchers typically need to sign the platform's developer agreement, which sets specific limits on data collection and storage. By signing the developer agreement, the researcher gains access to social media data through a programming interface or other research interface (hereafter API; application programming interface). Some platform APIs may require payment, which should be considered when preparing the project budget.

The terms of service of social media platforms prohibit scraping, which refers to machine-assisted or automated data collection from public web pages. Scraping could, in theory, allow researchers to bypass the restrictions set by the developer agreement, although it is typically prohibited by the terms of service (ToS) of many platforms. These terms vary by platform and change over time, so it is important to check them regularly to ensure compliance. The academic community also debates whether it is ethically necessary to follow platform terms of service, particularly when the goal is to study public discourse or disinformation (see, for example, Bruns 2019; Rogers 2018; Sandvig 2017).

Review the end-user license agreement (EULA) of the platform in question. The end-user license agreements (hereafter EULA) and terms of service of social media platforms, in addition to developer agreements, also define the scope of research activities, even though a platform’s EULA may not explicitly mention research use. In such cases, the license agreement and terms of service should be compared with current legislation or various fair use policies (e.g., Fair Use in U.S. copyright law, quotation rights or private use rights in Finnish legislation). (See, for example, Laaksonen & Salonen 2018; Obar & Oeldorf-Hirsch 2020).

Separate agreements apply to social media data obtained from commercial entities. Social media data for research purposes can also be acquired by purchasing pre-collected data from a commercial provider. This requires entering into an agreement with the commercial entity, which may be just as restrictive as a developer agreement. Carefully review the contract before signing.

A compromise solution for complying with the General Data Protection Regulation (GDPR) information requirement is to make noise about one’s research or to waive the information requirement. According to GDPR, research participants—in this case, users of a social media platform whose data is included in the study—must be informed about the collection and use of their personal data. In principle, the information requirement for social media data is met when the user has accepted the platform’s EULA and terms of service. As noted above, however, very few users have actually read the EULA or terms of service, which theoretically places the responsibility of informing users on the researcher. In practice, however, it is impossible for the researcher to inform every individual whose content is included in the study—social media data is typically collected in multiple languages from hundreds, thousands, or even millions of users. Additionally, some user accounts may no longer exist by the time the research is published.

In certain limited cases, it is also possible to waive the requirement to inform research participants—for example, when the data has been obtained from sources other than the participants themselves and informing them proves impossible or would require disproportionate effort. If informing participants is not feasible, the researcher or research team can announce the data collection on the social media platform where the research is being conducted, on the research project’s website, or by other means of making public "noise" about the study. For example, the researcher or research team could post tweets with relevant hashtags related to the research topic, informing users about the ongoing study.

Even if the requirement to inform participants has been waived or only general public announcements about the research have been made as described above, a data protection notice must still be prepared and submitted for archiving to the data protection officer of the home institution (at the University of Helsinki, via email to tietosuoja@helsinki.fi). The data protection notice also serves as a document fulfilling the obligation of accountability. A template for this document can be found in Flamma at the University of Helsinki.

Social media data is primarily low-risk personal data but may also contain sensitive information. Social media data is personal data that can also be sensitive, as users may express their political views or religious beliefs. However, social media data is often public (visible to anyone on the internet) or semi-public (visible to all users of the platform), which lowers its risk level. In theory, every social media platform user has accepted the EULA and terms of service when joining the platform, which by default classify posts as public or semi-public. However, few average users have actually read the details of the EULA or terms of service (see, for example, Obar & Oeldorf-Hirsch 2020). As a result, it cannot be assumed that users are aware that their posts are as public as, for example, opinion pieces in newspapers. This presents an ethical challenge for researchers and complicates compliance with the requirement to inform research participants. (See, for example, AoIR).

Anonymization of social media data or direct quotes taken from it can often be reverted using phrase searches. When publishing research, it must be taken into account that anonymization can often be reverted by performing a phrase search with a search engine. If the data is publicly available on the internet (e.g., a tweet, a Reddit post), a phrase search is highly likely to revert the anonymization with minimal effort. Anonymization reverted by a simple phrase search makes the data identifiable, meaning it does not meet GDPR requirements for the anonymous processing of personal data. Complete anonymization would also require the removal of linked or mentioned accounts. When studying well-defined or rare social media phenomena, the risk of identifying individuals increases.

The anonymity of research participants can be strengthened by, for example, translating social media data into another language (if the original language differs from the language of the research) and not providing the original-language version. Another possibility is to use paraphrasing, meaning expressing the content in a way that remains faithful to the original meaning but is not word-for-word (see, for example, Markham 2012). These solutions bring the research closer to an ethically responsible application of GDPR requirements.

The transparency of research based on social media data improves when all actions taken are documented. Social media platform APIs, software used for data collection, and other tools may change, update, or become obsolete, making it impossible to replicate data collection. The transparency and reproducibility of research suffer if data collection cannot be repeated. By documenting the actions taken—such as the version of the API used, the software and scripts applied—transparency and reproducibility improve significantly. Another challenge to reproducibility is that content disappears from social media platforms when users leave or delete content. Additionally, users may later modify their old posts in ways that alter their meaning. It is therefore essential to document when the data collection took place.

The transparency of the data collection process is compromised when social media data is obtained from a commercial provider. When purchasing data from a commercial provider, the researcher may not know the principles by which the provider acquired the data for its platform. This is particularly problematic for large datasets, where it is difficult for the researcher to determine exactly how the data was collected. The lack of transparency in the collection process increases the researcher’s uncertainty about how the commercial provider’s platform selects data for the researcher. These selections are made by algorithms that may be classified as trade secrets. This creates blind spots in research that can be detrimental and may go unnoticed even by the researcher. (Joseph et al. 2014; Morstatter et al. 2013.)

An effective way to collect social media data is to use ready-made data collecting tools. There are several ready-made and free data collecting tools available online that make it easier to collect social media data for research purposes. These collectors are typically based on Python or R packages and open-source code—for example, in the case of collecting X data, the widely used data collection tool Twarc. Using these tools requires the researcher to have a developer agreement as well as at least basic programming skills or familiarity with the documentation. It is particularly recommended to review the data collecting tool’s documentation and, if necessary, its source code (for example, Twarc’s documentation can be found here, and its source code here).

The transparency of social media data collection can be increased by programming custom data collecting tools. Programming custom collectors requires significant programming skills and a considerable time investment from the researcher. The most commonly used data collecting tools have been so widely adopted that they have undergone extensive testing through use, which is lacking in collectors programmed by individual researchers. On the other hand, the researcher has precise knowledge of how the social media dataset is formed. If desired, the researcher can also share the collectors and scripts they have used (e.g., on GitHub), further improving the transparency of data collection.

Cleaning raw social media data can take a significant amount of time. Raw social media data contains a lot of extraneous information that researchers must filter out to make the data more useful for analysis. Additionally, the data may not be in a file format that standard office software can process, meaning it often needs to be converted into a different format. This may involve, for example, trimming JSON file contents, processing data with scripts, and saving it in CSV format.

When cleaning social media data, it is essential to plan the intended analysis in advance, as the structure and outcome of the analysis will be determined by the processed data. Without proper planning, parts or even the entirety of the data cleaning process may need to be repeated, unnecessarily consuming research resources. The time required for cleaning raw data can be difficult to estimate, but it can easily take weeks or even months—especially with poor or no planning.

Transforming raw data into a format useful for research may reduce transparency and reproducibility if work processes are not documented. How has the raw data been processed and modified to better suit the research? What scripts have been used? Writing down answers to these kinds of questions after each stage of work significantly improves the transparency of the research. Documenting work processes also enhances internal communication and efficiency within the research team, as understanding another researcher’s scripts without documentation can be extremely challenging, time-consuming, and frustrating. The documentation of the research process can be considered metadata for the study, which is generally publishable even if the data itself cannot be made publicly available.

Instead of describing social media data itself, it is more meaningful to focus on documenting work processes. Social media data is generated independently of the researcher and for purposes outside of research, meaning that the researcher cannot document all metadata related to its creation. For example, a JSON file collected via X’s API contains a large amount of metadata, some of which (e.g., the background color of a user’s profile page) is rarely relevant for research. A researcher cannot be expected to be responsible for how a social media platform generates metadata or defines the parameters of the data provided to researchers. However, it is important for researchers to stay informed about the platform’s updates and their potential impact on data collection. Additionally, familiarity with critical research on social media platforms can be beneficial. In research documentation based on social media data, the most significant responsibilities for researchers relate not to describing the data itself but rather to describing the work process.

Transforming raw data into a format useful for research may reduce transparency and reproducibility if work processes are not documented. How has the raw data been processed and modified to better suit the research? What scripts have been used? Writing down answers to these kinds of questions after each stage of work significantly improves the transparency of the research. Documenting work processes also enhances internal communication and efficiency within the research team, as understanding another researcher’s scripts without documentation can be extremely challenging, time-consuming, and frustrating. The documentation of the research process can be considered metadata for the study, which is generally publishable even if the data itself cannot be made publicly available.

There is no single storage solution that fits all situations. When selecting a storage solution, the following factors should be considered: How many people need access to the data? How much data is there? How sensitive is the data? Is the research being conducted in collaboration with other universities? Because of these considerations, there is no universally suitable storage solution. It is highly recommended to carefully explore different storage options. The storage solutions offered by the University of Helsinki are listed in this table. Non-recommended options include commercial cloud services and external hard drives/USB sticks.

Here are some suggested storage solutions suitable for social media data. These recommendations are based on the idea of social media data as low-risk sensitive personal data.

If a researcher is conducting their study alone at the University of Helsinki and accumulates less than 100 GB of data, the home directory (Z-drive) is a good option, as it provides sufficient security, strict access control, and automatic backups.
If a researcher is working alone or in a group consisting only of University of Helsinki researchers and accumulates less than 10 TB of data, the group directory (P-drive) is a good option, as it provides sufficient security, strict access control, and automatic backups.
If the research group includes members outside the University of Helsinki, suitable solutions include various CSC services, Fairdata IDA, and Microsoft Teams for documentation. Many of these services do not allow the storage of personal data, or such data is recommended to be stored in encrypted form—therefore, researchers should review the terms and descriptions of these services before use. Security and data protection can be enhanced by combining different services, for example, by storing pseudonymization keys in one service and documentation related to data collection and processing in another.

Developer agreements prevent researchers from openly sharing or archiving collected social media data in its original form or at all. Social media platforms impose restrictions on making data publicly available and archiving it for long-term storage. For example, X has allowed tweet data to be shared in a "dehydrated" format, meaning only tweet-specific ID numbers are provided. To retrieve the full content of the tweet ("rehydrate" it), these IDs must be input into X’s API. "Rehydration" can also be done without a developer agreement, but in this case, the process must be carried out one tweet at a time. In spring 2023, X’s API became paid, making large-scale "rehydration" more problematic: data collection is now limited to 10,000 tweets per month for $100, with higher usage requiring additional payments. Instead of sharing social media data itself, researchers can make their scripts publicly available, for example, on GitHub after the study. However, for this to be useful, the code should be cleaned and commented on, as unstructured or undocumented scripts can be difficult or impossible to reuse.

After the research, the curation of social media data remains the researcher’s responsibility. The researcher must ensure that the social media data they have does not include posts that have since been deleted or otherwise hidden. The deletion of a post can be interpreted as the user’s expression of their right to be forgotten under GDPR, meaning the researcher must remove the content from their dataset. The requirement to delete content is also stated, for example, in X’s own guidelines. In practice, these considerations must be addressed within the research team, balancing the societal and scientific significance of the study, the constraints set by companies, and research ethics. The curation requirement also presents additional challenges for researchers, as precisely replicating research findings becomes extremely difficult or even impossible if the same dataset cannot be accessed or reused. Further complicating the practical execution of curation is the transition of X’s API to a paid model in spring 2023, since curation requires rehydrating the data, meaning it must be processed through the API. The curation requirement also gradually reduces the amount of data available to researchers and, in some cases, does so dramatically within a short period if the social media platform loses users for any reason. Under the curation requirement, the researcher’s object of study is always the social media platform and its content as it exists at the moment of retrieval, rather than the authentic discussions that may have since been deleted for various reasons. This significantly limits the possibilities for historical research on social media.

Storing social media data after research for the researcher’s own use. The recommended options for storing raw data after research for personal use include CSC services and the University of Helsinki’s own storage solutions, such as the home directory (Z-drive) or the group directory (P-drive). The main issue with these solutions is their dependence on university user accounts: if the researcher’s employment at the university ends, access to these services is also lost. This condition applies to many other services as well, such as IDA. In such cases, transferring data to other services should be anticipated and planned well in advance. Commercial cloud services or external hard drives are not recommended for storing data after research. However, if no other options are available, encrypting the files and storing them in multiple locations can provide additional security.

Deleting data is also an option. Not all research data needs to be preserved; it can be destroyed after the research is completed. If the data is to be deleted, simply using the operating system's delete function (usually Delete) is not sufficient, as files can potentially be recovered later. More information on secure file deletion can be found on the University of Helsinki's Helpdesk website.

Bruns, A. (2019). After the ‘APIcalypse’: social media platforms and their fight against critical scholarly research. Information Communication and Society, 22(11), 1544–1566. https://doi.org/10.1080/1369118X.2019.1637447

Joseph, K., Landwehr, P. M., & Carley, K. M. (2014). Two 1%s Don’t make a whole: Comparing simultaneous samples from Twitter’s Streaming API. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8393 LNCS, 75–83. https://doi.org/10.1007/978-3-319-05579-4_10

Laaksonen, Salla-Maaria & Salonen, Margareta 2018. Kuka saa päättää, mitä dataa tutkijalla on käytössään? Ei ainakaan amerikkalainen suuryritys. Rajapinta. https://rajapinta.co/2018/12/04/kuka-saa-paattaa-mita-dataa-tutkijalla-on-kayto…

Markham, A. (2012). Fabrication as Ethical Practice. Information, Communication & Society, 15(3), 334–353. https://doi.org/10.1080/1369118X.2011.641993

Morstatter, F., Ave, S. M., Ave, S. M., & Carley, K. M. (2013). Is the Sample Good Enough ? Comparing Data from Twitter ’ s Streaming API with Twitter ’ s Firehose.

Obar, J. A., & Oeldorf-Hirsch, A. (2020). The biggest lie on the Internet: ignoring the privacy policies and terms of service policies of social networking services. Information, Communication & Society, 23(1), 128–147. https://doi.org/10.1080/1369118X.2018.1486870

Rogers, R. (2018). Social media research after the fake news debacle. Partecipazione e Conflitto: The Open Journal of Sociopolitical Studies, 11(2), 557–570. doi:10.1285/i20356609v11i2p557

Sandvig, C. (2017). Heading to the courthouse for Sandvig v. Sessions. https://socialmediacollective.org/2017/10/19/heading-to-the-courthouse-for-sand…

Venturini, T., Bounegru, L., Gray, J., & Rogers, R. (2018). A reality check(list) for digital methods. New Media and Society, 20(11), 4195–4217. https://doi.org/10.1177/1461444818769236

AoIR.

Toolkit for Digital Methods.

Tutkimuksen tietosuoja HY:n kontekstissa - Flamma.

PDF version of the guide will be added soon.