Research Data Guide: Textual Data Refined for Research Use

This guide addresses research data management issues related to textual data refined for research use.

Take care of your data management skills. Data management skills are fundamental for researchers. Together with data management planning, they ensure that researchers can identify and manage risks related to data handling (e.g., data protection, data security, data access rights, and data storage). The University of Helsinki's Data Support provides free data management training for researchers. Data Support also offers guidance and training, as well as tools for data management planning.

Click here to return to the Data Management Guides page.

This guide addresses research data management issues related to textual data refined for research use – particularly text content handled in a structured format. The text content may originate from sources external to the research group, such as various memory institutions, or it may be collected by the research groups themselves, for example, scanned books and other documents. This guide primarily focuses on working with textual data obtained from external sources, but the proposed solutions can also be applied to self-collected or project-generated data.

Special attention is given to converting data into RDF format (Resource Description Framework, a standardized model for data exchange, particularly between web applications) and creating a semantic web interface (see, for example, BiographySampo [the website is in Finnish]).

This guide uses the terms raw data (natural or structured textual data), source organizations (entities from which the raw data is obtained), refined data (textual data converted into a structured format), processing pipeline (software used to transform raw data into refined data), and portal (end-user interface).

Published data as a research output holds the same status as an article or a book. In projects involving the study of large structured textual datasets, the publication of refined data generated during the project is emphasized more than in typical projects that employ a research approach based on textual data. In addition to research articles and other traditional research publications, structured refined data is invariably made available through a separate user interface or published as such. To enable the publication of structured data generated in the research project, attention should be given to proactive data management planning.

The documentation of data and the research process is important. To make data publicly accessible for others to use, the data and research process must be carefully documented. It is advisable to record at the earliest possible stage how data is collected and refined, what potential shortcomings arise in the processing, etc.

Multidisciplinarity and multiple organizations. Research is typically conducted collaboratively among multiple organizations and across disciplines—for example, computer scientists refine the data, while historians conduct research based on the generated data. The more carefully and earlier the research team plans the project and its data management, the fewer resources will be needed during the project for repeating processing, programming, or other workflows. Below are some key questions that should be resolved before starting the research.

Key aspects that should be planned in advance include:

Where does the data come from? Is it collected independently or obtained from a source organization?
If the data comes from a source organization, what kind of agreements need to be written, and what restrictions on data usage will be included in the agreements?
Which research organizations are involved in the project, and how will the division of tasks between different organizations and researchers be arranged?
How can version control be designed to be unambiguous and understandable to all parties?
How can the processing pipeline be designed so that the process is automatable and reproducible?
In what format will the data be processed and described?
Which refined datasets will be published? Where? Under what conditions?

Routine in making agreements. Research involving large textual datasets often relies on text corpora obtained from other organizations, making agreements related to data management crucial. It is in the best interest of both the research group and the organization providing the text corpus to have a clear, written, and mutually agreed-upon contract for data usage and publication. Establishing a routine for drafting agreements on data handling is advisable. The agreement does not need to be complex, but clarity and unambiguity are essential. The agreement should clearly define at least what data is provided, to whom, under what conditions, what data can be published, under what license, and in which repository. Vague agreements can lead to the publication of incorrect data, additional work, and ambiguities regarding agreed-upon wording, which unnecessarily consume resources on both sides.

Guidance and a preliminary information form available on the University of Helsinki’s intranet can be used to help draft agreements, which should also be reviewed with the organization's legal team before signing.

Boundaries defined by agreements. Research activities are regulated by both agreements and legislation. When studying large textual datasets, agreements with source organizations play a significant role, as these organizations provide the text or metadata corpus for research use, and research cannot proceed without their permission. Therefore, it is particularly important for the research team to be aware of the restrictions imposed by prior agreements that the source organization has made concerning the data. Typical requirements may include filtering out living individuals from the dataset or removing specific sections before publication.

Boundaries defined by legislation. From a legal perspective, the European Union’s General Data Protection Regulation (GDPR) particularly defines how personal data related to living individuals can be processed and published. Individuals whose data is processed (referred to as data subjects under GDPR) have the right to be informed about what data is collected about them, how it is used, and by whom. It is essential for the research team to be aware of GDPR and other legal requirements before publishing data or a related portal, as non-compliance could lead to illegal publication.

How to comply with the GDPR requirements? If the dataset consists of personal data of deceased individuals (e.g., WarSampo database of casualties) or is inherently publicly available (e.g.,ParliamentSampo[the website is in Finnish]), the GDPR does not affect the creation or publication of the refined data. However, datasets that contain personal data of both deceased individuals and living individuals or their relatives (e.g., BiographySampo [the website is in Finnish]) fall into a legal gray area regarding GDPR compliance. In such cases, a dedicated privacy notice for the portal must be created and easily accessible. Creating such a notice is a hallmark of high-quality data handling and publication.

For data publications, either all individuals whose data is included must be contacted, or the research team must verify how the source organization has informed individuals about the use of their personal data. The data protection measures adopted by the source organization also determine the boundaries for research groups refining the data. Additionally, early contact with the institution’s data protection officer and legal advisors is recommended to ensure the legality of the planned research, data publication, and portal.

Refining data into a structured format may reveal otherwise hidden sensitive personal data. In the case of BiographySampo (the website is in Finnish), the information is not particularly sensitive or high-risk but still constitutes personal data. An important question is whether refining the data into a structured format introduces a new informational layer to the exposed personal data. If the refinement reveals otherwise hidden sensitive personal information (i.e., special categories of personal data as defined by GDPR), particular attention must be paid to data handling.

It is preferable to collect too many data fields rather than too few. Since textual datasets are typically gathered from multiple source organizations, the data is inherently heterogeneous. Therefore, it must be standardized into a format that is as structured as possible to be useful for research. To make this standardization process as efficient as possible, it is advisable to request that source organizations provide more data fields rather than fewer. This can potentially save a significant amount of time during research by preventing the need to repeat certain tasks as the work progresses.

Text should be as structured as possible. When refining large textual datasets, the data should be as structured as possible. For example, tabular files (e.g., .csv) are more structured and thus easier to use than text documents (e.g., .docx). Completely unstructured, natural text is particularly challenging to refine, which should be taken into account when planning and scheduling research. To ensure smooth research progress, it is advisable to request textual datasets from source organizations in the most structured format possible, such as .csv.

A lack of routine leads to time-consuming investigations. Just as with contracts, acquiring data from source organizations often involves a great deal of internal processing within those organizations, partly due to the absence of established procedures. These processes can delay the start of the research. One possible way to streamline the process is to prepare a contact template within the research group. This template would outline the basic information about the requested data, how it will be refined, planned publications (such as a portal, data publication, and research articles), and so on. The key focus in designing the contact template should be on making the internal processes of the source organizations as smooth as possible.

The time required to fulfill source organizations' requirements. Source organizations may impose requirements on which parts of the textual dataset can be used or published in a portal or as a data publication. A typical example is the removal of living individuals from the dataset due to GDPR requirements. Implementing these requirements can be time-consuming, but potential problems can be avoided with early planning and by allocating sufficient time for this step.

Despite careful planning and clear agreements, the raw data that is refined is not error-free. Typical issues include errors and inaccuracies related to optical character recognition (OCR). In principle, the researcher or research group can manually correct these errors, but in the long term, a better solution is to contact the source organization and request them to modify the raw data. This prevents the research team from having to make the same corrections repeatedly when the raw data is reprocessed through the pipeline. With this in mind, it is advisable to allow flexibility in scheduling, as correcting errors may take time for the source organization.

The processing pipeline must be repeatable, documented, and automatable. Especially when designing a user portal with a lifespan of several years, the data refinement processing pipeline should be designed to be repeatable and automated. This means that when the source organization’s raw data changes for any reason, those changes are reflected in the refined data of the portal by rerunning the raw data through the processing pipeline. It is therefore crucial to create comprehensive documentation of the pipeline’s programming and functionality so that any broken features can be repaired, even if the composition of the research team changes.

Clarity in internal communication within the research team should be prioritized. Since research on large textual datasets is typically conducted in research groups involving multiple organizations, internal communication within the research team should be emphasized from a data management perspective as well. Clear and unambiguous communication ensures that two researchers do not perform the same processing on the same dataset and that no ambiguities arise in version control.

The importance of documenting processing rather than the data itself. Since the research of large textual datasets primarily relies on data obtained from source organizations, in principle, there is no need to describe the raw data itself. In practice, however, source organizations rarely describe their data in great detail, as the material was not originally collected for open publication – in such cases, the responsibility for documentation falls on the research team. If additional descriptions and metadata are needed, their formulation should be integrated into other processing. More important is the documentation of research processes, as it ensures research transparency and facilitates the transmission of good practices and tools. Important aspects to document include the tools used as well as versioning information (which operating system version, software library, and source code were used, etc.).

The separate publication of metadata promotes open science. In the field of large textual datasets research, it is common to describe the processing pipeline as part of a research article, but from the perspective of open science, there is nothing preventing the processing pipeline from being described as a metadata file, for example, in Etsin. Metadata files can describe the processing pipeline, document the tools used and their versions, and so on. Such metadata files can be used to easily share effective and well-established methods that benefit other researchers.

When selecting a storage solution, several factors must be taken into account, including:

there is no single solution that suits everyone,
how many people need access to the data,
how much data there is,
how sensitive the material is,
whether the research is conducted in collaboration with other universities.

For this reason, there is no single storage solution that is always suitable for everyone. It is highly recommended to carefully familiarize oneself with different storage solutions. The storage solutions offered by the University of Helsinki are listed in this table. The worst options are commercial cloud services, external hard drives, and USB sticks.

In the case of large textual datasets, version control plays a particularly important role when selecting a storage solution. When raw data is processed multiple times through a processing pipeline, it must be possible to ensure that any changes made to the data during processing are also reversible. For example, if the data becomes corrupted and the latest backup is several iterations old, good version control saves research resources.

Here are some recommended storage solutions for large textual datasets. The recommendations are based on the storage solutions offered by the University of Helsinki and the assumption that textual datasets contain at most low-risk personal data.

For source code, it is recommended to use the University of Helsinki’s GitLab, i.e., version.helsinki.fi.
If a researcher is working alone at the University of Helsinki and accumulates less than 100 GB of data, the home directory (Z drive) is a good option, as it provides sufficient data security, precise access control, version control, and automatic backup.
If a researcher is working alone or in a group consisting only of University of Helsinki researchers and accumulates less than 10 TB of data, the group directory (P drive) is a good option, as it provides sufficient data security, precise access control, version control, and automatic backup.
If the research group includes external collaborators outside the University of Helsinki, suitable solutions include several CSC services, Fairdata IDA, and, if necessary, Microsoft Teams. It should be noted that many of these solutions are not suitable for storing data containing personal information. If personal data must be stored in these services, the data should be encrypted (e.g., using Cryptomator or 7-Zip software). A commonly used storage solution is CSC’s Allas; however, it should be noted that data cannot be processed within Allas, and for processing, the data must be downloaded to another storage solution.

Textual datasets refined for research purposes are typically published in two forms in addition to research articles: as data for further refinement and as a portal. Data for further refinement can be made openly available as a data publication, for example, on platforms such as Zenodo, the Language Bank of Finland, and Linked Data Finland. The portal is published as a separate website with as open access as possible. In both cases, the same data is presented but in different formats: in the portal, the data is more easily accessible to a wider audience, while as a data publication, it is more useful for fellow researchers and can be adapted for their own research. There may also be some differences in the published data due to requirements from source organizations—for example, a source organization may require the removal of certain information from the data publication, while the portal redirects users to the source organization’s own service for that information.

Long-term preservation needs to be planned in advance. Since projects researching large textual datasets focus not only on research but also on publishing portals and refined data, these projects have an especially long tail. The goal is for the portal to remain accessible even after the project's funding has ended. Therefore, it is particularly important to carefully plan how the portal's functionality will be ensured in the long term—for example: what happens to the portal when the project's principal investigator retires, how will server costs be covered, and how will software updates be managed?

Lou, B., O'Brien O'Keeffe, K. & Unsworth, J. (toim.) (2005). Electronic Textual Editing. New York: Modern Languages Association.

Nyhan, J. (2012). Text encoding and scholarly digital editions. Teoksessa C. Warwick, M. Terras & J. Nyhan (toim.), Digital Humanities in Practice (s. 117–138). Facet Publishing. https://doi.org/10.29085/9781856049054

Oldman, D., Doerr, M. & Gradmann, S. (2015). Zen and the Art of Linked Data: New Strategies for a Semantic Web of Humanist Knowledge. TeoksessaS. Schreibman, R. Siemens & J. Unsworth (toim.), A New Companion to Digital Humanities (s. 251–273). Wiley. https://doi.org/10.1002/9781118680605.ch18

Pitti, D. V. (2004). Designing Sustainable Projects and Publications. Teoksessa S. Schriebman, R. Siemens & J. Unsworth (toim.), A Companion to Digital Humanities. Oxford: Blackwell Publishing. https://companions.digitalhumanities.org/DH/

Baker, J. (2014). Preserving Your Research Data. Programming Historian. https://doi.org/10.46430/phen0039

PDF version of the guide will be added soon.