Take care of your data management skills. Data management skills are fundamental for researchers. Together with data management planning, they ensure that researchers can identify and manage risks related to data handling (e.g., data protection, data security, data access rights, and data storage). The University of Helsinki's Data Support provides
This guide covers individual-level registry data collected by authorities for administrative and planning purposes, which were not originally gathered for research use. The data are maintained by registry authorities, the most significant of which are
Managing registry data requires planning, even when accessed remotely. Individual-level registry data often contain personal information that can be used to identify individuals. The largest registry authorities (Statistics Finland, Findata) do not provide researchers with individual-level data but instead offer the possibility to process pseudonymized datasets via remote access within their own secure environments (Statistics Finland's Fiona, Findata's Kapseli). In these cases, many aspects of data management fall under the responsibility of the registry authorities.Researchers contribute to responsible data management by complying with the data protection rules of the remote access system. However, there are still important considerations related to the use of registry data, such as ensuring sufficient time and financial resources—these factors should be carefully considered when preparing a data management plan (DMP).
The processing of personal and sensitive data must be carefully planned. When handling personal and sensitive data, it is essential to assess the risks associated with data processing – in other words, how much harm the disclosure of the information could cause to an individual or a community. A useful tool for assessing risk levels is conducting a Data Processing Impact Assessment (DPIA). When planning data processing, it is particularly important to identify potential risks at different stages, such as the possibility of data leaking to unauthorized parties. Proper training for all individuals handling the data is crucial in this regard. Careful handling also applies to remote access to data. Additionally, selecting secure storage locations is essential (see "Data storage during the project"). When combining registry data with other datasets, such as survey data, it is advisable to allocate sufficient time for consulting legal experts.
Allocate sufficient time for the application process and consider costs. When planning research using registry data, it is crucial to ensure adequate time and financial resources. The timelines and acquisition costs set by statistical authorities should be investigated already in the research planning phase. Application processes for data are typically slow (often taking over a year). The costs of acquiring and using data are continuously increasing. Prices and timelines depend on the scope and complexity of the data, the size of the research group using the data, and the number of registry authorities involved in the process. Costs contist of usage permits, data preparation, and possible remote access system fees. The largest registry authorities publish price and timeline estimates for typical datasets on their websites:
Collaborating in research saves resources. Due to long permit processes and high data costs, it is advisable to use registry data within existing research projects whenever possible by engaging in research collaboration. It is possible to apply for access permits for new researchers to use datasets already available to research groups. Information on collaboration opportunities and other researchers already using the data can be sought from one’s home institution and colleagues in the same field.
Legislation related to registry data. Individual-level registry data can be used either as they are or combined with other data, such as surveys. Data management regulations and principles differ depending on whether the dataset consists solely of registry data or if it is linked to other sources.
(See, for example, the Secondary Use Act (Act on the Secondary Use of Social and Health Data);
Aggregated statistical data based on official records (such as counts, averages, etc.) are subject to their own data management regulations and principles (
Research based on registry data does not require informing participants. Research that relies solely on registry data does not require informing the individuals whose data is being used. With registry data, it cannot be assumed that the individuals in the registry can be reached with reasonable effort, for example, when dealing with the entire population of Finland. However, the situation changes if registry data is combined with other sources, such as surveys, in which case informing participants is required.
Keep in mind that registry data is often sensitive. Restricted-access registry datasets typically contain individual-level data, making them sensitive. To obtain a usage permit for such data, researchers must familiarize themselves with and commit to the data security regulations of the relevant registry authorities. Even when data is processed in a remote access system, the researcher remains responsible for handling the dataset. Responsible data management is carried out by adhering to the rules of the remote access system.
The researcher is required to manage data responsibly as a data controller. When conducting registry research, either the researcher or the institution carrying out the research acts as the data controller, which entails responsible data management: "The data controller determines the processing of the dataset, ensures the security and appropriateness of personal data processing, and is responsible for the proper disposal or archiving of the dataset at the end of the research. If the researcher is employed by the institution conducting the research, the institution serves as the data controller. If the researcher or research group conducts the research independently, the researcher or research group may act as the data controller themselves." (
Research based on registry data does not always require ethical review. Research that relies solely on registry data does not require an ethical review: "Ethical committee evaluation is not required for research using public and published information, registry and document data, or archival materials." (See
Registry data is produced and provided for research use by registry authorities. Researchers do not collect these datasets themselves but specify in their data permit applications which registry data they need for their study. The application must include a variable list and a research plan justifying the use of the requested datasets. It is also advisable to apply for a sufficiently long usage period to allow for research verification or the review process of research publications before the data is destroyed.
Separate guidelines on Applying for Registry Data and Combining Registry Data with Surveys:
Guidelines for Applying for Registry Data
Guidelines for Combining Registry Data with Surveys
Storage and processing of registry data in a remote access system. Registry data cannot be transferred out of the remote access system provided by the registry administrator for processing. The use of Statistics Finland’s remote access system is subject to country-specific restrictions and is not possible, for example, from the United States. The Fiona environment of Statistics Finland and the Kapseli environment of Findata are produced by the CSC – IT Center for Science. Each system has its own software for data processing, and users can also request the installation of their own analysis tools and codes. In the remote access system, all researchers in a project who have been granted access to the same dataset can access all the project’s data. If access needs to be restricted, the project must be divided into smaller sub-projects. It is not possible to manage folders within a project: in the remote access system, all users also have access to each other’s work folders.
Registry data in the remote access system cannot be exported, but analysis results can be obtained under certain conditions. The registry data itself remains within the closed environment of the registry administrator, but group-level analysis results, such as group averages and regression coefficients, can be extracted from the system. The requested output undergoes a data protection review process before being sent to the researcher (see, for example,
Consider the limited capacity of remote access systems for data processing. Registry datasets are often large, containing millions of observations and hundreds of variables. For analyses requiring a lot of memory and even for storing analytical datasets, the system’s capacity limits may quickly be reached (for example, in Statistics Finland’s system, usage fees are tiered based on the computing power used). In data management, it is advisable to anticipate the challenges related to limited processing capacity and agree within the research team using the same dataset on practices such as deleting intermediate analysis steps.
Allocate ample time for data preprocessing and integration. Registry data based on administrative records is often raw data, better suited for administrative needs and information systems than for research. Transforming raw data into a research dataset—by defining the study population and period, combining information from different sources, and operationalizing research concepts into existing data—is a slow and decision-intensive manual process. Additionally, the processes of data accumulation, availability, and quality of different registry data, as well as potential changes over time, often need to be clarified separately with registry authorities. In this regard, research collaboration is often beneficial, as many registry researchers face similar challenges, and ideas and advice for solutions can be obtained from others.
Describe in the research publication what pre-existing datasets have been used. In registry research publications, the focus is typically on the description of the analysis dataset rather than the raw data. For the sake of research reproducibility and open science, a good practice would be to report which registries or pre-built data modules were used to construct the analysis dataset, as well as the analytical methods applied in data preparation. Reporting data at the variable and registry level is also recommended, as the content of pre-existing datasets changes over time, making it difficult for reviewers or future users to determine retrospectively what dataset was available to the researcher at the time of the study.
There is a difference between the storage of remote access data and transferred datasets. The storage of datasets from Statistics Finland and Findata takes place in remote access environments produced by the CSC – IT Center for Science. Datasets transferred to researchers’ own institutions must be stored in a secure environment, with options including the university’s Umpio system or secure storage devices. The university’s shared workspace cannot be used to store data containing personal information, even if folder access can be restricted. Researchers must determine how the dataset is transferred from the data controller to their own storage and ensure secure data transfer and storage.
Possible data processing and storage in the University of Helsinki's environments. Data should be stored and processed in a pseudonymized and encrypted form whenever possible. The home (Z:) and group directories (P:) available to university members are suitable for low- and medium-risk datasets that have been pseudonymized. Pseudonymization keys must always be stored separately from the actual dataset, in an encrypted format and in a different storage location. High-risk identifiable data must be stored in
The storage of remotely accessed registry data is the responsibility of the registry administrator. When the data usage permit expires, the researcher’s access to the data ends. The storage or deletion of data within the registry authority’s remote access system is not the responsibility of the researcher. However, the researcher is responsible for saving program codes and results from the remote access system to document the research and ensure reproducibility.
The researcher is responsible for the storage and disposal of transferred datasets. Utilize university services for data storage. During the research project, data can be stored in a personal home directory (if used individually) or a group directory. Guidelines for different storage solutions suitable for various purposes can be found in the
The home and group directories at the University of Helsinki are backed up every hour and function on Windows, Mac, and Linux operating systems. These directories are hosted on the university's own servers. Every university member has access to a home directory (Z-drive on Windows computers).
When the data usage permit expires, the data must be securely deleted.
Results derived from registry data are typically included in research publications. These results are based on statistical analyses of the dataset but do not contain individual-level data. Results are often published alongside the research article. Unpublished descriptive results may also be worth preserving for research documentation.
Analysis codes used for registry data can be stored and published. The program codes used for processing registry data remain with the researcher, and some journals may require them to be published alongside the research article. Analysis codes are text files that allow researchers to revisit their study if needed. Publishing program codes supports research reproducibility and is considered good scientific practice. For publication, it is recommended to store program codes in
Describing analysis codes facilitates reuse. Program codes should be documented with sufficient detail. Some general documentation is always necessary, but other researchers may not need a detailed explanation for every command. Researchers conducting similar statistical analyses can usually read and understand at least part of the program code without line-by-line documentation.
PDF version of the guide will be added soon.