Take care of your data management skills. Data management skills are fundamental for researchers. Together with data management planning, they ensure that researchers can identify and manage risks related to data handling (e.g., data protection, data security, data access rights, and data storage). The University of Helsinki's Data Support provides free data management training for researchers. Data Support also offers guidance and training, as well as tools for data management planning.
This guide covers individual-level registry data collected by authorities for administrative and planning purposes, which were not originally gathered for research use. The data are maintained by registry authorities, the most significant of which are Statistics Finland and public registry authorities under the Secondary Use Act, which are listed on Findata's website. Findata is the data permit authority through which registry data under the Secondary Use Act, i.e., social and health sector data, can be applied for. Registry data not covered by the Secondary Use Act must be requested separately from each respective registry authority (e.g., Statistics Finland, the Finnish National Agency for Education, the Digital and Population Data Services Agency, the Finnish Defence Forces, the Legal Register Centre). Individual-level registry data can be used for research either as they are or combined with other data, such as surveys (see "Guidelines for Combining Registry Data with Surveys"). The linkage of individual-level data between different registries and other data sources is possible using the personal identity code, which uniquely identifies all permanent residents of Finland.
Managing registry data requires planning, even when accessed remotely. Individual-level registry data often contain personal information that can be used to identify individuals. The largest registry authorities (Statistics Finland, Findata) do not provide researchers with individual-level data but instead offer the possibility to process pseudonymized datasets via remote access within their own secure environments (Statistics Finland's Fiona, Findata's Kapseli). In these cases, many aspects of data management fall under the responsibility of the registry authorities.Researchers contribute to responsible data management by complying with the data protection rules of the remote access system. However, there are still important considerations related to the use of registry data, such as ensuring sufficient time and financial resources—these factors should be carefully considered when preparing a data management plan (DMP).
The processing of personal and sensitive data must be carefully planned. When handling personal and sensitive data, it is essential to assess the risks associated with data processing – in other words, how much harm the disclosure of the information could cause to an individual or a community. A useful tool for assessing risk levels is conducting a Data Processing Impact Assessment (DPIA). When planning data processing, it is particularly important to identify potential risks at different stages, such as the possibility of data leaking to unauthorized parties. Proper training for all individuals handling the data is crucial in this regard. Careful handling also applies to remote access to data. Additionally, selecting secure storage locations is essential (see "Data storage during the project"). When combining registry data with other datasets, such as survey data, it is advisable to allocate sufficient time for consulting legal experts.
Allocate sufficient time for the application process and consider costs. When planning research using registry data, it is crucial to ensure adequate time and financial resources. The timelines and acquisition costs set by statistical authorities should be investigated already in the research planning phase. Application processes for data are typically slow (often taking over a year). The costs of acquiring and using data are continuously increasing. Prices and timelines depend on the scope and complexity of the data, the size of the research group using the data, and the number of registry authorities involved in the process. Costs contist of usage permits, data preparation, and possible remote access system fees. The largest registry authorities publish price and timeline estimates for typical datasets on their websites: Statistics Finland's schedule estimates and pricing; Findata's queue status and pricing. Check the policies of your home institution regarding covering costs – some faculties have allocated budget funds for paying data access fees.
Collaborating in research saves resources. Due to long permit processes and high data costs, it is advisable to use registry data within existing research projects whenever possible by engaging in research collaboration. It is possible to apply for access permits for new researchers to use datasets already available to research groups. Information on collaboration opportunities and other researchers already using the data can be sought from one’s home institution and colleagues in the same field.
Legislation related to registry data. Individual-level registry data can be used either as they are or combined with other data, such as surveys. Data management regulations and principles differ depending on whether the dataset consists solely of registry data or if it is linked to other sources.
(See, for example, the Secondary Use Act (Act on the Secondary Use of Social and Health Data); Flamma page: Secondary use of social and health data in research).
Aggregated statistical data based on official records (such as counts, averages, etc.) are subject to their own data management regulations and principles (Statistics Finland Act [in Finnish]).
Research based on registry data does not require informing participants. Research that relies solely on registry data does not require informing the individuals whose data is being used. With registry data, it cannot be assumed that the individuals in the registry can be reached with reasonable effort, for example, when dealing with the entire population of Finland. However, the situation changes if registry data is combined with other sources, such as surveys, in which case informing participants is required.
Keep in mind that registry data is often sensitive. Restricted-access registry datasets typically contain individual-level data, making them sensitive. To obtain a usage permit for such data, researchers must familiarize themselves with and commit to the data security regulations of the relevant registry authorities. Even when data is processed in a remote access system, the researcher remains responsible for handling the dataset. Responsible data management is carried out by adhering to the rules of the remote access system.
The researcher is required to manage data responsibly as a data controller. When conducting registry research, either the researcher or the institution carrying out the research acts as the data controller, which entails responsible data management: "The data controller determines the processing of the dataset, ensures the security and appropriateness of personal data processing, and is responsible for the proper disposal or archiving of the dataset at the end of the research. If the researcher is employed by the institution conducting the research, the institution serves as the data controller. If the researcher or research group conducts the research independently, the researcher or research group may act as the data controller themselves." (Findata FAQ: What is meant by a data controller? See also Statistics Finland: "Data license holder as data controller", scroll to the bottom.)
Research based on registry data does not always require ethical review. Research that relies solely on registry data does not require an ethical review: "Ethical committee evaluation is not required for research using public and published information, registry and document data, or archival materials." (See Ethical Principles of Research Involving Human Subjects and Ethical Review in the Human Sciences in Finland. Finnish National Board on Research Integrity (TENK), 2019: 16). The situation may change if registry data is combined with other sources, such as survey data. In such cases, the research may be required to undergo ethical review. Information on ethical review procedures has been compiled on the University of Helsinki's website. HSSH can assist with requesting a statement from an ethical review committee. In certain cases, ethical approval must be obtained from both the registry authority and the researcher's home institution. Processing times for these approvals can be long, so it is advisable to apply early.
Registry data is produced and provided for research use by registry authorities. Researchers do not collect these datasets themselves but specify in their data permit applications which registry data they need for their study. The application must include a variable list and a research plan justifying the use of the requested datasets. It is also advisable to apply for a sufficiently long usage period to allow for research verification or the review process of research publications before the data is destroyed.
Separate guidelines on Applying for Registry Data and Combining Registry Data with Surveys:
Guidelines for Applying for Registry Data
Guidelines for Combining Registry Data with Surveys
Storage and processing of registry data in a remote access system. Registry data cannot be transferred out of the remote access system provided by the registry administrator for processing. The use of Statistics Finland’s remote access system is subject to country-specific restrictions and is not possible, for example, from the United States. The Fiona environment of Statistics Finland and the Kapseli environment of Findata are produced by the CSC – IT Center for Science. Each system has its own software for data processing, and users can also request the installation of their own analysis tools and codes. In the remote access system, all researchers in a project who have been granted access to the same dataset can access all the project’s data. If access needs to be restricted, the project must be divided into smaller sub-projects. It is not possible to manage folders within a project: in the remote access system, all users also have access to each other’s work folders.
Registry data in the remote access system cannot be exported, but analysis results can be obtained under certain conditions. The registry data itself remains within the closed environment of the registry administrator, but group-level analysis results, such as group averages and regression coefficients, can be extracted from the system. The requested output undergoes a data protection review process before being sent to the researcher (see, for example, “Use of Microdata in FIONA” for more on the review process).
Consider the limited capacity of remote access systems for data processing. Registry datasets are often large, containing millions of observations and hundreds of variables. For analyses requiring a lot of memory and even for storing analytical datasets, the system’s capacity limits may quickly be reached (for example, in Statistics Finland’s system, usage fees are tiered based on the computing power used). In data management, it is advisable to anticipate the challenges related to limited processing capacity and agree within the research team using the same dataset on practices such as deleting intermediate analysis steps.
Allocate ample time for data preprocessing and integration. Registry data based on administrative records is often raw data, better suited for administrative needs and information systems than for research. Transforming raw data into a research dataset—by defining the study population and period, combining information from different sources, and operationalizing research concepts into existing data—is a slow and decision-intensive manual process. Additionally, the processes of data accumulation, availability, and quality of different registry data, as well as potential changes over time, often need to be clarified separately with registry authorities. In this regard, research collaboration is often beneficial, as many registry researchers face similar challenges, and ideas and advice for solutions can be obtained from others.
Describe in the research publication what pre-existing datasets have been used. In registry research publications, the focus is typically on the description of the analysis dataset rather than the raw data. For the sake of research reproducibility and open science, a good practice would be to report which registries or pre-built data modules were used to construct the analysis dataset, as well as the analytical methods applied in data preparation. Reporting data at the variable and registry level is also recommended, as the content of pre-existing datasets changes over time, making it difficult for reviewers or future users to determine retrospectively what dataset was available to the researcher at the time of the study.
There is a difference between the storage of remote access data and transferred datasets. The storage of datasets from Statistics Finland and Findata takes place in remote access environments produced by the CSC – IT Center for Science. Datasets transferred to researchers’ own institutions must be stored in a secure environment, with options including the university’s Umpio system or secure storage devices. The university’s shared workspace cannot be used to store data containing personal information, even if folder access can be restricted. Researchers must determine how the dataset is transferred from the data controller to their own storage and ensure secure data transfer and storage.
Possible data processing and storage in the University of Helsinki's environments. Data should be stored and processed in a pseudonymized and encrypted form whenever possible. The home (Z:) and group directories (P:) available to university members are suitable for low- and medium-risk datasets that have been pseudonymized. Pseudonymization keys must always be stored separately from the actual dataset, in an encrypted format and in a different storage location. High-risk identifiable data must be stored in Umpio, which is the University of Helsinki’s secure computing environment. The Helpdesk data storage table (in Finnish) can be consulted to check which storage locations are suitable for sensitive data. Additional information on handling sensitive data can also be found in the Helpdesk guidelines.
The storage of remotely accessed registry data is the responsibility of the registry administrator. When the data usage permit expires, the researcher’s access to the data ends. The storage or deletion of data within the registry authority’s remote access system is not the responsibility of the researcher. However, the researcher is responsible for saving program codes and results from the remote access system to document the research and ensure reproducibility.
The researcher is responsible for the storage and disposal of transferred datasets. Utilize university services for data storage. During the research project, data can be stored in a personal home directory (if used individually) or a group directory. Guidelines for different storage solutions suitable for various purposes can be found in the University of Helsinki's Data Support wiki table(in Finnish). Document for yourself and your team where the data is stored, so that you can, for example, delete all the data you have promised to destroy should there be a need thereof.
The home and group directories at the University of Helsinki are backed up every hour and function on Windows, Mac, and Linux operating systems. These directories are hosted on the university's own servers. Every university member has access to a home directory (Z-drive on Windows computers). Instructions for obtaining a group directory. If the data is sensitive, a suitable storage location at the University of Helsinki is Umpio.
When the data usage permit expires, the data must be securely deleted. The Helpdesk guidelines should be followed, especially for deleting sensitive datasets. Physical storage media, such as external hard drives or CDs, can be destroyed; for example, a broken CD cannot be repaired. Special disposal containers for CDs are available upon request from facility services. IT support (helpdesk@helsinki.fi) can also handle the deletion if the researcher is unsure or dealing with highly sensitive data.
Results derived from registry data are typically included in research publications. These results are based on statistical analyses of the dataset but do not contain individual-level data. Results are often published alongside the research article. Unpublished descriptive results may also be worth preserving for research documentation.
Analysis codes used for registry data can be stored and published. The program codes used for processing registry data remain with the researcher, and some journals may require them to be published alongside the research article. Analysis codes are text files that allow researchers to revisit their study if needed. Publishing program codes supports research reproducibility and is considered good scientific practice. For publication, it is recommended to store program codes in Zenodo and link them to GitHub. In Zenodo, they receive a persistent identifier and are safely stored for the researcher's future use. Program codes and algorithms should be licensed under the MIT or GNU license (see Which license is suitable for software or data?).
Describing analysis codes facilitates reuse. Program codes should be documented with sufficient detail. Some general documentation is always necessary, but other researchers may not need a detailed explanation for every command. Researchers conducting similar statistical analyses can usually read and understand at least part of the program code without line-by-line documentation.
PDF version of the guide will be added soon.