Research Data Guide: Register Data

This guide covers individual-level registry data collected by authorities for administrative and planning purposes, which were not originally gathered for research use.

Take care of your data management skills. Data management skills are fundamental for researchers. Together with data management planning, they ensure that researchers can identify and manage risks related to data handling (e.g., data protection, data security, data access rights, and data storage). The University of Helsinki's Data Support provides free data management training for researchers. Data Support also offers guidance and training, as well as tools for data management planning.

Click here to return to the Data Management Guides page.

This guide covers individual-level registry data collected by authorities for administrative and planning purposes, which were not originally gathered for research use. The data are maintained by registry authorities, the most significant of which are Statistics Finland and public registry authorities under the Secondary Use Act, which are listed on Findata's website. Findata is the data permit authority through which registry data under the Secondary Use Act, i.e., social and health sector data, can be applied for. Registry data not covered by the Secondary Use Act must be requested separately from each respective registry authority (e.g., Statistics Finland, the Finnish National Agency for Education, the Digital and Population Data Services Agency, the Finnish Defence Forces, the Legal Register Centre). Individual-level registry data can be used for research either as they are or combined with other data, such as surveys (see "Guidelines for Combining Registry Data with Surveys"). The linkage of individual-level data between different registries and other data sources is possible using the personal identity code, which uniquely identifies all permanent residents of Finland.

Managing registry data requires planning, even when accessed remotely. Individual-level registry data often contain personal information that can be used to identify individuals. The largest registry authorities (Statistics Finland, Findata) do not provide researchers with individual-level data but instead offer the possibility to process pseudonymized datasets via remote access within their own secure environments (Statistics Finland's Fiona, Findata's Kapseli). In these cases, many aspects of data management fall under the responsibility of the registry authorities.Researchers contribute to responsible data management by complying with the data protection rules of the remote access system. However, there are still important considerations related to the use of registry data, such as ensuring sufficient time and financial resources—these factors should be carefully considered when preparing a data management plan (DMP).

The processing of personal and sensitive data must be carefully planned. When handling personal and sensitive data, it is essential to assess the risks associated with data processing – in other words, how much harm the disclosure of the information could cause to an individual or a community. A useful tool for assessing risk levels is conducting a Data Processing Impact Assessment (DPIA). When planning data processing, it is particularly important to identify potential risks at different stages, such as the possibility of data leaking to unauthorized parties. Proper training for all individuals handling the data is crucial in this regard. Careful handling also applies to remote access to data. Additionally, selecting secure storage locations is essential (see "Data storage during the project"). When combining registry data with other datasets, such as survey data, it is advisable to allocate sufficient time for consulting legal experts.

Allocate sufficient time for the application process and consider costs. When planning research using registry data, it is crucial to ensure adequate time and financial resources. The timelines and acquisition costs set by statistical authorities should be investigated already in the research planning phase. Application processes for data are typically slow (often taking over a year). The costs of acquiring and using data are continuously increasing. Prices and timelines depend on the scope and complexity of the data, the size of the research group using the data, and the number of registry authorities involved in the process. Costs contist of usage permits, data preparation, and possible remote access system fees. The largest registry authorities publish price and timeline estimates for typical datasets on their websites: Statistics Finland's schedule estimates and pricingFindata's queue status and pricing. Check the policies of your home institution regarding covering costs – some faculties have allocated budget funds for paying data access fees.

Collaborating in research saves resources. Due to long permit processes and high data costs, it is advisable to use registry data within existing research projects whenever possible by engaging in research collaboration. It is possible to apply for access permits for new researchers to use datasets already available to research groups. Information on collaboration opportunities and other researchers already using the data can be sought from one’s home institution and colleagues in the same field.

Legislation related to registry data. Individual-level registry data can be used either as they are or combined with other data, such as surveys. Data management regulations and principles differ depending on whether the dataset consists solely of registry data or if it is linked to other sources.

(See, for example, the Secondary Use Act (Act on the Secondary Use of Social and Health Data); Flamma page: Secondary use of social and health data in research).

Aggregated statistical data based on official records (such as counts, averages, etc.) are subject to their own data management regulations and principles (Statistics Finland Act [in Finnish]).

Research based on registry data does not require informing participants. Research that relies solely on registry data does not require informing the individuals whose data is being used. With registry data, it cannot be assumed that the individuals in the registry can be reached with reasonable effort, for example, when dealing with the entire population of Finland. However, the situation changes if registry data is combined with other sources, such as surveys, in which case informing participants is required.

Keep in mind that registry data is often sensitive. Restricted-access registry datasets typically contain individual-level data, making them sensitive. To obtain a usage permit for such data, researchers must familiarize themselves with and commit to the data security regulations of the relevant registry authorities. Even when data is processed in a remote access system, the researcher remains responsible for handling the dataset. Responsible data management is carried out by adhering to the rules of the remote access system.

The researcher is required to manage data responsibly as a data controller. When conducting registry research, either the researcher or the institution carrying out the research acts as the data controller, which entails responsible data management: "The data controller determines the processing of the dataset, ensures the security and appropriateness of personal data processing, and is responsible for the proper disposal or archiving of the dataset at the end of the research. If the researcher is employed by the institution conducting the research, the institution serves as the data controller. If the researcher or research group conducts the research independently, the researcher or research group may act as the data controller themselves." (Findata FAQ: What is meant by a data controller? See also Statistics Finland: "Data license holder as data controller", scroll to the bottom.)

Research based on registry data does not always require ethical review. Research that relies solely on registry data does not require an ethical review: "Ethical committee evaluation is not required for research using public and published information, registry and document data, or archival materials." (See Ethical Principles of Research Involving Human Subjects and Ethical Review in the Human Sciences in FinlandFinnish National Board on Research Integrity (TENK), 2019: 16). The situation may change if registry data is combined with other sources, such as survey data. In such cases, the research may be required to undergo ethical review. Information on ethical review procedures has been compiled on the University of Helsinki's website. HSSH can assist with requesting a statement from an ethical review committee. In certain cases, ethical approval must be obtained from both the registry authority and the researcher's home institution. Processing times for these approvals can be long, so it is advisable to apply early.

Registry data is produced and provided for research use by registry authorities. Researchers do not collect these datasets themselves but specify in their data permit applications which registry data they need for their study. The application must include a variable list and a research plan justifying the use of the requested datasets. It is also advisable to apply for a sufficiently long usage period to allow for research verification or the review process of research publications before the data is destroyed.

Separate guidelines on Applying for Registry Data and Combining Registry Data with Surveys:

Guidelines for Applying for Registry Data

  • A permit to use registry data must be obtained from the relevant registry authority. Findata's website provides good basic guidelines for applying for registry data, and Statistics Finland's website also includes guidance on the application process.
  • Applying for a data permit requires planning, as the requested datasets and selected variables must be defined in the application stage. It is advisable to contact registry authorities early in the planning phase to ensure that the application includes the most suitable data for implementing the research plan. Datasets:
  • Data from different authorities, such as Findata and Statistics Finland, can be linked at the individual level: Statistics Finland's guidelines.
  • If applying for a permit to combine data from multiple registry authorities, permission must be obtained from each authority separately, specifying what data will be included in the final dataset, where the data will be stored, and where it will be analyzed.
  • Findata only provides data to security-audited environments (see Toini Register). The University of Helsinki does not have such a secondary use environment. Therefore, registry data is primarily processed within Findata's Kapseli or Statistics Finland's Fiona remote access systems. These CSC-maintained remote access systems provide most statistical software programs (e.g., R, Stata, SPSS, SAS). A restricted version of CSC's SD Desktop is included in the remote access environments under the Secondary Use Act. It is free for University of Helsinki researchers but has limited software options (LibreOffice, Python, R) and does not allow the installation of additional software or files (e.g., scripts). All data uploaded to SD Desktop must go through Findata, requiring a separate, paid application.
  • Statistics Finland's data can generally only be accessed through Statistics Finland's own remote access system. The following data can be transferred to Findata's remote access system or a researcher's home institution: cause of death data, age, gender, education, occupation, and socioeconomic status (see "What data can be transferred to Findata or another institution?").
  • It is important to get the permit process started early enough, as obtaining data takes at least several months, and for customized datasets, often over a year. Registry authorities publish estimated timelines on their websites: Statistics Finland's estimates and Findata's queue status. Timelines vary, but it is advisable to anticipate that everything will generally take longer than expected.
  • A separate permit must be applied for each user. New users can be granted access later.
  • Each user must sign a confidentiality agreement when applying for a data permit.
  • A separate access permit must be requested for remote access systems (Statistics Finland's Fiona or Findata's Kapseli) from the system administrator. Remote access is established either from the university’s workstation or a virtual desktop infrastructure (VDI). Before applying for remote access, contact Helsinki University's IT Center (datasupport@helsinki.fi or it4science@helsinki.fi) to obtain the most up-to-date information on the required technical contact person for the application. Once the connection is established from the University of Helsinki to the remote access system, users log in as specified by the system administrator (e.g., using two-factor authentication).
  • Each remote user must submit a user- and workspace-specific remote access agreement. Remote access can also be granted from home or abroad, but there may be country-specific restrictions. If a researcher changes their workplace or research affiliation, a request to update the location must be submitted. Even a phone number change must be reported, as two-factor authentication relies on it.
  • Usage permits are granted for a specified period, after which access to the system and data is revoked. However, extensions are common, and applying for an extension is a simpler process than the initial application—though it still involves costs.

Guidelines for Combining Registry Data with Surveys

  • Permission to link survey and registry data must be obtained from an ethics committee and the relevant registry authorities. Applications must specify which linkages will be made.
  • Consent for registry linkages must be obtained from survey respondents. Participants must be informed about which registry data will be linked. This can be done briefly on the survey form with a link to a research website providing a more detailed explanation of the planned linkages and data usage (e.g., a privacy notice).
  • The consent letter for participants must also be submitted to the registry authority and the university's ethics committee. In the planning phase it is important to anticipate that the process can be lengthy and modifications to the consent letter may be required.
  • The linking of Statistics Finland and Findata registry data to surveys is carried out by the respective registry authority within its own remote access system. Survey data containing personal identity numbers is sent to the registry authority, which is responsible for pseudonymization. Using pseudonymized identifiers, registry data can be linked to the survey within the remote access system. Pseudonymized data still allows individuals to be distinguished and linked across different datasets, meaning the data remains personal data and must be handled according to data protection regulations.

Storage and processing of registry data in a remote access system. Registry data cannot be transferred out of the remote access system provided by the registry administrator for processing. The use of Statistics Finland’s remote access system is subject to country-specific restrictions and is not possible, for example, from the United States. The Fiona environment of Statistics Finland and the Kapseli environment of Findata are produced by the CSC – IT Center for Science. Each system has its own software for data processing, and users can also request the installation of their own analysis tools and codes. In the remote access system, all researchers in a project who have been granted access to the same dataset can access all the project’s data. If access needs to be restricted, the project must be divided into smaller sub-projects. It is not possible to manage folders within a project: in the remote access system, all users also have access to each other’s work folders.

Registry data in the remote access system cannot be exported, but analysis results can be obtained under certain conditions. The registry data itself remains within the closed environment of the registry administrator, but group-level analysis results, such as group averages and regression coefficients, can be extracted from the system. The requested output undergoes a data protection review process before being sent to the researcher (see, for example, “Use of Microdata in FIONA” for more on the review process).

Consider the limited capacity of remote access systems for data processing. Registry datasets are often large, containing millions of observations and hundreds of variables. For analyses requiring a lot of memory and even for storing analytical datasets, the system’s capacity limits may quickly be reached (for example, in Statistics Finland’s system, usage fees are tiered based on the computing power used). In data management, it is advisable to anticipate the challenges related to limited processing capacity and agree within the research team using the same dataset on practices such as deleting intermediate analysis steps.

Allocate ample time for data preprocessing and integration. Registry data based on administrative records is often raw data, better suited for administrative needs and information systems than for research. Transforming raw data into a research dataset—by defining the study population and period, combining information from different sources, and operationalizing research concepts into existing data—is a slow and decision-intensive manual process. Additionally, the processes of data accumulation, availability, and quality of different registry data, as well as potential changes over time, often need to be clarified separately with registry authorities. In this regard, research collaboration is often beneficial, as many registry researchers face similar challenges, and ideas and advice for solutions can be obtained from others.

Describe in the research publication what pre-existing datasets have been used. In registry research publications, the focus is typically on the description of the analysis dataset rather than the raw data. For the sake of research reproducibility and open science, a good practice would be to report which registries or pre-built data modules were used to construct the analysis dataset, as well as the analytical methods applied in data preparation. Reporting data at the variable and registry level is also recommended, as the content of pre-existing datasets changes over time, making it difficult for reviewers or future users to determine retrospectively what dataset was available to the researcher at the time of the study.

There is a difference between the storage of remote access data and transferred datasets. The storage of datasets from Statistics Finland and Findata takes place in remote access environments produced by the CSC – IT Center for Science. Datasets transferred to researchers’ own institutions must be stored in a secure environment, with options including the university’s Umpio system or secure storage devices. The university’s shared workspace cannot be used to store data containing personal information, even if folder access can be restricted. Researchers must determine how the dataset is transferred from the data controller to their own storage and ensure secure data transfer and storage.

Possible data processing and storage in the University of Helsinki's environments. Data should be stored and processed in a pseudonymized and encrypted form whenever possible. The home (Z:) and group directories (P:) available to university members are suitable for low- and medium-risk datasets that have been pseudonymized. Pseudonymization keys must always be stored separately from the actual dataset, in an encrypted format and in a different storage location. High-risk identifiable data must be stored in Umpio, which is the University of Helsinki’s secure computing environment. The Helpdesk data storage table (in Finnish) can be consulted to check which storage locations are suitable for sensitive data. Additional information on handling sensitive data can also be found in the Helpdesk guidelines.

The storage of remotely accessed registry data is the responsibility of the registry administrator. When the data usage permit expires, the researcher’s access to the data ends. The storage or deletion of data within the registry authority’s remote access system is not the responsibility of the researcher. However, the researcher is responsible for saving program codes and results from the remote access system to document the research and ensure reproducibility.

The researcher is responsible for the storage and disposal of transferred datasets. Utilize university services for data storage. During the research project, data can be stored in a personal home directory (if used individually) or a group directory. Guidelines for different storage solutions suitable for various purposes can be found in the University of Helsinki's Data Support wiki table(in Finnish). Document for yourself and your team where the data is stored, so that you can, for example, delete all the data you have promised to destroy should there be a need thereof.

The home and group directories at the University of Helsinki are backed up every hour and function on Windows, Mac, and Linux operating systems. These directories are hosted on the university's own servers. Every university member has access to a home directory (Z-drive on Windows computers). Instructions for obtaining a group directory. If the data is sensitive, a suitable storage location at the University of Helsinki is Umpio.

When the data usage permit expires, the data must be securely deleted. The Helpdesk guidelines should be followed, especially for deleting sensitive datasets. Physical storage media, such as external hard drives or CDs, can be destroyed; for example, a broken CD cannot be repaired. Special disposal containers for CDs are available upon request from facility services. IT support (helpdesk@helsinki.fi) can also handle the deletion if the researcher is unsure or dealing with highly sensitive data.

Results derived from registry data are typically included in research publications. These results are based on statistical analyses of the dataset but do not contain individual-level data. Results are often published alongside the research article. Unpublished descriptive results may also be worth preserving for research documentation.

Analysis codes used for registry data can be stored and published. The program codes used for processing registry data remain with the researcher, and some journals may require them to be published alongside the research article. Analysis codes are text files that allow researchers to revisit their study if needed. Publishing program codes supports research reproducibility and is considered good scientific practice. For publication, it is recommended to store program codes in Zenodo and link them to GitHub. In Zenodo, they receive a persistent identifier and are safely stored for the researcher's future use. Program codes and algorithms should be licensed under the MIT or GNU license (see Which license is suitable for software or data?).

Describing analysis codes facilitates reuse. Program codes should be documented with sufficient detail. Some general documentation is always necessary, but other researchers may not need a detailed explanation for every command. Researchers conducting similar statistical analyses can usually read and understand at least part of the program code without line-by-line documentation.

PDF version of the guide will be added soon.