FAQ: Data cleaning

All kinds of files accumulate in various storage locations during a research project, so it’s a good practice to clean things up every now and then. Is there anything that can be deleted or stored more sensibly? Some storage locations are more economical and environmentally sustainable than others. And would it be possible—and desirable—to share some of the data with others? See below for our best tips on cleaning up your data.

Data Clean-up Week

Welcome to clean up research data together!

Come meet us on Friday 29 May from 10:00 to 12:00 on the campuses if you have questions about cleaning, storing, or sharing data. You can also ask general questions about research data management, such as preparing data management plan (DMP).

Our service points during Data Clean-up Week on Friday 29 May from 10:00 to 12:00:

City Centre: Kaisa House, Guidance Corner, 3rd floor
Kumpula: Library group study room Polaris (G108a)
Viikki: EE Building (Agnes Sjöberginkatu 2), 1st floor lobby
Meilahti: Biomedicum, meeting room Kuutti (B328a1). The room is located at the top of the B-wing stairs on the 3rd floor.

Why should I clean research data?

Data cleaning generally refers to measures aimed at improving the quality and comprehensibility of data. These measures may include, for example, deleting unnecessary or unusable files, standardizing file naming conventions, and creating a clear and appropriate folder structure.

Storing data always consumes resources. For this reason, storage should always serve a clear purpose. Such purposes may include, for example, the further use of the data or the verification of results. It makes no sense to keep files that no longer serve any purpose. Unnecessary copies and obsolete versions should therefore be removed from the data. It may also be better to delete files entirely if they are so poorly described that their purpose or context can no longer be determined.

If your data contains personal information, it is of the utmost importance to ensure (in addition to an appropriate storage location) that files containing personal information are deleted as originally promised to the research participants in the privacy notice.

These measures improve the comprehensibility of the entire dataset and its potential for future use. Data cleaning also ensures the sustainable and ethical preservation of the dataset.

Where should I keep data at different stages of the research process?

The storage and preservation of research data should be planned according to the different phases of the research. Well-chosen storage and preservation locations ensure the security, discoverability, and usability of research data throughout the research process, and they also facilitate the reuse of the data.

1. When data is collected, processed, or analyzed

During the active phase of research, convenient places for storing and sharing data include, for example:

shared network drives (e.g., P-drive)
OneDrive
Teams

For code and scripts, a version control service is particularly useful, such as:

GitHub
GitLab

For sensitive data there are secure environments like:

University of Helsinki’s Umpio service
CSC’s solutions: ePouta, SD Connect and SD Desktop

During the analysis phase, additional computing power may also be required, for which both the university and CSC offer dedicated solutions. for more examples for storing and sharing data.

2. When the data is no longer being edited

Once the research data are finalized, they should be well documented and archived for long-term preservation in a reliable data archive that provides persistent identifiers (such as DOIs), such as:

If a discipline-specific repository is available, such as a gene bank or a language bank, it should be used. Some services are intended for medium-term preservation, like UH Databank (for 5–15 years), while others preserve data for future generations, such as UH Data archive. In some services, data are freely accessible to everyone (such as Zenodo), whereas in others access is restricted (like in UH Databank).

Well-chosen storage solutions ensure the security, findability, and usability of research data throughout the research process, and they also promote the reuse of datasets. Read more about .

How do I create an well-organized folder structure?

A well-organized folder structure in a research project supports reproducibility and collaboration. A proper folder structure ensures that information is easily accessible to other project members—or even to yourself—a year (or even a week) from now. Attached is an example that you can adapt and modify according to the size and topic of your project.

A few basic rules:

Separate source material (raw data), processed data, and results.
Remember to keep an unmodified copy of the source data/raw data.
Keep file names informative and consistent.
Create a README.txt file that explains where each data set is located—especially if the data is stored in different locations.

How do I come up with a good file name?

A good file name is clear, informative, and consistent. Its purpose is to convey at a glance what the file contains, what it relates to, and at which stage of the research process it was created. Well‑named files save time, reduce errors, and improve the reproducibility of research. When creating a file name, ask yourself: Can I understand the purpose of this file without opening it?

A good file name is:

descriptive but concise
consistent with other files in the project
technically safe, so that file names do not break when files are transferred between systems
understandable even after a significant amount of time has passed

Examples for good file names:

survey_responses_raw_2024-03-12.csv
regression_results_income_model1.RData
interview_codes_thematic_v1.xlsx

Avoid file names that do not clearly indicate the content, such as, data.csv, analysis_new.xlsx, or final.docx.

1. Use a consistent structure

Choose a single file‑naming structure and apply it to all files. A common and effective model is:

project_description_stage_date_version.fileextension

Example:

climate_study_temperature_processed_2024-05-01_v2.csv

Consistency is more important than perfection: using the same logic makes it easier to understand files quickly, even months later.

2. Avoid special characters and spaces

Use:

lowercase letters
underscores (_) or hyphens (-)

Avoid:

spaces
Scandinavian characters (ä, ö)
special characters (?, %, &, #)

Following these conventions helps prevent problems when data are transferred between operating systems, manipulated using command‑line tools, or analysed using code-based methods.

3. Include dates in a standard format

If timing is relevant, use the format YYYY-MM-DD.

Example:

experiment_log_2024-11-07.txt

4. Version naming

Include a version identifier in file names when files are updated over time, such as_v1, _v2, _v3 or _001, _002, _003.

Example:

manuscript_methods_v3.docx

Avoid vague file names, such as, final.docx, or final_really_final_new.docx.

If the project is large or involves multiple contributors, the use of a version control system (e.g. ) is strongly recommended.

How can I securely delete data?

Simply deleting or overwriting a file usually does not permanently erase it; in the skilled hands of an expert, these files can still be recovered. File deletion must therefore be carried out carefully to ensure that files are truly and reliably removed.

When deleting files, you must also ensure that there are no backups of the files to be deleted that could be recovered at a later date. For older hard drives and portable storage devices, the only reliable way to delete files may be to physically destroy the entire drive.

There are various tools available for securely deleting individual files on Windows, macOS, and Linux operating systems. See the for more information.