Welcome to clean up research data together!
Come meet us on Friday 29 May from 10:00 to 12:00 on the campuses if you have questions about cleaning, storing, or sharing data. You can also ask general questions about research data management, such as preparing data management plan (DMP).
Our service points during Data Clean-up Week on Friday 29 May from 10:00 to 12:00:
Data cleaning generally refers to measures aimed at improving the quality and comprehensibility of data. These measures may include, for example, deleting unnecessary or unusable files, standardizing file naming conventions, and creating a clear and appropriate folder structure.
Storing data always consumes resources. For this reason, storage should always serve a clear purpose. Such purposes may include, for example, the further use of the data or the verification of results. It makes no sense to keep files that no longer serve any purpose. Unnecessary copies and obsolete versions should therefore be removed from the data. It may also be better to delete files entirely if they are so poorly described that their purpose or context can no longer be determined.
If your data contains personal information, it is of the utmost importance to ensure (in addition to an appropriate storage location) that files containing personal information are deleted as originally promised to the research participants in the privacy notice.
These measures improve the comprehensibility of the entire dataset and its potential for future use. Data cleaning also ensures the sustainable and ethical preservation of the dataset.
The storage and preservation of research data should be planned according to the different phases of the research. Well-chosen storage and preservation locations ensure the security, discoverability, and usability of research data throughout the research process, and they also facilitate the reuse of the data.
1. When data is collected, processed, or analyzed
During the active phase of research, convenient places for storing and sharing data include, for example:
For code and scripts, a version control service is particularly useful, such as:
For sensitive data there are secure environments like:
During the analysis phase, additional computing power may also be required, for which both the university and CSC offer dedicated solutions.
2. When the data is no longer being edited
Once the research data are finalized, they should be well documented and archived for long-term preservation in a reliable data archive that provides persistent identifiers (such as DOIs), such as:
If a discipline-specific repository is available, such as a gene bank or a language bank, it should be used. Some services are intended for medium-term preservation, like UH Databank (for 5–15 years), while others preserve data for future generations, such as UH Data archive. In some services, data are freely accessible to everyone (such as Zenodo), whereas in others access is restricted (like in UH Databank).
Well-chosen storage solutions ensure the security, findability, and usability of research data throughout the research process, and they also promote the reuse of datasets. Read more about
A well-organized folder structure in a research project supports reproducibility and collaboration. A proper folder structure ensures that information is easily accessible to other project members—or even to yourself—a year (or even a week) from now. Attached is an example that you can adapt and modify according to the size and topic of your project.
A few basic rules:
A good file name is clear, informative, and consistent. Its purpose is to convey at a glance what the file contains, what it relates to, and at which stage of the research process it was created. Well‑named files save time, reduce errors, and improve the reproducibility of research. When creating a file name, ask yourself: Can I understand the purpose of this file without opening it?
A good file name is:
Examples for good file names:
Avoid file names that do not clearly indicate the content, such as, data.csv, analysis_new.xlsx, or final.docx.
1. Use a consistent structure
Choose a single file‑naming structure and apply it to all files. A common and effective model is:
project_description_stage_date_version.fileextension
Example:
Consistency is more important than perfection: using the same logic makes it easier to understand files quickly, even months later.
2. Avoid special characters and spaces
Use:
Avoid:
Following these conventions helps prevent problems when data are transferred between operating systems, manipulated using command‑line tools, or analysed using code-based methods.
3. Include dates in a standard format
If timing is relevant, use the format YYYY-MM-DD.
Example:
4. Version naming
Include a version identifier in file names when files are updated over time, such as_v1, _v2, _v3 or _001, _002, _003.
Example:
Avoid vague file names, such as, final.docx, or final_really_final_new.docx.
If the project is large or involves multiple contributors, the use of a version control system (e.g.
Simply deleting or overwriting a file usually does not permanently erase it; in the skilled hands of an expert, these files can still be recovered. File deletion must therefore be carried out carefully to ensure that files are truly and reliably removed.
When deleting files, you must also ensure that there are no backups of the files to be deleted that could be recovered at a later date. For older hard drives and portable storage devices, the only reliable way to delete files may be to physically destroy the entire drive.
There are various tools available for securely deleting individual files on Windows, macOS, and Linux operating systems. See the