Standardised data and FAIR principles

Invisible research data: the importance of standardisation

Research data remains invisible unless it is stored electronically. Data may also be neither reusable nor machine-readable due to unstructured data formats, local standards, or limited willingness to share data.1 This includes detailed experimental instructions, precise instrument settings, or analysis results. Such gaps in research data management make it difficult to reproduce data and can lead to wasted time and resources.

To make research data reusable, it needs to be standardised and harmonised it. This is known as making data FAIR, so the scientific community can use and share data in the best possible way. The first step is to prepare and process data so colleagues can easily reuse it. To prevent data misuse by conflicting groups, different access rights can be defined to grant or restrict access.

What is FAIR data and how does the FAIR principle help to ensure that data is provided in a harmonised and continuous manner?

 

The FAIR principles

According to the Go-FAIR initiative, the four FAIR principles ensure that research data is optimally organised, accessible and verifiable for both humans and machines. However, this does not mean that the data is unrestrictedly usable and accessible. FAIR data is not the same as open data!

The acronym FAIR stands for Findable, Accessible, Interoperable and Reusable.

Die FAIR Prinzipien, Sangya Pundir/ Wikimedia Commons / FAIR_data_principles – 2016/ CC-BY-SA-4.0/
The FAIR Principles, Sangya Pundir/ Wikimedia Commons / FAIR_data_principles – 2016/ CC-BY-SA-4.0/

 

The FAIR principle in detail

Findability

Findability is the most critical aspect of reusable data. Researchers need to describe their research datasets with meaningful and comprehensive additional standardised information that provides structured properties for understanding the dataset. This information is called metadata. The more comprehensive the metadata, the better the visibility of the research data. It is important to provide metadata with a clear vocabulary to avoid ambiguity and enable efficient searching. Simple metadata can include the author’s name, the date the file was created, the file type or the file size. The more specific the metadata, the more information it can provide to help understand a data set. An example is the flexural strength test of a new material. This test determines how much stress a material can withstand when bent before it breaks. To make the results of such tests understandable and reproducible, researchers need to document extensive metadata about the materials used, the test conditions, and the results. For example, the metadata associated with the test will explain how the flexural strength of porcelain ceramics differs from that of ceramics used in building materials such as roof tiles.

 

Simple example of metadata (in bold) and corresponding raw data for a flexural strength study in materials research. The data shown is an example only and may differ from the real data.
Simple example of metadata (in bold) and corresponding raw data for a flexural strength study in materials research. The data shown is an example only and may differ from the real data.

 

To improve data integration, metadata needs to be linked to globally unique and persistent identifiers (PIDs). An identifier works like an Internet link. An example of a commonly used identifier is the Digital Object Identifier (DOI), which uniquely identifies scientific publications and research datasets. When a dataset is published in a database (repository), it is given a DOI that uniquely identifies it, even if the location of the data changes. Many databases automatically generate such an identifier, which improves the visibility of the data and allows other researchers to refer to it.

Let’s say a materials research group has published a dataset on the flexural strength of a newly developed material. The researchers upload the data to a publicly accessible repository, such as Zenodo or Figshare, and receive the DOI 10.1234/materialdata.2024.123. Researchers use this DOI to access the dataset directly and to cite it in scientific publications. The dataset can be found under this DOI even years later because it is integrated into the DOI system and always refers to the current storage location. Since the metadata and the research data are two sets of information, the metadata must be given a unique identifier to indicate which research data it refers to.

Comprehensive metadata and identifiers alone are not enough to make research data findable if no one knows it exists. Indexing is one way of making digital resources visible to search engines and, thus, to other researchers. Internet search engines can ‘read’ (process) and index the contents of text-based files and specific encrypted document formats. This allows users to find the data in search results, for example, when searching for a particular term on Google.

To make data FAIR, it is not enough for users to know how to find it. They also need to be able to access it. This is called accessibility.

Accessibility

Findable data does not automatically mean accessible data. A dataset is findable, but it may require special permission, a licence, or prior registration and login to a website to access it. For example, in materials research, the flexural strength data of a new material may be stored in a database. Still, it can only be accessed through protocol-based authentication and authorisation. These protocols (most commonly http(s) or ftp) should be free and globally available so that the data and its description (metadata) are accessible. However, accessibility does not mean that the data are ‘open’ or ‘free’. Protocols should maximise the accessibility of data and specify the exact conditions under which these data are accessible.

Even if the original dataset is no longer available, the metadata should remain accessible. Researchers can still find out who collected the data and what articles have been published on the topic. For example, they can track down the creators of the original data, who can then provide access to the original dataset.

Interoperability

Data and metadata should be machine-readable without the need for specialised tools. It should be possible to link or process data with other datasets, systems and tools. Interoperability ensures that data and metadata can be used and combined in different applications and research contexts.

One criterion for this is a generally applicable and formal language accessible to the scientific community. It should be independent of researchers or research institutions. For the data, scientists should use standardised and defined vocabularies and ontologies so that others can interpret them consistently and reliably. To enhance interoperability of this materials research data, internationally recognised terms and formats are used. This allows researchers and machines to combine the materials data with other datasets and gain new information from the links between datasets.6

In addition, metadata should be linked to other data and other metadata. These links should describe the relationships between datasets to increase visibility and enable correct citation.

Reusability

We have already discussed how data can be accessed in the “Accessibility” section. The “Reusability” section explains who can use the data. This is also called legal interoperability. Common licences for data use define the conditions under which the data can be reused. The most common examples are MIT or Creative Commons.

These licences state that the data may be used for research and development if the original author or institution is credited. This allows other researchers to use the data in their studies without seeking further permission. The more meaningful attributes attached to a dataset, the easier to find and reuse. The number and quality of attributes can be increased by providing metadata. This describes the context, such as a test protocol or parameter settings. Reusable data also means that users know what rights are associated with the data and where it came from.

Scientists can reuse data more efficiently if it is harmonised. This means, for example, using a standard template and vocabulary and having the same data type. For example, materials research data can be (re)used more effectively if it conforms to established materials research standards, such as ASTM standards for mechanical testing of ceramics.

These measures ensure that the dataset is structured in such a way that it is not only accessible and understandable but can also be used directly in other projects without further adaptation, e.g., for the development of similar materials or for modeling material properties in simulations.

 

What can we learn from FAIR data?

The FAIR initiative aims to provide high-quality, standardised data that can be reused in the long term. FAIR data should help to organise research better and to gain new knowledge more quickly. This will save time and resources. It should also improve visibility, which in turn can encourage collaboration. It will also provide a record of integrity, helping to prevent misconduct.

The process that describes step by step how we can achieve FAIR data is the FAIRification process. The Three-point FAIRification Framework provides a practical guide.

 

Further Information

 

Skip to content