Major societal challenges such as health, climate change, energy, food availability, migration and peace depend on the contributions of a distributed and diverse international network of researchers and subject experts. The aim of open science is to improve the accessibility of research outputs, including articles, data and other research objects, so that researchers, industry and the public can make use of, build on, and ensure the validity of these research outputs.
Among research outputs, research data are often the most diverse - as diverse as the international network of experts that perform research. Datasets may be small or large, simple or complex, structured or unstructured. Data may stem from hundreds of different subjects, may be produced by numerous methodologies, and exist in a plethora of different formats. The diversity of data is also characterized by a variety of data management practices, of varying quality and comprehensiveness. Historically, large structured datasets in well-established disciplines are more likely to adopt unified and standardized formats that are disciplinarily defined and accepted. Similarly well established disciplines tend to have common and understood workflows, where as in the long tail of research it is not unusual for researchers to use a variety of tools and to develop ad-hoc data workflows. Long tail datasets, on the other hand, which vary radically in source, discipline, size, subject, provenance, funding, format, longevity, location and complexity, are less likely to adhere to common standards. The wide distribution and diversity of long-tail data means that ensuring such data is discoverable and stored in appropriate formats with relevant curation and metadata to facilitate reuse is challenging, and that these data have received less attention historically. Furthermore, the terms used to refer to long tail data, e.g. ‘small data’, ‘legacy data’ or ‘orphan data’ have contributed to diminishing the perceived importance of such data.
Considering that a large portion of research datasets (and associated research funding) are found in the long tail, it is paramount that we address the specific and unique data management challenges for this data. The risks of neglecting long-tail data are real and significant. These include both limiting the reproducibility, transparency, and verifiability of research results, and RDA Long Tail of Research Data Interest Group unnecessary costs associated with the duplication of research data. Moreover, the potential benefits for reuse are significantly reduced.
The Research Data Alliance (RDA) “Long Tail of Research Data Interest Group” has been assessing the situation of long tail data over the last three years, and urges the broader community to consider the risks and opportunities related to long-tail data. This document provides seven recommendations for a variety of stakeholders, including governments, funders, research institutions and researchers to help improve the current approach to managing long tail data. We call on the community to work together to create necessary and sufficient conditions to ensure we are able to properly steward these valuable research outputs for future generations of researchers.