Data reproducibility

Thursday, 6 June 2013 in Science by Astrid Pellieux

Being able to reproduce the results of a scientific study is a necessary condition to the progress of science. It is by this mean that a result can be validated or rejected. However, not all scientific papers provide enough information to allow researchers to replicate a study. The lack of detailed protocols, raw data, data analysis procedures, computer code, etc. is responsible of the non-reproducibility of a study. In addition, this lack of information may not allow the reuse of the study to explore new hypothesis. Consequently, researchers are increasingly being encouraged to share all data related to a publication to enhance the growth of science.

The main way for a researcher to share his work is to publish a full text article whose conventional format does not allow the addition of detailed information such as complete protocols, datasets, computer code, etc. Without those information the result of a study can’t be reproduced, and thus the accuracy of a study can’t be provided. In fact, a scientific phenomenon can only be reported as proven when it has been reproduced several times. As a consequence, it is requested from researchers to share all data related to a publication to ensure the study’s reproducibility. This will allow researchers to validate or reject the conclusion of a study by highlighting false positive results. The importance of data reproducibility has especially been shown on several studies. The polemic around an arsenic-associated bacteria study published in 2010 in Science is a good example. The accuracy of this study has been vividly questioned notably because of its non-reproducibility. Another example is the “Growth in a Time of Debt” study published in 2010 in the American Economic Review on the relationship between government debt and economic growth whose the reproducibility was proven to be impossible due to miscalculation.

Being able to reproduce a study will also dissuade some researchers to fraudulently modify their negative results into positive results. Moreover, those negative results and their associated data are encouraged to be published. It will enhance the transparency of science and inform other scientists that such experiment under such experimental conditions doesn’t work. In this regard, F1000Research will not apply article fee for negative results until end of august. For information, some journals are specifically dedicated to the publication of negative results. It’s notably the case of the Open Access and peer-reviewed Journal of Negative Results in Biomedicine.

Nowadays, sharing data is not only scientific good practice, but also a requirement of some funder agencies and journals. For example, the US National Science Foundation (NSF) expects researchers to share their data for all studies done under their grants. To do so, the foundation requires researchers to add a two-page data management plan for each one of their funding proposals. This document must describe what actions will be done to share data. The European Commission also involves itself in scientific data sharing. Horizon 2020, the EU’s research and innovation funding program for 2014-2020, aims to improve access to research results produced in Europe. For more information, the policies of some funder agencies regarding data sharing can be found on the BioSharing website. Concerning journals, some of them require researchers to share all data related to a study. For example PLoS, BioMed Central, Science, etc. are good examples of journals developing data sharing policies. Despite the involvement of several funder’s agencies and journals, not all actors in the research world encourage researchers to share their data. Moreover, this may appear complicated and a waste of time for some researchers. That’s why the Reproducibility Initiative has been launched by Science Exchange, PLoS ONE, Figshare, and Mendeley. The aim is to promote the sharing of all data supporting a study by rewarding researchers involving themselves in this practice.

The Reproducibility Initiative. Source: blog.mendeley.com

Many web services exist in order to help researchers to share data (protocols, raw data, analysis procedures, computer code, etc.). The most common way is to store them in public repositories. Some of those archives are specialized in a specific field of research. For example, Gene Expression Omnibus (GEO), ArrayExpress or GenBank are repositories dedicated to genomic data. Others are general databases such as Dryad or Figshare. More different kind of data such as datasets and protocols can be found in a general repository. Concerning Open Source softwares and computer codes, specific archives exist. Bitbucket and GitHub are good examples. Moreover, to help researchers to find one specific repository, some web platforms provide good results. That’s the case of Databib and more recently re3data.org launched in late May.

Another way to share data is the use of an Electronic Laboratory Notebook (ELN). It is a management tool allowing researchers to daily report everything about their work, from the submission of a hypothesis to the establishment of a conclusion. Not always open to everybody, an ELN is used to store all data files in one place which facilitates the work when the time come to make the data publically available. Moreover, some ELNs are continually publically open online such as UsefulChem and OpenWetWare. One of our products at shazino is hivebench, an ELN allowing you to store all your research data and share them with your colleagues.

To allow as many people as possible to access all the supporting data of a publication, it is important to standardize the file formats of each kind of data. For example a universal document format, such as the OpenDocument Text format, will be better for word processing documents. Moreover, some file formats are already well established in specific field of research, and as consequence their use are recommended. It is the case of the MIAME format for gene expression. A catalogue referencing data files and their characteristic of use can be find in the biosharing website.

Finally to allow everybody to find their data, it is really important for researchers to add one or several links inside their article to where the different supporting data are. When the data used have been produced by other researchers, it also important to cite those datasets in the publication. The international consortium DataCite is an important actor in the data citations area. To facilitate citations, DataCite notably works on assigning persistent identifiers to datasets like the Digital Object Identifier (DOI).

For more information, see:
Open Science, the future of scientific research - Shazino blog
How to make a paper reproducible? - Reproducible Research
Challenges in irreproducible research - Nature
Article collections on Data standardization, sharing and publication - BMC Research Notes