Important, Widely Used and Well-Reported Datasets

March 31, 2017 PLOS Collections 10th Anniversary Collections

In conjunction with PLOS ONE’s 10^th anniversary celebration, the journal is launching a Datasets Collection to highlight articles with datasets that are noteworthy because of their impact and usefulness. The collection was assembled by PLOS ONE Senior Editor Meghan Byrne in collaboration with members of the PLOS ONE Editorial Staff, the PLOS ONE Editorial Board and the PLOS-wide Data Advisory Board.

There are many reasons to share data—ensuring reproducibility, promoting scientific progress, facilitating reuse and increasing the value of research, to name a few. Moreover, research shows data sharing correlates with higher quality research and increased citation rates.

Despite a growing recognition of the importance of data sharing, it is still not common practice in all areas of research. With its broad scope and large volume, PLOS ONE has been in a unique position to make an impact on data availability across scientific disciplines.

From its launch, PLOS ONE has encouraged data sharing, and since 2014, along with the other six PLOS journals, has asked authors to make their data publicly available whenever possible and to specify the location of their underlying data in a data availability statement. Furthermore, to increase visibility and reusability of supporting data and meta-data, PLOS ONE deposits all supporting information files to figshare, an open, public repository.

Below, we provide insight into why articles were highlighted.

Social Networks

Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter
Sune Lehmann: “This paper has made an MTurk generated list of word-valences openly available to the research community. As a first-cut sentiment analysis method, this dataset is invaluable and I’ve downloaded it at least dozens of times to use for both teaching & research.”

Paleontology

Multivariate Analyses of Small Theropod Dinosaur Teeth and Implications for Paleoecological Turnover through Time
Andrew Farke: “This paper assembles a massive dataset of measurements for over 1,000 teeth of small carnivorous dinosaurs, which has been really useful to help track changes in dinosaur diversity and distribution prior to the big extinction at the end of the Mesozoic.” (Andrew Farke also discussed the paper in a blog post titled, “And this is why we should always provide our data.”)

Developmental Biology

A Low Dose of Dietary Resveratrol Partially Mimics Caloric Restriction and Retards Aging Parameters in Mice
This 2008 study is one that Marc Robinson-Rechavi’s group references in their research. The authors used mice to ask whether a compound found in red wine has similarly beneficial effects on health as caloric restriction. Data from their study are publicly available in the NCBI Gene Expression Omnibus, and the paper has been cited almost 400 times.

Physiology

The Human Serum Metabolome
Psychogios et al. assembled a collection of over 4,500 small molecule metabolites found in human serum. The authors made the dataset freely available and the paper has been cited over 450 times.

Computational Biology

Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome
In this paper investigating the reproducibility of a previously published computational biology study, the authors (one of whom is Data Advisory Board member Phil Bourne) made all the data, software and workflows fully available, and they included a detailed description of the workflows in the Supporting Information files, which are also available in figshare.

Information Sciences

Data Sharing by Scientists: Practices and Perceptions
Jake Carlson: “Tenopir and her co-authors conducted a survey developed by DataONE to understand the perceptions and practices of researchers across the United States on sharing their research data. Prior to this, analysis of data management, sharing and curation practices had largely been conducted at an institutional level. By making their data available to others the authors have provided the data librarian and curation communities a means to benchmark the progress made in making data sharing a normative part of scholarship.”

Economic Geography

An Economic Geography of the United States: From Commutes to Megaregions
Nelson and Rae analyzed a dataset of over 4 million commuter trips to better understand megaregions within the United States. They deposited their underlying data and additional analysis files to figshare. As of March 2017, the dataset had been downloaded over 6,000 times.

Bioinformatics

A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives
Andreas Prlić: “We need more and better benchmark sets to evaluate bioinformatics methods. This manuscript not only contains a new protein sequence alignment benchmark that tries to reproduce current challenges when exploring sequence space, it also contains a comprehensive evaluation of the current best methods.”

Animal Science and Welfare

Epidemiological Investigations of North American Zoo Elephant Welfare (PLOS ONE Collection)
The papers in this collection show how it can be possible to share aggregated data when the individual-level data are sensitive and therefore subject to restrictions. For reasons relating to protection of the facilities and animals included in this study, access restrictions apply to the individual-level data underlying the findings, but the authors were able to share a valuable dataset of de-identified, population-level data that underlies nine papers published in PLOS ONE.

Applied Psychology

Closing the Achievement Gap through Modification of Neurocognitive and Neuroendocrine Function: Results from a Cluster Randomized Controlled Trial of an Innovative Approach to the Education of Children in Kindergarten
In this cluster randomized controlled trial, Blair and Raver found that teaching self-regulation can help lower achievement gaps in kindergarten, particularly in high-poverty schools. The individual-level data from over 750 children in 29 schools are available in Dryad.

Neurology

Prediction and classification of Alzheimer disease based on quantification of MRI deformation
Gregory Petsko: “Although its impressive accuracy (>95% in differentiating mild AD from healthy elderly) is still not high enough to make it useful in younger populations where the prevalence of the disease is less than 10%, the fact that MRI, a widely-available technology that is much cheaper to employ than PET scanning, could be a tool for predicting and classifying Alzheimer’s disease is an important step towards a cost-effective method that is accurate enough to avoid significant false-positives.”

Conservation Science

Patterns of Vertebrate Diversity and Protection in Brazil
The authors analyzed the diversity of birds, mammals, and amphibians in Brazil and the effects of efforts to protect these species. They deposited their data to Dryad and provided GIS-ready datasets on a dedicated website called http://biodiversitymapping.org.

Clinical Trials

Terminated Trials in the ClinicalTrials.gov Results Database: Evaluation of Availability of Primary Outcome Data and Reasons for Termination
Williams et al. deposited the data from this study of terminated clinical trials to Dryad. The dataset collates information on over 900 clinical trials registered in ClinicalTrials.gov.

Neuroscience

PLOS ONE Academic Editor Daniele Marinazzo’s selections and commentary:

A Correspondence between Individual Differences in the Brain’s Intrinsic Functional Architecture and the Content and Form of Self-Generated Thoughts
“Everything is shared: raw fMRI data, behavioral data, questionnaires etc., and the unthresholded statistical maps are uploaded to NeuroVault.org, to facilitate meta-analysis and comparisons with other analyses.”

Reproducibility and Temporal Structure in Weekly Resting-State fMRI over a Period of 3.5 Years
“Extremely valuable longitudinal dataset, conveniently annotated and shared on NITRC.”

Multiband multi-echo imaging of simultaneous oxygenation and flow timeseries for resting state connectivity
“Precious multimodal dataset, shared on openfMRI with the convenient BIDS format, allowing easy integration with several analysis pipelines.”

Marine Science

Millimeter-Sized Marine Plastics: A New Pelagic Habitat for Microorganisms and Invertebrates
Reisser et al. investigated the organisms living on millimeter-size plastics floating in the ocean and made available hundreds of scanning electron microscopy images and analyses of plastic particles. The dataset has been downloaded over 5,000 times.

Meta-Research on Data Sharing

PLOS Data Advisory Board Member Lisa Johnston highlights additional articles based on the bibliography of her recent book, Curating Research Data. Some of these papers are also highlighted in the PLOS Open Data Collection.

A Principal Component Analysis of 39 Scientific Impact Measures

Sharing Detailed Research Data Is Associated with Increased Citation Rate

Do Altmetrics Work? Twitter and Ten Other Social Web Services

Sharing underlying data is essential for accelerating scientific advances and maximizing the value of research. Through this collection, we hope to demonstrate that the impact of a work can be reflected not only in the significance of the results but also in the usefulness of the underlying data, while we also recognize the contributions of authors who have created and shared important, widely used, and well-reported datasets.