We recently asked a colleague to share a dataset they published along with their paper at one of the ACM conferences. The paper had the "Artifacts available" badgea in the ACM Digital Library, highlighting the research in the paper as reproducible. Yet, the instructions to get the dataset required several steps rather than just a link: log in, find the paper, click on a tab, scroll, get to the dataset. It was much better than receiving the dataset by email. Yet in many other research disciplines—biology, geophysics, biodiversity, social sciences, cultural heritage—sharing data and other research artifacts is streamlined and is the cultural norm. Computer science (CS) is pretty good at sharing software. How did CS researchers get behind many other sciences in how we think about sharing data?
Let's start by distinguishing three different aspects of data sharing: open data, data required for reproducibility of published research, and data as a first-class citizen in scientific discourse. All three aspects are related, but they are not the same: a dataset can be open but not citable or easily discoverable, for example. Or a dataset may be findable and interoperable, but not open.