A domain-specific language for describing machine learning dataset release_pd5ecwlnobe3zdnldb6ydwaa24

by Joan Giner-Miguelez, Abel Gómez, Jordi Cabot

Released as a article .

2022  

Abstract

Datasets play a central role in the training and evaluation of machine learning (ML) models. But they are also the root cause of many undesired model behaviors, such as biased predictions. To overcome this situation, the ML community is proposing a data-centric cultural shift where data issues are given the attention they deserve, and more standard practices around the gathering and processing of datasets start to be discussed and established. So far, these proposals are mostly high-level guidelines described in natural language and, as such, they are difficult to formalize and apply to particular datasets. In this sense, and inspired by these proposals, we define a new domain-specific language (DSL) to precisely describe machine learning datasets in terms of their structure, data provenance, and social concerns. We believe this DSL will facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The DSL is implemented as a Visual Studio Code plugin, and it has been published under an open source license.
In text/plain format

Archived Files and Locations

application/pdf  652.1 kB
file_duw2nvlovzabbfelyfqmmxpvtu
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2022-07-05
Version   v1
Language   en ?
arXiv  2207.02848v1
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 8e6bd5b5-dd9c-43d4-958e-72bd6f8418da
API URL: JSON