Assessing Software Defect Prediction on WLCG Software: a Study with Unlabelled Datasets and Machine Learning Techniques
release_hwk5zdnpfrfo3k3vo5ufgtzvxe
by
Elisabetta Ronchieri, Marco Canaparo, Davide Salomoni, Barbara Martelli
2019
Abstract
Software defect prediction aims at detecting part of software that can likely contain faulty modules - e.g. in terms of complexity, maintainability, and other software characteristics - and therefore that require actual attention. Machine Learning (ML) has proven to be of great value in a variety of Software Engineering tasks, such as software defects prediction, also in the presence of unlabelled datasets that contain a set of features (i.e. software metrics) for the various software modules (such as files, classes and functions) but lack of modules classification like their defectiveness. To accomplish these tasks, datasets have to be collected for the various modules and properly preprocessed before the application of ML techniques: these activities are essential to manage missing values and/or removal inconsistencies amongst data and to make labelled datasets. Unlabelled datasets represent the vast majority of software datasets. The extraction of the complete set of features (defectiveness included) and the labeling of the various modules imply effort and time. In literature there exist various approaches to build a prediction model on unlabelled datasets that entail a high number of permutations. Cloud computing infrastructure, GPU-equipped resources and adequate ML framework can give the chance to build software defect prediction model within a reasonable computation time. This new study describes the analysis of new unlabelled datasets from WLCG software, coming from HEP-related experiments and middleware, with ML techniques by implementing models in different available frameworks, such as Weka, R and python-based frameworks. We have evaluated these frameworks by considering four aspects: learning curve, extensibility, hardware utilization and speed. This study also includes new approaches to label the various modules due to the heterogeneity of software metrics distribution. Our results suggest that predictive accuracy is generally above 96%; furthermore, our procedure keeps trace of the predict defective [...]
In text/plain
format
Archived Files and Locations
application/pdf 1.0 MB
file_mmq4pduz3feqrahkp574jasxoy
|
zenodo.org (repository) web.archive.org (webarchive) |
article-journal
Stage
published
Date 2019-11-05
access all versions, variants, and formats of this works (eg, pre-prints)
Datacite Metadata (via API)
Worldcat
wikidata.org
CORE.ac.uk
Semantic Scholar
Google Scholar