On Valid and Reliable Experiments in Music Information Retrieval

Project Summary

Every experimental science is based on the notion of valid and reliable experiments, i.e. experiments that really measure what one wants to examine and experiments which yield repeatable results. Music Information Retrieval (MIR), as the interdisciplinary science of retrieving information from music, conducts experiments with a multitude of methods from machine learning, statistics, signal processing, artificial intelligence, etc. It relies on the proper evaluation of all these methods to measure the success of new algorithms, or, in more general terms, chart the progress of the whole field of MIR. The principal role of computer experiments and their statistical evaluation within MIR is now widely accepted and understood, but the more fundamental notions of validity and reliability in MIR experiments are still rarely discussed within the field.

This lack of awareness for valid and reliable MIR experimentation is at the heart of a number of seemingly puzzling phenomena in recent MIR research. Marginally and imperceptibly altered data, so-called adversarial examples, are able to drastically reduce performance of state of the art MIR systems. It has even been claimed that such easily fooled MIR systems therefore do not use musical knowledge at all. Other authors have pointed out that, due to a lack of inter-rater agreement when annotating ground truth data, performance in many MIR tasks can never exceed a certain glass ceiling, since it is not meaningful for an algorithm to model specific raters. A problem of algorithmic bias are difficulties of learning in high dimensional spaces, where some data objects act as `hubs', being abnormally close to many other data objects thereby causing disturbances in music recommendation, since hub songs are being recommended over and over again.

Although a small but growing body of work and literature concerning these MIR problems exists, what is still lacking is an understanding of their true nature: they are problems of validity and reliability in MIR experimentation. Since a failure to comprehend this fundamental issue at the heart of MIR is severely impeding progress in the field, our main goals in this project are: (i) to provide a framework for valid and reliable experimentation in MIR; (ii) to advance the state of the art concerning adversarial examples, inter-rater agreement and algorithmic bias by conducting exemplary valid and reliable MIR experiments.

The main focus of this project is on MIR where the above mentioned phenomena are especially apparent, but the very same problems of course have ramifications in general machine learning also, making sure that our research has the potential to advance the progress in MIR and far beyond.

Project Details

Project Number


Principal Investigator

Arthur Flexer

Project Period

May 2019 - April 2022

Funding Amount

€ 347.476,50


Scientific Publications

Feldbauer R., Rattei T., Flexer A.: scikit-hubness: Hubness Reduction and Approximate Neighbor Search, Journal of Open Source Software, 5(45), 1957, 2020. DOI:


Flexer A., Lallai T.: Can We Increase Inter- and Intra-Rater Agreement in Modeling General Music Similarity?, in Proceedings of 20th International Society for Music Information Retrieval Conference, Delft, The Netherlands, 2019. also available as: OFAI-TR-2019-01.


Paischer F., Prinz K., Widmer G.: Audio Tagging With Convolutional Neural Networks Trained With Noisy Data, Technical Report, DCASE2019 Challenge, 2019.


Prinz K., Flexer A.: End-to-End Adversarial White Box Attacks on Music Instrument Classification, arXiv:2007.14714 [eess.AS], 2020.


Prinz K., Flexer A.: Weak Multi-Label Audio-Tagging with Class Noise, Late Breaking/Demo at the 20th International Society for Music Information Retrieval, Delft, The Netherlands, 2019.


Prinz K., Flexer A., Widmer G.: On End-to-End White-Box Adversarial Attacks in Music Information Retrieval, Transactions of the International Society for Music Information Retrieval, 4(1), pp.93–104, 2021. DOI:


Prinz K., Flexer A., Widmer G.: The Impact of Label Noise on a Music Tagger, In Proceedings of the 13th International Workshop on Machine Learning and Music, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2020. see also arXiv:2008.06273 [eess.AS]