Our AI-Generated Music Identification journey

2024-05-224 Min read
AI-Generated Music Identification

By Nicolas Dauban (DSP engineer)

One of the major revelations of 2023 was the leap in maturity of generative AI. New tools accessible to the general public have appeared, enabling images, videos, text and music to be generated from a simple prompt. The models of music generation achieved by AI have progressed to such a level that it will soon be impossible to discern them from real artists, using only our ears. This recent progress is a source of concern for some artists and players in the music industry. With this in mind, we set out to develop a system for identifying AI-generated music based solely on the audio signal. This article will cover the development of this project: the creation of a dataset, the training, and the evaluation of our detection system.

Preliminary study

State of the art

The generation of music by AI has been one of the major areas of interest in MIR (Music Information Retrieval) research in recent years. Models such as MusicGen, AudioLDM2, MusicLM and Suno are achieving levels of quality that are increasingly close to that of music produced by humans using conventional tools.

There are tools that detect AI-generated voices, and tools capable of identifying different musical characteristics, such as mood, genre or instrumentation. However, to date, as far as we know, there is no tool in the scientific literature or on the market that can identify music generated by artificial intelligence. This lack can be explained in particular by the very recent maturity of these technologies.

Model Architecture

At first, we considered AI-generated music detection as a special case of music auto-tagging. In the same way that a musical instrument can be identified, we assumed that AI-generated music could also be identified from its timbre. This hypothesis was put forward after analyzing samples generated by various models using IRCAM’s dedicated algorithm.

Audio quality is crucial as it reflects the entire audio creation process, from recording to post-mastering effects. At IRCAM, they already did groundbreaking work on the subject with the Audio Defect Detection in Music with Deep Networks, in 2021. More recently they take part, among other research institutes, in the AQUA-RIUS (Audio Quality Analysis for Representing, Indexing, and Unifying Signals) project, which aims to further investigate audio quality using deep learning.

With these spectacular algorithms, we were confident that the quality control route was also worth taking, dedicated to the objective of identifying recurring features of poor quality in AI-generated tracks.

Combining and cross-checking those quality and authenticity results with timbre identification could enable a reliable verdict as to whether the track comes from a generative AI model or not.

Dataset gathering

Collection of AI-made music

From a publicly available AI-generated tracks repositories, we selected around 1,000 musical extracts, totaling 500 minutes. In addition, we selected around 500 minutes from our own original track dataset, acquired specifically for model training.

Implementation and training

The model used for AI detection is a convolutional neural network. We used PyTorch for implementation. The greyscale spectrogram images, of the optimal dimensions obtained at the previous step, are used as input. The output consists of two neurons, the first giving the probability that a piece is generated by AI, the second that the piece is “genuinely” produced. The model, whose architecture is shown below, is made up of convolutional layers, pooling layers, dense layers and non-linearities. We also used batch-normalization techniques.

During the training phase, we tested several combinations of batch size and initial learning rate with the objective of achieving the best results. These ideal values we ended with varied greatly depending on each model. For the first two models we had a 1 to 8 ratio as the batch size was concerned and a 1 to 50 ratio for the learning rate. In both cases, the accuracy obtained on the validation data during training was greater than 99%.

Model evaluation

To evaluate the detection models, we generated data, using 2 prompt-based generative models.

We created prompts based on combinations of 10 musical genres, 10 emotions and 10 instruments. Here are two examples of prompts generated by keyword combinations: "romantic disco bass" and "sad jazz violin".

We felt it was preferable to generate data similar to that obtained by a user in a real-world scenario. To obtain a large number of prompts, we used a language model. Here are two examples of prompts obtained: "Make a dreamy ambient soundscape" and "Produce a Bollywood dance groove".

For both models, we observed the output obtained for 3 different types of input: 300 extracts generated by the first, 300 extracts generated by the second, and over 1000 extracts of humanly created songs from our proprietary dataset (which were absent from the training and evaluation datasets).

The results are shown in the table below:

There was a very low false negative rate and a very high true positive rate. We can see that the detection models are very specific to the generation models on which they were trained. This can make possible to identify which AI model generated the track detected as such, which could prove to be valuable information in certain use cases, if we were asked for it.

Conclusion

This approach led to such accurate and discriminant results between two AI models that we’re convinced it could serve as the definitive tool for music industry stakeholders. We even applied it successfully to more music generation models since, therefore allowing us to cover the present and future generative AI music spectrum.

👉 More information and all the features on the AI-Generated Detector page

Think we're on the same wavelength?