AI composing music



Do you know what AI model would be best for let it learn composing music? I really don't know where to start there. Are there some good papers out there? I would say, if I use a NN, my only option would be a recurrent NN, because it needs to have a concept of timing, of chord progressions, and so on, right? I am also wondering how the learning function would look like, and how I could give the AI so much feedback as they usually need. Any tips, any literature, any tips?

Btw, if I made something wrong, or didn't write the question well, please tell it to me before (possibly) downvoting, I don't have much experience at doing that.


Posted 2018-08-27T16:13:16.433

Reputation: 307



Best Practices — Artificial Musical Composition Design

The best way to produce produce a learning of musical composition is to architect a solution that contains the basic components of composition proficiency.

  • Analysis of prior work
  • Music theory
  • Creative application of those two
  • Transcription or MIDI representation

The learning is in the first and third items and the second is a model of musical scales, harmony, melody, and rhythm. For the learning, GAN (Generative Adversarial Network) is the best choice from the array of machine learning approaches that are well developed as of this writing.

Critique of Other Approaches

Although DeepMind Technologies Limited (London) has some well designed solutions to offer, their WaveNet raw audio generation software produces a nice voice synthesis, but the music generation leaves much to be desired, which is fine. Their goal was better speech synthesis.

Although generative models is the way to go, as mentioned above, DeepMind will not likely provide you with the source to their proprietary software. There is a non-proprietary implementation, but the software lacks a comprehensive model of either traditional or progressive music theory.

CNN and LSTM are for recognition, which is not at all the best choice for analyzing prior compositions when several compositions are available in MIDI or other digital transcription formats. Even files representing 19th century player piano rolls are adequate for training examples.


It is not the composer's responsibility to play instruments well or sing with grace and power. The composer determines a piece as a matrix of notes with the following features.

  • Tone (frequency)
  • Intensity (amplitude)
  • Timbre (as from a specific instrument)
  • Start time (relative)
  • Duration
  • Modifications during the note

The start time and duration are expressed as a function of time signature.

A musical composition is a representation of this matrix. An artificial orchestra, band, vocalist, choir, or DJ can be constructed to provide the sound, but that is a separate art. There is no reason why an artificial composer can't provide a piece to human performers either.

This is an important distinction. An artificial composer does not produce sound. They produce the matrix of notes.

Music Theory

Music theory can be best represented in object oriented software, as a library, providing all musical options for notes, chords, melodies, and rhythms as abstractions.


This is where deep networks may prove effective, since the function of applying the patterns learned (via a GAN or some other generative sub-system) to the music theory and the production of a musical expression that will be perceived as having merit by listeners, buyers, and/or critics may be a complex function best developed in an artificial network.

The disparity (error) function in various capacities in this system's design is key.

The GAN must converge based on a metric expressing the fitness of the piece, and the deep network must converge on a selection of generated pieces based on the perception of merit mentioned above. Ultimately, the primary feedback will be from listening events, downloads, purchases, and reviews. How those aggregate is also key. If the goal is money, then purchases may preempt the others in aggregation of feedback. If the goal is fame, performances and reviews may matter most. If the goal is reach, then listening events in YouTube or downloads of an mp3 of a performance of the composition may be the primary indicator.

How the GAN and the deep network that evaluates GAN output and may also market the best results interact with one another presents an interesting problem, which is largely a problem in probability, balance, and connection topology.


There are many MIDI libraries and there are software packages that will produce notes on a staff in PDF or print form. The output of the composition is the easiest of the engineering tasks involved in this system's design.


Posted 2018-08-27T16:13:16.433

Reputation: 1 124


There are a few of them. The most recent I've found is from DeepMind, "The challenge of realistic music generation: modelling raw audio at scale", available here:

This video is a great analysis of it.


Posted 2018-08-27T16:13:16.433

Reputation: 803

Thanks for linking to the paper in addition to the video analysis. – DukeZhou – 2018-08-27T20:13:19.183

@BlueMoon93, I like this answer, but maybe you could link to some other ones. EvoMusArt puts out a few papers every year, for example:

– John Doucette – 2018-08-30T11:36:50.247

@JohnDoucette I don't know much on the theme, and I didn't know that article/team. You should compile your own answer, so the OP gets notified – BlueMoon93 – 2018-08-30T12:14:55.060


I am also new to the neural network architecture game but from what I have learned so far I think you have a few good options to choose from.

A recurrent neural network (RNN) would be a standard approach but if you're looking for something more robust you could look into a Long Short Term Memory network (LSTM). The neurons have a memory of past events and can recall that later on. It is a subset of RNN.

Perhaps you could go a little further and use a Convolutional Neural Network (CNN). So far these type of networks have been highly successful for image recognition. You could abstract a song piece as an image. Each pixel could be a progression in time and the value of the pixel could be the actual note.

Also take a look at this article for a good overview of several different neural network types.

Jeff Wingo

Posted 2018-08-27T16:13:16.433

Reputation: 9