It's been a while...
I probably could have kept this post for myself, since the main reason I'm writing this is to force myself to better formalize and structure these ideas. But I also wouldn't mind discussing these topics with other people. ;)
This will be a look at generative models for music from the perspective of a
deep learning researcher. In particular, I will be taking the standpoint that
such models should aim to possess some sort of creativity: Producing truly
novel and "interesting" artifacts. To put it another way:
They should model the process, not the data.
I realize that many of the aspects I bring up are
already being considered and actively researched in other communities (e.g.
computational creativity). However, my main goal is to bring up specifically why
some of the current research directions in deep generative modeling are (IMHO)
misguided and why resources might be better spent on other
problems.
Generative modeling may be summarized as: Given a set (or more generally, a
domain) of data x, build a model of the probability distribution p(x). In
fact, most "modern" deep generative modeling frameworks (such as GANs or
VAEs)
While such models are mostly built and tested in the image domain (especially
faces), attempts at creating music "from nothing" are becoming more ambitious
(e.g. Music Transformer
Please note: I am aware that there are contexts/applications where training a generative model to "copy" a distribution is actually the goal, and there is nothing wrong with that. However, models like the ones mentioned above are usually presented "as is", with their ability to generate an endless stream of music as the main selling point. Interestingly, they are often explicitly advertised as generating music "in a certain style", which IMHO masks their limitations somewhat by pretending that generating hours of Mozart-like music (for example) is the whole point. Of course, there is some value here -- namely, exploring/showcasing model architectures that are capable of the kind of long-term structure needed to produce music. I'm certainly not proposing to get rid of deep models altogether -- their incredible expressiveness should be leveraged. But I believe that at some point, the relentless scaling-up should be stopped, or at least halted for a bit, and the insights be applied to more creative approaches to making music.
As an approach radically different from copy models, it might be possible to
start generating artifacts (e.g. pieces of music) without any reference data
whatsoever
Given a fixed data distribution for training, a generative model will be "done" eventually. That is, it will have converged to the "best" achievable state given the architecture, data, learning goals, training procedure etc. If we then start sampling thousands upon thousands of outputs (compositions) from the model, these will all come from the exact same (albeit possibly extremely complex) distribution. Diversity can be achieved by using conditional distributions instead, but these will still be stationary.
It should be clear that this is not a reasonable model of any creative process, nor will it ever create something truly novel. On the contrary, such a process should be non-stationary, that is, always evolving. New genres develop, old ones fall out of favor. Ideas are iterated upon. New technology becomes available, fundamentally disrupting the "generative distribution". Such things should (in my opinion) be much more interesting to model and explore than a literal jukebox.
Concretely, I believe that concepts from research on
open-endedness
To expand upon the speculation on "ex nihilo" generative models at the end of the last section, it could be interesting to train a copy model to initialize an open-ended search process which is then perhaps guided by more general principles/priors. This would allow for exploring possible evolutions of existent musical genres.
Generative modeling of music is usually done at one of two levels.
The first one
is the symbolic level, where music is represented by MIDI, ABC notation, piano
rolls or some other format. What is common to such representations is that they
use (often discrete) "entities" encoding notes or note-like events in terms of
important properties such as the pitch and length of a tone. Importantly, there
is no direct relation to audio in these symbols -- the sequences need to be
interpreted first, e.g. by playing them on some instrument. This implies that
the same MIDI sequence can sound vastly different when it is interpreted via two
different instruments. This is arguably already a problem in itself, since
widely used symbolic representations lack means of encoding many factors that
are important in contemporary music (electronic music in particular).
In that
regard, it is quite telling that many symbolic models are trained on classical
(piano) music. Here, the instrument is known and fixed, and so it can be assumed
that a "sensible" sequence of symbols will sound good.
However, there is a second problem related to the interpretation of musical symbols, which is perhaps easier to miss. Namely, the symbols have absolutely no meaning by themselves. Previously I said that they usually encode factors such as the pitch or length of a tone -- but the exact relationships are imposed by human interpretation. Take the typical western twelve-tone equal temperament (which MIDI note pitches are also commonly mapped to) as an example: Here, every twelve tones are one octave apart (i.e. they double in fundamental frequency). Seven tones (a fifth) are in a relation of 3:2, etc. Generally, every tone increases in frequency by about 6% compared to the next-lower one. Such intervals undoubtedly play an important role in human perception of music. But these relations are completely absent from a symbolic note representation. For a model training on such data, there might as well be five tones to an octave, or thirteen, or... The concept of intervals does not arise from symbolic data, and thus a model trained on such data cannot learn about it.
Then why do such models manage to produce data that sounds "good", with harmonic
intervals we find pleasing? This is simply because the models copy the data they
receive during training. If the data tends to use certain intervals and avoid
others, the model
will do so, as well. The difference is that the training data was generated
(i.e. the songs where composed) with a certain symbol-to-sound relationship in
mind (e.g. twelve-tone equal temperament). However, this relationship is lost on
the model, which merely copies what is has been taught without understanding
the ultimate "meaning" (in terms of actual sound). In fact, this seems
incredibly close to John Searle's famous Chinese Room
argument
Aside from the symbolic level, it is also possible to directly generate data
on the audio (waveform) level.
Recently, OpenAI released their Jukebox model
It may be possible to combine symbolic and waveform approaches to achieve the best of both worlds. Essentially, this means using a symbolic-level model to produce sequences of symbols, and then a waveform model that translates those symbols into sound. This preserves many advantages of symbolic models (e.g. explicit, interpretable representations and specific ways of making sound) while also allowing the model to "connect" with the domain we eventually care about (audio).
While this sounds good in theory, there are of course problems with this approach, too. The main one is probably how to formulate a joint model for symbols and sound. A major obstacle here is that it is not possible to backpropagate through discrete symbols. Since most symbolic models output soft probability distributions over symbols, this is not a problem in the pure setting. But a joint model would probably not be able to work with such soft outputs, since it would be like "pressing every piano key a little bit, but one more than the others". Still, there are workarounds for this issue, such as vector quantization with straight through estimators -- or dropping gradient-based methods entirely and using alternatives like reinforcement learning or evolutionary computing instead.
Besides this problem at the symbolic level, there is also one in the symbol-to-audio pipeline: This generator needs to be differentiable, too. This means we cannot simply train a symbolic model with a "real" piano (samples), since the instrument cannot be backpropagated through. Alternatively, using standard neural network architectures (e.g. Wavenet) can lead to artifacts and/or slow generation as discussed before. Personally, I am really interested in approaches like DDSP that preserve differentiability while incorporating a sensible inductive bias for audio, leading to much better quality with simpler models.
A possible hybrid approach could go like this:
Colton
Heath & Ventura
On the other hand, autoregressive models as well as flow-based models (which generalize the former) can explicitly compute probabilities for a given data point, which might be taken as a proxy for "quality". A model could use this to reject bad samples (e.g. that resulted from an unfortunate random draw) on its own. This is troublesome, however, since it is not clear a priori what a "high" probability is, and accordingly what kind of score one should strive for. This is particularly true in the (common) case where the data is treated as continuous, and the probabilities computed are actually densities. Also, this approach seems inappropriate for judging novelty -- truly novel work would likely receive a low probability and thus be difficult to differentiate from work that is simply low-quality, which would also receive a low score.
Additionally, none of these models use their "self-judging" abilities to actually iterate and improve on their own outputs. This is fairly common in a creative process: Create something (perhaps only partially), judge which parts are good/bad and improve on the ones that are lacking. Here, I find self-attention approaches such as the transformer interesting: The model can essentially take multiple turns in creating something, looking at specific parts of its own output and use this information to iterate further. However, current transformer models usually do not produce actual outputs (in data space) at each layer; instead they compute on high-dimensional hidden representations and only produce an output at the very end.
Given our evolutionary history, I believe it's safe to say that perception came first, and the ability and desire for creativity arose out of these capacities. At the same time, generative and perceptual processes could also be tightly interlinked inside a model, e.g. using a predictive coding framework. At this time, I don't know enough about PC to really make a judgement (or go into more detail), however.
Likely the biggest challenge in modeling (human) creativity is that art is usually "about" something, meaning that it relates in some way to the artist's own experience in the world. As such, properly approaching this subject seems to require solving strong AI. However, there may be ways to at least make steps towards a solution via simpler methods. One example could be multi-modal representations. As humans, we are able to connect and interrelate perceptions from different senses, e.g. vision, touch and hearing. We can also relate these perceptions to memories, abstract concepts, language etc. It seems obvious that such connections inform many creative artifacts. For example, program music provides a "soundtrack" that fits a story or series of events. Such music is neither creatable nor understandable without understanding language/stories (which in turn requires general world knowledge). On a more personal level, an artist may create a piece of music that somehow mirrors a specific experience, say, "lying at night at the shore of a calm mountain lake".
Models that simply learn to approximate a given data distribution (limited to
the modality of interest) clearly cannot
make such connections.
To summarize:
Each of these points offers several directions for future research to explore. It is possible that none of these proposed methods/directions will result in anything comparable to copy models, in terms of surface-level quality, for a very long time. However, I believe it is important to break the mould of trying to make progress by throwing humongous amounts of compute at highly complex data distributions. Instead, generative music (at least for creative purposes) should start from first principles and accept that the results might be "lame" for a while. In the long term, this has the potential to teach us about music, about creativity in general, and about ourselves. Can Jukebox do that?