In my post of https://alsuren.wordpress.com/2008/07/31/, I introduced the Hidden Markov Modelling framework upon which the rest of the project is based. In today’s post, I will go over a few of the different models that can be plugged into the framework, and discuss their Practical limitations.

As the work on this section is still in progress, this post is likely to be quite rough until I’ve explored a few things. You will also notice that my TODO list is embedded in this post. If you have any thoughts on what I should put to the top of the TODO list, please post a comment at the bottom. Also, I have made many assertions that may not actually be valid. If you see something that looks wrong, feel free to point it out.

One of the useful properties of Markov Models is that the same model parameters used in the inference/tracking model can also be used to generate synthetic data. This is useful, as it lets you listen to what your model *thinks* a song sounds like, and compare it against the real thing. Throughout the rest of this project, we will try to make sure that we can generate synthetic data from our model, and that it sounds similar to what we expect music to sound like. Here are a few such models.

## Deterministic Model

In the last post, we looked at a generative model that looks like the following Graphical Model:

We used a State Transition Matrix that looked like this:

and a look-up table as our output model. Essentially, the only thing that was random in the system was its start position.

### Resilience to Noise

This model will immediately fail in the presence of any noise.

### Implementation Issues

This model was not implemented, as it is completely unrealistic.

## Coin Toss Speed Model

A possible extension to this model is to have a position transition matrix which looks like:

Where p is the probability of moving on to the next position in the bar. If p=1, the system will behave exactly like our previous model. This has an expected rate of progress of p, and the position after time t is binomial distributed.

### Problems with the Generative Model

There are many limitations to this model. The first is that it has no notion of what its speed has been in the past, so it will speed up and slow down at random like a primary school child, or an acid jazz musician.

### Resilience to Noise

This model is better than the previous model, because if music is generated with the model, and then slowed down, it is still capable of tracking the bar position, whereas the old system would simply output a uniform probability of 0 for all possible positions.

Because of the position is no longer deterministic, it is impossible to track the bar position *exactly*, but the best guess of the algorithm will be as good as that of a human observer. If the values of P(x[t]=position | a[0…t]) are plotted as brightness values on an image for each value of x and t, areas of uncertainty in x will look blurred. They will tend to be sharp just after the start of a note, and blurred towards the end.

If the music is sped up at any instant to more than 1 x per t, then this model will output all zeros from that point onwards. Similarly, if any notes are not played, or are played in the wrong order, the model will fail.

### Implementation Issues

This model was not implemented.

## Coin Toss Emission Model

One way to amend the above model is to change the emission model. We say that the musician will only play the note recorded in the score with a probability of e (notes are dropped with probability 1-e). Also, with probability s, a note will be picked at random, and played instead. If e=1, and s=0, this will behave like the above model.

### Problems with the Generative Model

In addition to the above problems, long continuous notes will often be cut into chunks by sections of dropped notes, and there will be random squeaks. It will sound even more like a child learning to play than the previous model.

### Resilience to Noise

If notes are removed from or added to the piece, they will be treated as randomly dropped/inserted notes, and the tracking will continue as before. If the piece is temporarily sped up to greater than 1 x per t, the resulting notes can be explained away as random insertions.

### Implementation Issues

This model was not implemented. All of the above models require something to infer the notes played with high precision. This is possible, but extremely expensive computationally.

## Gaussian Emission Model

This is the first emission model that was implemented in Python. The format of input/output to the model is mono .wav data with 44100 samples per second.

Rather than having notes written in the score, we will say that the score just contains the loudness of the music at that point. Every hundredth of a second, the musician will output 441 samples of (Gaussian) white noise with variance equal to that indicated in the score (in reality, the variance will be Gaussian distributed). TODO: Add a wav file sample of the output.

When observing, the value of P(samples | x[t]) is actually the Gaussian probability density function, given by

P(samples | x[t]) = (1/(sqrt(2*pi)*sigma) ** 441) * exp( -var(samples)*441 / (2*(sigma ** 2)) ).

Where sigma is taken from the score, and var(samples) is the variance of the samples. It is left as an exercise for the readers to verify the correctness of the mathematics for this, if they so wish. The important thing is that the maximum likelihood position for x[t] is the one that has the same variance as the observed samples. This is shown for 10 samples below, and for 441 samples below that.

### Problems with the Generative Model

The fact that the sound generated is white means that it sounds like a brush on a drum. Also, as the variance of the samples is used as the observation: observing a song, and then generating sound that matches those observations makes the song unrecognisible.

### Resilience to Noise

This model is very good at tracking synthetic data. One thing to be careful of is that the variance of the input, and that of the score are the same, and that there is not too much clipping, or quantisation error (.wav stores 16bit signed integer values).

### Implementation Issues

There is a lot of scope for numerical instability here. Rather than implementing the formula for P(samples | x[t]) given above, it is recommended to implement log(P(samples | x[t])), and then use exp() after. Also, rather than passing P(a[0…t] | x[t]) through the algorithm (which decays to zero exponentially quickly), one should pass through P(x[t] | a[0…t]), whose sum remains constant at all times. If one is interested in P(a[0…t] | x[t]), it is possible to return P(a[t] | x[t], a[0…t-1]) for each t along with each P(x[t] | a[0…t]). It is then possible to take logs, and use the cumsum() function to calculate the log likelihood at each t.

## Extra State Variable for the Speed

Previously, we have looked at only a single state variable. Now, we will add a second variable for the speed, v[t]. Rather than having the position advancing randomly, we now make the position advance by v[t] each time. The Graphical Model now looks like this:

It is possible to do an analysis similar to that done in the last post, in order to find an expression for P(a[0…t-1] | x[t-1], v[t-1]), and so on. An alteranative approach (suggested by C. M. Bishop) is to merge the states, to look like the diagram below, and simply construct a single large transition matrix.

Setting v to 1 will make the system behave like the original model. Letting v[t] = 1 with probability p, or v[t] = 0 with probability 1 – p will make the system behave like the Coin Toss Speed Model. Alternatively, if we have a state transition matrix for v that looks like the diagram below, we will be able to generate constant tempo music by making p small.

### Problems with the Generative Model

The model will either only generate music with 3 speeds, or wander all over the place.

If p is small, it will only be able to track music at 3 tempos. If p is large, it will lose track of the beat too quickly when there is a break in the music. [TODO: Create some graphs of what the spread looks like for some values of p, and work out the conditions on p to get a spread of x with a single maximum.]

### Resilience to Noise

This model works for synthetic data under many forms of noise for single instrument pieces, but if multiple instruments are played, or voices are included in the song, it will fail.

### Implementation Issues

In order to get acceptable resolution in speeds, the score must be *very* large (1000 or more for 10 possible speeds). This means that for each operation, a 10000 item long vector for P(a[0…t-1] | x[t-1], v[t-1]) must be multiplied by the 10000×10000 speed transition matrix. This is a reasonably expensive thing to do, but the matrix is very sparse, and has regular structure (it will only have of the order of 30000 non-zero elements, and it contains sub-matrices which have very low bandwidth. If using LAPACK (most matrix algebra systems will use this either directly or indirectly) then it is possible to inform the computer of this sparseness, and speed up the computation with very little effort. At best, this will square root the cost of computation for a single

Another problem is that creating a single large transition matrix is very expensive in terms of memory. If a dense matrix representation is used, it will generally require a contiguous block of memory to be allocated from the heap. For large scores, memory fragmentation will make this impossible. Luckily, this only ever needs to be done once for each model, and once it has completed the matrix can be converted to a sparse format.

An alternative implementation is to separate the state variables, and do some maths [TODO: does anyone want me to do this here? I have it written in my folder]. This way, the state transition matrix for speed, P(v[t] | v[t-1]) becomes a 10×10 with items only in the 3 diagonals. The transition matrix for P(x[t] | x[t-1], v[t-1]) then becomes a Kronecker delta again, and can be evaluated using slice operations again. Instinctively, this would seem far quicker than the simplistic method outlined above, but if a scripting language like Python or Matlab is used, then the interpreter overhead for the slicing operations may turn out to be significant for small matrices. Also, a clever LAPACK implementation could concievably reduce the original matrix multiplication to this anyway.

It is also possible to turn the speed transition matrix into a symmetrically extended 1-D convolution, using the kernel [p, 1-2p, p].

[TODO: plot a graph of execution time vs array size for each implementation]

## Filtered Gaussian Emission Model

Rather than storing just a single value of the variance, the spectrum will be split up onto bands, using a Discrete Wavelet Transform. Each band will therefore have its variance stored in the score.

In order to generate music, the musician will look in his score, and generate random noise for each band at the appropriate variance. He will then inverse DWT the bands, to get the real signal.

### Problems with the Generative Model

I’ve not implemented it yet, so I don’t know what it sounds like. I should have it done by the weekend, as I see my supervisor on Monday.

### Resilience to Noise

It should filter out other instruments better than the previous model, but we’ll have to see.

### Implementation Issues

I am currently in the process of implementing this model.

I clapped along with that. Please do not respond with a dimsissive comment.

Comment by johnhoffman — November 27, 2008 @ 2:21 pm