July 31, 2008

Hidden Markov Models for Musical Generation and Observation (Draft 3. Please read and give feedback)

Filed under: Uncategorized — Tags: , , , — alsuren @ 12:12 am

[EDIT 1: Largely re-structured to give more context for the fundamental concepts.]
[EDIT 2: Addressed some of Gavin’s points (link to a description of D-separation, and some general cleaning up of algebra.) Thanks Gav.
Also added some headings to help the casual reader.
Also note that I have changed my wordpress theme, so that the diagrams don’t get completely chewed up by your web browser.]
This post is an outline of my project so far. It is a kind-of 0th draft for what my report might contain. Please give feedback on how readable it all is (note: I already know that my diagrams use () and my text uses []. Unfortunately, I have a lot of diagrams, so changing this by hand would be tedious. I will write a script to do it for me at some point soon. I will also find a way to re-size my diagrams so that they fit on the page nicely). It’s likely to be mostly maths, but I’ll try to keep it as readable as possible, with as many pictures as possible. Please read as far as you can follow, and then post a comment, or email, or facebook, or MSN, or whatever telling me what I need to clarify.


The aim of this project is to analyse music, specifically focussing on rhythm. Before this can be done, we need to be able to track where we are in a bar. Once we can track where we are in the bar, we can start to extract useful features, and not be completely thrown off if two songs are very slightly faster or slower than each other. In order to do *this*, we need to have a model of how music is generated.

Toy Model of Music Generation

A really simple model would be the following:

Physical Description

Assume that a musician always plays music at the same tempo, and that he always plays the same bar in a loop. The musician starts at a random position in the bar, to make things a little random. Every hundredth of a second, the musician looks at his score, which tells him what to play for that hundredth of a second. He then keeps playing that note for a hundredth of a second, before moving on to the next position, and looking at his score again.

Note that this model of a musician is very limited. It might be valid for a music box, but not a human. That’s okay, because we are only studying it to introduce markov models. In the next article, we will explore more robust models, and outline the practical problems encountered when implementing them.

Technical Description

We call his position in the bar his state. Because we cannot observe x, we call it a “hidden state variable”. It has a state transition diagram that looks like this:
Deterministic Bar Position State Transition Diagram

The fact that the notes he plays are determined by his position in the bar can be represented using a trellis diagram:
Super-Simple Bar pointer Trellis Diagram.
The trellis diagram is a Graphical Model of our system. Arrows on a graphical model are always in the direction cause -> effect. The above trellis diagram also shows the fact that x[t] is only dependant on x[t-1] (in this case, x[t] = (x[t-1] + 1) % (X+1), using C-style syntax)

Observation Model

All we observe is what our musician plays at each time instant (marked a[t] on the diagram), and we wish to guess where he is in the bar at each time instant (x[t]). A good goal is to guess the value of x[t] that is most likely, given our observations, and our model (the score).

Base Case

We will start with a single observation.

At t=0, we observe that he is playing a C#. We wish to evaluate P(x[0]=position | a[0]=C# ) for each possible position in the bar, and pick the maximum. As a Graphical Model (shown below), a[0] is filled in, as its value has been given, and we wish to evaluate the probability of x[0] for each position. We will use more complicated graphical models later, so it is worthwhile familiarising yourself with the notation.
Graphical Model for a single Observation.

The same information can also be shown in a form that will be more familiar to the general public: The Venn Diagram below shows the events x[0]=6, and a[0]=C#. Note that the diagram is not to scale.
Venn Diagram (not to scale).


The white circle represents the event that the musician starts (“x[0]”) at position 6 in the bar (“x[0]=6”). If out diagram were to scale, the proportion of diagram covered by this circle would represent the probability of this event happening. We refer to this probability as P(x[0]=6). Because we model the musician as starting in a random position in the score, this is 1 / the number of possible starting positions.


The event that the first note we hear is a C# (“a[0]=C#”) is coloured in blue (this is arbitrary). The proportion of the diagram covered by the blue circle is the probability that a[0]=C#. We denote this as P(a[0]=C#). This can be found by working out the proportion of the score that contains a C#.

Conditional Probability, and Philosophy

At this point, a common objection is “But you just told us that we observed him playing a C#, so surely P(a[0]=C#)=1?” This is a good point, that needs clarifying. Usually, when we wish to evaluate P(a[0]=C#) given that we have observed C#, we write P(a[0]=C# | a[0]=C#) (read the “|” as “given”). Now, strictly, when we write P(a[0]=C#), we should really write P(a[0]=C# | model) instead. This is an important philosophical point, because P(a[0]=C#) could be interpreted as P(a[0]=C# | omniscience), which is always either 0 or 1. For the rest of this article, wherever nothing is noted as “given”, assume “given nothing other than my model”. This point is worthy of an article on its own, but we will not dwell on it any further here.

a[0] given x[0]

The proportion of the x[0]=6 circle that is also covered by the a[0]=C# circle represents the probability that a[0]=C# given that x[0]=6 (written as P(a[0]=C# | x[0]=6)). This can actually be read directly from the score: If there is a C# written in the position 6 in the score, then we have a 100% chance of hearing a C# (this would be represented by putting the 6 circle entirely within the C# circle). Otherwise it is 0 (which would be represented by not letting the circles overlap). Rather than having to create hundreds of diagrams, we will simply say “this diagram is not to scale”, and let the reader use his imagination.

x[0] given a[0]

The proportion of the a[0]=C# circle that is also covered by the x[0]=6 circle represents the probability of the musician having started in position 6, given that we have heard a C#. P(x[0]=6 | a[0]=C#). This is what we wish to calculate in order to find our best guess of where we started. We will get back to this later.

a[0] AND x[0]

The area where both circles overlap represents the probability that both a[0]=C# and x[0]=6. We write this as P(x[0]=6, a[0]=C#). It should be noted that this can be calculated as P(x[0]=6 | a[0]=C#) * P(a[0]=C#). This result is known as the product rule of probability.

A more Consice and General Notation

At this point, we will clean up the notation a little. Previously, we have explicitly written a[0]=C#, but the results above also work if we are interested in the event a[0]=D (simply replace C# with D), or a[0]>C# (if you consider one note to be greater than another). In order to capture this generality, we could use another variable for the note, but this very quickly becomes clunky. Instead, we will simply leave out the =C#. This has the added advantage that there is only a single = on each line, making things neater. When actually writing algorithms, the =C# can be substituted back in, as long as it is done consistently.

returning to x[0] given a[0]

Re-writing the product rule using the new notation looks like this:
P(x[0], a[0]) = P(x[0] | a[0]) * P(a[0])
P(x[0], a[0]) = P(a[0] | x[0]) * P(x[0])

and therefore:
P(x[0] | a[0]) * P(a[0]) = P(a[0] | x[0]) * P(x[0])

Re-arranging to get an expression for P(x[0] | a[0]):
P(x[0] | a[0]) = P(a[0] | x[0]) * P(x[0]) / P(a[0])
Those of you in the know will recognise this as Bayes’ rule.

Numerical Example, and Consolidation

The terms on the right hand side are all known, as seen above. But how is this useful? To make this useful, we must substitute in some terms. First, we substitute in our observation, and the position we’re wanting to test:
P(x[0]=6 | a[0]=C#) = P(a[0]=C# | x[0]=6) * P(x[0]=6) / P(a[0]=C#)

Then we can read the probabilities from our model. Let’s say there is a C# at position 6, and also at 7, 8, and 9. Let’s also say that the score is 20 positions long. That means:
P(a[0]=C# | x[0]=6) = 1
P(x[0]=6) = 1/20
P(a[0]=C#) = 4/20

P(x[0]=6 | a[0]=C#) = 1 * (1/20) / (4/20)
P(x[0]=6 | a[0]=C#) = 1/4

This is also the case for x[0] = 7, 8, or 9. It is 0 elsewhere, because the P(a[0]=C# | x[0]=10) term goes to 0. A useful check is that if we sum over all possible values of x[0], we get 4 * 1/4 + 16 * 0 = 1.

Considering More Obaservations

So by observing just a single snippet of a note, we can narrow down our position to one of 4 positions. This is useful, but it is only the start. Now we wish to incorporate some more observations, and build up a better guess about where we are in the bar.

Recursive Definition of P(x[t-1] | a[0…t-1])

In order to do this, we will jump straight to time t, and use all previous observations. We wish to calculate the probability of x[t] given all of the observations from a[0] to a[t], written P(x[t] | a[0…t]). We would like to get this in terms of P(x[t-1] | a[0…t-1]) and other known quantities, because that way, we can iterate from t=0 (the result for which is already known, from above) and only require the the output of the last iteration each time. This is a common pattern in convolutional error correction codes, as it lets us take into account all previous observations without running out of memory by trying to store P(x[t] | a[0…t]) for all previous t (you can just store the position of each maximum).

The first thing we do is use Bayes’ rule to make sure that we only have terms that can be factorised using the product rule. We will substitute terms back in later.

P(x[t] | a[0…t]) = P(a[0…t] | x[t]) * P(x[t]) / P(a[0…t]) — [1]

To help you follow this, I should point out that in each line of working, I will expand a single term from the right hand side of the last line. First, take P(a[0…t] | x[t]) (graphical model below).

Graphical Model for P(a[0...t] | x[t])

D-Separation and Conditional Independance

In this model, we notice that x[t] is a given node, and that it blocks the path between a[t] and a[t-1]. We say that a[t] and a[t-1] are conditionally independent given x[t], so you can bring a[t] out of the expression, simply by multiplying, as you can do with any independent random variables. is a useful summary of some situations where we can or can’t quickly infer conditional independence using d-separation.
P(a[0…t] | x[t]) = P(a[0…t-1] | x[t]) * P(a[t] | x[t]) — [2]

By the law of total probability (
P(a[0…t-1] | x[t]) = sum_position( P(a[0…t-1], x[t-1]=position | x[t]) ) — [3]

P(a[0…t-1], x[t-1] | x[t]) = P(a[0…t-1]| x[t-1], x[t]) * P(x[t-1] | x[t]) — [4]

P(a[0…t-1]| x[t-1], x[t]) looks like this:
Graphical Model for P(a[0...t-1] | x[t-1], x[t])
Notice that the path between a[t-1] and x[t] is blocked by x[t-1], so there is no point in conditioning on x[t]. i.e.
P(a[0…t-1]| x[t-1], x[t]) = P(a[0…t-1]| x[t-1]) — [5]
Notice that this is the same form as the left hand side of [2]. This will come in useful later.

Returning to [4],
P(x[t-1] | x[t]) = P(x[t] | x[t-1]) * P(x[t-1]) / P(x[t])
but, if not conditioned upon anything:
P(x[t]) = P(x[t-1]) , so:
P(x[t-1] | x[t]) = P(x[t] | x[t-1]) — [6]
Note that in our case, P(x[t] | x[t-1]) is 1 where x[t] = x[t-1] + 1, and 0 elsewhere, but we will keep this expression general, as it will be useful later.

Now substitute [6] and [5] into [4] into [3] into [2], and we get:
P(a[0…t] | x[t]) = P(a[t] | x[t]) * sum_position( P(a[0…t-1] | x[t-1]=position) * P(x[t] | x[t-1]=position) ) — [7]

Returning to [1], we need to find an expression for P(a[0…t]). Using the law of total probability, we find:
P(a[0…t]) = sum_position( P(a[0…t] | x[t]=position) * P(x[t]=position) )

But P(x[t]) is a constant, so we can take it outside the sum:
P(a[0…t]) = P(x[t]) *sum_position( P(a[0…t] | x[t]=position) ) — [8]

Now, we substitute [8] into [1] and cancel:
P(x[t] | a[0…t]) = P(a[0…t] | x[t]) / sum( P(a[0…t] | x[t]) ) — [9]


So we now have an expression [9] for P(x[t] | a[0…t]) in terms of P(a[0…t] | x[t]), and we have an expression [] for P(a[0…t] | x[t]) in terms of P(a[0…t-1] | x[t-1]).
Setting t=0, we find that P(a[0] | x[0]) is something that we have already evaluated (we call this the base case of the recursion). We can now go from t=1 onwards, and find values of P(a[0…t] | x[t]), using the formula in [7], then substitute into [9], and find a maximum, and we have a very sensible guess of where we are in the bar at time t.

Some Implementation Considerations

It should be noted that when we have very few observations, we will not have a single distinct maximum. In this toy example, we start with 4 after 1 observation, then 3, then 2, then 1. In more practical models, we tend to be even more unsure, and for even longer periods of time. Storing the full array of probabilities in this case is a possible idea, but can cause us to swallow memory. Trying to fit a probability distribution to the matrix will often let us capture the structure in just a few parameters.

It should also be noted that there are many tweaks that can be made to speed this calculation up for the toy model shown here. In this model, P(x[t] | x[t-1]) is a Kronecker delta function, so sums can be translated into index operations in C-inspired languages, or into slice operations in fortran-inspired languages. We have not made this optimisation here, because later, we will use non-deterministic state transitions, so the generality of being able to substitute an arbitrary state transition matrix will come in useful. The same goes for the emission probabilities: Currently, we are using a deterministic model, but in the next article, we will explore models which fit to real data.

Blog at