Alsuren

July 31, 2008

Hidden Markov Models for Musical Generation and Observation (Draft 3. Please read and give feedback)

Filed under: Uncategorized — Tags: , , , — alsuren @ 12:12 am

[EDIT 1: Largely re-structured to give more context for the fundamental concepts.]
[EDIT 2: Addressed some of Gavin’s points (link to a description of D-separation, and some general cleaning up of algebra.) Thanks Gav.
Also added some headings to help the casual reader.
Also note that I have changed my wordpress theme, so that the diagrams don’t get completely chewed up by your web browser.]
This post is an outline of my project so far. It is a kind-of 0th draft for what my report might contain. Please give feedback on how readable it all is (note: I already know that my diagrams use () and my text uses []. Unfortunately, I have a lot of diagrams, so changing this by hand would be tedious. I will write a script to do it for me at some point soon. I will also find a way to re-size my diagrams so that they fit on the page nicely). It’s likely to be mostly maths, but I’ll try to keep it as readable as possible, with as many pictures as possible. Please read as far as you can follow, and then post a comment, or email, or facebook, or MSN, or whatever telling me what I need to clarify.

Introduction

The aim of this project is to analyse music, specifically focussing on rhythm. Before this can be done, we need to be able to track where we are in a bar. Once we can track where we are in the bar, we can start to extract useful features, and not be completely thrown off if two songs are very slightly faster or slower than each other. In order to do *this*, we need to have a model of how music is generated.

Toy Model of Music Generation

A really simple model would be the following:

Physical Description

Assume that a musician always plays music at the same tempo, and that he always plays the same bar in a loop. The musician starts at a random position in the bar, to make things a little random. Every hundredth of a second, the musician looks at his score, which tells him what to play for that hundredth of a second. He then keeps playing that note for a hundredth of a second, before moving on to the next position, and looking at his score again.

Note that this model of a musician is very limited. It might be valid for a music box, but not a human. That’s okay, because we are only studying it to introduce markov models. In the next article, we will explore more robust models, and outline the practical problems encountered when implementing them.

Technical Description

We call his position in the bar his state. Because we cannot observe x, we call it a “hidden state variable”. It has a state transition diagram that looks like this:
Deterministic Bar Position State Transition Diagram

The fact that the notes he plays are determined by his position in the bar can be represented using a trellis diagram:
Super-Simple Bar pointer Trellis Diagram.
The trellis diagram is a Graphical Model of our system. Arrows on a graphical model are always in the direction cause -> effect. The above trellis diagram also shows the fact that x[t] is only dependant on x[t-1] (in this case, x[t] = (x[t-1] + 1) % (X+1), using C-style syntax)

Observation Model

All we observe is what our musician plays at each time instant (marked a[t] on the diagram), and we wish to guess where he is in the bar at each time instant (x[t]). A good goal is to guess the value of x[t] that is most likely, given our observations, and our model (the score).

Base Case

We will start with a single observation.

At t=0, we observe that he is playing a C#. We wish to evaluate P(x[0]=position | a[0]=C# ) for each possible position in the bar, and pick the maximum. As a Graphical Model (shown below), a[0] is filled in, as its value has been given, and we wish to evaluate the probability of x[0] for each position. We will use more complicated graphical models later, so it is worthwhile familiarising yourself with the notation.
Graphical Model for a single Observation.

The same information can also be shown in a form that will be more familiar to the general public: The Venn Diagram below shows the events x[0]=6, and a[0]=C#. Note that the diagram is not to scale.
Venn Diagram (not to scale).

x[0]

The white circle represents the event that the musician starts (“x[0]”) at position 6 in the bar (“x[0]=6”). If out diagram were to scale, the proportion of diagram covered by this circle would represent the probability of this event happening. We refer to this probability as P(x[0]=6). Because we model the musician as starting in a random position in the score, this is 1 / the number of possible starting positions.

a[0]

The event that the first note we hear is a C# (“a[0]=C#”) is coloured in blue (this is arbitrary). The proportion of the diagram covered by the blue circle is the probability that a[0]=C#. We denote this as P(a[0]=C#). This can be found by working out the proportion of the score that contains a C#.

Conditional Probability, and Philosophy

At this point, a common objection is “But you just told us that we observed him playing a C#, so surely P(a[0]=C#)=1?” This is a good point, that needs clarifying. Usually, when we wish to evaluate P(a[0]=C#) given that we have observed C#, we write P(a[0]=C# | a[0]=C#) (read the “|” as “given”). Now, strictly, when we write P(a[0]=C#), we should really write P(a[0]=C# | model) instead. This is an important philosophical point, because P(a[0]=C#) could be interpreted as P(a[0]=C# | omniscience), which is always either 0 or 1. For the rest of this article, wherever nothing is noted as “given”, assume “given nothing other than my model”. This point is worthy of an article on its own, but we will not dwell on it any further here.

a[0] given x[0]

The proportion of the x[0]=6 circle that is also covered by the a[0]=C# circle represents the probability that a[0]=C# given that x[0]=6 (written as P(a[0]=C# | x[0]=6)). This can actually be read directly from the score: If there is a C# written in the position 6 in the score, then we have a 100% chance of hearing a C# (this would be represented by putting the 6 circle entirely within the C# circle). Otherwise it is 0 (which would be represented by not letting the circles overlap). Rather than having to create hundreds of diagrams, we will simply say “this diagram is not to scale”, and let the reader use his imagination.

x[0] given a[0]

The proportion of the a[0]=C# circle that is also covered by the x[0]=6 circle represents the probability of the musician having started in position 6, given that we have heard a C#. P(x[0]=6 | a[0]=C#). This is what we wish to calculate in order to find our best guess of where we started. We will get back to this later.

a[0] AND x[0]

The area where both circles overlap represents the probability that both a[0]=C# and x[0]=6. We write this as P(x[0]=6, a[0]=C#). It should be noted that this can be calculated as P(x[0]=6 | a[0]=C#) * P(a[0]=C#). This result is known as the product rule of probability.

A more Consice and General Notation

At this point, we will clean up the notation a little. Previously, we have explicitly written a[0]=C#, but the results above also work if we are interested in the event a[0]=D (simply replace C# with D), or a[0]>C# (if you consider one note to be greater than another). In order to capture this generality, we could use another variable for the note, but this very quickly becomes clunky. Instead, we will simply leave out the =C#. This has the added advantage that there is only a single = on each line, making things neater. When actually writing algorithms, the =C# can be substituted back in, as long as it is done consistently.

returning to x[0] given a[0]

Re-writing the product rule using the new notation looks like this:
P(x[0], a[0]) = P(x[0] | a[0]) * P(a[0])
P(x[0], a[0]) = P(a[0] | x[0]) * P(x[0])

and therefore:
P(x[0] | a[0]) * P(a[0]) = P(a[0] | x[0]) * P(x[0])

Re-arranging to get an expression for P(x[0] | a[0]):
P(x[0] | a[0]) = P(a[0] | x[0]) * P(x[0]) / P(a[0])
Those of you in the know will recognise this as Bayes’ rule.

Numerical Example, and Consolidation

The terms on the right hand side are all known, as seen above. But how is this useful? To make this useful, we must substitute in some terms. First, we substitute in our observation, and the position we’re wanting to test:
P(x[0]=6 | a[0]=C#) = P(a[0]=C# | x[0]=6) * P(x[0]=6) / P(a[0]=C#)

Then we can read the probabilities from our model. Let’s say there is a C# at position 6, and also at 7, 8, and 9. Let’s also say that the score is 20 positions long. That means:
P(a[0]=C# | x[0]=6) = 1
P(x[0]=6) = 1/20
P(a[0]=C#) = 4/20

Therefore:
P(x[0]=6 | a[0]=C#) = 1 * (1/20) / (4/20)
P(x[0]=6 | a[0]=C#) = 1/4

This is also the case for x[0] = 7, 8, or 9. It is 0 elsewhere, because the P(a[0]=C# | x[0]=10) term goes to 0. A useful check is that if we sum over all possible values of x[0], we get 4 * 1/4 + 16 * 0 = 1.

Considering More Obaservations

So by observing just a single snippet of a note, we can narrow down our position to one of 4 positions. This is useful, but it is only the start. Now we wish to incorporate some more observations, and build up a better guess about where we are in the bar.

Recursive Definition of P(x[t-1] | a[0…t-1])

In order to do this, we will jump straight to time t, and use all previous observations. We wish to calculate the probability of x[t] given all of the observations from a[0] to a[t], written P(x[t] | a[0…t]). We would like to get this in terms of P(x[t-1] | a[0…t-1]) and other known quantities, because that way, we can iterate from t=0 (the result for which is already known, from above) and only require the the output of the last iteration each time. This is a common pattern in convolutional error correction codes, as it lets us take into account all previous observations without running out of memory by trying to store P(x[t] | a[0…t]) for all previous t (you can just store the position of each maximum).

The first thing we do is use Bayes’ rule to make sure that we only have terms that can be factorised using the product rule. We will substitute terms back in later.

P(x[t] | a[0…t]) = P(a[0…t] | x[t]) * P(x[t]) / P(a[0…t]) — [1]

To help you follow this, I should point out that in each line of working, I will expand a single term from the right hand side of the last line. First, take P(a[0…t] | x[t]) (graphical model below).

Graphical Model for P(a[0...t] | x[t])

D-Separation and Conditional Independance

In this model, we notice that x[t] is a given node, and that it blocks the path between a[t] and a[t-1]. We say that a[t] and a[t-1] are conditionally independent given x[t], so you can bring a[t] out of the expression, simply by multiplying, as you can do with any independent random variables. http://www.biomedcentral.com/1471-2105/8/S6/S5/figure/F3 is a useful summary of some situations where we can or can’t quickly infer conditional independence using d-separation.
P(a[0…t] | x[t]) = P(a[0…t-1] | x[t]) * P(a[t] | x[t]) — [2]

By the law of total probability (http://en.wikipedia.org/wiki/Law_of_total_probability):
P(a[0…t-1] | x[t]) = sum_position( P(a[0…t-1], x[t-1]=position | x[t]) ) — [3]

P(a[0…t-1], x[t-1] | x[t]) = P(a[0…t-1]| x[t-1], x[t]) * P(x[t-1] | x[t]) — [4]

P(a[0…t-1]| x[t-1], x[t]) looks like this:
Graphical Model for P(a[0...t-1] | x[t-1], x[t])
Notice that the path between a[t-1] and x[t] is blocked by x[t-1], so there is no point in conditioning on x[t]. i.e.
P(a[0…t-1]| x[t-1], x[t]) = P(a[0…t-1]| x[t-1]) — [5]
Notice that this is the same form as the left hand side of [2]. This will come in useful later.

Returning to [4],
P(x[t-1] | x[t]) = P(x[t] | x[t-1]) * P(x[t-1]) / P(x[t])
but, if not conditioned upon anything:
P(x[t]) = P(x[t-1]) , so:
P(x[t-1] | x[t]) = P(x[t] | x[t-1]) — [6]
Note that in our case, P(x[t] | x[t-1]) is 1 where x[t] = x[t-1] + 1, and 0 elsewhere, but we will keep this expression general, as it will be useful later.

Now substitute [6] and [5] into [4] into [3] into [2], and we get:
P(a[0…t] | x[t]) = P(a[t] | x[t]) * sum_position( P(a[0…t-1] | x[t-1]=position) * P(x[t] | x[t-1]=position) ) — [7]

Returning to [1], we need to find an expression for P(a[0…t]). Using the law of total probability, we find:
P(a[0…t]) = sum_position( P(a[0…t] | x[t]=position) * P(x[t]=position) )

But P(x[t]) is a constant, so we can take it outside the sum:
P(a[0…t]) = P(x[t]) *sum_position( P(a[0…t] | x[t]=position) ) — [8]

Now, we substitute [8] into [1] and cancel:
P(x[t] | a[0…t]) = P(a[0…t] | x[t]) / sum( P(a[0…t] | x[t]) ) — [9]

Consolidation

So we now have an expression [9] for P(x[t] | a[0…t]) in terms of P(a[0…t] | x[t]), and we have an expression [] for P(a[0…t] | x[t]) in terms of P(a[0…t-1] | x[t-1]).
Setting t=0, we find that P(a[0] | x[0]) is something that we have already evaluated (we call this the base case of the recursion). We can now go from t=1 onwards, and find values of P(a[0…t] | x[t]), using the formula in [7], then substitute into [9], and find a maximum, and we have a very sensible guess of where we are in the bar at time t.

Some Implementation Considerations

It should be noted that when we have very few observations, we will not have a single distinct maximum. In this toy example, we start with 4 after 1 observation, then 3, then 2, then 1. In more practical models, we tend to be even more unsure, and for even longer periods of time. Storing the full array of probabilities in this case is a possible idea, but can cause us to swallow memory. Trying to fit a probability distribution to the matrix will often let us capture the structure in just a few parameters.

It should also be noted that there are many tweaks that can be made to speed this calculation up for the toy model shown here. In this model, P(x[t] | x[t-1]) is a Kronecker delta function, so sums can be translated into index operations in C-inspired languages, or into slice operations in fortran-inspired languages. We have not made this optimisation here, because later, we will use non-deterministic state transitions, so the generality of being able to substitute an arbitrary state transition matrix will come in useful. The same goes for the emission probabilities: Currently, we are using a deterministic model, but in the next article, we will explore models which fit to real data.

Advertisements

3 Comments »

  1. Strange how things come up, I just looked up Hidden Markov Models yesterday (they occur in biology too). I read about as far as the third diagram, and it seemed fairly comprehensible, but I’ll leave the remainder for those in more technical subjects.

    Comment by Thomas K — July 31, 2008 @ 12:13 pm

  2. I’ve read through this and tried to make sense of it. The first step seems fairly obvious – if there are four positions that could be C#’s and at t=0 we hear a C# obviously there is a 1 in 4 chance for x[0] being each one of those.

    The calculations after that get slightly messier but as far as I can tell they are all correct. (except that you missed a ) off the end of [7].) I haven’t seen this notation with circles and arrows before but I think what you used it to justify makes sense anyway.

    “We can then go from t=1 onwards, and find values of P(a[0…t] | x[t]), using the formula in [7], then substitute into [8], and find a maximum, and we have a very sensible guess of where we are in the bar at time t.”

    In the sum in [7], all but one of the P(x[t] | x[t-1]=position) will be zero. [7] is simplified to P(a[0…t] | x[t]) = P(a[t] | x[t] = IMPLICIT) * P(a[0…t-1] | x[t-1]=IMPLICT-1), which expands immediately to P(a[0…t] | x[t]) = prod_i P(a[i] | x[t]), which is clearly 0 or 1. (When implementing this I guess you might store previous values and so the recursive definition is still useful.) Then you say you substitute into [8]. I believe you mean by this substitute into P(x[t] | a[0…t]) = P(a[0…t] | x[t]) / sum( P(a[0…t] | x[t]) ), the expression just after [8], as P(x[t] | a[0…t]) is what you want.

    The expression after [8] is directly analogous to the obvious expression for P(x[0] | a[0]). Bearing in mind that all the probabilities are either 0 or 1 this is just saying that all the values of x[t] that the observed behaviour could have occurred for are equally likely, which again seems fairly obvious (or it comes from Bayes’ Theorem as you showed). I don’t know how you would “find a maximum” when all the probabilities are the same. Unless I’ve misunderstood.

    Comment by Gavin S. — August 2, 2008 @ 8:05 pm

  3. 😀 thanks so much for looking over it. I probably should have made it clear that this is a toy model. All it’s there for is to introduce the framework. My current implementation has a second hidden state variable controlling the speed, which will speed up and slow down at random.

    It also doesn’t observe the notes, as determining the exact note is surprisingly expensive. What it actually does is model the samples as i.i.d. gaussian, with a variance taken from the “score”. Hopefully you can see that it quickly becomes a long way removed from just 0s and 1s.

    Really, you should just wait until my next post for all the fun details. I have blog writing as the only item on my todo list for tomorrow, so I will try to clean up your points in this post, and then outline what the model *actually* looks like in my next post.

    I will ping you next time I see you online to catch up.

    Comment by alsuren — August 2, 2008 @ 11:54 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: