NOTE: Some equations or figures may not come out nicely in this HTML document. You can download a postscipt as well as pdf version from http://vision.ai.uiuc.edu/dugad/

A TUTORIAL ON HIDDEN MARKOV MODELS

Rakesh Dugad and U. B. Desai
Signal Processing and Artificial Neural Networks Laboratory
Department of Electrical Engineering
Indian Institute of Technology --- Bombay
Powai, Bombay 400 076, India
Contact Email : dugad@uiuc.edu, ubdesai@ee.iitb.ernet.in
Technical Report No. : SPANN-96.1

May 1996

Introduction [1]

Suppose a person has say three coins and is sitting inside a room tossing them in some sequence--this room is closed and what you are shown (on a display outside the room) is only the outcomes of his tossings TTHTHHTT... this will be called the observation sequence . You do not know the sequence in which he is tossing the different coins, nor do you know the bias of the various coins. To appreciate how much the outcome depends on the individual biasing and the order of tossing the coins, suppose you are given that the third coin is highly biased to produce heads and all coins are tossed with equal probability. Then, we naturally expect there to be far greater number of heads than tails in the output sequence. Now if it be given that besides the bias the probability of going to the third coin (state) from either the first or the second coin (state) is zero; then assuming that we were in the first or second state to begin with the heads and tails will appear with almost equal probability inspite of the bias. So we see that the output sequence depends very much on the individual bias, the transition probabilities between various states, as well as on which state is chosen to begin the observations. The three sets, namely, the set of individual bias of the three coins, the set of transition probabilities from one coin to the next and the set of initial probabilities of choosing the states characterize what is called as the HIDDEN MARKOV MODEL(HMM) for this coin tossing experiment. We shall shortly see the problems and their solutions that come under the framework of HMMs.

Notations

In the above experiment the outcomes, of tossing of each coin, T or H are called the observation symbols. Because we considered coins, only two observation symbols were possible. To be more general, consider a set of N urns each consisting of a number of marbles. The marbles are of M distinct colors. Within each urn the marbles are of distinct colors. Our experiment consists of drawing marbles from these urns in some sequence; only the sequence of marbles drawn is shown to us. Marbles, here, correspond to the T or H and urns to coins in the above experiment. We now define the following notation for our model :

: N = number of states (urns) in the model
: M = total number of distinct observation symbols (marbles of M distinct colors)
: T = length of observation sequence i.e. the number of symbols observed
: 1,2,...,N will denote the N urns respectively
: denotes the state in which we are at time t
: V = {,...,} the discrete set of possible observation symbols
: = P( = i), the probability of being in state i at the beginning of the experiment i.e. at t=1
: A = {} where = P(=j = i), the probability of being in state j at time t+1 given that we were in state i at time t. We assume that 's are independent of time.
: B = {( k)},( k) = P( at t = j), the probability of observing the symbol given that we are in state j.
: will denote the observation symbol observed at instant t
: will be used as a compact notation to denote an HMM

Using the model, an observation sequence O = ,,..., is generated as follows : We start our experiment by choosing one of the urns (according to the initial probability distribution ), then we choose a marble (observation symbol) from this urn -- this beginning instant of time is taken as t=1 and the state and observation symbol chosen at this t=1 are denoted by and respectively. After this we choose an urn (may be same or different from the urn at t=1) according to the transition probability distribution A and again select a marble (denoted by ) from this urn depending on the observation symbol probability distribution ( k) for that urn (state). Continuing this upto time t=T, generates the observation sequence O = ,,..., .

The Three Problems for HMMs

Most applications of HMMs are finally reduced to solving three main problems. These are :

Problem 1: : Given the model how do we compute P( O), the probability of occurrence of the observation sequence O = ,,..., .
Problem 2: : Given the model how do we choose a state sequence I = ,,..., so that P( O,I),the joint probability of the observation sequence O = ,,..., and the state sequence given the model is maximized.
Problem 3: : How do we adjust the HMM model parameters so that P( O) (or P( O,I) ) is maximized.

Problems 1 and 2 can be viewed as analysis problems while Problem 3 is a typical synthesis (or model identification or training) problem.

Solutions to the Three Problems

Problem 1

A most straightforward way to determine P( O) is to find P(O ) for a fixed state sequence I = ,,..., then multiply it by P( I) and then sum up over all possible I's. We have

Hence we have:

where I = ,,..., .

From Eq. 4 we see that the summand of this equation involves 2 T--1 multiplications and there exists distinct possible state sequences I. Hence a direct computation from ( 4) will involve of the order of 2 T multiplications. Even for small values, N =5 and T=100, this means approximately multiplications which would take aeons to complete even for a supercomputer. Hence we see that a more efficient procedure is required to solve Problem 1; such a procedure exists and is called the forward-backward procedure.

Forward-Backward Procedure

Consider the forward variable ( i) defined as :

i.e the probability of the partial observation sequence upto time t and the state i at time t, given the model . ( i) can be computed inductively as follows:

for t = 1,2,..., T--1,
then we have:

In Step 2 we want to compute the probability of partial observation sequence upto time t+1 and state j at time t+1; state j can be be reached (with probability

) independently from any of the N states at time t. The summation in Eq. 7 refers to this fact. Also the summand gives observation sequence upto time t; hence the

(

) outside the brackets. In Step 3 we just sum up all possible (independent) ways of realizing the given observation sequence. Let us illustrate this by taking the case of

(We have dropped dependence on for convenience). The sequence ,,=j occurs as: first , then state , then object . But can occur through any of the following mutually exclusive and exhaustive ways : state 1 then , state 2 then , and so on upto state N. We know that if {} be a set of mutually exclusive and exhaustive events then for any event E we have

Hence here we can write

which is the same as (7) above. Hence this simple example gives an insight into the mathematical justification of (7).

As for (8) using (9 ) we can write

Hence proved.

Now let us examine the number of computations involved in this algorithm. Step 1 involves N multiplications. In Step 2 the summation involves N multiplications plus one for the out of bracket () term---this has to be done for j=1 to N and t = 1 to T--1, making the total number of multiplications in Step 2 as (N+1)N(T--1). Step 3 involves no multiplications. Hence total number of multiplications is N+N(N+1)(T--1) i.e. of the order of as compared to required for the direct method. For N=5 and T=100 we need about 3000 computations for the forward method as compared to required by the direct method--a saving of about 69 orders of magnitude!

Backward Procedure

In a similar manner we may define a backward variable ( i) as:

i.e. the probability of the observation sequence from t+1 to T given the state i at time t and the model . Note that here =i has already been given (it wasn't in the case of the forward variable). This distinction has been made to be able to combine the forward and the backward variables to produce useful results, as we shall see soon. We can easily solve for ( i) as done for ( i) :

for t = T--1,T--2,...,1,

The proof of (12) and (13) is similar to the one given for (7) and (8) and is left as an exercise for the reader.

The computation of P( O) using ( i) also involves of the order of calculations. Hence both the forward as well as the backward method is equally efficient for the computation of P( O). This solves Problem 1.

Problem 2

Here we have to find a state sequence I = ,,..., such that probability of occurrence of the observation sequence O = ,,..., from this state sequence is greater than that from any other state sequence. In other words, our problem is to find I that will maximize P( O,I). There is a famous algorithm to do this, called the Viterbi Algorithm\ [2]. It is an inductive algorithm in which at each instant you keep the best (i.e. the one giving maximum probability) possible state sequence for each of the N states as the intermediate state for the desired observation sequence O = ,,..., . In this way you finally have the best path for each of the N states as the last state for the desired observation sequence. Out of these, we select the one which has highest probability.

In order to get an idea of the Viterbi algorithm as applied to the optimum state estimation problem a simple reformulation of the problem will be useful

Consider the expression for ; from (1)-(2) we have

Now define

then it is easily seen that

consequently the problem of optimal state estimation, namely,

becomes equivalent to

This reformulation now enables us to view terms like as the cost associated in going form state to state at time t.

The Viterbi algorithm to find the optimum state sequence can now be described as follows:
Suppose we are currently in state i and we are considering visiting state j next. We shall say that the weight on the path from state i to state j is (i.e. the negative of logarithm of probability of going from state i to state j and selecting the observation symbol in state j) where is the observation symbol selected after visiting state j--this is the same symbol that appears in the given observation sequence O = ,,..., .When the initial state is selected as state i the corresponding weight is -- we shall call this the initial weight. We define the weight of a sequence of states as the sum of the weights on the adjacent states. Note that this actually corresponds to multiplying the corresponding probabilities. Now finding the optimum sequence is merely a matter of finding the path (i.e. a sequence of states) of minimum weight through which the given observation sequence occurs.

Note that the Viterbi Algorithm is essentially a dynamic programming approach for minimizing U( ,,..., ).

We shall illustrate with an example on how the minimization is done. Consider a three state HMM. Fig.1(a) shows the weights assigned to various paths as described above. The numbers in circles show the initial weights assigned to the corresponding states. To simplify understanding of the algorithm we have assumed that the weights are symmetrical (i.e. the weight from state i to state j is same as the weight from state j to state i). Also we have assumed the weights do not change with time which in general will not be true but then once this simplified case is understood the extension to practical case is straight forward.

Figure 1: (a) The state machine. The weights for transitions between the states have been shown. The circled numbers are the initial weights of the corresponding states. (b) How a path of minimum cost is traced out using the Viterbi algorithm. The cumulative weights are shown in bold and the final minimum cost path is also shown in bold.

Consider Fig. 1(b). We take an observation sequence ,..., of length four. At time t=2 state 1 can be visited from any of the three states that we had at time t=1. We find out the weights on each of these paths and add corresponding weights to the initial weights ( we shall call this cumulative weight at state 1 as the cost at state 1 at time t=2). Thus going from state 1 to state 1 the cost at state 1 is 3+6=9. Similarly the cost at state 1 in going from state 2 to state 1 is 3+4=7 and the cost in going from state 3 to state 1 is 2+3=5. Of these the cost 5 is minimum. Hence we retain this cost at state 1 for further calculations. The minimum cumulative weight paths are shown in the figure by arrowed lines. The cumulative weights have been shown in bold alongside the respective states at each time instant. We repeat the same procedure for state 2 and state 3 . We see that the minimum costs at state 2 and state 3 are 6 (through state 3

) and 6 (through state 1 ) respectively.

We repeat the above procedure again for t=3 but now using the costs calculated above for each state rather than the initial weights ( we used them above because we started with t=1). And the same procedure is repeated for t=4. Now select out the state which has minimum cost of all the states. We see that state 1 is the required state with a cost of 11. Back tracing the sequence of states through which we got at state 1

at time t=4 gives the required sequence of states through which the given observation sequence has highest probability of occurrence. As can be seen from the figure this state sequence is state 3 ,state 1 ,state 3 ,state 1 . This sequence has been shown in bold in the figure.

To prove our point suppose you were given that the length of observation sequence is required to be two and that the last state is to be state 3 . What path would you choose to minimize the cost (i.e. to maximize the probability of the observation sequence ) ? The procedure outlined above clearly shows that we would choose the path beginning at state 1 and ending at state 3 (at t=2) as that gave the minimum cost (viz. 6) at state 3 . All other paths have a higher cost. Similar argument applies if any other state were required to be the last state. Similarly if the observation sequence were to be of length three and ending in say state 1 we would choose the path state 1 ,state 3 ,state 1 as outlined in the figure and described above. This means that at any given instant the path tracked upto any state by the above procedure is the minimum cost path if we were to stop at that instant at that state. Proceeding to t=4 we see that we have the minimum cost paths corresponding to stopping in state 1 ,state 2 or state 3 respectively. We just pick up the smallest of these three because we are interested in the minimum cost and hence we choose to stop in state 1 which gives us the minimum cost.

Now that we know how the Viterbi Algorithm works here is how it can be implemented [1,3] :

Note : ( i) denotes the weight accumulated when we are in state i at time t as the algorithm proceeds and ( j) represents the state at time t-1 which has the lowest cost corresponding to the state transition to state j at time t. For example in Fig. 1(b),

and

Initialization
For
Recursive computation
For for
Termination
Tracing back the optimal state sequence
For

Hence

gives the required state-optimized probability , and

is the optimal state sequence.

A little reflection over the above steps will show that computationally the Viterbi Algorithm is similar to the forward (-backward) procedure except for the comparisons involved for finding the maximum value. Hence its complexity is also of the order of . This solves Problem 2.

Problem 3

This problem deals with training the HMM such that it encodes the observation sequence in such a way that if a observation sequence having many characteristics similar to the given one be encountered later it should be able to identify it. There are two methods that we shall discuss depending on which probability is chosen for identification:

The Segmental K-means Algorithm :

In this method the parameters of the model

are adjusted to maximize P( O,I

) where I here is the optimum sequence as given by the solution to Problem 2 above.

The Baum-Welch re-estimation Formulas :

Here parameters of the model

are adjusted so as to increase P( O) until a maximum value is reached. As seen before calculating P( O) involves summing up P( O,I) over all possible state sequences I. Hence here our focus is not on a particular state sequence.

The Segmental K-means Algorithm ([#knshape##1#], [#knsegmental##1#])

This method takes us from to such that where, is the optimum sequence for O = ,,..., and , found according to the solution to Problem 2 . This criterion of optimization is called the maximum state optimized likelihood criterion. This function is called the state optimized likelihood function. In this method to train the model a number of (training) observation sequences are required. Let there be number of such sequences available. Each sequence O = ,,..., consists of T observation symbols. Instead of number of such sequences we might be given just one very long sequence; in that case we segment it into some conveniently chosen number of short sequences each of length T. Each observation symbol ( ) is assumed to be a vector of dimension D (). The algorithm then consists of the following steps:

Randomly choose N observation symbols (vectors of dimension D) and assign each of the observation vectors to one of these N vectors from which its Euclidean distance is minimum. Hence we have formed N clusters each called a state (1 to N). This initial choice of clustering vectors doesn't decide the final HMM that we get but can decide the number of iterations required for the HMM training. In our case we just pick up N equally spaced sequences of feature vectors and pick one vector from each of these sequences. The first vector is taken as the first vector of the first of these sequences, the second as the second of the second of these sequences and so on. Of course this is just to make the initial choice of clusters as widely distributed as possible and one is at liberty to choose the vectors as per his need.
Calculate the initial probabilities and the transition probabilities : For
For and
Calculate the mean vector and the covariance matrix for each state : For
Calculate the symbol probability distributions for each training vector for each state as (we assume Gaussian distribution -- there is no particular reason for this ,just change the formulas below for the particular probability distribution that sui ts your problem ): For
Find the optimal state sequence (as given by the solution to Problem 2) for each training sequence using computed in steps 2 to 4 above. A vector is reassigned a state if its original assignment is different from the corresponding estimated optimum state; for example suppose of the 5th training sequence was assigned to state 3 (i.e. the 3rd cluster) and now we find that the in the optimal state sequence corresponding to the 5th training sequence, is not 3 but say 4. Hence now we reassign of the 5th training sequence to state 4. Hence we assign (of say kth training sequence ) to state i if (corresponding to the kth training sequence ) is state i and this is done for all training sequences (i.e. for k=1 to ).
If any vector is reassigned a new state in Step 5, use the new assignment and repeat Step 2 through Step 6; otherwise, stop.

It can be shown [4] that the segmental K-means algorithm converges to the state-optimized likelihood function for a wide range of observation density functions including the Gaussian density we have assumed. Please refer [4] for the proof.

The Baum-Welch re-estimation Formulas [#knbasicintro##1#] :

This method assumes an initial HMM model which is improved upon by using the formulas (given below) so as to maximize P( O) . An initial HMM can be constructed in any way but we may use the first five steps of the Segmental K-means Algorithm described above to give us a reasonable initial estimate of the HMM. Henceforth we shall assume that an initial HMM is known. This algorithm maximizes P( O) by adjusting the parameters of . This optimization criterion is called the maximum likelihood criterion. The function P( O) is called the likelihood function. Before we get down to the actual formulas we shall introduce some concepts and notations that shall be required in the final formulas. Consider ( i)

defined as follows:

i.e. the probability of being in state i at time t given the observation sequence O = ,,..., and the model . By Bayes law we have :

where ( i) and ( i) are defined by (5) and (10) respectively. ( i) accounts for ,..., and state i at time t, and ( i) accounts for ,..., given state i at time t .

We also define ( i,j) as :

i.e. the probability of being in state i at time t and making a transition to state j at time t+1, given the observation sequence O = ,,..., and the model .

Using Bayes law it is seen that

But O = ,,..., . Hence

Consider the second term. The observation symbol at time t+1 is at which time we are required to be in state j; this means that the observation symbol has to be picked up from state j .Hence we have

Hence combining (5),(32),(34) and (37) we get

Here ( i) accounts for ,..., , accounts for the transition to state j , () picks up the symbol from state j ,and ( j) accounts for the remaining observation sequence ,..., .

If we sum up ( i) from t=1 to T we get a quantity which can be viewed as the expected number of times state i is visited or if we sum up only upto T-1 then we shall get the expected number of transitions out of state i (as no transition is made at t=T). Similarly if ( i,j) be summed up from t=1 to T-1 , we shall get the expected number of transitions from state i to state j . Hence

Now we are prepared to present the Baum-Welch re-estimation formulas:

The re-estimation formula for is simply the probability of being in state i at time t. The formula for is the ratio of expected number of time of making a transition from state i to state j to the expected number of times of making a transition out of state i. The formula for ( i) s the ratio of the expected number of times of being in state j and observing symbol to the expected number of times of being in state j . Note that the summation in the last formula goes upto t=T.

If we denote the initial model by and the re-estimation model by consisting of the parameters estimated above, then it can be shown that either:

The initial model is a critical point of the likelihood function in which case , or
i.e. we have found a better model from which the observation sequence
O = ,,..., is more likely to be produced.

Hence we can go on iteratively computing until P( O

) is maximized. This solves Problem 3.

We have seen that the Baum-Welch algorithm maximizes P( O) ; given by (4) . Each term in the summand of this equation is between 0 and 1 and hence the summand (which is a product of many such terms) is going to be very small and there are as many such small terms as the number of possible state sequences -- for a computer to be able to compute such terms individually, it should be able to store within its precision the smallest of these terms. If T=100 there would be around 200 numbers (each between 0 and 1) to be multiplied and hence each term could comfortably go beyond the precision of any computer. Taking logarithm does not help here since P( O) is given by a summation of such small terms. Perhaps some clever scaling procedure could be used to get over this problem.

On the other hand the Segmental k-means algorithm maximizes P( O,I)

which is calculated using the Viterbi Algorithm as given by (18) --(24) . Here we can use the logarithm to take care of product of many small numbers as no summation is involved. The Segmental k-means algorithm is also preferred because most of the time we are interested in the occurrence of the observation sequence from the best possible (i.e. the optimum) state sequence and considering the occurrence of the observation sequence through all the possible state sequences is not the intended representation of the problem in most cases. Moreover the Segmental k-means algorithm requires much less computation as compared to the Baum-Welch method. For all these reasons we have used the Segmental k-means algorithm in our work.

It is hoped that this would serve as good introduction to HMMs for anyone to venture further into the realm of HMMs.

References

1: L. R. Rabiner and B. H. Juang, ``An introduction to hidden Markov models,'' IEEE ASSP Mag., pp 4--16, Jun. 1986.
2: G.D.Forney Jr., ``The Viterbi Algorithm,'' Proc. IEEE, vol. 61, no. 3, pp. 263--278, Mar. 1973.
3: Yang He and Amlan Kundu, ``2-D shape classification using HMMs,'' IEEE Trans. Patt. Anal. Machine Intell., vol. 13, no. 11, pp. 1172--1184, Nov. 1991.
4: B. H. Juang and L. R. Rabiner, ``The segmental k-means algorithm for estimating the parameters of hidden Markov models,'' IEEE Trans. Accoust., Speech, Signal Processing, vol. ASSP-38,no. 9, pp. 1639--1641, Sept. 1990.

About this document ...

A TUTORIAL ON HIDDEN MARKOV MODELS

The command line arguments were:
latex2html -split 0 /home/yaman/ubdesai/html/HMM_TUT/newhmmtut.tex.

The translation was initiated by Prof U B Desai on Wed May 29 14:48:23 IST 1996

...method: On similar lines it can be seen that the number of additions involved is N(N--1)(T--1) + (N--1).
...assumed: If the observation symbols aren't actually given as numbers or vectors we shall have to characterize each of them with a vector

Prof U B Desai
Wed May 29 14:48:23 IST 1996