Rakesh Dugad and U. B. Desai
Signal Processing and Artificial Neural Networks Laboratory
Department of Electrical Engineering
Indian Institute of Technology --- Bombay
Powai, Bombay 400 076,
India
Contact Email : dugad@uiuc.edu, ubdesai@ee.iitb.ernet.in
Technical Report No. : SPANN-96.1
May 1996
Suppose a person has say three coins and is sitting inside a room tossing them in some sequence--this room is closed and what you are shown (on a display outside the room) is only the outcomes of his tossings TTHTHHTT... this will be called the observation sequence . You do not know the sequence in which he is tossing the different coins, nor do you know the bias of the various coins. To appreciate how much the outcome depends on the individual biasing and the order of tossing the coins, suppose you are given that the third coin is highly biased to produce heads and all coins are tossed with equal probability. Then, we naturally expect there to be far greater number of heads than tails in the output sequence. Now if it be given that besides the bias the probability of going to the third coin (state) from either the first or the second coin (state) is zero; then assuming that we were in the first or second state to begin with the heads and tails will appear with almost equal probability inspite of the bias. So we see that the output sequence depends very much on the individual bias, the transition probabilities between various states, as well as on which state is chosen to begin the observations. The three sets, namely, the set of individual bias of the three coins, the set of transition probabilities from one coin to the next and the set of initial probabilities of choosing the states characterize what is called as the HIDDEN MARKOV MODEL(HMM) for this coin tossing experiment. We shall shortly see the problems and their solutions that come under the framework of HMMs.
In the above experiment the outcomes, of tossing of each coin, T or H are called the observation symbols. Because we considered coins, only two observation symbols were possible. To be more general, consider a set of N urns each consisting of a number of marbles. The marbles are of M distinct colors. Within each urn the marbles are of distinct colors. Our experiment consists of drawing marbles from these urns in some sequence; only the sequence of marbles drawn is shown to us. Marbles, here, correspond to the T or H and urns to coins in the above experiment. We now define the following notation for our model :
Using the model, an observation sequence O =
,
,...,
is generated as follows :
We start our experiment by choosing one of the urns (according to the initial
probability distribution
), then we choose a marble (observation symbol) from this
urn -- this beginning instant of time is taken as t=1 and the state and observation symbol chosen
at this t=1 are denoted by
and
respectively. After this we choose an urn (may be same or different from
the urn at t=1) according to the transition probability distribution A and again select a
marble (denoted by
) from this urn depending on the observation symbol probability distribution
( k) for that urn (state).
Continuing this upto time t=T, generates the observation sequence O =
,
,...,
.
Most applications of HMMs are finally reduced to solving three main problems. These are :
Problems 1 and 2 can be viewed as analysis problems while Problem 3 is a typical synthesis (or model identification or training) problem.
A most straightforward way to determine P( O) is to find P(O
) for a fixed state sequence I =
,
,...,
then multiply it by P( I
) and then sum up over all possible I's. We have
Hence we have:
where I = ,
,...,
.
From Eq. 4 we see that the summand of this equation involves 2 T--1
multiplications and there exists distinct possible state sequences I. Hence a direct computation from
( 4) will involve of the order of 2 T
multiplications. Even for small values, N =5 and
T=100, this means approximately
multiplications which would take aeons to complete even
for a supercomputer. Hence we see that a more efficient procedure is required to solve Problem 1;
such a procedure exists and is called the forward-backward procedure.
Consider the forward variable ( i) defined as :
i.e the probability of the partial observation sequence upto time t and the state i at time t,
given the model .
( i) can be computed inductively as follows:
(We have dropped dependence on for convenience).
The sequence
,
,
=j occurs as: first
, then state
, then object
. But
can occur through any of the following mutually exclusive and
exhaustive ways : state 1 then
, state 2 then
, and so on upto state N. We know that if {
} be a set of
mutually exclusive and exhaustive events then for any event E we
have
Hence here we can write
which is the same as (7) above. Hence this simple example gives an insight into the mathematical justification of (7).
As for (8) using (9 ) we can write
Hence proved.
Now let us examine the number of computations involved in this algorithm. Step 1
involves N multiplications. In Step 2 the summation involves N multiplications plus one for the out of bracket (
) term---this
has to be done for j=1 to N and t = 1 to T--1, making the total number of multiplications in Step 2
as (N+1)N(T--1). Step 3 involves no multiplications. Hence total number of multiplications is
N+N(N+1)(T--1) i.e. of the order of
as compared to
required for the direct method
.
For N=5 and T=100 we need about
3000 computations for the forward method as compared to
required by the
direct method--a saving of about 69 orders of magnitude!
In a similar manner we may define a backward variable ( i) as:
i.e. the probability of the observation sequence from t+1 to T given the state i
at time t and the model . Note that here
=i
has already been given (it wasn't in the case of the forward variable).
This distinction has been made to be able to combine the forward and
the backward variables to produce useful results, as we shall
see soon. We can easily solve for
( i) as done for
( i) :
The proof of (12) and (13) is similar to the one given for (7) and (8) and is left as an exercise for the reader.
The computation of P( O) using
( i) also involves of the
order of
calculations. Hence both the forward as well as the backward method is
equally efficient for the computation of P( O
).
This solves Problem 1.
Here we have to find a state sequence I = ,
,...,
such that probability of
occurrence of the observation sequence O =
,
,...,
from this state sequence
is greater than that from any other state sequence. In other words,
our problem is to find I that will maximize P( O,I
). There is a
famous algorithm to do this, called the Viterbi Algorithm\
[2]. It is an inductive algorithm in which at each
instant you keep the best (i.e. the one giving maximum probability)
possible state sequence for each of the N states as the intermediate
state for the desired observation sequence O =
,
,...,
. In this way you
finally have the best path for each of the N states as the last state
for the desired observation sequence. Out of these, we select the one
which has highest probability.
In order to get an idea of the Viterbi algorithm as applied to the optimum state estimation problem a simple reformulation of the problem will be useful
Consider the expression for ; from (1)-(2) we
have
Now define
then it is easily seen that
consequently the problem of optimal state estimation, namely,
becomes equivalent to
This reformulation now enables us to view terms like
as the cost associated in
going form state
to state
at time t.
The Viterbi algorithm to find the optimum state sequence can now be
described as follows:
Suppose we are currently in state i and we are considering visiting
state j next. We shall say that the weight on the path from state
i to state j is (i.e. the negative of
logarithm of probability of going
from state i to state j and selecting the observation symbol
in state j) where
is the
observation symbol selected after visiting state j--this is the same
symbol that appears in the given observation sequence O =
,
,...,
.When the
initial state is selected as state i the corresponding weight is
-- we shall call this the initial weight. We define
the weight of a sequence of states as the sum of the weights
on the adjacent states. Note that this actually corresponds to
multiplying the corresponding probabilities. Now finding the optimum
sequence is merely a
matter of finding the path (i.e. a sequence of states) of minimum
weight through which the given observation sequence occurs.
Note that the Viterbi Algorithm is essentially a dynamic programming
approach for minimizing U( ,
,...,
).
We shall illustrate with an example on how the minimization is done. Consider a three state HMM. Fig.1(a) shows the weights assigned to various paths as described above. The numbers in circles show the initial weights assigned to the corresponding states. To simplify understanding of the algorithm we have assumed that the weights are symmetrical (i.e. the weight from state i to state j is same as the weight from state j to state i). Also we have assumed the weights do not change with time which in general will not be true but then once this simplified case is understood the extension to practical case is straight forward.
Figure 1: (a) The state machine. The weights for transitions between the
states have been shown. The circled numbers are the initial weights
of the corresponding states. (b) How a path of minimum cost is traced
out using the Viterbi algorithm. The cumulative weights are shown in
bold and the final minimum cost path is also shown in bold.
Consider Fig. 1(b). We take an observation
sequence ,...,
of length four. At time t=2 state 1
can be visited from any
of the three states that we
had at time t=1. We find out the weights on each of these paths and
add corresponding weights to the initial weights ( we shall call this
cumulative weight at state 1
as the cost at state 1
at time t=2). Thus going from
state 1
to state 1
the cost at state 1
is 3+6=9. Similarly the cost at
state 1
in going from state 2
to state 1
is 3+4=7 and the cost in going
from state 3
to state 1
is 2+3=5. Of these the cost 5 is
minimum. Hence we retain this cost at state 1
for further
calculations. The minimum cumulative weight paths are shown in the
figure by arrowed lines. The cumulative weights have been shown in
bold alongside the respective states at each time instant.
We repeat the same procedure for state 2
and state 3
. We
see that the minimum costs at state 2
and state 3
are 6 (through state 3
) and 6 (through state 1 ) respectively.
We repeat the above procedure again for t=3 but now using the costs calculated above for each state rather than the initial weights ( we used them above because we started with t=1). And the same procedure is repeated for t=4. Now select out the state which has minimum cost of all the states. We see that state 1 is the required state with a cost of 11. Back tracing the sequence of states through which we got at state 1
at time t=4 gives the required sequence of states through which the given observation sequence has highest probability of occurrence. As can be seen from the figure this state sequence is state 3 ,state 1 ,state 3 ,state 1 . This sequence has been shown in bold in the figure.
To prove our point suppose you were given that the length of
observation sequence is required to be two and that the last state is
to be state 3
. What path would you choose to minimize the cost (i.e. to
maximize the probability of the observation sequence
) ? The
procedure outlined above clearly shows that we would choose the path
beginning at state 1
and ending at state 3
(at t=2) as that gave the
minimum cost (viz. 6) at state 3
. All other paths have a higher
cost. Similar argument applies if any other state were required
to be the last state. Similarly if the observation sequence were to be
of length three and ending in say state 1
we would choose the path
state 1
,state 3
,state 1
as outlined in the figure and described above. This
means that at any given instant the path tracked upto any state by the
above procedure is the minimum cost path if we were to stop at that
instant at that state. Proceeding to t=4 we see that we have the
minimum cost paths corresponding to stopping in
state 1
,state 2
or state 3
respectively. We just pick up the smallest of
these three because we are interested in the minimum cost and
hence we choose to stop in state 1
which gives us the minimum cost.
Now that we know how the Viterbi Algorithm works here is how it can be implemented [1,3] :
Note : ( i)
denotes the weight accumulated when we are
in state i at time t as the algorithm proceeds and
( j)
represents the state at time t-1 which has the
lowest cost corresponding to the state transition to state j at time
t. For example in Fig. 1(b),
and
A little reflection over the above steps will show that
computationally the Viterbi Algorithm is similar to the forward
(-backward) procedure except for the comparisons involved for finding
the maximum value. Hence its complexity is also of the order of
. This solves Problem 2.
This problem deals with training the HMM such that it encodes the observation sequence in such a way that if a observation sequence having many characteristics similar to the given one be encountered later it should be able to identify it. There are two methods that we shall discuss depending on which probability is chosen for identification:
are adjusted so as to increase P( O)
until a maximum value is
reached. As seen before calculating P( O
)
involves summing up
P( O,I
)
over all possible state sequences I. Hence here our focus is not on a particular state sequence.
This method takes us from
to
such that
where,
is the optimum sequence for O =
,
,...,
and
, found according to
the solution to Problem 2 . This criterion of optimization is called the maximum state optimized likelihood criterion.
This function
is called the state optimized likelihood function.
In this method to train the model a number of (training) observation sequences are required.
Let there be
number of such sequences available. Each sequence O =
,
,...,
consists of T observation
symbols. Instead of
number of such sequences we might be
given just one very long
sequence; in that case we segment it into some conveniently chosen
number of short sequences each of length T.
Each observation symbol (
) is assumed
to be a vector of dimension D (
).
The algorithm then consists of the following steps:
For and
This method assumes an initial HMM model which is improved upon by using the formulas
(given below) so as to maximize P( O)
. An initial HMM can be constructed in any way
but we may use the first five steps of the Segmental K-means Algorithm described above to
give us a reasonable initial estimate of the HMM. Henceforth we shall assume that an
initial HMM is known. This algorithm maximizes P( O
)
by adjusting the parameters of
.
This optimization criterion is called the maximum likelihood criterion.
The function P( O
)
is called the likelihood function.
Before we get down to the actual formulas we shall introduce some concepts and
notations that shall be required in the final formulas. Consider
( i)
defined as follows:
i.e. the probability of being in state i
at time t given the observation sequence
O =
,
,...,
and the model
. By Bayes law we have :
where ( i)
and
( i)
are defined by (5)
and (10) respectively.
( i)
accounts for
,...,
and state i
at
time t, and
( i)
accounts for
,...,
given state i
at time t .
We also define ( i,j)
as :
i.e. the probability of being in state i
at time t and making a transition to state j at time t+1,
given the observation sequence O =
,
,...,
and the model
.
Using Bayes law it is seen that
But O =
,
,...,
. Hence
Consider the second term. The observation symbol at time t+1
is
at which time we are required to be in state j; this
means that the observation symbol
has to be picked up
from state j
.Hence we have
Hence combining (5),(32),(34) and (37) we get
Here ( i)
accounts for
,...,
,
accounts
for the transition to state j
,
(
)
picks up the symbol
from state j ,and
( j)
accounts for
the remaining observation sequence
,...,
.
If we sum up ( i)
from t=1 to T we get a quantity which can be viewed
as the expected number of times state i is visited or if we sum up only upto T-1 then we shall get
the expected number of transitions out of state i (as no transition
is made at t=T). Similarly
if
( i,j)
be summed up from t=1 to T-1 , we shall get the expected number of transitions from
state i
to state j
. Hence
Now we are prepared to present the Baum-Welch re-estimation formulas:
The re-estimation formula for is simply the probability of
being in state i at time t. The formula for
is the ratio of expected number of time of making
a transition from state i to state j to the expected number of times of making a
transition out of state i. The formula for
( i)
s the ratio of the expected number of
times of being in state j and observing symbol
to the
expected number of times of being in state j .
Note that the summation in the last formula goes upto t=T.
If we denote the initial model by and the re-estimation model by
consisting of
the parameters estimated above, then it can be shown that either:
We have seen that the Baum-Welch algorithm maximizes P( O)
;
given by (4)
. Each term in the summand of this equation is
between 0 and 1 and hence the summand (which is a product of many such
terms) is going to be very small and there are as many such small
terms as the number of possible state sequences -- for a computer to
be able to compute such terms individually, it should be able to store
within its precision the smallest of these terms. If T=100 there
would be around 200 numbers (each between 0 and 1) to be multiplied
and hence each term could comfortably go beyond the
precision of any computer. Taking logarithm does not help here since
P( O
)
is given by a summation of such small terms. Perhaps some
clever scaling procedure could be used to get over this problem.
On the other hand the Segmental k-means algorithm maximizes P( O,I)
which is calculated using the Viterbi Algorithm as given by (18) --(24) . Here we can use the logarithm to take care of product of many small numbers as no summation is involved. The Segmental k-means algorithm is also preferred because most of the time we are interested in the occurrence of the observation sequence from the best possible (i.e. the optimum) state sequence and considering the occurrence of the observation sequence through all the possible state sequences is not the intended representation of the problem in most cases. Moreover the Segmental k-means algorithm requires much less computation as compared to the Baum-Welch method. For all these reasons we have used the Segmental k-means algorithm in our work.
It is hoped that this would serve as good introduction to HMMs for anyone to venture further into the realm of HMMs.
A TUTORIAL ON HIDDEN MARKOV MODELS
This document was generated using the LaTeX2HTML translator Version 95.1 (Fri Jan 20 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 /home/yaman/ubdesai/html/HMM_TUT/newhmmtut.tex.
The translation was initiated by Prof U B Desai on Wed May 29 14:48:23 IST 1996