Suppose the
observed state is
j and the
action is
. Two
results ensue:
- A return
is realized
- The system moves to a new state
Let:
state in
n th period,
action taken on the basis of
[
A
0 is the initial action based on the initial state
Y
0 ]
A
policy
π is a set of functions
,
such that
The
expected return under policy
π , when
is
The
goal is to determine
π to maximize
.
Let
and
. If
is Markov, then use of (CI9) and (CI11) shows that
for any policy the
Z -process is Markov. Hence
We assume
time homogeneity in the sense that
We make a dynamic programming approach
Define recursively
as follows:
. For
, set
Put
Then, by
A4-2 ,
is a MG, with
and
IX A4-4
We may therefore assert
Hence,
. For
. If a policy
π
* can be found which yields equality, then
π
* is an optimal policy.
The following procedure leads to such a policy .
- For each
, let
be the action which
maximizes
Thus,
.
- Now,
, which yields equality in the argument above. Thus,
and
π
* is optimal.
Note that
π
* is a Markov policy,
. The
functions
f
n depend on the future stages, but once determined, the policy
is Markov.
A4-9 doob's martingale
Let
X be an integrable random variable and
Z
N an arbitrary sequence
of random vectors. For each
n , let
. Then
is a MG.
A4-9a best mean-square estimators
If
, then
is the best mean-square estimator of
X , given
.
is a MG.
A4-9b futures pricing
Let
X
N be a sequence of “spot” prices for a commodity. Let
t
0 be the
present and
be a fixed future. The agent can be expected to know
the past history
, and will
update as
t increases beyond
t
0 . Put
, the
expected futures price, given the history up to
. Then
is a Doob's MG, with
, relative
to
, where
and
for
.
A4-9c discounted futures
Assume rate of return is
r per unit time. Then
is the
discount factor . Let
Then
Thus
is a SMG relative to
.
Implication from martingale theory is that all methods to determine profitable
patterns of prediction from past history are doomed to failure.
IX A4-5
A4-10 present discounted value of capital
If
is the discount factor,
X
n is the dividend at time
n , and
V
n is the present value, at time
n , of all future returns, then