A Maximize or Explore Problem over a Finite Time Series

Question

I recently read "Reinforcement Learning" Bardo and Sutton which motivated me to come up with this problem (which I hope is well posed):

The Problem

Some sort of reward maximizing agent finds itself in an environment with a finite number of states it can be in, each of which gives a certain reward, and a finite number of timesteps it can traverse these states in. The agent is given this information:

A number $m$ which denotes the number of total states it can explore.

A number $T$ which denotes the Total number of discrete timesteps the agent can act in before the "game" is up.

A range $a$ to $b$ which denotes the minimum to maximum values that a reward for any given state can be.

(For simplicity, and specificity) The agent also knows that rewards are distributed uniformly between $a$ and $b$.

What the agent does not know:

Let $S$ denote the set of states where $S$'s content does not necessarily need to be numeric, and $V$ denote the set of values each of which is an integer, such that $|S| = |V|$, and $V$ may not have all unique values, as it is sampled from Uniform$(a,b)$ with the integer restriction.

The agent does not know the content of $V$ nor does it know the mapping $f(s_i) = v_j$, which gives the reward associated with each state.

The agent can only take two actions throughout the finite discrete time series $0$ to $T$:

(Explore) The agent can acquire the reward from $s_i$ where $s_i$ the next unseen state in an ordered list of states (i.e. both learn what $f(s_i)$ is, and gain that value).

OR

(Maximize) The agent can return or stay at the state in the set of already "seen" or "explored" states with the maximum reward. In other words, the agent can keep track of the set $S_s$ denoting the set of states the agent knows the reward for through "exploring", and select $s \in S_s \text{ st. } f(s) = v \text{ is maximized in } S_s$.

The agent can be given a plan in the form of a "choice-vector", $P$. $P$ has length $T$, where each index denotes a timestep $t$ in the entire time series and the contents of $p_t$ is 1 of 2 symbols either telling the agent to explore at time $t$ or maximize at time $t$.

I seek to give this agent a function $h(T, m, a, b) = k$ where $k$ denotes the index of $P$ in which the agent should switch from "exploring" to "maximizing" in order to maximize $E[\sum_{t=0}^T f(s_t)]$ or maximize its expected reward over the entire time series. One obvious assumption I am making is that a choice vector that maximizes total reward effectively exclusively explores, and then, after, exclusively maximizes, and does not switch back and forth. This is only an intuition, and if it is wrong, let it be known.

Some Thoughts and Work I Have Done

It seems as though as $T-m$ grows larger the agent would want to explore more and more to achieve a near optimal total reward. In the extreme case, the agent can explore all states up to timestep $m$, and then take the global maximum reward at each timestep from $m+1$ to $T$. It seems that the greater the difference between $m$ and $T$ the better this policy is. The solution were $m > T$ seems less intuitive.

I started attempting to solve this problem via. monte carlo, but have am still working on it. Decision trees may be useful too, where the sort of choice vectors I am thinking about can all be thought of as a single turn from traversing down all left node children, until a certain level, and then switching to traversing all right children (see below). But again here I would ultimate do some sort of simulation. If you would like to view my code please message me, and we can talk about it.

My question is ultimately is there a way of finding the function $h(T, m, a, b) = k$ mathematically without simulation, what is that function, and how would I go about finding it.

Wait... since you specified $h(T,m,a,b)$ without reference to what has been seen so far, does that mean you must decide $k$ ahead of time, and stick to it regardless of what has been see so far? I mean, if the first state I see has reward $b$, there is no reason to explore any more. — antkam
– antkam, Commented Mar 23, 2019 at 1:41
Yes, $k$ would be determined before hand. Your point about $b$ makes sense to me. — tharvey
– tharvey, Commented Mar 23, 2019 at 5:02
Although, it isn't completely obvious even if your choice making is online what to do with if you find reward $b-1$ or a near-optimal solution, decisions at that point would have to be dependent on $T$ and $m$. Also, I am looking for an Identical policy for all possible lists of values in a problem specified by $T, m, a, $and $, b$ — tharvey
– tharvey, Commented Mar 23, 2019 at 5:14
Ok, just to be absolutely clear: $k$ is determined ahead of time before any exploring. Suppose the agent is told $k=10$ in a certain case. Then even if it sees reward $b$ in the first state, it is forced to keep exploring until $k=10$ states are explored. (And of course afterwards it will go back to the 1st state, or another state with reward $b$, to maximize.) Am I right? — antkam
– antkam, Commented Mar 23, 2019 at 6:41
Yes. I want to emphasize that I am trying to find the best $k$ to a "class" of possible rewards. Where class means $m$ ordered uniform random variables with parameters a and b. There is no guarantee that $b$ will even be in the reward set, and clearly, if it is the optimal solution is to choose it at each timestep. Yet the chances of $b$ being both contained and/or discovered are dependent on $T$ and $m$. The way that you phrased your previous question seems to be pointing out that it is silly to not choose $b$ every time, and I agree, but... — tharvey
– tharvey, Commented Mar 23, 2019 at 7:43

antkam · Accepted Answer · 2019-03-23 20:55:49Z

The way I understand the problem, the agent will always explore $k$ steps, regardless of what it sees in those $k$ steps. Then for all the time $k < t \le T$, it will repeat the highest reward seen among the first $k$. The goal is to find the optimal $k= h(T,m,a,b)$ maximizing $E[\sum^T_{t=1} R_t]$ where $R_t$ is the reward at step $t$.

Assume that if the agent explores, every new reward is independent of the old rewards, and all the rewards are identically distributed. Let $R$ be the random variable describing such a reward and assume we know the distribution of $R$.

First, we have:

$$f_k \equiv E[\sum^T_{t=1} R_t] = E[\sum^k_{t=1} R_t] + E[\sum^T_{t=k+1} R_t]$$

by linearity of expectation. (As usual, note that linearity of expectation works even for dependent variables, as the latter $R_t$'s certainly are.) Both terms on the right hand side have nice conceptual solutions, and for some "classic" distributions, both terms have explicit / closed-form solutions. Then we are simply maximizing $f_k$ over the range $1 \le k \le m$.

The first term $E[\sum^k_{t=1} R_t] = k \,E[R]$ because for all $t \in [1,k], R_t$ is i.i.d. as $R$.

For uniform $R=U(a,b)$ (whether continuous or restricted to discrete/integer values), we have $E[R] = {a+b \over 2}$.

The second term $E[\sum^T_{t=k+1} R_t] = (T-k) \,E[M_k]$ where $M_k = \max^k_{t=1} R_t$, i.e. the maximum of $k$ independent instances of $R$.

For continuous $R=U(a,b)$ this has (IIRC) a very nice closed-form solution: $E[M_k] = {kb + a \over k+1}$.
For discrete (integer-valued) $R = U(a,b)$ the solution is a bit messier: $P(M_k \le c) = ({c - a + 1 \over b - a + 1})^k$ from which $E[M_k]$ can be calculated.

In any case, you now have a solution for $f_k$ and simply need to maximize over the range $1 \le k \le m$. One way is to simply calculate all $f_k$ in range.

In some cases you might be able to do better. E.g., let $R = U(a,b)$ continuous. Then $f_k = k {a+b \over 2} + (T-k) {kb + a \over k+1}$. You can differentiate $f$ and find the optimal $k^*$ and e.g. check the integers $\lceil k^* \rceil, \lfloor k^* \rfloor$. Or if the optimal $k^* > m$ then you can just pick $k=m$. (Technically you also need to show if $f$ is increasing/decreasing etc.)

Stack Exchange Network

A Maximize or Explore Problem over a Finite Time Series

1 Answer 1

You must log in to answer this question.

Hot Network Questions

A Maximize or Explore Problem over a Finite Time Series

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions