I recently read "Reinforcement Learning" Bardo and Sutton which motivated me to come up with this problem (which I hope is well posed):
The Problem
Some sort of reward maximizing agent finds itself in an environment with a finite number of states it can be in, each of which gives a certain reward, and a finite number of timesteps it can traverse these states in. The agent is given this information:
A number $m$ which denotes the number of total states it can explore.
A number $T$ which denotes the Total number of discrete timesteps the agent can act in before the "game" is up.
A range $a$ to $b$ which denotes the minimum to maximum values that a reward for any given state can be.
(For simplicity, and specificity) The agent also knows that rewards are distributed uniformly between $a$ and $b$.
What the agent does not know:
Let $S$ denote the set of states where $S$'s content does not necessarily need to be numeric, and $V$ denote the set of values each of which is an integer, such that $|S| = |V|$, and $V$ may not have all unique values, as it is sampled from Uniform$(a,b)$ with the integer restriction.
The agent does not know the content of $V$ nor does it know the mapping $f(s_i) = v_j$, which gives the reward associated with each state.
The agent can only take two actions throughout the finite discrete time series $0$ to $T$:
(Explore) The agent can acquire the reward from $s_i$ where $s_i$ the next unseen state in an ordered list of states (i.e. both learn what $f(s_i)$ is, and gain that value).
OR
(Maximize) The agent can return or stay at the state in the set of already "seen" or "explored" states with the maximum reward. In other words, the agent can keep track of the set $S_s$ denoting the set of states the agent knows the reward for through "exploring", and select $s \in S_s \text{ st. } f(s) = v \text{ is maximized in } S_s$.
The agent can be given a plan in the form of a "choice-vector", $P$. $P$ has length $T$, where each index denotes a timestep $t$ in the entire time series and the contents of $p_t$ is 1 of 2 symbols either telling the agent to explore at time $t$ or maximize at time $t$.
I seek to give this agent a function $h(T, m, a, b) = k$ where $k$ denotes the index of $P$ in which the agent should switch from "exploring" to "maximizing" in order to maximize $E[\sum_{t=0}^T f(s_t)]$ or maximize its expected reward over the entire time series. One obvious assumption I am making is that a choice vector that maximizes total reward effectively exclusively explores, and then, after, exclusively maximizes, and does not switch back and forth. This is only an intuition, and if it is wrong, let it be known.
Some Thoughts and Work I Have Done
It seems as though as $T-m$ grows larger the agent would want to explore more and more to achieve a near optimal total reward. In the extreme case, the agent can explore all states up to timestep $m$, and then take the global maximum reward at each timestep from $m+1$ to $T$. It seems that the greater the difference between $m$ and $T$ the better this policy is. The solution were $m > T$ seems less intuitive.
I started attempting to solve this problem via. monte carlo, but have am still working on it. Decision trees may be useful too, where the sort of choice vectors I am thinking about can all be thought of as a single turn from traversing down all left node children, until a certain level, and then switching to traversing all right children (see below). But again here I would ultimate do some sort of simulation. If you would like to view my code please message me, and we can talk about it.
My question is ultimately is there a way of finding the function $h(T, m, a, b) = k$ mathematically without simulation, what is that function, and how would I go about finding it.
