What is wrong with this reinforcement learning environment ?

Question

I'm working on below reinforcement learning problem: I have bottle of fix capacity (say 5 liters). At the bottom of bottle there is cock to remove water. The distribution of removal of water is not fixed. we can remove any amount of water from bottle, i.e. any continuous value between [0, 5].

At the top of the bottle one tap is mounted to fill water in the bottle. RL agent can fill [0, 1, 2, 3, 4] liters in the bottle. Initial bottle level is any value between [0, 5].

I want to train the agent in this environment to get optimal sequence of actions such that bottle will not get empty and overflow which implies continuous supply of water demand.

Action space = [0, 1, 2, 3, 4] Discrete Space

Observation Space = [0, Capacity of Bottle] i.e. [0, 5] Continuous Space

Reward logic = if bottle empty due to action give negative rewards; if bottle overflow due to action give negative rewards

I have decided to use python to create an environment.

from gym import spaces import numpy as np class WaterEnv(): def __init__(self, BottleCapacity = 5): ## CONSTANTS self.MinLevel = 0 # minimum water level self.BottleCapacity = BottleCapacity # bottle capacity # action space self.action_space = spaces.Discrete(self.BottleCapacity) # observation space self.observation_space = spaces.Box(low=self.MinLevel, high=self.BottleCapacity, shape=(1,)) # initial bottle level self.initBlevel = self.observation_space.sample() def step(self, action): # water qty to remove WaterRemoveQty = np.random.uniform(self.MinLevel, self.BottleCapacity, 1) # updated water level after removal of water UpdatedWaterLevel = (self.initBlevel - WaterRemoveQty) # add water - action taken UpdatedWaterLevel_ = UpdatedWaterLevel + action if UpdatedWaterLevel_ <= self.MinLevel: reward = -1 done = True elif UpdatedWaterLevel_ > self.BottleCapacity: reward = -1 done = True else: reward = 0.5 done = False return UpdatedWaterLevel_, reward, done def reset(self): """ Reset the initial bottle value """ self.initBlevel = self.observation_space.sample() return self.initBlevel import random from collections import deque from keras.models import Sequential from keras.layers import Dense from keras.optimizers import sgd class DQNAgent: def __init__(self, state_size, action_size): self.state_size = state_size self.action_size = action_size self.memory = deque(maxlen=2000) # memory size self.gamma = 0.99 # discount rate self.epsilon = 1.0 # exploration rate self.epsilon_min = 0.01 # minmun exploration rate self.epsilon_decay = 0.99 # exploration decay self.learning_rate = 0.001 # learning rate self.model = self._build_model() def _build_model(self): # Neural Net for Deep-Q learning Model model = Sequential() model.add(Dense(256, input_dim=self.state_size, activation='relu')) model.add(Dense(256, activation='relu')) model.add(Dense(self.action_size, activation='linear')) model.compile(loss='mse', optimizer=sgd(lr=self.learning_rate)) return model def remember(self, state, action, reward, next_state, done): self.memory.append((state, action, reward, next_state, done)) def act(self, state): if np.random.rand() <= self.epsilon: return random.randrange(self.action_size) act_values = self.model.predict(state) return np.argmax(act_values[0]) # returns action def replay(self, batch_size): minibatch = random.sample(self.memory, batch_size) for state, action, reward, next_state, done in minibatch: target = reward if not done: target = (reward + self.gamma * np.amax(self.model.predict(next_state)[0])) target_f = self.model.predict(state) target_f[0][action] = target self.model.fit(state, target_f, epochs=1, verbose=0) if self.epsilon > self.epsilon_min: self.epsilon *= self.epsilon_decay # create iSilo enviroment object env = WaterEnv() state_size = env.observation_space.shape[0] action_size = env.action_space.n minibatch = 32 # Initialize agent agent = DQNAgent(state_size, action_size) done = False lReward = [] # carry the reward upto end of simulation rewardAll = 0 XArray = [] # carry the actions upto end of simulation EPOCHS = 1000 for e in range(EPOCHS): #state = np.reshape(state, [1, 1]) # reset state in the beginning of each epoch state = env.reset() time_t = 0 rewardAll = 0 while True: # Decide action #state = np.reshape(state, [1, 1]) action = agent.act(state) next_state,reward, done = env.step(action) #reward = reward if not done else -10 # Remember the previous state, action, reward, and done #next_state = np.reshape(next_state, [1, state_size]) agent.remember(state, action, reward, next_state, done) # remembering the action for perfrormace check XArray.append(action) # Assign next_state the new current state for the next frame. state = next_state if done: print(" episode: {}/{}, score: {}, e: {:.2}" .format(e, EPOCHS, time_t, agent.epsilon)) break rewardAll += reward # experience and reply if len(agent.memory) > minibatch: agent.replay(minibatch) lReward.append(rewardAll) # append the rewards

After running the 1000 epoch, I observed that agent has not learned anything. Unable to find out whats going wrong.

You are not running this as a continuous problem as I thought (and answered in your last question). The episodes end as soon as the agent "fails" by not supplying or over-filling. Is that what you intended? Your reward scheme is still fine for that, but the episodic version does support other reward schemes as well (since the length of an episode comes into play). — Neil Slater
– Neil Slater, Commented Aug 17, 2018 at 16:01
I felt that ending the episodes when agent fail will help in learning. Correct me if I’m wrong. I was intended to use that because of my environment logic. And I’m not sure where to end the episodes, may be after a 500 steps giving the positive rewards when supply of water and negative rewards by not supplying or over-filling. — Krishna Nevase
– Krishna Nevase, Commented Aug 17, 2018 at 17:15
You don't need episodes at all. For this example problem it is not a big deal though. — Neil Slater
– Neil Slater, Commented Aug 17, 2018 at 19:48
Not understood how to remove the episodes. In above code where I have to make changes to remove the episodes? Ending the episode after agent fails - what do you think it is correct. — Krishna Nevase
– Krishna Nevase, Commented Aug 18, 2018 at 3:11
Neither is "correct", it depends on how the situation would play out in reality. Form your first description, I was imagining a supply that continued regardless of failures (e.g. you wouldn't say "sorry" and pour in a random extra amount of water if one day the tank became empty). Making it episode-based means you re-set it every time it fails. What happens in the imagined "reality" of your system? To remove episodes, stop using the "done" property, and clamp the state to only be valid after you have calculated the rewards. — Neil Slater
– Neil Slater, Commented Aug 18, 2018 at 6:46

Neil Slater · Accepted Answer · 2018-08-17 16:16:26Z

I can see two issues:

Your environment is not tracking changes to state, just random success/fail based on self.initBlevel which is never modified to reflect changes. Although you calculate and return the new state (as variable UpdatedWaterLevel_), this is not fed back into the system. You store it as the next "state" in the DQN replay table, but don't actually store it in the environment as the current state. You should do that - without it the replay table will be filled with incorrect values. There should be a variable that the environment has access to which represents the current state.
You are running the system as an episodic problem, but do not re-set the environment for the start of a new episode. This is a "hidden" bug due to the issue above, but would immediately become a problem for you if you let the state go outside the bounds of the problem you have defined.

Given the problem setup, I would expect the agent to learn to always fill the container to the maximum possible capacity (and it would then get drained by the amount of the random request). That would lead to infinitely long episodes, so you do still need discounting.

Possibly your NN is over-complex for this simple task, which could make learning slower. But that's harder to tell. The relationship to expected future discounted reward based on current state and action is complex, so you might need a moderate size of network to capture that.

@KrishnaNevase: No problem. If these have helped, then please consider accepting them, by clicking on the tick mark - as the writer of the question, it is your choice which answer is most useful. You also get +2 rep for accepting an answer. — Neil Slater
– Neil Slater, Commented Aug 17, 2018 at 19:50

Stack Exchange Network

What is wrong with this reinforcement learning environment ?

1 Answer 1

Linked

Hot Network Questions

What is wrong with this reinforcement learning environment ?

1 Answer 1

Linked

Related

Hot Network Questions