Neural Contextual Bandits with UCB-based Exploration (NeuralUCB)

NeuralUCB utilizes the representational capabilities of deep neural networks and employs a neural network-based random feature mapping to create an upper confidence bound (UCB) for reward, enabling efficient exploration.

This is a contextual multi-armed bandit algorithm, meaning it is suited to RL problems with just a single timestep.

Example

from tensordict import TensorDict from agilerl.algorithms.neural_ucb import NeuralUCB from agilerl.components.replay_buffer import ReplayBuffer from agilerl.wrappers.learning import BanditEnv # Fetch data https://archive.ics.uci.edu/ iris = fetch_ucirepo(id=53) features = iris.data.features targets = iris.data.targets # Create environment env = BanditEnv(features, targets) context_dim = env.context_dim action_dim = env.arms memory = ReplayBuffer(max_size=10000) observation_space = spaces.Box(low=features.values.min(), high=features.values.max()) action_space = spaces.Discrete(action_dim) bandit = NeuralUCB(observation_space, action_space) # Create NeuralUCB agent context = env.reset() # Reset environment at start of episode for _ in range(500): # Get next action from agent action = agent.get_action(context) next_context, reward = env.step(action) # Act in environment # Save experience to replay buffer transition = TensorDict({ "obs": context[action], "reward": reward, }, batch_size=[1] ) memory.add(transition) # Learn according to learning frequency if len(memory) >= agent.batch_size: for _ in range(agent.learn_step): experiences = memory.sample(agent.batch_size) # Sample replay buffer agent.learn(experiences) # Learn according to agent's RL algorithm context = next_context 

Neural Network Configuration

To configure the architecture of the network’s encoder / head, pass a kwargs dict to the NeuralUCB net_config field. Full arguments can be found in the documentation of EvolvableMLP, EvolvableCNN, and EvolvableMultiInput.

For discrete / vector observations:

NET_CONFIG = { "encoder_config": {'hidden_size': [32, 32]}, # Network head hidden size "head_config": {'hidden_size': [32]} # Network head hidden size } 

For image observations:

NET_CONFIG = { "encoder_config": { 'channel_size': [32, 32], # CNN channel size 'kernel_size': [8, 4], # CNN kernel size 'stride_size': [4, 2], # CNN stride size }, "head_config": {'hidden_size': [32]} # Network head hidden size } 

For dictionary / tuple observations containing any combination of image, discrete, and vector observations:

CNN_CONFIG = { "channel_size": [32, 32], # CNN channel size "kernel_size": [8, 4], # CNN kernel size "stride_size": [4, 2], # CNN stride size } NET_CONFIG = { "encoder_config": { "latent_dim": 32, # Config for nested EvolvableCNN objects "cnn_config": CNN_CONFIG, # Config for nested EvolvableMLP objects "mlp_config": { "hidden_size": [32, 32] }, "vector_space_mlp": True # Process vector observations with an MLP }, "head_config": {'hidden_size': [32]} # Network head hidden size } 
agent = NeuralUCB(observation_space, action_space, net_config=NET_CONFIG) # Create NeuralUCB agent 

Evolutionary Hyperparameter Optimization

AgileRL allows for efficient hyperparameter optimization during training to provide state-of-the-art results in a fraction of the time. For more information on how this is done, please refer to the Evolutionary Hyperparameter Optimization documentation.

Saving and Loading Agents

To save an agent, use the save_checkpoint method:

from agilerl.algorithms.neural_ucb import NeuralUCB agent = NeuralUCB(observation_space, action_space) # Create NeuralUCB agent checkpoint_path = "path/to/checkpoint" agent.save_checkpoint(checkpoint_path) 

To load a saved agent, use the load method:

from agilerl.algorithms.neural_ucb import NeuralUCB checkpoint_path = "path/to/checkpoint" agent = NeuralUCB.load(checkpoint_path) 

Parameters

class agilerl.algorithms.neural_ucb_bandit.NeuralUCB(*args, **kwargs)

Neural Upper Confidence Bound (UCB) algorithm.

Paper: https://arxiv.org/abs/1911.04462

Parameters:
  • observation_space (gym.spaces.Space) – Observation space of the environment

  • action_space (gym.spaces.Space) – Action space of the environment

  • index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0

  • hp_config (HyperparameterConfig, optional) – RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.

  • net_config (dict, optional) – Network configuration, defaults to None

  • gamma (float, optional) – Positive scaling factor, defaults to 1.0

  • lamb (float, optional) – Regularization parameter lambda, defaults to 1.0

  • reg (float, optional) – Loss regularization parameter, defaults to 0.000625

  • batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64

  • normalize_images (bool, optional) – Flag to normalize images, defaults to True

  • lr (float, optional) – Learning rate for optimizer, defaults to 1e-3

  • learn_step (int, optional) – Learning frequency, defaults to 2

  • mut (str, optional) – Most recent mutation to agent, defaults to None

  • actor_network (EvolvableModule, optional) – Custom actor network, defaults to None

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True

clone(index: int | None = None, wrap: bool = True) SelfEvolvableAlgorithm

Creates a clone of the algorithm.

Parameters:
  • index (Optional[int], optional) – The index of the clone, defaults to None

  • wrap (bool, optional) – If True, wrap the models in the clone with the accelerator, defaults to False

Returns:

A clone of the algorithm

Return type:

EvolvableAlgorithm

static copy_attributes(agent: SelfEvolvableAlgorithm, clone: SelfEvolvableAlgorithm) SelfEvolvableAlgorithm

Copies the non-evolvable attributes of the algorithm to a clone.

Parameters:

clone (SelfEvolvableAlgorithm) – The clone of the algorithm.

Returns:

The clone of the algorithm.

Return type:

SelfEvolvableAlgorithm

evolvable_attributes(networks_only: bool = False) dict[str, EvolvableModule | ModuleDict | Optimizer | dict[str, Optimizer] | OptimizerWrapper]

Returns the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.

Parameters:

networks_only (bool, optional) – If True, only include evolvable networks, defaults to False

Returns:

A dictionary of network attributes.

Return type:

dict[str, Any]

get_action(obs: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts], action_mask: ndarray | None = None) int

Returns the next action to take in the environment.

Parameters:
  • obs (numpy.ndarray[float]) – State observation, or multiple observations in a batch

  • action_mask (numpy.ndarray, optional) – Mask of legal actions 1=legal 0=illegal, defaults to None

Returns:

Action to take in the environment

Return type:

int

static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]

Returns the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).

Parameters:

action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.

Returns:

The dimension of the action space.

Return type:

int.

get_lr_names() list[str]

Returns the learning rates of the algorithm.

get_policy() EvolvableModule

Returns the policy network of the algorithm.

static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]

Returns the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).

Parameters:

observation_space (spaces.Space or list[spaces.Space].) – The observation space of the environment.

Returns:

The dimension of the state space.

Return type:

tuple[int, …].

property index: int

Returns the index of the algorithm.

init_params() None

Initializes the parameters of the network.

static inspect_attributes(agent: SelfEvolvableAlgorithm, input_args_only: bool = False) dict[str, Any]

Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.

Parameters:

input_args_only (bool) – If True, only include attributes that are input arguments to the constructor. Defaults to False.

Returns:

A dictionary of attribute names and their values.

Return type:

dict[str, Any]

learn(experiences: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts]] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts], ...]) float

Updates agent network parameters to learn from experiences.

Parameters:

experiences (tuple[numpy.ndarray, numpy.ndarray]) – Batched states, rewards in that order.

Returns:

Loss value from training step

Return type:

float

classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) SelfEvolvableAlgorithm

Loads an algorithm from a checkpoint.

Parameters:
  • path (string) – Location to load checkpoint from.

  • device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’

  • accelerator (Optional[Accelerator], optional) – Accelerator object for distributed computing, defaults to None

Returns:

An instance of the algorithm

Return type:

RLAlgorithm

load_checkpoint(path: str) None

Loads saved agent properties and network weights from checkpoint.

Parameters:

path (string) – Location to load checkpoint from

property mut: Any

Returns the mutation object of the algorithm.

mutation_hook() None

Executes the hooks registered with the algorithm.

classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], wrapper_cls: type[SelfAgentWrapper] | None = None, wrapper_kwargs: dict[str, Any] = {}, **kwargs) list[SelfEvolvableAlgorithm | SelfAgentWrapper]

Creates a population of algorithms.

Parameters:

size (int.) – The size of the population.

Returns:

A list of algorithms.

Return type:

list[SelfEvolvableAlgorithm].

preprocess_observation(observation: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts]) Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]

Preprocesses observations for forward pass through neural network.

Parameters:

observations (ObservationType) – Observations of environment

Returns:

Preprocessed observations

Return type:

torch.Tensor[float] or dict[str, torch.Tensor[float]] or tuple[torch.Tensor[float], …]

recompile() None

Recompiles the evolvable modules in the algorithm with the specified torch compiler.

register_mutation_hook(hook: Callable) None

Registers a hook to be executed after a mutation is performed on the algorithm.

Parameters:

hook (Callable) – The hook to be executed after mutation.

register_network_group(group: NetworkGroup) None

Sets the evaluation network for the algorithm.

Parameters:

name (str) – The name of the evaluation network.

reinit_optimizers(optimizer: OptimizerConfig | None = None) None

Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.

Parameters:

optimizer (Optional[OptimizerConfig], optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.

save_checkpoint(path: str) None

Saves a checkpoint of agent properties and network weights to path.

Parameters:

path (string) – Location to save checkpoint at

set_training_mode(training: bool) None

Sets the training mode of the algorithm.

Parameters:

training (bool) – If True, set the algorithm to training mode.

test(env: str | Env | VectorEnv | AsyncVectorEnv, swap_channels: bool = False, max_steps: int = 100, loop: int = 1) float

Returns mean test score of agent in environment with epsilon-greedy policy.

Parameters:
  • env (Gym-style environment) – The environment to be tested in

  • swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False

  • max_steps (int, optional) – Maximum number of testing steps, defaults to 500

  • loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean over these tests. Defaults to 3

Returns:

Mean test score of agent in environment

Return type:

float

to_device(*experiences: Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]) tuple[Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor], ...]

Moves experiences to the device.

Parameters:

experiences (tuple[torch.Tensor[float], ...]) – Experiences to move to device

Returns:

Experiences on the device

Return type:

tuple[torch.Tensor[float], …]

unwrap_models() None

Unwraps the models in the algorithm from the accelerator.

wrap_models() None

Wraps the models in the algorithm with the accelerator.