3. Agent Classes

3.1. Agent Base Class Reference

class rlgraph.agents.agent.Agent(state_space, action_space, discount=0.98, preprocessing_spec=None, network_spec=None, internal_states_space=None, action_adapter_spec=None, exploration_spec=None, execution_spec=None, optimizer_spec=None, observe_spec=None, update_spec=None, summary_spec=None, saver_spec=None, auto_build=True, name='agent')[source]

Bases: rlgraph.utils.specifiable.Specifiable

Generic agent defining RLGraph-API operations and parses and sanitizes configuration specs.


Builds this agent. This method call only be called if the agent parameter “auto_build” was set to False.

build_options (Optional[dict]): Optional build options, see build doc.
call_api_method(op, inputs=None, return_ops=None)[source]

Utility method to call any desired api method on the graph, identified via output socket. Delegate this call to the RLGraph graph executor.


op (str): Name of the api method.

inputs (Optional[dict,np.array]): Dict specifying the provided api_methods for (key=input space name,
values=the values that should go into this space (e.g. numpy arrays)).
any: Result of the op call.
define_api_methods(policy_scope, pre_processor_scope, optimizer_scope, *params)[source]

Can be used to specify and then self.define_api_method the Agent’s CoreComponent’s API methods. Each agent implements this to build its algorithm logic.

policy_scope (str): The global scope of the Policy within the Agent. pre_processor_scope (str): The global scope of the PreprocessorStack within the Agent. params (any): Params to be used freely by child Agent implementations.

Any algorithm defined as a full-graph, as opposed to mixed (mixed Python and graph control flow) should be able to export its graph for deployment.

filename (str): Export path. Depending on the backend, different filetypes may be required.
get_action(states, internals=None, use_exploration=True, apply_preprocessing=True, extra_returns=None)[source]

Returns action(s) for the passed state(s). If states is a single state, returns a single action, otherwise, returns a batch of actions, where batch-size = number of states passed in.


states (Union[dict,np.ndarray]): States dict/tuple or numpy array. internals (Union[dict,np.ndarray]): Internal states dict/tuple or numpy array.

use_exploration (bool): If False, no exploration or sampling may be applied
when retrieving an action.
apply_preprocessing (bool): If True, apply any state preprocessors configured to the action. Set to
false if all pre-processing is handled externally both for acting and updating.
extra_returns (Optional[Set[str]]): Optional set of Agent-specific strings for additional return
values (besides the actions). All Agents must support “preprocessed_states”.
any: Action(s) as dict/tuple/np.ndarray (depending on self.action_space).
Optional: The preprocessed states as a 2nd return value.

Returns all weights relevant for the agent’s policy for syncing purposes.

any: Weights and optionally weight meta data for this model.

Bulk imports observations, potentially using device pre-fetching. Can be optionally implemented by agents requiring pre-training.

observations (dict): Dict or list of observation data.

Load model from serialized format.

path (str): Path to checkpoint directory.
observe(preprocessed_states, actions, internals, rewards, next_states, terminals, env_id=None)[source]

Observes an experience tuple or a batch of experience tuples. Note: If configured, first uses buffers and then internally calls _observe_graph() to actually run the computation graph. If buffering is disabled, this just routes the call to the respective _observe_graph() method of the child Agent.


preprocessed_states (Union[dict, ndarray]): Preprocessed states dict or array. actions (Union[dict, ndarray]): Actions dict or array containing actions performed for the given state(s). internals (Union[list]): Internal state(s) returned by agent for the given states.Must be

empty list if no internals available.

rewards (float): Scalar reward(s) observed. terminals (bool): Boolean indicating terminal. next_states (Union[dict, ndarray]): Preprocessed next states dict or array. env_id (Optional[str]): Environment id to observe for. When using vectorized execution and

buffering, using environment ids is necessary to ensure correct trajectories are inserted. See SingleThreadedWorker for example usage.

Applies the agent’s preprocessor to one or more states, e.g. to preprocess external data before inserting to memory without acting. Returns identity if no preprocessor defined.

states (np.array): State(s) to preprocess.
np.array: Preprocessed states.

Must be implemented to define some reset behavior (before starting a new episode). This could include resetting the preprocessor and other Components.


Resets an environment buffer for buffered observe calls.

env_id (Optional[str]): Environment id to reset. Defaults to a default environment if None provided.

Sets policy weights of this agent, e.g. for external syncing purporses.

weights (any): Weights and optionally meta data to update depending on the backend.
ValueError if weights do not match graph weights in shapes and types.
store_model(path=None, add_timestep=True)[source]

Store model using the backend’s check-pointing mechanism.


path (str): Path to model directory.

add_timestep (bool): Indiciates if current training step should be appended to exported model.
If false, may override previous checkpoints.

Terminates the Agent, so it will no longer be usable. Things that need to be cleaned up should be placed into this function, e.g. closing sessions and other open connections.


Performs an update on the computation graph either via externally experience or by sampling from an internal memory.

batch (Optional[dict]): Optional external data batch to use for update. If None, the
agent should be configured to sample internally.
float: The loss value calculated in this update.

3.2. DQN Agent

class rlgraph.agents.dqn_agent.DQNAgent(double_q=True, dueling_q=True, huber_loss=False, n_step=1, memory_spec=None, store_last_memory_batch=False, store_last_q_table=False, **kwargs)[source]

Bases: rlgraph.agents.agent.Agent

A collection of DQN algorithms published in the following papers: [1] Human-level control through deep reinforcement learning. Mnih, Kavukcuoglu, Silver et al. - 2015 [2] Deep Reinforcement Learning with Double Q-learning. v. Hasselt, Guez, Silver - 2015 [3] Dueling Network Architectures for Deep Reinforcement Learning, Wang et al. - 2016 [4] https://en.wikipedia.org/wiki/Huber_loss

define_api_methods(policy_scope, pre_processor_scope, optimizer_scope, *sub_components)[source]

Can be used to specify and then self.define_api_method the Agent’s CoreComponent’s API methods. Each agent implements this to build its algorithm logic.

policy_scope (str): The global scope of the Policy within the Agent. pre_processor_scope (str): The global scope of the PreprocessorStack within the Agent. params (any): Params to be used freely by child Agent implementations.
get_action(states, internals=None, use_exploration=True, apply_preprocessing=True, extra_returns=None)[source]
extra_returns (Optional[Set[str],str]): Optional string or set of strings for additional return
values (besides the actions). Possible values are: - ‘preprocessed_states’: The preprocessed states after passing the given states through the preprocessor stack. - ‘internal_states’: The internal states returned by the RNNs in the NN pipeline. - ‘used_exploration’: Whether epsilon- or noise-based exploration was used or not.
tuple or single value depending on extra_returns:
  • action
  • the preprocessed states

Resets our preprocessor, but only if it contains stateful PreprocessLayer Components (meaning the PreprocessorStack has at least one variable defined).


Performs an update on the computation graph either via externally experience or by sampling from an internal memory.

batch (Optional[dict]): Optional external data batch to use for update. If None, the
agent should be configured to sample internally.
float: The loss value calculated in this update.

3.3. ApeX Agent

class rlgraph.agents.apex_agent.ApexAgent(memory_spec=None, **kwargs)[source]

Bases: rlgraph.agents.dqn_agent.DQNAgent

Ape-X is a DQN variant designed for large scale distributed execution where many workers share a distributed prioritized experience replay.

Paper: https://arxiv.org/abs/1803.00933

The distinction to standard DQN is mainly that Ape-X needs to provide additional operations to enable external updates of priorities. Ape-X also enables per default dueling and double DQN.


Utility method that just returns the td-loss from a batch without applying an update.

batch (dict): Input batch.
Tuple: Total loss and loss per item.

Performs an update on the computation graph either via externally experience or by sampling from an internal memory.

batch (Optional[dict]): Optional external data batch to use for update. If None, the
agent should be configured to sample internally.
float: The loss value calculated in this update.

3.4. IMPALA Agent

class rlgraph.agents.impala_agent.IMPALAAgent(discount=0.99, fifo_queue_spec=None, architecture='large', environment_spec=None, weight_pg=None, weight_baseline=None, weight_entropy=None, worker_sample_size=100, dynamic_batching=False, **kwargs)[source]

Bases: rlgraph.agents.agent.Agent

An Agent implementing the IMPALA algorithm described in [1]. The Agent contains both learner and actor API-methods, which will be put into the graph depending on the type ().

[1] IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures - Espeholt, Soyer,
Munos et al. - 2018 (https://arxiv.org/abs/1802.01561)
default_environment_spec = {'frameskip': 4, 'level_id': 'seekavoid_arena_01', 'observations': ['RGB_INTERLEAVED', 'INSTR'], 'type': 'deepmind_lab'}
default_internal_states_space = Tuple(("Floatbox((256,) <class 'numpy.float32'> )", "Floatbox((256,) <class 'numpy.float32'> )"))

Can be used to specify and then self.define_api_method the Agent’s CoreComponent’s API methods. Each agent implements this to build its algorithm logic.

policy_scope (str): The global scope of the Policy within the Agent. pre_processor_scope (str): The global scope of the PreprocessorStack within the Agent. params (any): Params to be used freely by child Agent implementations.
define_api_methods_actor(env_stepper, env_output_splitter, internal_states_slicer, merger, states_dict_splitter, fifo_queue)[source]

Defines the API-methods used by an IMPALA actor. Actors only step through an environment (n-steps at a time), collect the results and push them into the FIFO queue. Results include: The actions actually taken, the discounted accumulated returns for each action, the probability of each taken action according to the behavior policy.

env_stepper (EnvironmentStepper): The EnvironmentStepper Component to setp through the Env n steps
in a single op call.

fifo_queue (FIFOQueue): The FIFOQueue Component used to enqueue env sample runs (n-step).

define_api_methods_learner(fifo_output_splitter, fifo_queue, states_dict_splitter, transpose_states, transpose_terminals, transpose_action_probs, staging_area, preprocessor, policy, loss_function, optimizer)[source]

Defines the API-methods used by an IMPALA learner. Its job is basically: Pull a batch from the FIFOQueue, split it up into its components and pass these through the loss function and into the optimizer for a learning update.


fifo_queue (FIFOQueue): The FIFOQueue Component used to enqueue env sample runs (n-step).

splitter (ContainerSplitter): The DictSplitter Component to split up a batch from the queue along its

policy (Policy): The Policy Component, which to update. loss_function (IMPALALossFunction): The IMPALALossFunction Component. optimizer (Optimizer): The optimizer that we use to calculate an update and apply it.

define_api_methods_single(fifo_output_splitter, fifo_queue, queue_runner, transpose_actions, transpose_rewards, transpose_terminals, transpose_action_probs, preprocessor, staging_area, concat, policy, loss_function, optimizer)[source]
get_action(states, internal_states=None, use_exploration=True, extra_returns=None)[source]

Returns action(s) for the passed state(s). If states is a single state, returns a single action, otherwise, returns a batch of actions, where batch-size = number of states passed in.


states (Union[dict,np.ndarray]): States dict/tuple or numpy array. internals (Union[dict,np.ndarray]): Internal states dict/tuple or numpy array.

use_exploration (bool): If False, no exploration or sampling may be applied
when retrieving an action.
apply_preprocessing (bool): If True, apply any state preprocessors configured to the action. Set to
false if all pre-processing is handled externally both for acting and updating.
extra_returns (Optional[Set[str]]): Optional set of Agent-specific strings for additional return
values (besides the actions). All Agents must support “preprocessed_states”.
any: Action(s) as dict/tuple/np.ndarray (depending on self.action_space).
Optional: The preprocessed states as a 2nd return value.

Performs an update on the computation graph either via externally experience or by sampling from an internal memory.

batch (Optional[dict]): Optional external data batch to use for update. If None, the
agent should be configured to sample internally.
float: The loss value calculated in this update.