3. Agent Classes¶
3.1. Agent Base Class Reference¶
-
class
rlgraph.agents.agent.
Agent
(state_space, action_space, discount=0.98, preprocessing_spec=None, network_spec=None, internal_states_space=None, action_adapter_spec=None, exploration_spec=None, execution_spec=None, optimizer_spec=None, observe_spec=None, update_spec=None, summary_spec=None, saver_spec=None, auto_build=True, name='agent')[source]¶ Bases:
rlgraph.utils.specifiable.Specifiable
Generic agent defining RLGraph-API operations and parses and sanitizes configuration specs.
-
build
(build_options=None)[source]¶ Builds this agent. This method call only be called if the agent parameter “auto_build” was set to False.
- Args:
- build_options (Optional[dict]): Optional build options, see build doc.
-
call_api_method
(op, inputs=None, return_ops=None)[source]¶ Utility method to call any desired api method on the graph, identified via output socket. Delegate this call to the RLGraph graph executor.
- Args:
op (str): Name of the api method.
- inputs (Optional[dict,np.array]): Dict specifying the provided api_methods for (key=input space name,
- values=the values that should go into this space (e.g. numpy arrays)).
- Returns:
- any: Result of the op call.
-
define_api_methods
(policy_scope, pre_processor_scope, optimizer_scope, *params)[source]¶ Can be used to specify and then self.define_api_method the Agent’s CoreComponent’s API methods. Each agent implements this to build its algorithm logic.
- Args:
- policy_scope (str): The global scope of the Policy within the Agent. pre_processor_scope (str): The global scope of the PreprocessorStack within the Agent. params (any): Params to be used freely by child Agent implementations.
-
export_graph
(filename=None)[source]¶ Any algorithm defined as a full-graph, as opposed to mixed (mixed Python and graph control flow) should be able to export its graph for deployment.
- Args:
- filename (str): Export path. Depending on the backend, different filetypes may be required.
-
get_action
(states, internals=None, use_exploration=True, apply_preprocessing=True, extra_returns=None)[source]¶ Returns action(s) for the passed state(s). If states is a single state, returns a single action, otherwise, returns a batch of actions, where batch-size = number of states passed in.
- Args:
states (Union[dict,np.ndarray]): States dict/tuple or numpy array. internals (Union[dict,np.ndarray]): Internal states dict/tuple or numpy array.
- use_exploration (bool): If False, no exploration or sampling may be applied
- when retrieving an action.
- apply_preprocessing (bool): If True, apply any state preprocessors configured to the action. Set to
- false if all pre-processing is handled externally both for acting and updating.
- extra_returns (Optional[Set[str]]): Optional set of Agent-specific strings for additional return
- values (besides the actions). All Agents must support “preprocessed_states”.
- Returns:
- any: Action(s) as dict/tuple/np.ndarray (depending on self.action_space).
- Optional: The preprocessed states as a 2nd return value.
-
get_policy_weights
()[source]¶ Returns all weights relevant for the agent’s policy for syncing purposes.
- Returns:
- any: Weights and optionally weight meta data for this model.
-
import_observations
(observations)[source]¶ Bulk imports observations, potentially using device pre-fetching. Can be optionally implemented by agents requiring pre-training.
- Args:
- observations (dict): Dict or list of observation data.
-
load_model
(path=None)[source]¶ Load model from serialized format.
- Args:
- path (str): Path to checkpoint directory.
-
observe
(preprocessed_states, actions, internals, rewards, next_states, terminals, env_id=None)[source]¶ Observes an experience tuple or a batch of experience tuples. Note: If configured, first uses buffers and then internally calls _observe_graph() to actually run the computation graph. If buffering is disabled, this just routes the call to the respective _observe_graph() method of the child Agent.
- Args:
preprocessed_states (Union[dict, ndarray]): Preprocessed states dict or array. actions (Union[dict, ndarray]): Actions dict or array containing actions performed for the given state(s). internals (Union[list]): Internal state(s) returned by agent for the given states.Must be
empty list if no internals available.rewards (float): Scalar reward(s) observed. terminals (bool): Boolean indicating terminal. next_states (Union[dict, ndarray]): Preprocessed next states dict or array. env_id (Optional[str]): Environment id to observe for. When using vectorized execution and
buffering, using environment ids is necessary to ensure correct trajectories are inserted. See SingleThreadedWorker for example usage.
-
preprocess_states
(states)[source]¶ Applies the agent’s preprocessor to one or more states, e.g. to preprocess external data before inserting to memory without acting. Returns identity if no preprocessor defined.
- Args:
- states (np.array): State(s) to preprocess.
- Returns:
- np.array: Preprocessed states.
-
reset
()[source]¶ Must be implemented to define some reset behavior (before starting a new episode). This could include resetting the preprocessor and other Components.
-
reset_env_buffers
(env_id=None)[source]¶ Resets an environment buffer for buffered observe calls.
- Args:
- env_id (Optional[str]): Environment id to reset. Defaults to a default environment if None provided.
-
set_policy_weights
(weights)[source]¶ Sets policy weights of this agent, e.g. for external syncing purporses.
- Args:
- weights (any): Weights and optionally meta data to update depending on the backend.
- Raises:
- ValueError if weights do not match graph weights in shapes and types.
-
store_model
(path=None, add_timestep=True)[source]¶ Store model using the backend’s check-pointing mechanism.
- Args:
path (str): Path to model directory.
- add_timestep (bool): Indiciates if current training step should be appended to exported model.
- If false, may override previous checkpoints.
-
terminate
()[source]¶ Terminates the Agent, so it will no longer be usable. Things that need to be cleaned up should be placed into this function, e.g. closing sessions and other open connections.
-
update
(batch=None)[source]¶ Performs an update on the computation graph either via externally experience or by sampling from an internal memory.
- Args:
- batch (Optional[dict]): Optional external data batch to use for update. If None, the
- agent should be configured to sample internally.
- Returns:
- float: The loss value calculated in this update.
-
3.2. DQN Agent¶
-
class
rlgraph.agents.dqn_agent.
DQNAgent
(double_q=True, dueling_q=True, huber_loss=False, n_step=1, memory_spec=None, store_last_memory_batch=False, store_last_q_table=False, **kwargs)[source]¶ Bases:
rlgraph.agents.agent.Agent
A collection of DQN algorithms published in the following papers: [1] Human-level control through deep reinforcement learning. Mnih, Kavukcuoglu, Silver et al. - 2015 [2] Deep Reinforcement Learning with Double Q-learning. v. Hasselt, Guez, Silver - 2015 [3] Dueling Network Architectures for Deep Reinforcement Learning, Wang et al. - 2016 [4] https://en.wikipedia.org/wiki/Huber_loss
-
define_api_methods
(policy_scope, pre_processor_scope, optimizer_scope, *sub_components)[source]¶ Can be used to specify and then self.define_api_method the Agent’s CoreComponent’s API methods. Each agent implements this to build its algorithm logic.
- Args:
- policy_scope (str): The global scope of the Policy within the Agent. pre_processor_scope (str): The global scope of the PreprocessorStack within the Agent. params (any): Params to be used freely by child Agent implementations.
-
get_action
(states, internals=None, use_exploration=True, apply_preprocessing=True, extra_returns=None)[source]¶ - Args:
- extra_returns (Optional[Set[str],str]): Optional string or set of strings for additional return
- values (besides the actions). Possible values are: - ‘preprocessed_states’: The preprocessed states after passing the given states through the preprocessor stack. - ‘internal_states’: The internal states returned by the RNNs in the NN pipeline. - ‘used_exploration’: Whether epsilon- or noise-based exploration was used or not.
- Returns:
- tuple or single value depending on extra_returns:
- action
- the preprocessed states
-
reset
()[source]¶ Resets our preprocessor, but only if it contains stateful PreprocessLayer Components (meaning the PreprocessorStack has at least one variable defined).
-
update
(batch=None)[source]¶ Performs an update on the computation graph either via externally experience or by sampling from an internal memory.
- Args:
- batch (Optional[dict]): Optional external data batch to use for update. If None, the
- agent should be configured to sample internally.
- Returns:
- float: The loss value calculated in this update.
-
3.3. ApeX Agent¶
-
class
rlgraph.agents.apex_agent.
ApexAgent
(memory_spec=None, **kwargs)[source]¶ Bases:
rlgraph.agents.dqn_agent.DQNAgent
Ape-X is a DQN variant designed for large scale distributed execution where many workers share a distributed prioritized experience replay.
Paper: https://arxiv.org/abs/1803.00933
The distinction to standard DQN is mainly that Ape-X needs to provide additional operations to enable external updates of priorities. Ape-X also enables per default dueling and double DQN.
-
get_td_loss
(batch)[source]¶ Utility method that just returns the td-loss from a batch without applying an update.
- Args:
- batch (dict): Input batch.
- Returns:
- Tuple: Total loss and loss per item.
-
update
(batch=None)[source]¶ Performs an update on the computation graph either via externally experience or by sampling from an internal memory.
- Args:
- batch (Optional[dict]): Optional external data batch to use for update. If None, the
- agent should be configured to sample internally.
- Returns:
- float: The loss value calculated in this update.
-
3.4. IMPALA Agent¶
-
class
rlgraph.agents.impala_agent.
IMPALAAgent
(discount=0.99, fifo_queue_spec=None, architecture='large', environment_spec=None, weight_pg=None, weight_baseline=None, weight_entropy=None, worker_sample_size=100, dynamic_batching=False, **kwargs)[source]¶ Bases:
rlgraph.agents.agent.Agent
An Agent implementing the IMPALA algorithm described in [1]. The Agent contains both learner and actor API-methods, which will be put into the graph depending on the type ().
- [1] IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures - Espeholt, Soyer,
- Munos et al. - 2018 (https://arxiv.org/abs/1802.01561)
-
default_environment_spec
= {'frameskip': 4, 'level_id': 'seekavoid_arena_01', 'observations': ['RGB_INTERLEAVED', 'INSTR'], 'type': 'deepmind_lab'}¶
-
default_internal_states_space
= Tuple(("Floatbox((256,) <class 'numpy.float32'> )", "Floatbox((256,) <class 'numpy.float32'> )"))¶
-
define_api_methods
(*sub_components)[source]¶ Can be used to specify and then self.define_api_method the Agent’s CoreComponent’s API methods. Each agent implements this to build its algorithm logic.
- Args:
- policy_scope (str): The global scope of the Policy within the Agent. pre_processor_scope (str): The global scope of the PreprocessorStack within the Agent. params (any): Params to be used freely by child Agent implementations.
-
define_api_methods_actor
(env_stepper, env_output_splitter, internal_states_slicer, merger, states_dict_splitter, fifo_queue)[source]¶ Defines the API-methods used by an IMPALA actor. Actors only step through an environment (n-steps at a time), collect the results and push them into the FIFO queue. Results include: The actions actually taken, the discounted accumulated returns for each action, the probability of each taken action according to the behavior policy.
- Args:
- env_stepper (EnvironmentStepper): The EnvironmentStepper Component to setp through the Env n steps
- in a single op call.
fifo_queue (FIFOQueue): The FIFOQueue Component used to enqueue env sample runs (n-step).
-
define_api_methods_learner
(fifo_output_splitter, fifo_queue, states_dict_splitter, transpose_states, transpose_terminals, transpose_action_probs, staging_area, preprocessor, policy, loss_function, optimizer)[source]¶ Defines the API-methods used by an IMPALA learner. Its job is basically: Pull a batch from the FIFOQueue, split it up into its components and pass these through the loss function and into the optimizer for a learning update.
- Args:
fifo_queue (FIFOQueue): The FIFOQueue Component used to enqueue env sample runs (n-step).
- splitter (ContainerSplitter): The DictSplitter Component to split up a batch from the queue along its
- items.
policy (Policy): The Policy Component, which to update. loss_function (IMPALALossFunction): The IMPALALossFunction Component. optimizer (Optimizer): The optimizer that we use to calculate an update and apply it.
-
define_api_methods_single
(fifo_output_splitter, fifo_queue, queue_runner, transpose_actions, transpose_rewards, transpose_terminals, transpose_action_probs, preprocessor, staging_area, concat, policy, loss_function, optimizer)[source]¶
-
get_action
(states, internal_states=None, use_exploration=True, extra_returns=None)[source]¶ Returns action(s) for the passed state(s). If states is a single state, returns a single action, otherwise, returns a batch of actions, where batch-size = number of states passed in.
- Args:
states (Union[dict,np.ndarray]): States dict/tuple or numpy array. internals (Union[dict,np.ndarray]): Internal states dict/tuple or numpy array.
- use_exploration (bool): If False, no exploration or sampling may be applied
- when retrieving an action.
- apply_preprocessing (bool): If True, apply any state preprocessors configured to the action. Set to
- false if all pre-processing is handled externally both for acting and updating.
- extra_returns (Optional[Set[str]]): Optional set of Agent-specific strings for additional return
- values (besides the actions). All Agents must support “preprocessed_states”.
- Returns:
- any: Action(s) as dict/tuple/np.ndarray (depending on self.action_space).
- Optional: The preprocessed states as a 2nd return value.
-
update
(batch=None)[source]¶ Performs an update on the computation graph either via externally experience or by sampling from an internal memory.
- Args:
- batch (Optional[dict]): Optional external data batch to use for update. If None, the
- agent should be configured to sample internally.
- Returns:
- float: The loss value calculated in this update.