getMaxQValue - Obtain maximum estimated value over all possible actions from a Q-value function
critic with discrete action space, given environment observations - MATLAB ([original](https://www.mathworks.com/help/reinforcement-learning/ref/rl.function.rlqvaluefunction.getmaxqvalue.html)) ([raw](?raw))
Obtain maximum estimated value over all possible actions from a Q-value function critic with discrete action space, given environment observations
Since R2020a
Syntax
Description
[[maxQ](#mw%5Fc5c08050-60d7-48cb-9808-442a8becdf60),[maxActionIndex](#mw%5Fab3ced21-09d5-46b9-99ef-2a889d9582e5)] = getMaxQValue([qValueFcnObj](#mw%5F697aec70-1f1a-4676-8c51-464b65b28d98),[obs](#mw%5F1ba9962d-a5f8-4c42-9704-86b386545b26))
evaluates the discrete-action-space Q-value function criticqValueFcnObj
and returns the maximum estimated value over all possible actions maxQ
, with the corresponding action indexmaxActionIndex
, given environment observationsobs
.
[[maxQ](#mw%5Fc5c08050-60d7-48cb-9808-442a8becdf60),[maxActionIndex](#mw%5Fab3ced21-09d5-46b9-99ef-2a889d9582e5),[nextState](#mw%5Fe2bb0839-5c6a-4b86-836a-f97996f72144)] = getMaxQValue(___)
also returns the updated state of qValueFcnObj when it contains a recurrent neural network.
___ = getMaxQValue(___,UseForward=[useForward](#mw%5F62e98354-62fb-4351-810b-3a8122d309ad%5Fsep%5Fmw%5F7f8a74e4-f9cb-4418-bfe7-55c39bddd077))
allows you to explicitly call a forward pass when computing gradients.
Examples
Obtain Maximum Q-Value Function Estimates
Create an observation and action specification objects (or alternatively use getObservationInfo
and getActionInfo
to extract the specification objects from an environment. For this example, define the observation space as a continuous three-dimensional space, and the action space as a finite set consisting of three possible values (named -1, 0, and 1).
obsInfo = rlNumericSpec([3 1]); actInfo = rlFiniteSetSpec([-1 0 1]);
Create a default DQN agent and extract its critic.
agent = rlDQNAgent(obsInfo,actInfo); critic = getCritic(agent);
Use getMaxQValue
to return the maximum value, among the possible actions, given a random observation. Also return the index corresponding to the action that maximizes the value.
[v,i] = getMaxQValue(critic,{rand(3,1)})
Create a batch set of 64 random independent observations. The third dimension is the batch size, while the fourth is the sequence length for any recurrent neural network used by the critic (in this case not used).
batchobs = rand(3,1,64,1);
Obtain maximum values for all the observations.
bv = getMaxQValue(critic,{batchobs}); size(bv)
Select the maximum value corresponding to the 44th observation.
Input Arguments
obs
— Environment observations
cell array
Environment observations, specified as a cell array with as many elements as there are observation input channels. Each element of obs
contains an array of observations for a single observation input channel.
The dimensions of each element in obs
are_MO_-by-_LB_-by-LS, where:
- MO corresponds to the dimensions of the associated observation input channel.
- LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of observations, specify LB > 1. If qValueFcnObj has multiple observation input channels, then_LB_ must be the same for all elements of
obs
. - LS specifies the sequence length for a recurrent neural network. If
qValueFcnObj
does not use a recurrent neural network, then LS = 1. IfqValueFcnObj
has multiple observation input channels, then_LS_ must be the same for all elements ofobs
.
LB and_LS_ must be the same for bothact
and obs
.
For more information on input and output formats for recurrent neural networks, see the Algorithms section of lstmLayer.
useForward
— Option to use parallel training
false
(default) | true
Option to use forward pass, specified as a logical value. When you specifyUseForward=true
the function calculates its outputs usingforward instead of predict. This allows layers such as batch normalization and dropout to appropriately change their behavior for training.
Example: true
Output Arguments
maxQ
— Maximum Q-value estimate
array
Maximum Q-value estimate across all possible discrete actions, returned as a 1-by-_LB_-by-LS array, where:
- LB is the batch size.
- LS specifies the sequence length for a recurrent neural network. If qValueFcnObj does not use a recurrent neural network, then LS = 1.
maxActionIndex
— Action index
array
Action index corresponding to the maximum Q value, returned as a 1-by-_LB_-by-LS array, where:
- LB is the batch size.
- LS specifies the sequence length for a recurrent neural network. If qValueFcnObj does not use a recurrent neural network, then LS = 1.
nextState
— Updated state of the critic
cell array
Updated state of qValueFcnObj, returned as a cell array. IfqValueFcnObj
does not use a recurrent neural network, thennextState
is an empty cell array.
You can set the state of the critic to state
using dot notation. For example:
qValueFcnObj.State=state;
Version History
Introduced in R2020a