Adaptive Moment Estimation uses a step-dependent learning rate, a first moment $$a$$ and a second moment $$b$$, reminiscent of the momentum and velocity of a particle:

$x^{(t+1)} = x^{(t)} - \eta^{(t+1)} \frac{a^{(t+1)}}{\sqrt{b^{(t+1)}} + \epsilon },$

where the update rules for the three values are given by

$\begin{split}a^{(t+1)} &= \frac{\beta_1 a^{(t)} + (1-\beta_1)\nabla f(x^{(t)})}{(1- \beta_1)},\\ b^{(t+1)} &= \frac{\beta_2 b^{(t)} + (1-\beta_2) ( \nabla f(x^{(t)}))^{\odot 2} }{(1- \beta_2)},\\ \eta^{(t+1)} &= \eta^{(t)} \frac{\sqrt{(1-\beta_2)}}{(1-\beta_1)}.\end{split}$

Above, $$( \nabla f(x^{(t-1)}))^{\odot 2}$$ denotes the element-wise square operation, which means that each element in the gradient is multiplied by itself. The hyperparameters $$\beta_1$$ and $$\beta_2$$ can also be step-dependent. Initially, the first and second moment are zero.

The shift $$\epsilon$$ avoids division by zero.

For more details, see arXiv:1412.6980.

Parameters
• stepsize (float) – the user-defined hyperparameter $$\eta$$

• beta1 (float) – hyperparameter governing the update of the first and second moment

• beta2 (float) – hyperparameter governing the update of the first and second moment

• eps (float) – offset $$\epsilon$$ added for numerical stability

 apply_grad(grad, args) Update the variables args to take a single optimization step. compute_grad(objective_fn, args, kwargs[, …]) Compute gradient of the objective function at the given point and return it along with the objective function forward pass (if available). reset() Reset optimizer by erasing memory of past steps. step(objective_fn, *args[, grad_fn]) Update trainable arguments with one step of the optimizer. step_and_cost(objective_fn, *args[, grad_fn]) Update trainable arguments with one step of the optimizer and return the corresponding objective function value prior to the step. update_stepsize(stepsize) Update the initialized stepsize value $$\eta$$.

Update the variables args to take a single optimization step. Flattens and unflattens the inputs to maintain nested iterables as the parameters of the optimization.

Parameters
• grad (tuple[array]) – the gradient of the objective function at point $$x^{(t)}$$: $$\nabla f(x^{(t)})$$

• args (tuple) – the current value of the variables $$x^{(t)}$$

Returns

the new values $$x^{(t+1)}$$

Return type

list

Compute gradient of the objective function at the given point and return it along with the objective function forward pass (if available).

Parameters
• objective_fn (function) – the objective function for optimization

• args (tuple) – tuple of NumPy arrays containing the current parameters for the objection function

• kwargs (dict) – keyword arguments for the objective function

• grad_fn (function) – optional gradient function of the objective function with respect to the variables args. If None, the gradient function is computed automatically. Must return the same shape of tuple [array] as the autograd derivative.

Returns

NumPy array containing the gradient $$\nabla f(x^{(t)})$$ and the objective function output. If grad_fn is provided, the objective function will not be evaluted and instead None will be returned.

Return type

tuple (array)

reset()[source]

Reset optimizer by erasing memory of past steps.

Update trainable arguments with one step of the optimizer.

Parameters
• objective_fn (function) – the objective function for optimization

• *args – Variable length argument list for objective function

• grad_fn (function) – optional gradient function of the objective function with respect to the variables x. If None, the gradient function is computed automatically. Must return the same shape of tuple [array] as the autograd derivative.

• **kwargs – variable length of keyword arguments for the objective function

Returns

the new variable values $$x^{(t+1)}$$. If single arg is provided, list [array] is replaced by array.

Return type

list [array]

Update trainable arguments with one step of the optimizer and return the corresponding objective function value prior to the step.

Parameters
• objective_fn (function) – the objective function for optimization

• *args – variable length argument list for objective function

• grad_fn (function) – optional gradient function of the objective function with respect to the variables *args. If None, the gradient function is computed automatically. Must return the same shape of tuple [array] as the autograd derivative.

• **kwargs – variable length of keyword arguments for the objective function

Returns

the new variable values $$x^{(t+1)}$$ and the objective function output prior to the step. If single arg is provided, list [array] is replaced by array.

Return type

tuple[list [array], float]

update_stepsize(stepsize)

Update the initialized stepsize value $$\eta$$.

This allows for techniques such as learning rate scheduling.

Parameters

stepsize (float) – the user-defined hyperparameter $$\eta$$