1 深度强化学习面临的问题

（1）Stochastic gradient descent optimisation requires the useof small learning rates. (梯度下降需要较小的学习率。)

（2）Environments with a sparse reward signal can be difficultfor a neural network to model as there may be very fewinstances where the reward is non-zero. (极少情况下奖励为非零。)

（3）Reward signal propagation by value-bootstrapping tech-niques, such as Q-learning, results in reward informationbeing propagated one step at a time through the history ofprevious interactions with the environment. (用价值引导技术进行奖励信号传播。如 Q-学习，这导致每与环境交互一次奖励信息就按先前存储器传播一步。反馈信号传播慢。)

3 DND（可微神经字典）

k(x,y) is a kernel between vectors x and y.

4 ε-greedy policy（ε贪心策略）

NEC的流程为：

ε-greedy policy为：

5 存储器值更新

N-step Q估计的计算方法为：

6 N-step Q估计推导

Q-learning 中两种常用的衡量value的方式，一种MC方式，一种是TD方式。

7 训练模型

D为replay buffer，存储了经验。可以用于离线训练样本。

（1）sampled mini-batches from a replay buffer
（2）calculate predicted Q value by NEC.
（3）minimising L2 loss between the predicted Q value for a given action and the Q(N) estimate

1+