节点文献

强化学习中离策略算法的分析及研究

Analysis and Research on Off-policy Algorithms in Reinforcement Learning

【作者】 傅启明

【导师】 刘全;

【作者基本信息】 苏州大学 , 计算机科学与技术, 2014, 博士

【摘要】 强化学习是一种通过与环境进行“试错”交互寻找能够带来最大期望累积奖赏策略的学习方法。根据学习过程中行为策略与目标策略是否一致,强化学习方法分为在策略算法和离策略算法。与在策略算法相比,离策略算法具有更加广泛的应用范围,离策略算法已经成为当前强化学习领域的一个研究热点。本文针对当前离策略算法研究中难以收敛、收敛速度慢以及收敛精度低的问题展开分析,并提出一系列解决方案。本文主要研究内容包括以下四部分:(1)提出一种基于线性函数逼近的离策略Q(λ)算法,该算法通过引入重要性关联因子,在迭代次数逐步增长的过程中,使得在策略与离策略相统一,确保算法的收敛性。同时在保证在策略与离策略的样本数据一致性的前提下,对算法的收敛性给予理论证明。(2)从TD Error的角度出发,给出n阶TD Error的概念,并将n阶TD Error用于经典的Q(λ)学习算法,提出一种二阶TD Error快速Q(λ)学习算法——SOE FQ (λ)。该算法利用二阶TD Error修正Q值函数,并通过资格迹将TD Error传播至整个状态动作空间,加快算法的收敛速度。在此基础之上,分析算法的收敛性及收敛效率,在仅考虑一步更新的情况下,算法所要执行的迭代次数T主要指数依赖于1γ、ε11。(3)提出在学习过程中通过迁移值函数信息,减少算法收敛所需要的样本数量,加快算法的收敛速度。基于强化学习中经典的离策略Q-Learning算法的学习框架,结合值函数迁移方法,优化算法初始值函数的设置,提出一种新的基于值函数迁移的快速Q-Learning算法——VFT-Q-Learning。该算法在执行前期,通过引入自模拟度量方法,在状态空间以及动作空间一致的情况下,对目标任务中的状态与历史任务中的状态之间的距离进行度量,对其中相似并满足一定条件的状态进行值函数迁移,而后再通过学习算法进行学习。(4)针对大规模状态空间或者连续状态空间、确定环境问题中的探索和利用的平衡问题,提出一种基于高斯过程的离策略近似策略迭代算法。该算法利用高斯过程对带参的值函数进行建模,结合重要性关联因子构建生成模型,根据贝叶斯推理,求解值函数的后验分布。且在学习过程中,根据值函数的概率分布,求解动作的信息价值增益,结合值函数的期望值,选择相应的动作。在一定程度上,该算法可以解决探索和利用的平衡问题,加快算法的收敛速度。

【Abstract】 Reinforcement learning is a kind of learning method, which interacts with theenvironment in order to find the most optimal policy with the maximal expectedaccumulated reward. According to the equivalence of the behavior policy and the targetpolicy in the learning process, reinforcement learning algorithms can be divided into twomain parts: on-policy algorithms and off-policy algorithms. Compared with the on-policyalgorithms, off-policy algorithms can provide a much wider application range, andnowadays, the research related to the off-policy algorithms has been more and morepopular. With respect to the main problems, such as non-convergence, slow convergencerate and low convergence accuracy, in off-policy algorithms, the paper provides a series ofsolutions, which mainly include the following four parts:(1) Proposed a novel off Policy Q (λ)algorithm based on Linear FunctionApproximation, which introduces associated importance factor, uses associated importancefactor to unify the on-policy and off-policy on sample data distribution in iteration process,and assures the convergence. Under the premise of sample data consistency, the paper gavethe proof of the convergence for the algorithm.(2) From the aspect of the TD Error, the paper defined the N-order TD Error, used itin the traditional Q(λ) algorithm, and put forward a fast Q(λ) algorithm based on thesecond-order TD Error. The algorithm adjusts the Q value with the second-order TD Errorand broadcast the TD Error to the whole state-action space, which speed up theconvergence of the algorithm. In addition, the paper analyzed the convergence rate,andunder the condition of one-step update, the result shows that the number of iteration mainlydepends on11γ, ε1.(3) Proposed to transfer the value function between different similar learning taskswith the same state space and action space, which tries to reduce the needed samples in thetarget task and speed up the convergence rate. Based on the framework of off-policyQ-Learning algorithm, combined with the value function transfer method, this paper putforward a novel fast Q-Learning algorithm based on the value function transfer— VFT-Q-Learning. At the beginning, the algorithm uses Bisimulation metric to measure thedistance between states in target task and historical task on the condition that these twotasks have the same state space and action space, transfers the value function if the distancemeets some condition, and finally executes the learning algorithm.(4) In allusion to the problem of balancing the exploration and exploitation in thelarge or continuous state space, the paper put forward a novel off-policy approximatepolicy iteration algorithm based on Gaussian process. The algorithm uses Gaussian processto model the action-value function, and combined with associated importance factor toconstruct generative model, get the posteriori distribution of the parameter vector of theaction-value function by Bayesian inference. During the learning process, according to theposteriori distribution, compute the value of perfect information, and combined with theexpected value of the action-value function, we can select the appropriate action. To acertain extent, the algorithm can balance the exploration and exploitation in learningprocess, and accelerate the convergence.

  • 【网络出版投稿人】 苏州大学
  • 【网络出版年期】2014年 09期
节点文献中: