节点文献

基于模型的动态分层强化学习算法研究

【作者】 袁姣红

【导师】 吴敏; 陈鑫;

【作者基本信息】 中南大学 , 控制科学与工程, 2011, 硕士

【摘要】 强化学习因具有自学习和在线学习的良好特性,已经成为机器学习领域的一个重要分支。然而,智能体在大规模高维度的决策环境下进行强化学习时被“维数灾难”(学习参数的个数随变量的维数成指数级增长)所困扰,学习效率低下,导致难以及时甚至无法完成学习任务。因此,如果能有效缓解“维数灾难”,提出一种适用于未知大规模复杂环境下的高效率强化学习方法,则可以为提高智能体在实际应用中的自适应性提供有效的解决方案,对促进机器学习领域理论和技术的发展具有重要意义。因此,为了缓解未知大规模环境下的“维数灾难”问题,提高学习效率,本文研究将动态分层技术和基于模型的自学习技术相结合的方法,在基于模型的强化学习过程中,提出一种基于探索信息自适应聚类的动态分层强化学习算法。该算法动态生成融合了状态抽象和时态抽象(或称动作抽象)的MAXQ分层结构,从而通过限制MAXQ中每个子任务的策略搜索空间而显著加快了学习速度。首先,在基于模型的强化学习过程中,利用基于探索信息的自适应聚类算法将整个状态空间划分成若干个状态子空间,即通过状态抽象完成了任务的自动分层,并基于状态子空间的终止状态集,提出-种改进的动作选择策略。其次,根据各动作有效执行的频率情况进行时态抽象自动生成类似于MAXQ的分层结构,进而根据有效动作集将各状态子空间归入到相应的MAXQ子任务中,从而自动生成融合了状态抽象和时态抽象的MAXQ分层结构。再次,基于该MAXQ分层框架搜索任务的递归最优策略,并在以后的学习过程中动态调整MAXQ结构,以降低初次分层结构不合理的局限性。通过仿真试验表明,本文提出的算法能显著提高未知环境下智能体的学习效率,有效缓解“维数灾难”问题,从而验证了算法的有效性。最后对论文进行总结,并提出一些有待进一步研究的问题。

【Abstract】 Reinforcement learning (RL) has been an important branch of machine learning for its good characteristics of self-learning and online learning. However, RL is perplexed by the problem of’curse of dimensionality’(the number of parameters to be learned increases exponentially with the dimensionality of variables) in the large-scale environment, result in low learning efficiency which means that the agent can hardly complete the task in time, even fail to achieve the goal. Therefore, if a novel reinforcement learning method which can handle the problem of’curse of dimensionality’in the unknown large-scale world is presented, an effective solution proposal which is able to improve the adaptability of the agent in the application will be provided. More importantly, this study will have great significance to the development of machine learning theory and technology.This thesis studies the combination of dynamic hierarchical RL and model-based RL, so as to address the problem of’curse of dimensionality’in the unknown large-scale world as well as improve learning efficiency of the agent. During the process of model-based RL, a novel dynamic hierarchical reinforcement learning with adaptive clustering based on the exploration information (DHRL-ACEI) algorithm is proposed in this thesis. The DHRL-ACEI algorithm builds the MAXQ hierarchy in which state abstraction and temporal abstraction (also called action abstraction) are integrated, so it can accelerate the learning remarkably by restricting the exploration range of policy for every subtask in MAXQ hierarchy.First of all, the whole state subspace is divided into some state subspaces (regions) using adaptive clustering based on the exploration information algorithm in the model-based RL, which belongs to state abstraction, and an improved strategy of action selection is presented based on the set of terminal states w.r.t. these regions. Next, similar MAXQ hierarchy is constructed automatically based on frequency of successful execution w.r.t. actions, then incorporates those regions result from state abstraction into relevant subtasks of MAXQ by their sets of valid actions, so we build the MAXQ hierarchy automatically where state abstraction and temporal abstraction are combined. Then the recursively hierarchical optimal policy is derived based on the MAXQ framework, and the hierarchy will be updated dynamically in the following learning so as to reduce the limitation of unreasonable hierarchy built at first.Simulation results show that DHRL-ACEI algorithm presented in this thesis can handle the problem of’curse of dimensionality’and enhance learning efficiency of the agent significantly in the unknown large-scale environment, so that demonstrate the effectiveness of the DHRL-ACEI algorithm. Finally, this thesis makes some conclusions, and presents some issues for future research.

  • 【网络出版投稿人】 中南大学
  • 【网络出版年期】2012年 01期
节点文献中: