节点文献

语义Web使用挖掘若干关键技术研究

Research on Some Key Issues for Semantic Web Usage Mining

【作者】 孙明

【导师】 周明天;

【作者基本信息】 电子科技大学 , 计算机应用技术, 2009, 博士

【摘要】 随着Internet的快速发展,Web上承载的数据正以令人难以置信的速度迅速膨胀。Web使用挖掘是应用数据挖掘技术帮助用户从海量的Web数据中迅速发现使用模式的过程。由于目前Web上的数据大部分是非结构或半结构化的,作为智能程序的软件代理并不能理解和处理这些信息,因此Web使用挖掘的结果往往不尽如人意。语义Web作为当前Web的一种扩展,其中信息被赋予了良好定义的语义,有助于计算机自动地处理Web使用信息,能有效改善Web使用挖掘的结果,因此语义Web使用挖掘就成为当前Web挖掘的前沿研究领域。语义Web使用挖掘一方面从现有Web数据中抽取使用语义促进语义Web的构建,另一方面也利用语义Web数据有效提高了传统Web使用挖掘的质量和效率。本文回顾了语义Web使用挖掘研究的发展历程和各个阶段所取得的重要成果,阐述了它对促进Web技术发展的重要意义。系统地总结了语义Web使用挖掘的过程和任务,并指出了当前研究中存在的主要问题。本文从语义使用知识(半)自动构建和挖掘语义Web使用等角度出发,针对日志本体学习和日志本体挖掘所面临的几个关键问题进行了深入研究,并取得了如下创新性成果:(1)系统地提出了日志本体的分层体系结构。以事件为核心概念,采用自顶向下的分析方法,根据用户访问行为的语义从抽象到具体依次完整地给出了核心日志本体、应用日志本体和语义日志的形式化定义。这种分层体系架构弥补了相关研究对日志本体定义过于单调的缺陷,有利于不同层次上使用知识的语义描述,能提高后续语义Web使用挖掘的质量和效率。(2)提出了一种结合Web内容和使用挖掘学习应用日志本体的方法。该方法采用分步学习的思想,通过“原子应用事件抽取—原子应用事件分类关系学习—复合应用事件挖掘—应用事件非分类领域关系学习”依次确定应用日志本体的主要构成元素。在日志本体顶层架构的基础上,根据用户具体访问目的将用户请求映射为内容应用事件或服务应用事件;通过基于群体智能的Web页面聚类和用户访问路径上请求参数的语义分类,分别发现内容应用事件及服务应用事件的分类关系;以事件整分关系为基础构建事务空间,通过层次关联规则挖掘发现应用事件的非分类领域关系。实验表明,在Web使用领域内,该方法学习生成的应用日志本体在准确率和召回率上都明显优于目前主流本体学习工具生成的结果。(3)给出了DatalogSHIQ异构日志知识系统,并在此基础上提出了一种频繁Web访问模式发现的方法。DatalogSHIQ扩展了AL-log,支持表述能力更丰富的描述逻辑语言和异构Datalog规则,并降低了异构系统的安全性约束。在此基础上,引入应用访问规则集表示Web使用信息动态变化的语义,弥补了日志本体在表示动态访问知识上的不足。定义DatalogSHIQ之上的原子完善操作,提高了候选Web访问模式的表达能力。提出一种基于观察覆盖测试的ILP方法,能有效地从候选集中发现频繁Web访问模式,与已有研究相比,该方法增加了对复杂概念和独立角色的推理能力,发现的频繁Web访问模式具有更丰富的语义知识,能满足站点系统实际应用的需要。(4)提出了一种结合DL-safe规则的频繁Web访问模式及关联规则发现方法。在日志本体之上给出了异构规则语言DL-safeL以描述应用访问规则,增加了对选言规则的支持。基于trie树形结构提出DL-safeL之上的节点扩展算法,无需事先生成候选模式,可以直接通过计算容许谓词集而生成频繁Web访问模式和关联规则;巧妙地利用选言数据库中被证明的优化原则,通过语义等价模式以及模式分类冗余检查有效避免了过多逻辑推理所带来的算法性能瓶颈。实验结果表明,相比同样采用DL-safe规则的SEMINTEC,该方法在不提高计算复杂度的基础上,增加了对应用访问规则和观察集的覆盖,并支持模式中出现表示应用语义的Datalog原子。

【Abstract】 With the rapid development of the Internet, Web data is exploding incredibly. Web usage mining is the process of applying data mining techniques to the discovery of usage patterns from a huge amount of Web data. Most data on the Web is unstructured or semi-structured, thus it can not be understood and processed by intelligent software agents, which causes the results of traditional Web usage mining not to be always entirely as desired. The Semantic Web is an extension of the current Web in which information is given well-defined meaning. It can help computers to process Web information automatically and improve the results of Web usage mining effectively. Therefore, Semantic Web usage mining has become one of the front fields in Web mining.Semantic Web usage mining extracts the usage knowledge from current Web data to promote the construction of the Semantic Web; on the other hand, it also improves the results and efficiency of traditional Web usage mining by making use of the Semantic Web data. In this dissertation, we review the past progress and important achievements in the research of Semantic Web usage mining, and illustrate its significance for promoting the development of the Web techniques. We survey the tasks of Semantic Web usage mining as well as point out the main problems of current research in this field. From the view of the (semi-)automatically construction of the semantic usage knowledge and mining the Semantic Web usage, this dissertation presents some key issues of log ontology learning and log ontology mining, and results in following innovative achievements:(1) Proposing the hierarchy architecture of log ontology. Event is adopted as the core concept to describe the behavior of user-visitation. From top to bottom, the formalized definitions of core log ontology, application log ontology and semantic log are completely given according to the semantics of user-visitation. Compared with related works, this kind of architecture is advantageous to express the semantics of usage knowledge at different levels, and can improve the results and efficiency of the consequent Semantic Web usage mining. (2) Proposing an approach for application log ontology learning based on Web content and usage mining. In this method, the main elements of application log ontology are determined in turn by atom application events extraction, the taxonomy of atom application events learning, complex application events mining and the non-taxonomy domain relations of application events learning. Based on the top-level architecture of log ontology, the user’s requests are mapped to content application events or service application events according to the goal of the user’s visitation. The taxonomy of content application events can be discovered by swarm intelligence clustering for web document, and the taxonomy of service application events can be discovered by classifying the semantics of request parameters among the visiting path. By constructing the transaction space based on the part-whole relation between events, the non-taxonomy domain relations can be mined through hierarchy association rules. The experimental results show that both the precision and the recall of our method are better than the main ontology learning tools in Web usage domain.(3) Presenting the hybrid DatalogSHIQ log ontology knowledge system, based on which an approach for discovering the frequent Web access patterns is proposed. DatalogSHIQ, expanded from AL—log, supports richer description logic language and hybrid Datalog rules. It adopts the most general Datalog safeness to strengthen the expressivity. Application access rules are applied to represent the dynamic semantics of Web usage information, which can make up for the insufficiency of log ontologies about the expression of dynamic knowledge. The atom refinement operator based on DatalogSHIQ is proposed to generate more expressive candidate patterns. An ILP algorithm based on coverage testing about observations is developed to select frequent Web access patterns from candidates. Compared with related works, this method extends the ability for reasoning complex concepts and the independent roles, the results are richer and can satisfy the needs of practical application.(4) Proposing an approach for discovering the frequent Web access patterns and association rules with DL-safe rules. Based on log ontologies, the hybrid rules language DL-safeL is given to describe the application access rules with disjunctive forms. Based on trie tree, a node explanation algorithm is presented to directly generate the frequent Web access patterns and association rules by computing admissible predicates. This method ingeniously makes use of the optimized principle in disjunctive database, and check whether a pattern is semantically free or taxonomy redundancy to avoid the algorithm performance bottleneck caused by too much logic reasoning. Compared with SEMINTEC, the experimental results show that this method supports the coverage testing about application rules and observations, and supports patterns with Datalog atoms without exacerbating the complexity of computation.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络