节点文献
基于Web服务组合的文本分类PSE问题研究
Research on Problem of Text Categorization PSE Based on Web Service Composition
【作者】 梅健;
【导师】 张武;
【作者基本信息】 上海大学 , 计算机应用技术, 2008, 博士
【摘要】 随着Web服务技术与网格技术相融合,Web服务在各个领域的应用越来越广泛,其中基于Web服务的问题求解环境(PSE)已成为目前计算机应用领域的研究热点之一。文本分类(TC)可视为文本归属求解的问题研究,存在多种的分类算法。但是,这些算法缺乏统一管理,算法的接口存在很大差异性。此外,随着分类精度的不断提高,面对更大规模的文本数据,传统的技术难以快速满足文本分类过程所需计算资源。而Web服务技术通过封装分类算法资源,不仅可以提供资源统一管理、开放的标准接口,而且更重要的是能够有效积聚资源,满足分类处理中对资源的要求。因此,为了利于分类算法共享使用、提高研究效率,本文提出了基于Web服务的文本分类问题求解平台(PSE-TC),为研究人员提供大规模并行计算、算法研究比较和结果分析等服务。本论文的研究工作主要包括以下几个方面:1.研究PSE-TC的体系结构。借鉴Web服务资源框架(WSRF)和PSE相关应用研究,针对文本分类的特点,提出了集成文本分类算法的服务平台概念。设计了四层的PSE-TC体系结构,包括资源提供层、Web服务整合层、任务执行层和Web Portal层。2.研究可扩展的Web服务体系。Web服务整合层采用Tomcat+Jboss做为应用服务器,提供资源整合服务。同时通过AXIS组件对外发布服务,提供适合于文本分类算法服务应用编程接口。实现了以Web服务技术贯穿整个分类过程,包括构建分类器服务、分类服务和任务执行状态监控服务。3.研究PSE-TC环境下的服务安全保证。为了满足用户发布的服务具有访问控制的需求,本论文实现一个轻量级的访问控制服务——统一安全认证服务。通过统一安全认证服务将整个服务平台的所有用户按照一定的策略划分为不同角色。建立认证授权机制,实现了用户的证书管理和用户角色权限的分离,为以后的PSE安全方面研究打下了基础。4.研究基于Web服务组合的工作流。为了提高资源的利用率和资源调度的准确性,本论文引用域和域成员的概念,以域成员的层次关系、次序关系为基础,建立服务工作流模型。并在模型的基础上,提出了一种优化服务组合算法,很好地解决工作流管理中的资源冲突、执行中的模式僵化和用户被动地处理工作等问题。5.研究文本分类模型的反馈应用。本论文提出并实现将反馈控制运用于文本分类模型的修正和重构。以支持向量机为例,通过人工交互的方式形成反馈集,将反馈集中的支持向量通过反馈优化和除重等过程,构建成反馈后的分类器。通过应用该反馈方法,仅以少量的反馈文本就可以较大程度地提高分类模型的性能。最后,本论文对PSE-TC和相关的文本分类应用系统进行了测试。通过比对和分析实验结果,验证了上述理论和技术的可行性和正确性。
【Abstract】 With the the amalgamation of web service technology and grid technology, lots of web service-based applications appear in various fields. Among them, the web service-based Problem Solving Environments (PSE) is an emerging technology. It has become a research hotspot and been widely used in the computer application. Text Categorization (TC) can be looked upon as a solution to study the texts’ classification. Various classification algorithms related to TC has been studied. However, all those algorithms are lack of the uniform management and have the heterogeneous interface. Furthermore, with the classification precision increased and more and more large-scale text data arisen, the traditional technologies cannot meet the computational resources required for text classification process rapidly.By packaging classification algorithm resources, the web service technology not only provide the unified administration of the resources and the open standard interface, but also support efficient resources accumulation to deal with the classification process. In order to share the classification algorithms and improve the efficiency of research, a web service-based PSE application, Problem Solving Environment for Text Categorization (PSE-TC), is developed. PSE-TC can provide large-scale parallel computation, algorithm comparison and result analysis for the researchers.The main works in this thesis include the following aspects:1. Research on PSE-TC system structure. By using Web Service Resource Frame-work(WSRF) and related PSE application study for reference and contraposing the TC characteristics, the concept of the service platform, integrating the classification algorithm, has been brought forward. Finally, a web service-based four-layer architecture is given, which is the resource provider layer, service integration layer, task execution layer and web portal layer.2. Research on the expanded web service architecture. The web service integration layer considers the Tomcat and Jboss as the application server, which provides the grid resource integration service. Meanwhile, AXIS is used to be the component for publishing the service, and offer the application programming interface suitable for the TC algorithm research. In the entire classification process, the web service is the key technology, which involved in structuring text classifier, classifying text and serving the status monitor.3. Research on the web service security assurance. With the consideration of the web service access control requirement, this thesis describes a lightweight authorization service to solve service access control, Uniform Security Authorization Service (USAS). The USAS divides the users into different levels according to the definite access control policies, builds an authentication and authorization mechanism, and realizes the separation of the user certification management and user role authority. These functions will provide certain reference for the research on service security aspect.4. Research on the workflow based on web service composition. To improve resource utilization and the accuracy of task scheduling, we introduce the concept of domain and domain members. On the basis of hiberarchy and ordering relation among the members, we establish a service workflow model. Based on this model, an optimal service composition algorithm is studied, which can resolve resource conflict problem, pattern ossification problem and the job treatment problem.5. Research on the feedback application of the text classifier. In this thesis, the feedback control learning is applied to modify and rebuild the text classifier. We set the Support Vector Machine (SVM) as an instance to describe the full feedback learning process, building the feedback set by handwork, optimizing and getting rid of the support vector, and rebuilding the classifier. By carrying out the feedback learning, the effective and efficient of the classifier model can be improved greatly with a small quantity of feedback texts.In the end of this thesis, we perform some experiment on the PSE-TC and the related TC application system. By comparing and analyzing the experimental results, the feasibility and validity of the theories and technologies are proved.
【Key words】 Problem Solving Environment; Web Service; Text Categorization; Web Service Composition; Feedback Control;