节点文献
电子商务Web数据库不精确查询方法研究
Studies on Answering Imprecise Queries Over E-commerce Web Databases
【作者】 李昕;
【导师】 刘建辉;
【作者基本信息】 辽宁工程技术大学 , 管理科学与工程, 2010, 博士
【摘要】 近年来,随着World Wide Web的迅速膨胀,电子商务也随之得到迅速发展,以Web站点形式展示公司产品信息已成为电子商务交易的一个重要环节,这些Web站点通常由一个后台在线数据库支持,这些数据库称为电子商务Web数据库,电子商务Web数据库中的内容只能通过基于Web表单形式的查询接口来访问。目前,随着Internet的普遍应用和电子商务Web数据库所蕴含信息量的快速增长,访问电子商务Web数据库已成为大量普通用户获取商品信息的重要手段。现有的电子商务Web数据库查询处理模式通常假定用户明确自己的查询意图并且仅支持严格查询匹配,但随着查询电子商务Web数据库的用户群从熟悉领域知识的专业人员逐渐扩展到需要即时满足的普通用户,这种精确查询处理模式已经不再适用于普通用户的查询习惯。这是因为,大量普通用户对电子商务Web数据库的结构和内容并不很了解,并且他们的查询意图本身可能就是模糊或不精确的,因此查询条件仅是他们查询意图的部分或近似描述,相应地,除了与查询要求完全匹配的查询结果之外,一些与查询要求相近的查询结果也可能是他们所需要的。在现有的电子商务Web数据库查询处理模式下,为获得更多与查询要求相近的信息,用户将不得不多次修改查询条件,直到获得满意的查询结果或丧失耐心放弃尝试为止。由此可见,对于那些希望不用手工多次调整查询条件就能从大规模电子商务Web数据库中一次性获取更多满足查询要求的大量普通用户来说,电子商务Web数据库不精确查询方法的研究具有非常重要的意义。本文针对当前电子商务Web数据库查询中亟待解决的不精确查询问题进行了研究,从满足普通用户不精确查询需求的角度出发,按照不精确查询、不精确查询下的查询结果排序和查询结果top-k检索的研究顺序,提出一套行之有效的电子商务Web数据库不精确查询解决方案并给出具体的实现技术。本文的创新性研究成果主要有:(1)为了解决电子商务Web数据库不精确查询问题,提出了基于近似函数依赖的不精确查询方法。对于一个Web数据库关系表,基于一致集的概念导出最大集,生成最小平凡函数依赖集,从而找出属性之间的近似函数依赖关系,进而提出了属性权重评估方法,最不重要属性上的基本查询条件最先放松并且放松程度最大;基于关联规则思想,提出了文本型属性值之间的相似度评估方法;根据属性权重、属性值之间的相似度和松弛阈值,提出了查询松弛重写算法。实验结果表明,提出的属性权重评估和文本型值之间的相似度评估算法是合理、稳定的;用户调查结果表明,提出的查询松弛方法具有较高的召回率,能够有效地处理电子商务Web数据库查询中的不精确查询问题。(2)为了解决由不精确查询导致的电子商务Web数据库多查询结果问题,提出了基于概率信息检索(Probability Information Retrieval, PIR)模型的不精确查询结果排序方法。该方法在原始数据和查询日志基础上,利用概率信息检索模型评估查询未指定的属性值与指定的属性值以及用户偏好之间关联关系,进而构建查询结果元组打分函数并以此对查询结果进行排序。实验结果表明,提出的排序方法能够较好地满足用户需求和偏好,从而提高了电子商务Web数据库不精确查询结果排序的有效性。(3)针对查询结果排序算法执行效率的高效性要求,提出了基于阈值(Threshould Algorithm, TA)算法的top-k检索方法。该方法利用PIR模型构建对应于数据库中每个不同属性值的单调打分函数,在此基础上提出了基于TA算法的top-k检索解决方法,给出了相应的元组列表创建、聚类和top-k元组检索算法。实验结果表明,元组列表聚类算法能够准确发现聚类中心,top-k检索算法具有较高的准确性并且在很大程度上缩短了执行时间,从而提高了大规模数据环境下top-k查询结果的检索效率。
【Abstract】 In recent years, with the rapid expansion of the World Wide Web, E-commerce has developed fastly as well. To exhibit the product information by using web site is becoming a key for e-business. The web site is usually supported by an underlying online database, and this type of databases is referred to E-commerce Web database that is accessible only via web form based interface. Recently, with the universal use of the Internet and fast grows of the size of E-commerce Web databases, accessing the E-commerce Web database has become an important way for people to obtain the product information.The existing E-commerce Web database query processing models have usually assumed that users know what they want and they supported only a strict query matching model. But with the change of the E-commerce databases users from professional users that known application area to lay users that demaning“instant gradification”, this precise query processing model is difficult to suitable for these users’query style. The users have insufficient knowledge about the structure and content of the database, and their query intentions are often vague or imprecise, thus the query conditions can just describe the query intentions approximately. Consequently, the items that are relevant to the query conditions are also needed by the users besides the query results that match the query conditions exactly. In order to obtain the relevant answer items, the user has to reformulate query conditions until she/he gets the satisfactory answers or gives up. It can be seen that the study on technologies of anwering imprecise queries of E-commerce Web databases is very important for the large number of users that need obtain the more relevant information from the large size E-commerce Web database in once time.In this dissertation, the problems of imprecise query, which occur in searching the Web databases and standing in need of solutions, are investigated. Also, from the perspective of satisfying the users’imprecise query needs, an efficient imprecise query solution and corresponding technologies for the E-commerce Web database, in accordance with the order of imprecise query, query results ranking and top-k retrieval, are proposed. The main contributions of this dissertation are summarized as follows:(i) To deal with the problem of imprecise query of the E-commerce Web database, an imprecise query answering approach, which is based on approximate functional dependence relationship, is proposed. Based on the concept of the agree set, the maximum set is exported, and the minimum nontrivial functional dependence sets are generated consequently, which is used to find the approximate dependence relations. By using the approximate dependence relations, the attribute weight measuring approach is proposed. The first attribute to be relaxed must be the least important attribute and has the maximum relaxation degree. Next, based on the ideas of association rules, the semantic similarity measuring methods of categorical attribute values is proposed. According to the relaxation threshold, attribute weight and semantic similarities of attribute values, an adaptive query relaxation rewriting algorithm is proposed. Results of experiments demonstrate that the performance and results of attribute weight and attribute values similarity measuring methods proposed are stable and reasonable respectively, the query relaxation method proposed has higher recall and can resolve the problem of imprecise query of the E-commerce Web database effectively as well.(ii) To deal with the problem of many answers returned from an E-commerce Web database in response to an imprecise query, a query results ranking approach which is based on probabilistic information retrieval model, is proposed. Firstly, based on the database and query history, this approach takes advantage of the probabilistic information retrieval model to capture the correlations between the unspecified and specified attribute values as well as the user preferences, and then constructs the scoring function and ranks the query results according to the ranking scores. Results of experiments demonstrate that ranking method proposed can meet the user’s needs and preferences effectively, which means that the ranking quality of imprecise query results of E-commerce Web database can be improved as well.(iii) In order to improve the efficiency of the query results ranking algorithm, a top-k retrieval method based on threshold algorithm, is proposed. Based on the monotonous scoring function of different attribute values constructed by PIR model, a TA-based top-k retrieval solution is proposed. Next, the algorithms of tuples’orders creating, tuples’orders clustering and top-k tuples retrieval, are presented. Results of experiments demonstrate that the tuple’s order clustering algorithm can find the cluster center correctly; the top-k retrieval method has higher precision and better efficiency, which can improve the retrieval efficiency of the large da taset environment.
【Key words】 E-commerce Web database; imprecise query; approximate functional dependence; query results ranking; top-k retrieval;