节点文献

面向时间敏感对象的垂直搜索引擎关键技术研究

Research on Vertical Search Engine of Recency-sensitive Objects

【作者】 吴羽

【导师】 陈刚; 董金祥;

【作者基本信息】 浙江大学 , 计算机软件, 2011, 博士

【摘要】 随着搜索服务的逐渐普及深化,用户针对特定领域的搜索需求逐渐明确、对搜索结果的个性化程度和实时性要求逐渐增高,使得基于垂直搜索领域的高效信息检索服务已成为搜索引擎市场的热点。垂直搜索引擎通过聚焦抓取、智能调度、高维索引等技术,根据特定的领域知识和用户的搜索习惯,为用户提供特定垂直领域中时效性更高,更为个性化、专业化的搜索结果。然而现有大多数的垂直搜索引擎中存在1)爬虫系统抓取模式被动,目标抓取与用户查询时延过长;2)爬虫系统抓取调度盲目,抓取资源利用率低;以及3)索引系统性能低下,对特定文本信息的特征提取与聚类缺乏有效算法等问题,已经严重地桎梏了垂直搜索引擎市场的健康发展。本文试图对这些热点问题及其关键技术进行系统性研究。本文的主要贡献和创新点如下:1.爬虫系统的主动式聚焦抓取技术研究针对爬虫系统抓取模式被动,目标抓取与用户查询时延过长的问题,提出了语义驱动的查询驱动聚焦抓取技术,基于领域知识理解用户查询,提供了查询向目标网页的语义转换,实现了针对用户查询的主动式抓取,解决了目标抓取与用户查询时延过长的问题。充分的实验和在真实项目中的初步应用表明,查询驱动聚焦抓取技术为用户提供了10秒级搜索结果,大大降低了时延,极大的提高了用户体验。2.爬虫系统的智能调度技术研究针对爬虫系统抓取调度盲目、利用率低的问题,基于网页文档变化的泊松过程建模,在对单个对象新鲜度进行定量估算的基础上,提出对象级细粒度资源调度算法PoissonRank,实现了针对变化的抓取调度,极大地提高了抓取资源的利用率。仿真分析和商用项目中的应用表明了该模型的有效性,该调度技术能提高抓取资源利用率并更好的捕捉对象的变化。大量真实环境中的实验验证了对象分布规律和泊松过程建模的正确性以及用户体验的提升,同时PoissonRank对系统带来的额外开销很低,具有很高的可扩展性。3.索引系统中高维索引的在线更新技术研究针对索引系统中多媒体高维索引在线更新效率低下的问题,对高维索引中的LSH算法进行优化,提出了基于压缩位图(Compressed Bitmap)的CB-LSH高维索引技术,通过算子布尔代数化后引入压缩位图索引对LSH的添删改性能进行了全面提升,解决了高维索引在线更新的性能问题。理论分析证明了CB-LSH在空间占用和时间复杂度上的改善;大量真实数据上的实验结果表明,与现有的LSH算法相比,CB-LSH节约了三分之一的内存占用,删除性能提高了近一个数量级,查询性能提高了数倍,插入性能提高了约一半;真实项目验证了CB-LSH应用于在线实时更新的海量多媒体对象检索系统中是有效可行的。4.索引系统中文本信息的结果合并技术研究针对垂直领域中文本信息长度短、专业性强、噪音多,索引系统中聚类效果差的问题,提出了一种基于自然语言触发对的文本聚类技术TrigSigs,基于一阶触发对充分挖掘词汇隐含属性的关联关系,学习领域专业词汇、去除噪音词汇并提取关键特征词汇,实现了细粒度对象级聚类技术。仿真实验表明,该算法可以过滤绝大部分噪音词汇,并且根据词汇的分辨力合理分配权重,使最终聚类结果的准确率有很大的提升。

【Abstract】 With the more and more popularity of search engine services, domain-related search requests become more and more clear. The requirements for personal search and recency-sensitive search gradually heightened. As a result, efficient information retrieval based on vertical search engines has become the issues of the search engine domain. By using fo-cused crawling, intelligent scheduling and high-dimensional indexing techniques, as well as based on domain knowledge and personality, vertical search engines provides up to date, more personality-aware and more professional search results.However, the major problems exist in most vertical search engines are as follows: (1) the passive crawling mode for crawler system results in a long delay between user query and result retrieval. (2) the scheduler of crawler system schedules web page crawling driftless, which makes a very low utilization for crawling resources. (3) the performance of indexing system is not settle for online updates, and the merging results for certain unstructured text objects are terrible. This paper conducts fully study of these problems as well as the related key technologies.The major contributions of the paper are presented in the following:Firstly, it proposes a semantic based query triggered crawling (QTC) technique to settle the problem of long delay between user query and result retrieval caused by passive crawlers. Based on domain knowledge, QTC translates user query to request parameters of potential target results on domain web sites, and implements an active crawling technique focused on current user queries to solve the problem. Extensive experiments and beta test in real commercial applications show that QTC bridges the delay gap between user query and result retrieval, and brings 10-second-level freshness in vertical search results.Secondly, it proposes an object-level change-aware resource scheduling technique to settle the problem of low utilization of crawling resources caused by crawling blindly. This technique named Poisson-Rank which uses Poisson process to model the time of web ob-ject changing sequence. The Poisson process model provides a quantitative estimation of object-level freshness. By scheduling the crawler resources according to estimated object freshness, this technique not only improves the resource utilization but also captures the changing rule for objects more accurate. Extensive experiments in real data show the ac-curacy of object freshness estimation for Poisson process model, and improved resource utilization with nearly zero-extra-costs in performance.Thirdly, it proposes a more efficient high-dimensional indexing technique to address the performance problem of traditional high-dimensional indexing methods. This tech-nique named CB-LSH combines Compressed Bitmap index and Locality-Sensitive Hash-ing index. CB-LSH booleanizes each operator in LSH index and brings CB into LSH. CB-LSH greatly improves the performance and solved the online update problem for high-dimensional indexing. Theoretical analysis proves the improvements. Extensive experi-ments show that CB-LSH achieves 1/3 less memory usage,10 times of index deletion performance,4 times of query performance and 1.5 times of insert performance. Applica-tions in real commercial projects showed that CB-LSH is feasible for online updates in a large image retrieval system.Fourthly, it proposes a text clustering technique inspired by trigger-pairs in natural language to improve the clustering results of traditional text clustering algorithms for un-structured text data. Unstructured text data in e-commerce has the properties of very short length, noisy and professional vocabulary, which make the traditional text clustering al-gorithms useless. Trigger-pair based clustering technique (TrigSigs) uncovers hidden re-lations between words, adapts professional vocabulary and extracts key word features to enable a fine-granularity object level clustering technique. Simulation experiments show that this technique could filter out most noises, make efficient weight distribution between word features and greatly improve the-clustering results.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2011年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络