节点文献

汉语财经评论的修辞结构标注及篇章研究

Annotation and Analysis of Chinese Financial News Commentaries in Terms of Rhetorical Structure

【作者】 乐明

【导师】 冯志伟;

【作者基本信息】 中国传媒大学 , 语言学及应用语言学, 2006, 博士

【摘要】 篇章标注是国际语言资源建设的一个前沿内容。本文遵循国际篇章语料库的建设方法,通过先建立一个较大规模的以语篇为单位的汉语财经评论篇章语料库,然后在修辞结构理论(Rhetorical Structure Theory,RST)的指导下对语料进行了预处理、切分、标注、核查和统计分析,并研究了汉语篇章的修辞结构与表层语言信息之间的各种量化关系,试图为对比语言学和将来建立更大的、自动处理的篇章语料库做些基础工作。在进行汉语篇章修辞结构树库建设的具体工作之前,我们首先从理论上比较了从英语研究发展出来的RST和汉语相关的传统复句、句群、语篇和文章学研究,认为两者在关于篇章结构的基本假设和很多具体问题的结论上都非常相似,但是RST理论在坚持语言的交际观、强调作者交际意图与篇章单元修辞意义的核心性地位的关联性、强调语言结构层级的同质性,以及篇章修辞结构的形式化表达等方面具有比较突出的特点。因此,在总结了RST汉语研究和国际篇章修辞结构树库的建设成绩之后,我们认为有可能也有必要利用该理论对汉语篇章进行基于语料库的实证研究。为此,我们建立了一个含400篇,约80万字的汉语财经评论篇章语料库(Caijingpinglun,CJPL)。该语料库在语料选材上与英语WSJ-RST树库和德语的PCC树库有比较好的可类比性。不过由于语料直接取自网页,存在一些字符编码、文字编辑以及网页上传等问题,所以我们采取了各种比较谨慎的预处理步骤,将网页文档全部转换成有统一编码的文本文档,以保证后续处理的精度和效率。在预处理程序后,标注者首先在文本文档的基础上(同时参考原始网页文档),用普通读者的眼光对全部篇章语料进行了基本信息标注,包括各篇文章的体裁、题材、标题、导语、开头、结尾、出处、作者、来源等,并籍此对语料有了较好的了解。接着,我们依靠选定的篇章基本分析单元(Elementary Unit of Discourse Analysis,EUDA)边界标示符,由机器统一完成了语料的切分。在选定句号、问号、叹号、段落结束标记、分号、冒号、省略号和破折号等篇章单元边界标示符之前,我们对语料中标点符号的分布进行分析。分析显示这些标点符号不仅在绝大多数情况下正确地标示了篇章单元的边界,而且能够保证后续的关系标注具有较小的颗粒度。更为重要的是,根据这些选定的篇章边界符号完成切分之后,我们不再需要对切分结果进行人工干预,只需对极个别切分结果进行粘合,保证了处理的效率和正确率。在完成切分之后,标注者试验性地标注了所有文章各篇章单元之间的修辞关系,以至整个篇章的修辞结构树构造,从修辞结构角度对语料有了更进一步的理解。在该阶段结束的时候,我们筛除了2个存在严重编辑问题的文档和3个以口语性对话为主的长篇电视采访记录文档。从评论语料的实际出发,我们定义了12大组47种汉语的修辞关系和19种新闻篇章组织元素,并拟定了汉语篇章关系标注的工作守则,其中包括可能存在歧义时的修辞关系优先选用原则和一些特殊现象的处理方案。在设立关系和对关系进行定义的时候,我们不仅参考了多个英语、德语、汉语版本的修辞关系集及定义,也参考了汉语复句、句群和语篇研究的相关成果。另外,我们还对一些可能有所争议的切分标记和关系定义进行了一项心理语言学的调查。根据调查结果,又调整了部分关系的定义和关系优先选用顺序。在上述工作的基础上,我们按随机平均抽样原则选取了197篇语料,分2遍完成了对其中较短的97篇文章在EUDA(相当于分号句)及以上层级的修辞关系标注,为每个篇章建立一个覆盖整个文本的篇章修辞结构树,并执行了树结构有效性核查。根据两个版本的修辞结构标注,我们统一了最后的标注(第3遍),然后进行了随机抽样的标注者一致性测试。我们还在不参考修辞结构标注结果的情况下,为97篇语料单独标注了句间篇章提示标记(包括句间关联词语、句间回指指示词和回指代词、有篇章作用的标点符号)。之后,我们利用这些标注结果提取了数据,分析了这些评论语篇各个层级的结构特点、修辞关系的分布和篇章提示语的修辞功能。这项语料库驱动的数据分析显示,1)遵循一定的原则,汉语财经评论绝大多数(93.1%)都能用树结构作大致的形式化表示;2)我们所定义的修辞关系基本上都能被反复地用来连接在各个层级的篇章单元,显示出汉语篇章具有较好的结构层级同质性。3)扩展的经典RST关系集(Mann and Thompson 1988,Mann 2005)在汉语财经评论的篇章单元间关系的覆盖比例为90.4%,余下的关系也基本都可以用已知关系的核心性变异类型来表示。4)汉语财经评论的总体篇章树形,在CJPL语料库中以后段对第一段展开分说的头并卫结构(14.4%)为最多,其次是后段对第一段展开分说并逐步增加其他意思的头降卫结构(13.4%)、先述后评的中降卫结构(13.4%)和逐步展开最后得出结论的尾升卫结构(11.3%)。5)在CJPL语料库中,全文总体表示证明和评价的占53.6%,全文总体表示阐述、解释信息的占46.4%。这一数据说明国内新闻界对评论的社区定义与语言学界从理论角度对论证文的定义有一定的区别。6)虽然财经评论正文中的修辞关系有很多是多核心的,但单核心的核心-卫星模式仍占主导地位,占全部关系总数的64.6%。7)和汉语复句前偏后正的主导性结构不同,汉语评论文在分号句及以上层次中卫星-核心结构与核心-卫星结构的比例为46.16%:53.84%,核心性和篇章单元的次序之间没有明显的关联。8)以议论为主的“媒体财经评论”和以消息报道为主的“新闻联播”在各种关系的分布频率上有些差异,显示出语类对于修辞关系分布的影响。9)汉语评论语篇使用句间关联词语的频率28.5%,其中使用频率最高的连词为“而”;句间关联词语被较多地用于并加-M关系和罗列-M关系;10)一些关系,如附加-S关系、让步-S/-N关系、罗列-M关系等,常有关联词语标示;而另一些关系,如方式-S关系、引述-S关系、评价-M关系、解答关系-M/-S等,几乎没有关联词语表示。11)一些常见的关联词语在语料库中都有句内句外的用法,只是分布上有些差异,有些主要在句间(如“然而”),有些主要在句内(如“如果”)。12)语篇中存在一些句间关联词语连用的现象,大致可以分为强调(或缓和)语气、交叉限制关系和分辖上下文三种类型,其中最后一种类型实际上就是多重复句的关系间包孕能力在句以上单元间的扩展。13)汉语财经评论文最常用的句间回指指示词是“这”和各种带“这”的词语。14)一些标点符号,如问号、分号、冒号等,在汉语篇章中有明显的标示篇章单元间修辞关系的作用,而且与修辞关系核心性的关联度很高。15)虽然一些篇章提示标记(包括关联词语、回指词、标点符号和段落标记等)在汉语篇章中与某些修辞关系有比较强的关联性,但它们之间并不存在一种一一对应的映射关系。16)利用英、德、西等其他语言RST研究的数据,我们发现,修辞关系的有标频率在很多语种中都比较低,而且都常出现在较低的篇章层级单元之间。一些修辞关系,如让步、条件等有标的比例在各个语种中都比较高,而另一些关系,如评价、背景、详述、解答等的有标比例则都比较低。不过具体的比例和各种标记具体所能限制的关系的类型在各个语种之间略有不同。17)汉语篇章结构树的局部子树中存在一种比较特殊的螺旋型的结构。这一结构形式中,一个篇章单元总是与一个离其较远的单元发生修辞关系,而不是与其直接邻居发生修辞关系。如果这就是Kaplan(1966)所谓的圆周型(Circular)结构,且如果将来更多的语篇标注结果显示这一局部子树的结构形式有比较显著的频率,那么将说明Kaplan(1966)关于汉语篇章圆周型结构的假设有其正确的一面。18)汉语篇章修辞结构的层级同质性、汉语篇章结构中核心-卫星模式的主导地位以及经典RST关系集在汉语中的覆盖率都从实证角度说明了RST理论在汉语中的可移用性。虽然汉语财经评论树库的建设目前只取得了阶段性的进展,但我们认为,这一研究在中文信息处理、篇章理论研究和社会文化研究等方面都具有现实意义。首先,汉语财经评论树库的构建,可以为自然语言工程界提供篇章剖析所需的各类先验系数,帮助他们改进现有的汉语自动文摘模型,并为现有汉语自动篇章剖析算法提供训练和测试的平台。有了汉语RST树库,也就有了借鉴应用英语、德语等其他语种的篇章处理技术的物质基础,将帮助中文信息处理尽快地缩小与其他语言信息处理的差距。其次,我们对汉语财经评论语料的标注研究,在较大规模数据的基础上检验了修辞结构理论及其形式化方法在汉语中的可移用性。同时,我们也从篇章修辞结构的角度,拓展了汉语篇章提示标记的研究视野。如果有类比性好的语料库,也可以进行语言对比研究、语类对比研究等。另外,虽然语料库建设目前还很少用来为人文社会科学提供资源,我们还是可以预见它的广阔用途,比如基于大规模语料库的语用事实挖掘。在语料库基础上对汉语新闻评论做语言学性质的研究也会是一片广阔的天空。

【Abstract】 The revival of empirical paradigm and the application of machine learning have made the construction of linguistic resource a crucial task in natural language processing. The improvement in character/word and sentence processing and the ultimate goal of discourse processing have made discourse annotation an international frontier. This dissertation reports my efforts to enrich Chinese language resources through the building of a Chinese news commentary treebank, using the Rhetorical Structure Theory (RST) as its theoretical framework. Following the internationally observed methodology in corpus construction, I first did a pilot study, then on the selected corpus I took necessary steps including pre-processing, segmentation, relation annotation, validity checking and inner-coder agreement test to ensure the quality of the annotated discourses. Driven by the statistics obtained from the finished part of the corpus, I studied various correlations between the rhetorical structure and surface linguistic forms. This study can serve the purpose of providing a priori scores for automatic Chinese text parsers and summarizers, or for quantitative linguistic studies.Specifically, I did the following work:Before setting off to the detailed tasks of corpus construction, I did a theoretical analogy on the similarities and disparities between the English-rooted RST and Chinese traditional linguistic studies on Sentence Complexes (Fuju), Sentence Groups (Juqun), Discourse and Literary Composition (Wenzhang Xue). Various evidences show that the two schools have common grounds on the hypotheses on discourse structure and many specific observations, but RST is more consistent in its communicative perspective on language, and thereby lays more emphasis on the tie between writer’s intentions and the nucleus status of discourse units, is more insistent on homogeneity among layers of discourse units, and makes more efforts on formalization. The analogy, together with a review on Chinese RST studies and international RST treebank achievements, proved the plausibility and necessity to do a large-scale, corpus-based analysis on Chinese texts.For that purpose, I composed a Chinese financial news commentary corpus (Caijingpinglun, CJPL) with 400 news texts of about 780,000 characters. Mainly made up of financial news reports and commentaries, this CJPL corpus is of fair comparability to the English WSJ-RST treebank made up of Wall Street Journal articles, and to the German PCC treebank made up of Maerkische Allegemeine Zeitung commentary articles. Upon finishing the pre-processing steps, I first tagged, as an ordinary reader, basic documentary information to every text in CJPL, including Genre, Topic, Title, Lead, Opening, Ending, Source, Author, Publisher, and so on.Then I carried out a semi-automatic segmentation procedure based on selected EUDA (Elementary Unit of Discourse Analysis) delimiters, namely Full-stop, Question-mark, Exclamation-mark, End-of-paragraph sign, Semicolon, Colon, Ellipsis and Dash. The selection of these delimiters were based on a corpus study on their distribution, which revealed that they can not only signal the boundaries of discourse units in the majority cases but also effectively help reduce the granularity of later discourse analysis. This segmentation procedure yielded undisputable segments, which only need occasional rebinding but no further hand segmentation.After segmentation, I did a trial annotation to all the inter-EUDA relations of the 400 texts up to the completion of a discourse tree covering the whole text. By that time I felt to have gained fair understanding of my texts. Then I exclude 2 pieces of questionable integrity and 3 lengthy TV interview transcripts of mainly oral exchanges.Rooted in the linguistic facts of my corpus, I drafted, together with their corresponding definitions, a Chinese rhetorical relation set of 47 relations. I’ve also drafted an inventory of 19 scheme elements for news texts, and a working manual of how to cope with typical problems in relation tagging. While composing the definitions and the manual, I made constant references to various rhetorical inventories and traditional Chinese studies. Apart from that, I conducted a psycholinguistic study on native speakers’preference for certain structures and relation definitions.Based on the above-mentioned trial tagging, I annotated 97 shortest documents of 197 randomly selected ones from the 395 qualified corpus texts, following relation definitions and tagging conventions drafted. Each of the 97 documents was annotated twice and, when the whole lot was finished, checked for Tree structure validity. A third-time annotation was done to unify choices made in the first- and second-round of annotation, followed by an inner-coder consistency test and extraction of data for statistic analysis.Apart from rhetorical annotation, I also tagged inter-EUDA cue phrases (including inter-EUDA connectives and connectors, inter-EUDA deictic anaphora and pronoun anaphora, as well as discursively functioning orthographical marks). The tagging was done without reference to the rhetorical annotation.Data extracted from the completed portion of CJPL corpus suggest the following points:1) Following certain principles and conventions, the majority of Chinese news commentaries (93.1%) can be represented by a Tree structure;2) That the rhetorical relations (RRs) defined can be recursively applied to different layers of Chinese discourse units, demonstrates good homogeneity of Chinese text structure.3) The Extended relations of the Classic RST set (Mann and Thompson 1988, Mann 2005) cover 90.4% of all the cases in the Chinese Financial News Commentary Corpus, and the rest can be covered by deviations of those known RRs.4) The most popular overall Tree structure in CJPL is an opening sentence as Nucleus with a Satellite of multi-nuclear relations (14.4%), followed by an opening sentence as Nucleus with a satellite of mono-nucleic nucleus (13.4%), an opening sentence as Satellite with a nucleus of mono-nucleic nucleus (13.4%), and an opening sentence as Satellite with a nucleus of mono-nucleic satellite and a closing nucleus (11.3%).5) 53.6% of the root relations of the body of Chinese news commentaries are of JUSTIFY, EVALUATION and other presentational relations, suggesting a wide difference between practical definition of Commentary in the Chinese mass media community and the theoretical definition given by linguists.6) Despite a high percentage (35.4%) of multi-nuclear relations, the hypotactic mononuclear relations still withhold their majority.7) Quite different from the assumed overwhelming pattern of N-S order within Chinese sentence complexes, there is apparently no such dominant order among and above sentences delimited by our selected markers.8) The different distribution patterns of RRs in the commentary-dominated CJPL corpus and in a report-dominated Chinese TV news report corpus (Xinwenlianbo, XWLB) suggest the influence of genre on RR distribution.9) About 28.5% of all inter-EUDA relations are marked with conjunction or connectors in CJPL, with the most frequently used being“而(ER)”, and the most frequently marked relations being CONJUNCTION-M and LIST-M.10) Some relations, such as CONJUNCTION-M, CONCESSION-M/N/S and LIST-M, are frequently marked with conjunctions; while some other relations, such as MEANS-S, ATTRIBUTION-S, EVALUATION-M/N, INTERPRETAION-N, SOLUTION-M/N, are rarely marked with conjunctions.11) Some common conjunctions are found to be used both below and above sentence level in CJPL, but with different distribution patterns, some are obviously much more frequently used below sentence level, some much more frequently above sentence level, and some no significant difference below and above sentence level.12) Some inter-EUDA connectors are found to be used consecutively in CJPL texts, and their functions are mainly of the following three types: mitigating or amplifying modality, restricting each other’s rhetorical potentials, and governing different discursive units.13) The most frequently used inter-EUDA deictic expressions are“这(ZHE)”and words or phrases started with“这(ZHE)”.14) Some punctuation marks, such as question-mark, semicolon and colon, have strong correlation with certain RRs and their nuclarity patterns.15) Despite some strong correlations, there is no one-to-one mapping between discursive cue phrases (connectives, connectors, anaphoric deixis, punctuation marks) and RRs.16) A characteristic subtree structure of spiral shape has been identified in CJPL trees. In this structure, a discourse unit always relates itself not to its immediate neighbor, but to the most distant unit in the subtree. If this was what Kaplan (1966) meant to be the typical circular Chinese discourse structure, and if more cases are found in longer CJPL texts up to a significant level, we could say Kaplan held at least part of the truth. Although this Chinese RST treebank project has only completed partially, it promises practical values in discourse studies and cultural studies as well as in Chinese information processing:First of all, it is the first attempt to build an RST-annotated Chinese discourse treebank. Given other layers of linguistic information in the near future, this corpus can be used for the extraction of necessary a priori scores needed in Chinese summarizers and be used as a platform for training and testing statistics-based discourse parsers. Therefore, this Chinese RST treebank will serve as an ideal testbed for Chinese computer scientists to catch up with their international competitors in discourse processing.Secondly, our annotation efforts have proved on a fairly large scale the cross-language transferability of RST and its formalization. Meanwhile, new territories for studies on Chinese cue phrases are also explored. Exciting findings can be expected in Chinese discourse studies after the completion of this treebank. And given some comparable corpora in other languages or other genres, this corpus could also be used as an empirical database for contrastive rhetorical studies.Finally, we can also predict the usage of annotated Chinese discourse corpus in social sciences, corpus-driven studies on journalism or pragmatics, for instance.

  • 【分类号】H15
  • 【被引频次】16
  • 【下载频次】1018
节点文献中: 

本文链接的文献网络图示:

本文的引文网络