节点文献

语言形式化原理

A Study on Language Formalization

【作者】 王迈

【导师】 王德春;

【作者基本信息】 上海外国语大学 , 外国语言学及应用语言学, 2011, 博士

【摘要】 论文主要从语言学和计算机科学的视角,探讨语言形式化的一般原理和方法。除绪论外,论文的主体还包括语音形式化、语义形式化、语法形式化、语用修辞形式化、文字形式化等,共六章。各章的主要内容及观点归纳如下:第一章绪论重点探讨语言形式与意义的关系问题,指出形式联系意义既是语言学研究的根本原则,也是语言形式化研究的根本原则,它是贯穿全文的指导思想。本章还探讨了形式化研究的学科支持、其在语言学体系中的地位和作用,以及语言形式化的层次和基本架构等。第二章为语音形式化。首先探讨语音的三种属性及其内在联系,这是语音形式化的基础,也是设计各种语音编码方案及压缩方案的重要参考。语音形式化的基本过程是采样、量化、编码;利用语音属性的不同特点,可以采取不均匀量化、差分量化、矢量量化、频域波形编码、参数编码等手段,以提高语音形式化的效率和质量。本章还分别探讨了语音压缩、语音合成的自然度以及语音识别的概率模型等问题。第三章语义形式化是全文的重点。首先探讨符号主义范式的基本架构及工具,包括图灵机、有限状态自动机、正则表达式等;以及基于符号主义的几种代表性的语义形式化方法,包括义素分析、逻辑语义分析、语义格分析、词性分析等;这些形式化方法的效果都不理想,其根本原因在于忽视语义系统无限性这一本质属性,而任何对语义系统的有限化改写都将造成语义缺失,破坏其完整性,最终导致失败。与此相对,联结主义从人的自然生理结构出发,把人脑看成由众多节点联结而成的开放式关系网络,具有并行处理、容错、自学习、遗忘、规则浮现等特征,这与人脑中的概念网络结构十分相似,是词汇语义形式化的理想模型。计算机语言作为典型的符号主义描写工具,伴随其智能化处理能力的严重不足,业已表现出明显的联结主义转向。模糊性是语义形式化的另一基本问题。语言的模糊性非源于语言单位的有限性,也非源于客观世界的模糊性,它源于人脑对客观世界的认知方式,其中比较和概念化过程是模糊性产生的关键节点,而模糊性的产生反又促进了人脑认知效率的大幅提升。符号主义范式对语义进行有限化改写的过程中所摒弃主要内容正是模糊性,而联结主义范式可以实现对语义清晰与模糊的全覆盖。第四章讨论语法形式化。概念意义是明示的、开放的,语法意义是暗示的、封闭的,概念意义抽象为语法意义的过程,就是从明示的到暗示、从无限到有限的过程,它受到语言发展经济规律的制约。概念关系是多维的、普遍联系的,从深层概念结构到表层句法结构,是一个降维的线性化过程,语法就是作为多维信息损失的补偿机制而产生的。语法单位的有限性决定了其较词汇语义更易于形式化,符号主义范式可以胜任这一工作。本章还讨论了语法形式化的一些具体问题和难点,包括上下文无关语法及N元语法、词类划分、汉语的分词及词性标注等。最后作为示例探讨了“把”字结构,指出其句型意义为“不同类个体之间竞争关系的表达”,在此基础上给出其句法结构的语义构成,包括优势竞争者、劣势竞争者、竞争方式、竞争结果四项。第五章探讨语用修辞形式化,其基础是语境的形式化,包括参与者信息、客观环境、上下文、语言知识、常识性知识、社会文化背景知识等六类。基于实用性考虑,形式语境的构成不再区分语言性和知识性,而是影响意义表达和意义理解的一切因素的总和。本章用C++程序构建了一个基本的语境类,并讨论了该语境类在具体言语交际中的运作模式,虽然很不完善,却是一次全新的尝试。本章还讨论了一类特殊的修辞格——通感。通感既是五种感觉之间的相通,同时也是内省的情绪、情感之间的交融。通感与比喻、比拟等传统辞格具有相同的认知心理基础,都是处在心智连续统上的不同区域间的彼此联通,因此可以把它们共同纳入广义的通感范畴。心智连续统是辞格形式化的重要参考模型。最后一章是文字形式化。首先探讨文字的信息量——熵的概念,指出汉字的诸多特点包括字形复杂、数量庞大、区别度高、信息量大等,都与其高熵值密切相关。进一步观察,还可以发现隐藏在信息熵之下的语言共性,而词汇概念体系的复杂程度是衡量一种语言发达程度的根本标准。第二部分阐述文字形式化的具体内容,主要围绕文字的内码、外码和形码展开,包括各种主要的形式化方案和各自的优缺点。最后探讨文字识别的基本原理及实现。

【Abstract】 The thesis studies the general principles and methods of the language formalization mainly from the perspective of linguistics and computer sciences. In addition to the introduction, the paper consists of six chapters, including formal phonetics, formal semantics, formal grammar, formal pragmatics and rhetoric, formal writing system, etc. The main content and views of each chapter are summarized below:The first chapter mainly probes into the relation between linguistic form and meaning which points out that the integration of form and meaning is the fundamental principle of linguistics and formal linguistics. It is the guideline of the thesis. This chapter also examines subject support of formal linguistics and its role in the linguistic system, level and basic framework of formal linguistics, etc.The second chapter is formal phonetics. It first studies three phonetic attributes and their internal relation which is the base of formal phonetics and the important reference of designing various phonetic coding scheme and compression scheme. The basic process of formal phonectics consists of sampling, quantization and coding.With the different characteristics of phonetic attrebutes, it can take means of nonuniform quantizatio, differential quantization, Vector Quantization, Frequency domain waveform coding, Parametric Coding,etc in order to improve efficiency and quality of formal phonetics. The chapter also studies voice compression, natualness of speech synthesis, probability model of speech recognition respectively.Chapter 3 studies formal sematics which is the main points in this thesis. It first probes into basic frame of symbolic paradigm and tools including Turing Machine, Finite-State Automaton, Regular Expression, etc and several typical symbolism-based methods of formal sematics including Semanteme Analysis, Logical Semantic Analysis, Parts of Speech, etc. However, these methods are all not desirable because they neglect the infinity of sematic system. Any limited rewritings of sematic system cause the loss of sematic, break its integrity and eventually lead to failure.On the other hand, connectionism is based on physiological structure and regards the human brain as a complex network of interrelated nodes which is characterized by parallel distributed processing, fault tolerance, self-learning, forgetting, rule emergence, etc. It is quite similar to the conceptual network of human brain and therefore it is the ideal model of formal sematics.As a typical tool for describing symbolism, computer language has shown the obvious turning to connectionism with its severe lack of the intelligent ability.Fuzziness is another basic problem of formal sematics. Fuzziness of language doesn’t originate from finiteness of language’s units or fuzziness of the objective world. It comes from cognitive styles. The processes of comparison and conceptualization are the keys to the emergence of fuzziness which in turn improves the efficiency of cognition greatly. That symbolic paradigm rejects the main content in the process of limited rewriting of sematics is just the fuzziness while connectionist paradigm can wholly cover accuracy and fuzziness.Chapter 4 discusses formal grammar. Conceptual Meanings are explicit and open while Grammatical Meanings are implicit and closed. The process of abstraction of conceptural meanings to grammatical meanings is just from explicitness to implicity, from infinity to finiteness which is restricted by Economy Principle of language development. Conceptual relation is multidimensional and generally related.It is serialization of dimensionality reduction from deep Conceptual Structure to superficial syntactic structure.Grammar is the product of compensation mechanism of multidimensional information loss. The finiteness of grammar unit determines formal sematics. Symbolic paradigm is competent.The chapter also discusses some concrete problems and difficulties of formal grammar including context-free grammar & N-gram, Classification of Words, Chinese word segmentation, part-of-speech tagging.etc. Finally, take“ba”structure as an example which points out that the meaning of“ba”structure expresses the competitive relation among individuals of different kinds.Based on this example, semantic role of syntactic structure is presented including superior competitors, inferior competitors, competitive ways and competitive outcome.Chapter 5 probes into Formal pragmatics and rhetoric which is based on formal context including participant information, external environment, conversational context, linguistry, general knowledge, sociocultural knowledge,etc.Based on practical consideration, formation of formal context no longer distinguishes between linguistry and general knowledge and it is the combination of factors which affects the expression and understanding of meanings.This chapter discusses work pattern of the basic context class constructed by C++ programme in the concrete language communication. Although the discussion is not perfect, it is a new try.This chapter also discusses a special figure of speech-Synaesthesia. Synaesthesia is the transfer or empathy among the five senses and also the communication or fusion of introspective mood and emotion. There are no essential differences between the cognitive psychology of synaesthesia and such traditional figure of speech as metaphor, analogy. They are in the mental continuum between different regions, and thus can be connected together into their general synaesthesia category. Mental continuum is the important reference model of formal figure of speech.The last chapter discusses formal writing system. First it discusses the amount of information—the concept of entropy. It points out that many features of Chinese characters including complex font, great quantity, many differences, huge amounts of information are closely related to the high value of entropy. Linguistic Universalism hidden under entropy can be found when further observed. The complexity of lexical system is the fundamental standard for measuring the development levels of language.The second part of this chapter explains the concrete content of formal writing system.It focuses on Internal Code, external Code and graphemic Code including advantages and disavantages of various formal schemes. Finally it discusses the basic principles and realization of Character Recognition.

  • 【分类号】H0-02
  • 【被引频次】1
  • 【下载频次】434
节点文献中: