中文分词文献列表 Bibliography of Chinese Word Segmentation
中文分词文献列表 Bibliography of Chinese Word Segmentation
中文分词文献列表 Bibliography of Chinese Word Segmentation
由 张开旭维护( 中文分词实验环境)
如有意见与建议,欢迎联系作者:)
页面生成日期: 2012年09月28日, 由 bibpage工具自动生成自bib格式文献列表。2012
Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word DetectionProceedings of the 50th Annual Meeting of the Association for Computational Linguistics {(Volume} 1: Long Papers)Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous AnnotationsProceedings of the 50th Annual Meeting of the Association for Computational Linguistics {(Volume} 1: Long Papers)Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech TaggingProceedings of the 50th Annual Meeting of the Association for Computational Linguistics {(Volume} 1: Long Papers)Joint Chinese Word Segmentation, {POS} Tagging and ParsingProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language LearningUnified Dependency Parsing of Chinese Morphological and Syntactic StructuresProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language LearningIterative Annotation Transformation with Predict-Self Reestimation for Chinese Word SegmentationProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language LearningIncremental Joint Approach to Word Segmentation, {POS} Tagging, and Dependency Parsing in ChineseProceedings of the 50th Annual Meeting of the Association for Computational Linguistics {(Volume} 1: Long Papers)2011
Improving Chinese Word Segmentation and {POS} Tagging with Semi-supervised Methods Using Large Auto-Analyzed DataProceedings of 5th International Joint Conference on Natural Language ProcessingEnhancing Chinese Word Segmentation Using Unlabeled DataProceedings of the 2011 Conference on Empirical Methods in Natural Language Processingfeature engineering,使用in-domain的未标注数据帮助中文分词。 {增加的特征有:互信息;Accessor} Variety;基于标点符号的特征;篇章级的特征。 另外一个结论是使用实数值作为特征值不如用binary的。A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech TaggingProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies使用stacked learning这种meta-learning algorithm,有机制避免两层在训练时使用重叠的训练数据,但也能最大限度利用数据。 第一层使用了三个模型,基于词的,基于字序列标注的,基于单字分类的。Parsing the Internal Structure of Words: A New Paradigm for Chinese Word SegmentationProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies将词法分析与句法分析结合。在同一棵树下使用不同的“成分”标签。 使用句法分析的算法解码。Syntactic Processing using the Generalized Perceptron and Beam SearchComputational Linguistics之前工作的总结。 将平均感知器,应用于汉语的词法分析、句法分析。 使用beam search。A New Unsupervised Approach to Word SegmentationComputational Linguistics2010
A Fast Decoder for Joint Word Segmentation and {POS-Tagging} Using a Single Discriminative ModelProceedings of the 2010 Conference on Empirical Methods in Natural Language Processing解码速度从每秒2.24句,提高到每秒24.94就Joint Tokenization and TranslationProceedings of the 23rd International Conference on Computational Linguistics {(Coling} 2010)A Character-Based Joint Model for Chinese Word SegmentationProceedings of the 23rd International Conference on Computational Linguistics {(Coling} 2010)整合一个产生式模型和判别式模型 另外发现将某些binary特征值的权重改一下,可以提高效果。A Local Generative Model for Chinese Word SegmentationInformation Retrieval Technology提出一种用局部的语言模型做分词的方法。 {提出一种构造切分二叉树的方法,处理分词粒度问题,该方法也可直接利用CRF的输出构造二叉树。}Joint training and decoding using virtual nodes for cascaded segmentation and tagging tasksProceedings of the 2010 Conference on Empirical Methods in Natural Language Processing2009
Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and {POS} Tagging – A Case StudyProceedings of the 47th {ACL}Perceptron,分词与词性标注结合。将一种标注体系下的参数,转移到另一种标注体系中使用。Character-Level Dependencies in Chinese: Usefulness and LearningProceedings of the 12th Conference of the European Chapter of the {ACL} {(EACL} 2009)用字的依存树做分词。 最后系统,词内是词法字依存关系,词之间是线性依存关系。 当然最终效果没有现有最优系统好。基于字依存树的中文词法-句法一体化分析中国计算机语言学研究前沿进展 (2007-2009)基于 {CRFs} 的中文分词和短文本分类技术{就分词来说,用Chi方做特征选择,一半的特征仍然可以保持性能。} 个别字(如“的”,“和”,“了”)的有无对整句切分的正确性有帮助与干扰。 {使用CRF的置信度输出,低置信度产生高错误率。} 基于规则的、基于篇章上下文统计的低置信度后处理过程。A Simple and Efficient Model Pruning Method for Conditional Random Fields{CRF训练后,按参数值去掉大部分特征,性能都不会下降,用事实证明CRF有太多冗余。}Chinese text segmentation: A hybrid approach using transductive learning and statistical association measuresExpert Systems with Applications{多种加入各种特征提高CRF性能的方法。}Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language ModelingProceedings of the Joint Conference of the 47th Annual Meeting of the {ACL} and the 4th International Joint Conference on Natural Language Processing of the {AFNLP}{用Pitman-Yor,建立了两层语言模型,一个是词的,一个是} 句子的。Punctuation as Implicit Annotations for Chinese Word SegmentationComputational LinguisticsAn Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and {POS} TaggingProc. of {ACL-IJCNLP} 2009词典词与生词分别对待2008
Word Lattice Reranking for Chinese Word Segmentation and Part-of-Speech TaggingProceedings of the 22nd International Conference on Computational Linguistics {(Coling} 2008)使用reranking。有别于top-n的reranking,使用指数规模的word lattice reranking。至少看oracle,后者比前者就好。 解决的问题有:如何构造lattice,如何算oracle,有哪些特征,以及reranking的时候的cube剪枝。Joint Word Segmentation and {POS} Tagging Using a Single PerceptronProceedings of {ACL-08:} {HLT}A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech TaggingProceedings of {ACL-08:} {HLT}Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognitionThe Sixth {SIGHAN} Workshop on Chinese Language Processing将accessor variety {(AV)的结果离散化,然后分散到字,给为CRF的输入,可以提高分词效果。}An Empirical Comparison of Goodness Measures for Unsupervised Chinese Word Segmentation with a Unified FrameworkThe Third International Joint Conference on Natural Language Processing {(IJCNLP-2008)}, Hyderabad, India{描述了四种用于无监督中文分词的判别量:Frequency} of Substring with {ReductionDescription} Length Gain {(DLG)Accessor} Variety {(AV)Boundary} Entropy {(Branching} Entropy, {BE)}Bayesian semi-supervised chinese word segmentation for statistical machine translationProceedings of the 22nd International Conference on Computational Linguistics-Volume 1Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their DisambiguationText, Speech and DialogueInformation retrieval oriented word segmentation based on character associative strength rankingProceedings of the Conference on Empirical Methods in Natural Language Processing{用了RankingSVM的方法分词,用于IR}2007
Chinese Segmentation with a Word-Based Perceptron Algorithm采用average perceptron,然后用一种lazy update的方法。 采用了基于词的特征,所以解码使用柱搜索,而不能用贪心或者动态规划。基于有效子串标注的中文分词中文信息学报中文分词十年回顾中文信息学报中文词的认同度。从863、973到sig {han评测。语料库的质量控制(包括对“心理词”的规则制定)。基于语法的、基于规则的不如基于词的,又被基于字的取代。大规模真实文本中未登录词造成的分词精度失落比歧义切分造成的精度失落至少大5倍以上。基于字的,最大熵,SVM,CRF等。词位转移,2标注,4标注,微软的6标注。5字窗口足够了。}A dual-layer {CRFs} based joint decoding method for cascaded segmentation and labeling tasksProceedings of {IJCAI}{双层CRF做分词与词性标注,中规中矩。} 第一层基于字信息分词;第二层基于词,以及字信息标注词性。 {两层CRF分开训练,联合测试。第一层找N-best,再综合第一层第二层的结果重新排序。}A hybrid approach to word segmentation and pos tagging{ANNUAL} {MEETING-ASSOCIATION} {FOR} {COMPUTATIONAL} {LINGUISTICS}{字与词结合的Lattice,然后分词与标注结合。仍然用马尔可夫模型}Rethinking Chinese word segmentation: tokenization, character classification, or wordbreak identificationProceedings of the 45th Annual Meeting of the {ACL} on Interactive Poster and Demonstration Sessions不使用字标注,直接关心字间间隔(断开与不断开)。 使用滑动窗口的方法进行判断。2006
Subword-Based Tagging for Confidence-Dependent Chinese Word SegmentationProceedings of the {COLING/ACL} 2006 Main Conference Poster Sessionssubword-based tagging, 比如北京市 标注为 北京/l 市/r 不过还是用的三标注系统 {使用CRF中的置信度,与基于词典的方法融合} {CRF倾向于较高的OOV的F1,而较低的IV的F1}汉语词典的快速查询算法研究中文信息学报{双数组Trie数是相当高效的词典查询算法,适合中文分词。简单说是逐字哈希,而哈希函数是平凡的f(x)=x,而且不会有冲突。所以很快。但维护双数组也很难。}}An improved Chinese word segmentation system with conditional random fieldProceedings of the Fifth {SIGHAN} Workshop on Chinese Language Processing6-tag settone featureassistant segmentersDiscriminative pruning of language models for Chinese word segmentationProceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational LinguisticsContextual Dependencies in Unsupervised Word SegmentationProceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics{基于D过程的语言模型与词法模型两个词两个词的Gibbs采样}2005
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic ApproachComputational Linguistics使用perceptron学习线性模型与基于字标注不同,解码前构造word lattice。相当于事先缩小了可能的字标注结果集合的大小。将词分为若干类,每一类会按概率计算一些概率值,作为perceptron的参数。perceptron的参数全是非binary的。只有词类的trigram的概率,不涉及任何具体字。A conditional random field word segmenter for sighan bakeoff 2005Proceedings of the Fourth {SIGHAN} Workshop on Chinese Language Processing{SIGHAN} bakekoff 2005 中相当好的一个系统 {加了简单的词缀和叠字的feature在CRF里面}Perceptron Learning for Chinese Word SegmentationProceedings of Fourth {SIGHAN} Workshop on Chinese Language processing {(Sighan-05)}The second international chinese word segmentation bakeoffProceedings of the Fourth {SIGHAN} Workshop on Chinese Language ProcessingA Statistic Study of Three-character Unknown Words in ChineseJournal of Chinese Language and Computing2004
Chinese Segmentation and New Word Detection using Conditional Random FieldsProceedings of Coling 2004{将CRF引入中文分词}Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?Proceedings of {EMNLP} 2004用最大熵模型试了三种方法,分开做分词与标注或者同时做,词性标注用基于字的特征或者用基于词的特征: 同时的基于字的最好,但是时间慢很多。 分开基于字的稍差,但快很多。 分开基于词的,分词性能当然与基于字的一样,但词性标注差很多,总时间快一点。词性标注差是因为词之中的字对确定词性很重要。 没有同时而且基于词的,估计是因为机器跑不动。也没有实验在分词阶段用基于词的特征。基于无指导学习策略的无词表条件下的汉语自动分词计算机学报使用互信息与t测试差当作两个判据以字为单位进行无监督分词。以字算的标注准确度可到85\%左右。Applying conditional random fields to Japanese morphological analysisProc. of {EMNLP}{用改造过的CRF模型做日文分词。以词为单位,即y长度与x不一定相等。}Adaptive Chinese word segmentationProceedings of {ACL-2004}Unsupervised segmentation of Chinese corpus using accessor varietyNatural Language Processing {IJCNLP} 2004{如何用Accessor} variety 构造一个分词器。如何设计目标函数。Accessor variety criteria for Chinese word extractionComputational Linguistics2003
{HHMM-based} Chinese lexical analyzer {ICTCLAS}Proceedings of the second {SIGHAN} workshop on Chinese language processing-Volume 17{实用化的分词工具包ICTCLAS的介绍性论文。}Chinese lexical analysis using hierarchical hidden markov modelProceedings of the second {SIGHAN} workshop on Chinese language processing-Volume 17Chinese Word Segmentation as {LMR} TaggingProceedings of the second {SIGHAN} workshop on Chinese language processing-Volume 17Chinese Word Segmentation as Character TaggingComputational Linguistics and Chinese Language ProcessingThe first international Chinese word segmentation bakeoffProceedings of the second {SIGHAN} workshop on Chinese language processingA maximum entropy Chinese character-based parserImproved source-channel models for Chinese word segmentationProceedings of the 41st Annual Meeting on Association for Computational LinguisticsChinese word segmentation using minimal linguistic knowledgeProceedings of the second {SIGHAN} workshop on Chinese language processingCombining segmenter and chunker for Chinese word segmentationProceedings of the 2nd {SIGHAN} Workshop on Chinese Language Processing2002
Combining classifiers for Chinese word segmentationProceedings of the 1st {SIGHAN} Workshop on Chinese Language Processing里程碑,第一次提出字标注的分词模型Corpus-based methods in Chinese morphologyTutorial at the 19th {COLING}Corpus-based methods in Chinese morphology and phonology{COOLING} 20022001
汉语自动分词研究评述当代语言学{对上世纪中文分词研究的一个较好的回顾及评论。歧义,交集歧义与覆盖歧义;OOV。}Defining and automatically identifying words in ChineseSelf-supervised Chinese word segmentationAdvances in Intelligent Data Analysis{纯无监督分词,EM算法} self-supervised,分两个词典。 {MI词典剪枝}2000
A compression-based algorithm for Chinese word segmentationComput. Linguist.1999
Discovering Chinese words from unsegmented text (poster abstract)Proceedings of the 22nd annual international {ACM} {SIGIR} conference on Research and development in information retrieval{纯无监督分词,EM算法,0阶隐马尔可夫链}1998
串频统计和词形匹配相结合的汉语自动分词系统中文信息学报Chinese Word Segmentation without Using Lexicon and Hand-crafted Training DataProceedings of the 17th international conference on Computational linguistics-Volume 2A hybrid approach to word segmentationLecture notes in computer science1997
中文信息处理中的分词问题Applied LinguisticsAn unsupervised iterative method for Chinese new lexicon extractionInternational Journal of Computational Linguistics \& Chinese Language Processing1996
A stochastic finite-state word-segmentation algorithm for ChineseComputational LinguisticsUseg: A retargetable word segmentation procedure for information retrievalSymposium on Document Analysis and Information Retrieval1992
An efficient implementation of trie structuresSoftware: Practice and Experience双数组trie书}Word identification for Mandarin Chinese sentencesProceedings of the 14th conference on Computational linguistics-Volume 1
posted on
2013-03-18 17:05 lexus 阅读(
...) 评论(
...) 编辑 收藏