第一章_德塔自然语言图灵系统
基础应用: 元基催化与肽计算 编译机的语言分析机
知识来源, 作者对分词技术首先并不陌生, 感谢国内良好且扎实的免费义务教育和付费学历教育, 作者客观的系统学习了16年国学语文和英语基础. 对语文语言学的 节偏旁, 音, 字, 词, 句, 段, 章, 顶针, 搭配, 比拟修辞, 陈述, 议论, 散文, 古文, 诗词, 谚语, 成语, 语法结构, 多义, 多语意识, 辩论, 多态, 谐音, 归纳等语言学细节有基础的系统观和个人理解力.
其次感谢2009~2010法国ESIEE的Pascal教授, 在做MP6法语邮件项目6个月中, 作者系统的学习了契形文字的构造思想, 拉丁文的字意, 接触了Flech,
Flesher分词, 弗莱士元音词根搭配, 词缀的组合方式. 拉丁语系语言的意识理解密度. 和欧娜法语培训, 关于法语名词的性别表达, 法语发音,法语书写等细节课程.
有了这些基础, 作者感谢2011美国ELS10个月的学习, 作者从中国台湾(繁体), 中东, 日本, 朝鲜, 韩国和俄罗斯同学那学习了这5个国家的文字拼音组成方式, 发音方式, 书写方式和划分方式. 另外, 在ELS比较系统的重学了一遍美式英语所有英语时态语法. 如美式发音, 英语文章的词汇语句搭配方式, 辩论, 形容词的描述序列, 谚语, 里语和 复句同位词的词序理解方式. 作者有一篇散文对这段记忆的叙述.
Story - ELS: Wearing a formal suit and putting on a pair of black leather shoes, tightly stood in the hall of ELS. I felt happy and some fear, which was intwined in my heart. A rectangular, black table stood on the soft, gray and smooth carpet. I carried a package, which was loaded with important documents. Where I could see a lot of international students with fashionable clothes. I felt culture shock because the faces were not familiar to me. Ordinary I was used to seeing people with black hair, not people with yellow or brown hair. In China, all the people have dark eyes but here the people have blue, brown or green eyes, as I looked at all the strange faces and eyes, then began to feel anxious. I was reading a newspaper when the phone rang. Glancing around the hall quickly, and found everyone focused on their work. My heart had became calm after I put my cell phone on mute. To the left I could see through the window where the director's office was. The furnishings were a bit small but arranged orderly in a small area. A corridor with access to the LTC was next to the door. I was so happy to see where some Chinese people were waiting for their interviews.
To the right was a gate to the small hall, I saw a soft sofa here, ELS's teacher's photos were hanging on the white wall next to the sofa. I stood near the back gate where I could see some magazines in the holding hook, straight ahead was a vast globe map, I found out my hometown immediately and pointed it out suddenly, and felt friendly and warmly. When I went into Asian room, I could remember very clearly that to my right was a rectangular, new white board where Mr. Joel wrote our English sessions. I quietly seated at the first rank and looked toward at front of my eye sight, two narrow windows with white blinds on them, looked to pass the windows, where I could see many tall and straight pine trees with lush and green leaves on them. My mood had brightened lot. Everyone has a door in the deep heart, and the beautiful environment is often the best key, as a famous says" If you can't change the world, change yourself.".
作者还要感谢201909+北大杨叶,作者持续了3个月系统的学习了书法笔画研究关于 蝌蚪文, 金文, 纂, 小篆, 片假文, 竹节文, 汉简, 晋刻和隶书的书写方式, 叠加方式, 组合方式. 以及浏阳三联201803+ 作者在这待了2个星期实习怎么思维表达当老师.
作者最后要感谢lucene分词和中科院的基于lucene分词的上层隐马科夫中文概率分词分析插件, 2009年给作者启蒙了软件分词算法的意识,以及阅读了华中科技等大学的仅仅基于拼音概率分词的论文思维.
有了上述这6点基础, 再加上国外5年课程学历用英语教学, 于是作者在编码养疗经华瑞集系统的时候, 因为lucene中文分词速度逐渐跟不上亿字的分词搜索需求, 有魄力201808+ 自己重新开始写个分词软件并一直优化它.
作者以为没有学高三英语(当时浏阳三中转校浏阳一中, 人家已经学完了在复习了, 作者少学了很多科目) 会有遗憾, 现在发现, 当年的这些遗憾因为这些年持续的作业, 都通过另一种价值方式体现了.
罗瑶光
测试速度: 单机联想Y7000笔记本win10 实测峰值每秒 中文分词1630~1650万+中文字, 词库65000+, 函数准确率100%, 缺失语法函数 0. 3%-, 算法准确率 99. 7%+, 100%完整开放源码, 在api与书籍中.
测试效果: 输入: 如果从容易开始于是从容不迫天下等于是非常识时务必为俊杰沿海南方向逃跑他说的确实在理结婚的和尚未结婚的提高产品质量中外科学名著内科学是临床医学的基础内科学作为临床医学的基础学科重点论述人体各个系统各种疾病的病因发病机制临床表现诊断治疗与预防.
输出结果: 如果+从+容易+开始+于是+从容不迫+天下+等于+是非+常识+时务+必+为+俊杰+沿海+南+方向+逃跑+他+说+的+确实+在理+结婚+的+和+尚未+结婚+的+提高+产品质量+中外+科学+名著+内科学+是+临床+医学+的+基础+内科学+作为+临床+医学+的+基础+学科+重点+论述+人体+各个+系+统+各种+疾病+的+病因+发病+机制+临床+表现+诊断+治疗+与+预防+++++
定义: 德塔分词是一种 基于神经网络索引字典进行文章文字关联切割, 然后进行前序遍历其词性组合匹配, 按文学语法定义搭配 的规则切词引擎.
德塔分词的催化切词优化方式主要包含:
1 索引字典进行细化拆分加速. 细化微分能够有效的减少内存运算体积, 减少资源占用. 从而提高当前的关于堆栈的搜索和操作速度. 2 函数进行使用频率统计排列加速优化. 函数的使用频率统计排列一旦有高频提前的操作, 那么具备了队列优先意识, 可进行代谢.
3 动态类卷积遍历内核的关键字优化. 动态卷积内核的总数直接关联到计算复杂度, 计算越复杂, 成本便越高, 时间开销也越大, 当然自适应精度也相应提高. 4 函数文件和 函数文件名 进行新陈代谢, 二次新陈代谢优化索引编码加速. 函数越细化, 逻辑便越简洁, 那么单位的call计算便越均匀, 这种balanced操作越有条理. 5文学切词语法函数的细化优化加速. 文学切词问题便更有针对性. 定义者罗瑶光
Definition: Deta parser word segmentation was a word cutting engine where based on the index forest dictionary of the neural network map, which carried out the word associational cutting, and then circulating the traversal function of the part of speech(POS) and combinational matching, and defined the collocation according to the Chinese literary grammar.
1 The index dictionary could accelerate the refinement and splitting. The refinement differentiation could effectively reduce the memory operation volume and the occupation of resources, so as to improve the current stack search and operational speed.
2 The function could accelerate the optimization of the frequency’s usage and statistical arrangement. Once the function had the high-frequenct advance operation, meant It had a queue-priority consciousness and could be metabolized here. For example, the higher frequent logic sections could be arranged at the top by using Sequences of Von-Neumann (from top to bottom, from left to right).
3 The total number of dynamic convolutional kernels was directly related to the computational complexity. The more complex the computation was, the higher the cost would be. Of course the higher the time cost would be, the adaptive accuracy would also be improved accordingly.
4 Functional prototypes and function-file names were metabolized by PDE, and the secondary metabolism was optimized by Initons to accelerate index encoding. The more detailed the function, the more concise logic, the more uniform the unit-called calculation, and the more organized the balanced operation.
5 The refinement and optimization of literary lexical functions were accelerated. The literary lexical problem was more targeted.
分词
1德塔的分词是一种前序《排队论》逐字遍历文字索引, 通过索引中的词汇匹配 按长度进行提取, 然后将提取的词汇串 进行词性切分的过程. refer page 12 ~
2德塔的分词文字索引采用关联分类生成小文件map集(词性map, 词长map, 词类map), 进行整体加速,作为一个催化细化过程. refer page 44, 54, 92,
3德塔的词汇匹配目前有多个国家语言字符集, 可统一, 可拆分, 目前最大划分处理长度为4, 划分切词采用动态 类似CNN 卷积(遍历pos函数语句的内核计算, 非卷积的积分叠加计算) StringBuilder核做POS识别. refer page 45, 119, 120,
4德塔的词性切分按照4字词 3字词 2字词 单字进行逐级按词汇的 POS搭配语法模式进行归纳, 按文本的POS出现频率进行流水阀门方式优化. refer page 97, 116,
Deta parser, a sentence and word segmentational marching tool(NERO-NLP-POS), was based on an in-order sequence verbals computation, the computer monitors the humanoid specification, to read articles, sentences by each word one by one as the river flows, then cut the word link list by using stop method such as the index length recognizing, word dictionary marching, then extracts the pre-materials for the continuing POS(part of speech) process.
At the first. The author built a lot of association maps and classification sets, which did a nice verbal data storage of all kinds of the literary otho corpus(NERO), for instance, POS maps, word length maps and word type maps, to better catalyz the system-tunning of accelerations.
Each lexical map could make a combination and classificational definition, the author used the StringBuilder function to do the word segmentation by non-conventional different kernel computations. Like the water sequence flowed from top to bottom, from left to right, then gathered statistics of the frequency prototype usages, made the queued optimizations at the same time and catalyzed the system-tunning of accelerations.
The word segmentation and Its POS cut ways by following 4 Chinese chars words, 3 chars words, 2 chars words to loop check the POS lexical dictionary-map storages. Kernel computations work for finding a marching verbals word, then returned the response to the NLP engine.
Yaoguang.Luo
(德塔分词逻辑, 已经纠正红色字 ‘卷积’改为‘内核’, 因为第四修订版本已包含卷积两字, ppt所有书中的原图纠正内容统一更新在第5版, 罗瑶光)
排序,
1 德塔分词排序思想原型采用Sir Charles Antony Richard Hoare 的 快速排序思想.
refer page 版权原因无文字收录 已经refer 快速排序算法_百度百科
2 德塔分词排序源码原型采用Introduction to Algorithms 的 快速排序4代源码.
refer page 版权原因无源码收录 已经refer https://github. com/yaoguangluo/Data_Processor/blob/master/DP/sortProcessor/Quick_4D_Sort.
java
3 基于1 和 2原型, 德塔分词排序 采用Theory on YAOGUANG's Array Split Peak Defect 的微分催化算子优化思想 2013年开始优化. refer page 247, 248, 250, 529, 620,
4 优化过程为 小高峰左右比对法,波动算子过滤思想, 离散条件归纳微分思想(如笛摩根计算, 流水阀门计算等), 目前为TopSort5D.
refer page 658, 下册134
5 德塔分词的函数优化方式和算法优化方式, 包括分词引擎, 读心术, NLP分析等核心组件均采用微分催化系统. refer page 661,
The ordinary mode of the Deta catalytic sorting function was based on the quicksort 4D theory with Sir Charles Antony Richard Hoare, the author refered a book
神经网络索引
1德塔分词的词汇字典用map进行索引, 因为Jdk8+的map对象的key支持2分搜索, 搜索速度到了峰值. refer page, 129, 131
2德塔分词的索引不断的将大map进行细化分类, 如词长map, 词类map, 词性map, 让搜索再次加速.refer page 55,
3德塔分词的索引map支持 2次组合计算, 支持分布式服务器进行索引cache. 关于2次组合计算作者不建议单机使用. refer page 92,
4德塔分词map的key用string的 char对应ASCII int进行标识来执行find key, 方便二分搜索存储和 StringBuilder高速计算, 实现底层核统一. refer page 92
Nero Network Index Forest
1 Deta Parser did a word segmental indexed map by using humanoid sematic verbal dictionary, for the reason why using JDK8+ tool to do the map search logic, is that it had already integrated the binary search tree, balanced map tree arrangement and other technologies.
2 Deta Parser’s balanced binary search tree method made an observer mode of averaged classification with all types of the reflection java concurrent maps, those maps included the char word length, verbal types and part of speech corpus, etc. The author did it to accelerate the Nero-marching speedly for searching the words.
3 Deta Parser supported the secondary indexing computing combinations, this way could be suitable for the distributed cache searching systems. The author did not suggest this technology which be used on a single desktop.
4 For the computing logic, Finally Deta Parser functions used string builder to accelerate the searching engine.
神经网络索引的价值主要体现在2个地方, 切词的关联索引上和词汇map索引上. 切词的关联索引价值, 主要体现在将词汇的文字进行链化提取, 这种链化计算方式将词库中本相对独立的海量词汇进行了按人类语言文学中的顶针方法进行了有效的前后长度关联(NERO), 其价值有利于大文本的文字进行有必要关联链的 小段小段的提取(NLP), 类似挤牙膏一样, 挤出来就刷牙用掉(POS).
词汇map索引价值, 主要体现在 词汇的文字进行链化合理切分, 这种链化切分方式将词库中根据不同属性的分类map来组合匹配按人类语言文学中的词汇词性和主谓宾搭配严谨定义来切分. 其价值在这些分类map可以自适应设计和多样化扩展. 增加切词准确度和灵活度, 适应各种不同的场景, 类似牙刷机制, 挤出牙膏根据 匹配不同的牙刷和刷牙方法(NERO + POS), 匹配适应不同的口腔环境. 描述人 罗瑶光, 稍后优化下.
The accomplishment of the neural network-index is mainly reflected in two sections, 1 for the relevanced index of word segmentation, and 2 for the lexical indexed map. The associated and relevanced index-value of word segmentation, is mainly reflected in the chained extraction of words. This chained calculation method effectively correlates the relatively independent of a large number of words in the thesaurus, according to the Thimble Theory in human language and Literature (Nero). The value of the big data documental process, splits the word chain links list into small chars-token (max 4) sections, and It is similar to a squeezing toothpaste, and a brushing teeth (POS) after a squeezed out with the DetaParser marching engine.
The index value of the lexical map is mainly reflected in the reasonable chain-segmentation of lexical characters. This chain of word segmental method, combines and matches the classified maps in the thesaurus, according to different attributes. And then separates them according to the rigorous definition of lexical POS and SVO’s collocation in human literary languages. The adaptive industrial system designed and diversified the expansion of this classification, will increase the accuracy and flexibility of word segmentation and adapt to different segmental scenes. Similar to the way of toothbrushes, the extruded toothpaste is matched to adapt to different oral cavity-environments, according to different toothbrushes and brushing methods (Nero + POS).
Author: Yaoguang.Luo 稍后持续优化语法.
分词在线性文本搜索中应用,
1德塔分词的搜索建立在map类的权重计算方法上, 不同的权重叠加产生的打分进行排序输出. refer page 下册64
2权重的计算方法按词性的主谓宾如代名动形, 和 POS如 动名形谓介分类. refer page 下册66
3权重与词长, 词频进行耦合bit叠加计算(bit位计算比乘法要快一个数量级), 生成最终输出结果. refer page 下册68
4权重与词长的 比值可以精度调节, 确定搜索的精确性和记录个人搜索偏好. refer page 下册68
The Deta Parser word segmentation and Its applicationsin the linear text document environments.
1 There had a lot of rights weight by each indexed map, based on those right weights, Deta Parser did a marching score system to do the computation and calculation for the Chinese word segmentation logic.
2 The search weight of the computing logic, such as Subject Predicate Object(SVO), and part of speech(POS), for instance, Noun, Verb and Adjective etc.
3 To make a computing acceleration, the author injected a combination factor in the marching logics, such as bit calculation, frequency statistics and word length observations. Similars to the theory of Count Down Latch and Cyclic Barrier logic (made definitions first then proved, or proved first then did a conclusion) ways etc.
4 Above all things and logics once became JAVA transportations, the author set all global and local valuable scales to build the Foolishman- Self-Controller components to make the algorithms easy and simple.
Author: Yaoguang.Luo 稍后持续优化语法.
动态 POS函数流水阀门细化遍历内核匹配
1动态的核分为前序核和后序核两种. 根据词汇分析的位置进行实时变动更新. refer page 97
2前序核主要缓存存储词汇的位置和词性, 用于POS词性搭配的 POS函数流水阀门细化遍历计算. refer page 97
3后序核主要缓存词汇的切词链后面准备 跟进的词语. 用于POS语法的修正计算, 如连词匹配. refer page 97
4内核采用StringBuilder做核载体进行计算加速. refer page 97
Dynamic River Flows Gate Function Marching and Circustantly Loop the POS Kernel Computing.
1 Dynamic kernel contains prefix and postfix two types, can read the word token one by one. It does dynamic computing also at the same time.
2 Prefix kernel stores a POS cache buffer by each current word piece of information such as positions, frequency etc, to accelerate the word marching.
3 Postfix relevant to the optimization of word marching and segmentation. For example, checking the conjunctional relationship and continuing the word token link list.
4 The algorithms kernel uses StringBuilder to do higher computing affections according to computer language grammar.
POS函数流水阀门细化遍历前序内核关系图, 图中举例 如果是非常理想来进行分词. 首先通过索引字典森林长度匹配可以切分出 ‘如果’, ‘是非常’, ‘理想’, 3个索引关联词句, 作者词库无‘常理’词汇, 如果有, 可另行讨论. ‘如果’ 和 ‘理想’是比较稳定的词汇. ‘是非常’属于三字词, 于是开始流水阀门切分, 3字词索引没有 ‘是非常’ 这个词汇, 于是开始流水阀门自然语言计算处理(如果三字词有这个词汇, 就流水阀门计算三字词的词性词汇搭配, 如果有就return, 没有同样要更进细化成2字词来做流水法门. 这是该算法的强大之处). 首先拆分为‘是非-常’ 和 ‘是-非常’ 这两种词汇, 于是开始分析两种搭配词汇的POS词性, 通过分析每个词汇的前后链接词汇的词性(如 ‘是非’的前链词汇是‘如果’, ‘非常’的前链是‘是’,‘常’的前链是‘是非’和‘非’,‘理想’的前链包含‘常’和‘非常’)来确定切词, (这个词汇搭配是严谨固定的语法, 不含概率计算事件. )如果2字词搭配出现语法错误和无索引搜索关联, 则更进流水阀门至单字切词, 图中计算比较幸运得到2字切词计算结果, 按照流水阀门NERO-NLP-POS的水流计算, 在连副副 ‘如果-是-非常’ 计算时便return了结果, 没有在计算到连名副‘如果-是非-常’是因为连副副的语法计算的流水阀门高,优先计算并输出了. 描述人 罗瑶光
POS function gate river flows and their relationships.
POS functional gate river flows and their relationships. For example, the author did the word segmentation by using '如果是非常理想' in this sentence. At the first through the indexed forest mapped dictionary, Deta Parser could cut '如果是非常理想' into ‘如果’, ‘是非常’, ‘理想’ those three associated chars word sets token list. And in this result list, ‘如果’ and ‘理想’ these two lexical words seems to be immutably boned. ‘是非常’ was a three chars word token, then did an inner marching computing by using POS functional gate river flows theory. And at this time, the orthos corpus mapped base of the author's Deta Parser system which could not find any verbals such as‘是非常’, then continued do the two chars marched for the next step. About more powerful of these algorithms, was the Chinese chars literacy grammar marching system, for the chars segmental section, ‘是非常’ did a separation into two types such as ‘是非-常’ and ‘是-非常’, then analyzed contrast and distinguishment by these two segments. After analysis of each word and its prefix and postfix, POS combined with relationships, (The prefix token of ‘是非’ was ‘如果’, the prefix token of ‘非常’ was ‘是’, the prefix tokens of ‘常’ were ‘是非' and ‘非’, and the prefix tokens of '理想’ were ‘常’ and ‘非常’). This POS word segmentational theory was fixedly and immutably, which meant it should not contain any probability events here. If at this time, the DetaPaser did not find any associated chars relationships, then promoted to the next steps as reading and cutting sequence-list chars as single one by one. Above all, the result of the sample graph did a good show that DetaParser did a ‘如果-是-非常’ response because the priority of (Conjunction- Adj, v- Adj, v) was higher than (conjunction- noun- adj, v).
Author: Yaoguang Luo
2019年3月18日之前作者Github的 该算法函数编码框架已经出现
https://github. com/yaoguangluo/Deta_Parser/commit/25b90c9847d15df85c5c991448f2c271e0ad8106
注意: 链接的CNN 关键词的历史记录 属于作者用词错误, 作者当年基础学术累积不够, 关于卷积的知识仅仅学了计算机视觉的理论课, 以为带内核计算的都叫CNN卷积
另外作者发现自己还有一个错误, 就是以为序列链表方式计算就叫隐马科夫链计算. 所以 CNN+隐马可夫这两个技术词汇, 伴随作者10年之久. 今天进行ppt严谨定义, 翻阅大量定义文献资料,才发现这些错误. 予以纠正. 作者的ANN和RNN 出现的文本分析内核计算才是真正的CNN卷积计算.
POS
Deta Parser的分词词性基于自身的词性语料库, 格式为 词汇/词性, 举例如香蕉/名词, deta的语料库录入系统函数作者的写法是用string的contains 字符串来进行map 索引登记, 于是这种格式有一个巨大的好处, 可以进行复合标注. 如果香蕉/水果名词, 浏阳/地理名词城市名词, 基于这种格式, 形容词谓词特指等复杂复合词性可以很好的被计算机理解. 德塔分词的词性基于每两个邻近词汇的固定搭配, 如主语后面必为谓语, 名词+ 连词+ 后面必为名词, 形容词+ 连词+ 后面必为形容词, 动词+ 后面必为宾语 +宾语补足语, 这种来自人类语言文学的严谨固定搭配定义分词逐渐的取代了统计和概率论分词. 这些价值全部融入Deta分词api. 描述人 罗瑶光.
Deta POS
Deta parser of the word segmentation, was based on its corpus of POS or classes. The formative base was like a 'Verbal/POS'. For example ''Banana/Noun', the parser-engine might read the corpus base sequently, then to store the verbal 'Balana' as a key, and the POS was a key store. Also, the key store of POS could be a complex types of the annotational string. For example 'Banana/Noun' might be a 'Banana/FruitNoun', and one more example of 'LiuYang/CitynameGeographyNoun'. Meant the computer could easily understand a complex grammar-environment in how to parser the word correctly. Especially in a stable grammar-environment, such as Subject+ Predecate, Noun+ Conjunction+ Noun, Adjective+ Conjunction+ Adjective, Verb+ Subject or Adverb. Because of the strict and stable definition. Therefore Deta parser did not contain a probabilistic statistics about word segmentation.
1 德塔分词的核心类, 包含了词性的搭配切分所有函数. refer page 97, 116
NLP
Deta
Parser的自然语言处理, 函数功能主要体现在基于词汇索引森林的长度裁剪上, 中文的词汇格式比较统一, 不像西方语的 元音搭配方式, 如一个词汇中的元音含量的flech 弗莱士词汇难度定义, 中文一般表达为 单字的文言词, 双字普通词汇, 三字的俗语, 4字的成语, 5字以上一般为谚语和特定短语词汇, 而中文的5字以上的短语词汇某种意义上又可以进行1234字拆分, 举例 ‘巧媳妇难逃无米之炊’ 这9个字如果作为谚语词汇出现, 其实也可以分词为‘巧+ 媳妇+ 难+ 逃+ 无米之炊’ 于是罗瑶光先生将长度最大值设为4. 在保障分词的精准度上, 进行流水阀门的统计排列, 发现2字词和单字词的随机文章中频率比较高, 于是将2, 1字词的处理函数靠前, 逐渐 deta的 NLP流水阀门切词函数成型.因为这种方式, Deta POS的流水阀门也继承了这种高频优先计算思维. 描述人 罗瑶光
Deta NLP
Deta parser of the Nature Language Process, was based on its map-forest of indexed length of lexicons. Because the formative word was combined from connected Chinese alphabetics. Meant totally different with european lexicons, a 'Flechs or Flesh' parser the ratio about amount of word-vowels per the word-length. Seemed the length of Chinese word commonly could parser as four types of 'one char of achaism or singleton', 'two chars of simple word', 'three chars of special word or slang', 'four chars of idiom and slang' and more. The 'more' meant an example of '巧媳妇难逃无米之炊' here, although it was a nine-chars of slang, but it could be separated out a tokens-list of '巧'+ '媳妇'+ '难'+ '逃'+ '无米之炊'. So the deta parser could easily make a recognition of this tokens-list by using 'Dynamic River Flows Gate Function Marching and Circustantly Loop the POS Kernel Computing'. Above tokens-list contained more 'one or two char words' of '巧'+ '媳妇'+ '难'+ '逃', so the priorty to process a class of 'one char-word' is more higher than the class of idiom and slang. The author considered it was an evolutional theory about priorty to high frequency.
Author Yaoguangluo 稍后优化语法.
1 德塔分词的核心类, 包含了词性的词长切分所有函数. refer page 119, 120
ANN
德塔词性的卷积计算ANN, 主要包含意识比率算子, 环境比率算子, 动机比率算子, 情绪比率算子. 这个四个算子的组合计算产生了一些高级决策, 如 情感比重, 动机比重, 词权比重, 持续度, 趋势比重, 预测比重, 猜想比重, 意识综合. 这些决策在文本分析的领域可以拥有实际评估和决策的价值. 同时意识综合 summing 也是德塔DNN计算的一个输入参数组件, 用于文本中心思想词汇标识计算.
ANN, DetaParser ANN computing. It mainly contained a mind set, environment set, motivation set and emotion set etc. Those sets were computed as an advanceddecision, which emphasized weights as a comprehensive of trending, continuing, predicting, guessing and minding etc. With an associating in text mining and analyzed domain. This decision either would value a true estimation, or would calculate a summing centre for the next steps.
1词性卷积计算refer page 182
2用于确定文本的中心
1 算子组成
1. 1 S SENSING 意识比率
1. 2 E ENVIRONMENT 环境比率
1. 3 M MOTIVATION 动机比率
1. 4 E EMOTION 情绪比率
refer page 18
关于比率的描述:
罗瑶光先生个人认为比率的价值体现在比重, 举例如果100个词汇中有80个形容词, 则初步判断为文章形容词比重大, 文章属于比较强表达细腻的散文文笔. 举例如果100个词汇中有80个动词, 则初步判断为文章动词比重大, 文章属于比较强刻画生动的活动状态的叙述文笔. 这个比重能够很好的解释一些文章中的作者的动机和行为习惯. 以及写作风格.
1举例 如动机比率, 如果文中出现菜刀, 顶板, 油锅, 五花肉, 香料, 这些词汇, 这些词汇的动机map索引key 出现大量的烹饪, value时候, 那么计算机便能从这些比率中得到很多潜在的意识信息, 阅读者和计算机首先便能从文章中了解到是描述烹饪过程的文章.
2举例 如环境比率, 如果文中出现菜刀, 顶板, 油锅, 五花肉, 香料, 这些词汇, 这些词汇的动机map索引key 出现大量的厨房, 酒店, value时候, 那么计算机便能从这些比率中得到很多潜在的意识信息, 阅读者和 计算机首先便能从文章中了解到是描述酒店厨师的烹饪过程的文章.
3举例 如文学性比率, 如果文中出现菜刀, 顶板, 油锅, 五花肉, 香料, 这些词汇, 这些词汇的大量属于名词的比重大, 那么计算机便能从这些比率中得到很多潜在的信息, 阅读者和计算机首先便能从文章中了解到是描述酒店厨师的烹饪过程的技术类文章.
描述人罗瑶光
Implements a Ratio of POS.
Mr. YaoguangLuo considered the ratio of POS which meant the proportion of lexicons. For example their paper had 80 adjectives in 100 words, so the proportion of lexicon meant that the essay contained a lot of strokes, It more liked a prose. Of course, assumed their paper had 80 verbs in 100 words, so the proportion of lexicon meant that the essay contained a lot of actions, seemed their essay more likes a narrate story. Therefore, the author considered that ratios of POS could make a good descriptive activity, which to make a prediction of a personal written grammar. Examples as below.
1. Implements a ratio of motivation.
Assumed their paper appeared five words: Kitchen Knife, Chopping Board, Wok, Streaky-Meat and Spicy-Condiment. Resulted a higher motivation of lexicon was cooking. It meand the humanoid computer could read and find a potential information from this paper, would easily to know this paper was about to make a narrate of how to cook the food.
2. Implements a ratio of environment.
Although their paper appeared five words: Kitchen Knife, Chopping Board, Wok, Streaky-Meat and Spicy-Condiment. Resulted a higher environment of lexicon was kitchen. It meaned the humanoid computer would easily to know this paper was about to cook a food in where the address was a Hotel, Canting, Caffeteria, Pizzeria, Rosticerria or Restaurant.
3. Implements a ratio of literature.
Even their paper still appeared above five words. But resulted a higher ratio of POS was Noun. It meand the humanoid computer would easily to know this paper was about a definite essay of Cooking Science and Technology.
Author: YaoguangLuo 稍后翻译语法 因为出现情态助词,干脆全文过去时态
RNN
德塔的词位卷积计算RNN, 主要包含词性比率, 词距比率算子和欧基里德熵算子. 这三个算子主要用于求解 POS距离, COVEX距离, EUCLID距离. 这些权距 在一篇文章中, 能够很清楚的计算每一个词汇的使用度, 出现的价值, 和应用频率以及分布规律.用于文本的主要描述语句的 重心所在位置计算.
RNN, DetaParser RNN computing. It mainly contained a distance set and entropy set etc. Those sets were computed as the observer weights of Part of Speed POS, Covex of position and Euclid KNN. With associating in text mining and analysis domain. It could clearly find out an information by each lexicon, such as the frequency count, ruly distribution and trace weight. Above sets could make a good implementation of summing centre for the next steps.
1词位卷积计算refer page 178
2用于确定文本的重心
2. 1 算子组成
2. 1. 1 P POS 词性比率
2. 1. 2 C CORRELATION 词距比率
2. 1. 3 E E-DISTANCE 欧基里德熵
refer page 18
关于距离的描述,
罗瑶光先生个人认为文中的词汇不同属性和不同类别的词汇的位置距离在计算主要描述语句的重心所在位置后, 可以更好的归纳文章的中心思想, 我接着举例
如果文中出现菜刀, 顶板, 油锅, 五花肉, 香料, 这些词汇, 如果文中大量的出现五花肉的词汇, 阅读者和计算机便能理解这篇文章描述的是酒店厨师的烹饪食用肉类的的技术类文章. 当然, 如果文中大量的出现香料的词汇, 阅读者和计算机便能理解这篇文章描述的是酒店厨师的烹饪过程中关于香料的使用方法介绍的的技术类文章.
接着举例, 如果相同的香料 的词汇, 如 品牌陈醋, 这个词汇, 在全文1000字文章5段落中, 品牌陈醋在文中出现在第1段, 第2段, 第4段, 第5段, 出现了30多次, 其中第4段出现了20次, 这时候词距的作用可以提高 品牌陈醋的重心价值, 说明酒店厨师的烹饪过程中关于香料的使用方法介绍的的技术类文章. 香料的具体使用方法在第四段.
欧基里德熵的价值能更好的观测这些品牌陈醋 的词距关联的过程轨迹, 进行边缘囊括, 举例如果文中 句型是 品牌陈醋 + 水饺 + 品牌陈醋+ 五花肉. 那么这个水饺(RNN比重虽然低)的在词距的轨迹熵中计算 DNN中心计算中比重将会提高. 五花肉因为出现在末尾, (越末尾位置比较大, 这里我设计的方法出了问题, 因为我在读els的作文经常 把conclusion写在最后面, 我个人认为最后的段落是用来总结的. 不代表全人类思想, 今天20200402又思考了这个问题,觉得依旧有合情的价值, 因为在一些写作风格中, 如果一开始就来个outlook进行中心论点表达, 然后再分布论证, 最后一个conclusion段落进行总结, 虽然outlook出现的价值词汇RNN采集积分比较低, 但是词距也相应变的巨大, 最后的mean求解依旧占有大比重, 不会轻易偏离预想结果. )
描述人 罗瑶光
An Implement of Distance of POS.
Mr. Yaoguang Luo considered the distance of POS which meant the weight of lexicons. Those factors about the reflection of different attributes and the position of different classes, which could make a calculation of Mind. Then continuing examples as below.
Assumed their paper appeared five words: Kitchen Knife, Chopping Board, Wok, Streaky-Meat and Spicy-Condiment. Resulted a higher frequency of lexical appearance, was Streaky-Meat. It meaned the humanoid computer could read and find a potencial information from this paper, would easily to know this paper was about a definite essay on Cooking Science and Technology, which mainly made a presentation of meat. Similarly Resulted the higher frequency of lexical appearance, was Spicy-Condiment. which mainly made a presentation of Spicy-Formula. Lets continued examples as below.
Assumed their Spicy-Condiment contained a mature vinegar, which was a higher frequent lexicon. 'Vinegar' appeared at paragraph of 1, 2, 4 and 5, especially at 4. The accomplishment of distance of the same lexicon, could scale the weight of 'Vinegar'. Then humanoid computer would easily to know this paper was about an essay of Cooking Science and Technology. Which mainly an introduction of 'Vinegar' in presentation of Spicy-Formula. Especially at paragraph 4.
Euclidean KNN could trace an observation of frequent loxical distance. For example Deta RNN computings, Assumed It sequently input 1'Vinegar', 2'Dumpling', 3'Vinegar' and 4'Streaky-Meat', will result the Deta rank of DNN was higher than Deta rank of RNN by 'Dumpling'. And also we could find that the Deta ratio of DNN of 'Streaky-Meat' was highly.
Author: YaoguangLuo 稍后翻译语法
DNN
德塔的词汇深度计算 可以理解为 德塔词性的卷积计算ANN 与 德塔的词位卷积计算RNN 的前序笛卡尔卷积计算. 因为参数 由 文章中心思想 和 文章的重心词位 两类组成, 因此适用于分析和计算文章的 核心思想词汇的价值
DetaParser DNN computing. It mainly contained DetaParser ANN and DetaParser RNN, for the prefix Cartesian-calculations. Due to the inputs were two types of ANN summing and RNN position weights, thus, DetaParser DNN could dig a central Theory of the text document, especially suitable for the text mining system.
1词汇深度计算refer page 183
2用于确定文本的核心
大文本中西医结合 极速中文分词进行 DNN 关联计算.
DNN关联应用扩展
DNN 关联应用扩展 具体方式有很多, 作者可以举出一些比较有价值的搭配实例, 如将红色分为 小红, 浅红, 中红, 深红, 按255色阶分出4个程度阶. 然后根据DNN的词汇计算打分进行将词汇分类用这4种颜色代替, 举例 香蕉和苹果都是水果,进行DNN计算, 如果香蕉是30分, 苹果是40分, 水果是50分 那么进行色阶表达即可用 水果 深红, 苹果 中红, 香蕉 浅红 来色阶表达,这是名词的, 当然, 如果有形容词用紫色标识, 就 深紫, 中紫, 浅紫. 有 动词用黄色就 深黄, 中黄, 浅黄, 绿色就.. . 等等等, 这样德塔DNN的应用价值就灵活体现了. 应为属于工业应用, 作者在这里略. 定义人 罗瑶光
Expanding an associational DNN applications
The author listed a demonstrational sample of DNN ranks to classify as where their listed four colors of Red, Purple, Green and Yellow. Then also made four levels ranged from these colors where such as comparative degree of dark, little, median and deep. The max levels scale was 255 on each pix. Assumed the input was '香蕉 和 苹果 都 是 水果', then scored the DNN value of '香蕉' was 30, '苹果' was 40, and '水果' was 50. Firstly their values all were Noun of POS, meant only mark as Red color which was enough. Secondly the DNN value of '水果' was the highest of all, so was rendered a deep-red. The DNN value of '苹果' was higher than '香蕉', therefore the '苹果' was rendered a median-red, and the '香蕉' was rendered a little-red, made a distinction of other Nouns where rendered a dark-red. Similarly to other speeches classify. Purple for adjectives, and yellow for vebs etc.
2. 1 深度计算 (ANN sum核-> RNN PCE) refer page 18
为了方便大家的工程应用, 我组织下简单的文字来进行描述下. 从上图. 如果有一定经验的数据算法工程师是很容易理解的. 如果是新手也不要着急, 因为真正问题只是概念描述 的问题.
Deta 的DNN 是一个前序比对累增积分过程的内核算法. 需要做这个算法, 必要条件是ANN 的最终运算集合以及 RNN 的卷积内核参照. ANN 是比较基础的东西, 基础归基础, 应用领域非常强势, 2维的数据永远离不开他. 通过 ANN 的计算, 我们在处理文章的词汇计算中可以得到一些通用的信息集合, 比如文章的敏感度, 意识, 作者的精神状态, 动机, 作者当时的多语言环境因素等等, 为什么可以得到?原因是比较通俗易懂的, 因为褒义, 贬义统计, 文章的不同的词性的比例, 和词汇的转义猜测, 和名词的分类引申,这些基础都是非常简单的信息进行普通处理.
RNN 的内核矩阵就麻烦点了. Deta 的 RNN 内核矩阵主要是三个维度: 词性的统计值, 相同词汇的频率已经在文章中出现的欧几里得距离重心, 斜率关联等等, 这里需要严谨的算法公式来推到出内核.
有了 ANN 的最终数据集合 和 RNN 的卷积核, 我们就可以做CNN 轮询了 Deta 的 DNN 计算定义就是基于德塔的Ann 矩阵数据得到最终1 维数列比, 然后进行德塔的RNN 内核做 卷积处理 的3 层深度前序累增积分概率比CNN 轮循运算. (为了追求更高的质量和精度, 小伙伴可以自由改写我的作品思想源码, 增加更多的维度皆可. 永久开源, 别担心著作权问题, 以后赠予对象如有进行出版社出版, 相关文字和内容的引用就要注意了. 当前采用开源协议为GPL2.0协议, 之前为APACHE2.0协议)
上面介绍的是 ANN, RNN, CNN 关于公式, 环境, 原理和初始过程, 关于 Deta DNN 的计算算法在图片中已经列出来了.
这个算法的相关实现代码的核心部分地址如下:
https://github.com/yaoguangluo/Data_Processor/blob/master/DP/NLPProcessor/DETA_DNN. java
Deta DNN(ANN summing kernel -> RNN PCE)
Above picture is a topic of foundation, a pre-sequence marching of incremental differentiations, where based on the ANN-summing and RNN-convolutional kernel computing. The kernel of ANN-summing and RNN-computing, is belong to the domain of CNN convolutional kernel. The definition of Deta ANN and RNN please see the original pages. The author refers 'Yann.Lecun' here about an inventory of CNN.
The author YaoguangLuo 稍后优化语法.
图灵机
1 文学分析refer page 168
关于图中的环境, 动机联想, 倾向探索, 决策发觉的推荐词汇描述.
Deta文学分析的推荐词汇来自于语料词库. 在分词处理文章之前, 先进行语料库的词汇map导入索引预处理. 于是, 在输入一篇需要分析的文章之后进行德塔分词, 切出的这些词汇 通过 预处理的map索引集, 依次遍历搜索进行key find 来匹配映射其结果来统计展示, 举个例子 如图中文字 上瘾, 烟瘾,在map中能匹配到 化学, 于是 环境 属性行便出现了化学词汇. 其它行方法类似. 作者描述下为什么 会用 环境, 动机, 倾向, 决策, 来分行, 是因为, 一开始, 作者便想通过一种具有普遍概括的规律来进行描述这个组件功能, 于是用了原始的词汇表达方法, 如名词, 动词, 形容词, 作者认为 名词具有环境描述的包含能力, 动词具有动机描述的特征表达, 形容词具有具有情感的体现. 这些特定的搭配能够很好的解释一篇文章的意识思维. 描述人 罗瑶光
An implement of suggest lexicons about Enviroment, Motivatial Lenovo, Trending-Explore, Decisional Trigger.
Deta literary analyst which was based on Orthos map and lexical dictionary. Before the word segmentational process of essay, the Deta parser engine would init an indexed base where could store the reflective chars set of an observation. Then input strings to do the word segmentation. The Deta engine could sequently parser the word correctly, which was based on this corpus by finding key values of map. Then did an exhibition of resultant statistics. For example, typed a word 'Addicted' or 'Craving for Tobacco', the reflective result of marching an environmental corpus, would list a word 'chemistry'. Its application could be used well in similar domains of motivation, trending and decision. The author considered these reflections were an universal value of conclusion, meant an ordinary of such as nouns, verbs and adjectives, not only the reflections, but also the author considered lexicons of nouns, verbs and adjectives where could make reflections of environment, features and emotions. Those specific partner of grammartical reflections could explain well in essay minding.
The author YaoguangLuo 稍后优化语法.
德塔文学分析主要用于文章的思想分析和挖掘, 如确定多语意识的场景, 当时的环境, 动机, 意识形态倾向和决策思维表达等. (多语意识: 通过人物的对话方式, 语言特征, 模式场景等因素来 分析当时的人文情感, 大众思想, 从而了解所处时代的民族风情, 社会建筑, 时代背景. 作者当年引用马海良的人文建筑涉及了这个 ‘多语意识’词汇, 白育芳当时要作者写明词汇的refer出处, 教授人: 作者导师白育芳, 2007年.)
2 作品评估refer page 167
德塔作品评估 可理解为教育程度评估, 如语法, 词汇的词性统计, 专业词汇的统计, 成语, 三字词的词长词汇的统计, 等等. 如一个句子中含有的高级词汇的比率, 4字名词的比率, 形容词的比率. (作者最早意识出现在2009年 在上海章鑫杰那 处理法国ESIEE亚眠大学的法语邮件项目,Pascal教授曾传授作者关于FLECH法语元音比重单词分析的表述. 设计这个项目, 进行了灵感发散. 德塔图灵分词全文没有任何单词分析和 非中文的语言分析, 不涉及flech任何思想和逻辑, 因此一直没有refer. 作者拥有完整著作权和版权)
Portrait-assessment
Deta portrait-assessment could absorb as an assessment of educational level. Such as grammars, lexical process, NLP, statistical POS and verbals of triper-chars lexicon, quadru-chars idiom and penta-chars Slang. For example the ratio of contained verbal in each sentence, and the word length of each contained verbal. Also could determinate as a higher educational level verbals.
3 动机分析refer page 169
德塔动机分析 基于动机词典的map key匹配进行决策表达. 比较简单. 因为词典定义 带有作者个人主观思维特征. 所以没有太多描述.
Motivation-assessment. Deta motivation-assessment does a decision on trending, where based on motivational dictionary.
4 情感分析refer page 159
德塔情感分析 基于 褒义词 贬义词 和中性词 的 map key匹配 进行决策表达. 比较简单. 因为词典定义带有作者个人主观思维特征. 所以没有太多描述.
Emotion-assessment. Deta emotion-assessment does a decision on sentiment, where based on commendatory, derogatory and neutral dictionary.
5 习惯分析refer page 169
德塔习惯分析 基于 褒义词 贬义词 和中性词, 动机词, 文学分析数据, 作品评估比率, 教育程度等数据的全文比重, 来确定一个人写作特征, 和写作习惯. 写作风格. 因为词典定义 带有作者个人主观思维特征. 所以没有太多描述.
Habit-assessment. Deta habit-assessment does a comprehensive decision on multi-assessment of portrait, motivation and emotion.
6 教育程度评估refer page 168
德塔教育程度评估体现在文章中的(有效词汇如词长超过2位)的(有价值词汇如名动形谓状)的全文, 全句, 其它POS词性的比率来确定文章的句法特征. 举个简单的例子, 一个句子中有效有价值的形容词比重大的文章通常代表作者的分析表达和散文修饰能力比较强势.,思维来自作者初中语文学习.
Education-assessment. Deta education-assessment does an industrial application of essay-definition, such as a valuable reference, statement ability, narrate-level and tutor-assessment. It may include all of these above assessments to result a final summing-collection, to emphasis a fulfillment of educational level.
The author YaoguangLuo 稍后优化语法.
关于作文辅导能力的文字描述
Implements an essay of tutoral ability.
The author expanded a deep analyst after he had already concluded the ratios, weights and reflections. The purpose of development quickly to make a recognition of personal and featural habit, written style, emotional status and educational level. The corpus, speeches and definitions were contains POS, Distance, SVO, word length and weight. The calculation of NLP could easily make an assessment of educational level. For example at above picture, the input string contains a few adjectives, so the ratio-result of prose seemed more lower was 0.1578. The score absolutely trends to an argumentation and statement, was 0.47368 and 0.40983. Recently Deta NLP algorithms could make a sustainable optimization by increasing more components from filters, determinations, definitions, PCA and artical types.
应用
极速中文搜索
关于极速中文搜索, 目前整体应用于养疗经的主引擎搜索组件中, 主要体现在搜索的内容和搜索的对象的计算处理方式上. 关于搜索的内容, 普遍采用分词和统计方法, 主要包含内容的关键词, 词频, 词长, 词性等要素处理, 最后map索引封装, 方便调用. 而搜索对象为文本文件, 文本进行极速分词, 或者DNN分词, 然后进行按搜索内容的格式化数据进行POS, NLP, PCA 等模块计算来匹配打分, 然后包装结果排序输出. 这个过程中的一些固定中间变量, 可以进行按精度调节, 满足不同的工业场景计算, 自适应输出. 定义人 罗瑶光.
About a speed Chinese search algorithm.
It mainly was integrated in YangLiaoJing(YLJ) engine system. As an important componient, the algorithm contained two parts of searching, were content and subject. First, talked about searching of content, It included the way of word segmentation and statistics, for instance key word, frequency word, word length, parts of speech and etc. finally built a responding map for feeding a functional call back. Then talked about searching of subject, It could be a Text file, strings document, DNN minds stream, then did a marching score before a searching of content. The way of marching could be POS(Part of Speech), NLP(Nature Language Process), PCA (Primary Componient Analysis) and etc. Then output a final result after sorting and arranged conclusions. Seen the engine might add more scales of the self-control in the industrial environment.
The author YaoguangLuo. 稍后优化语法.
感想
20211112, 我在思考一个关键点, 为什么我的德塔分词 每秒近2300万的分词速度 在 sonar的 国际认证写法格式后, 变成了1600万+说明一个问题, sonar的格式化是加强人的视觉理解格式, 不是计算机的迅速理解格式. 所以元基索引编码 是趋势.
涉及著作权文件:
1.罗瑶光. 《德塔自然语言图灵系统 V10.6.1》. 中华人民共和国国家版权局,软著登字第3951366号. 2019.
2.罗瑶光. 《Java数据分析算法引擎系统 V1.0.0》. 中华人民共和国国家版权局,软著登字第4584594号. 2014.
3.罗瑶光,罗荣武. 《类人DNA与 神经元基于催化算子映射编码方式 V_1.2.2》. 中华人民共和国国家版权局,国作登字-2021-A-00097017. 2021.
4.罗瑶光,罗荣武. 《DNA元基催化与肽计算第二卷养疗经应用研究20210305》. 中华人民共和国国家版权局,国作登字-2021-L-00103660. 2021.
5.罗瑶光,罗荣武. 《DNA 元基催化与肽计算 第三修订版V039010912》. 中华人民共和国国家版权局,国作登字-2021-L-00268255. 2021.
6.罗瑶光. 《DNA元基索引ETL中文脚本编译机V0.0.2》. 中华人民共和国国家版权局,SD-2021R11L2844054. 2021. (登记号:2022SR0011067)软著登字第8965266号
7.类人数据生命的DNA计算思想 Github [引用日期2020-03-05] https://github.com/yaoguangluo/Deta_Resource
8.罗瑶光,罗荣武. 《DNA元基催化与肽计算 第四修订版 V00919》. 中华人民共和国国家版权局,SD-2022Z11L0025809. 2022. 登记号:国作登字-2022-L-10071310
文件资源
1 Jar: https://github.com/yaoguangluo/ChromosomeDNA/blob/main/BloomChromosome_V19001_20220108.jar
2 UML: DNA元基催化与肽计算 第四修订版V00919
3 PPT: https://github.com/yaoguangluo/ChromosomeDNA/tree/main/ppt
4 Book:《DNA元基催化与肽计算 第四修订版 V00919》上下册
https://github.com/yaoguangluo/ChromosomeDNA/tree/main/元基催化与肽计算第四修订版本整理
5 函数在Git的存储地址:Demos
Github:https://github.com/yaoguangluo/ChromosomeDNA/
Coding:公开仓库
Bitbucket:Bitbucket
Gitee:浏阳德塔软件开发有限公司GPL2.0开源大数据项目 (DetaChina) - Gitee.com
6 其它资源链接:
ZHIHU DNA元基催化与肽计算第四修订版
CSDN DNA元基催化与肽计算UML集_罗瑶光19850525的博客-CSDN博客
CSDN DNA元基催化与肽计算 第四修订版V00919