LDA计算复杂,不太适应大数据量计算。word2vec将词投射到向量上,使得词之间的远近程度易于计算,很容易表示同义词、近义词。
以1个词为输入,通过D维投射层,以及分类器(softmax或log-linear),让该词分到前后若干个词之间。前后词个数越多,模型的准确性越高,但计算量也越大。具有相同上下文的两个词,认为这两个词相似。
data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAK0AAADWCAYAAACjSgJBAAAgAElEQVR4Ae19Z3Bc15Xm143QjZwDARCZAAMAgiBIMIpBopgkijYlWaI0kqyRRzOWN7hkV7m29ods19pV9ojWTsker23tcKxVZBIpURQpRoE5IDGCIAKRiZzRufee12wAROx+76ET7qkiu/H6xnO/vn3uuScozIzAiXPAjTigdKOx8qFyDggc4KDlQHA7DnDQut2S8QFz0HIMuB0HvMWOeLCvBwaFL1QKAwYMCoQGBwhN9fV0w8cvAJqedjQ+aIPRBIRGzUJCTDg0A33Qm70RFKBmZc3o7e5lr0Y0NjbCrPSBl8IMrd6E+KRUhAWqxA5tTD3dYC9q6xqg0Rng6x+MtNREeBl1qK2tQ2+/BmaFAjFxiYgKC3xY14CaykpoTL5IS0+Bj2K4Se1gP3RmL6jZvGvrLW3SWdYvJAKps2eBirY03kdbtwaJqekIVJlRVV7B2lLC10cJjUaHiFnxCPQxw0sVCH+VD8wGDSorawDfAKSlzBba6O/tglGpRvBDXnV398AvIAi+3iL3GaMWlfeqYPINRLq1D7aGCm8/+Kt92HKY0N3TB2+FEXX1jYC3L7xggs5gRlxCAgY7W9HZ2w+Flw+bVxqblze6u7oQEBwCb6UCBr0G/RoD9P1deNDWBV+1GkadFmBzTE9NYu0O81DqO9GgvXbsSzT7ZyInpBW7PvoO//N/vYOEEBVOfb0Psws24/axj3CrxYCYsAAMDOjx5Pd2Ag9KcE8bhWcfX8zGrcexvfsRn7MA5SVXUVV1D91GNVKTE7H+qRhZQVt19TT+dugSkpNjoR0cRMKC5di4MBJ/2PV/kJSRDi+zDn1aJZ59+TWkhyvw6ccfo6l7EOw7BJ/AGOzc+RzCA9jCMqooO4/b/cFYqGrDv+8rRFrabJgMesTMyWWgjcH5o1/g4q1a9gU0wegViKef3oiyi2dx934t6tr6kZ6ejqWr16Kp5FtEFzyHgnglPvl4Lzo1RijNBgRGp+Pll55B8Ym9+OxiC37z6/+BIB89Du3fi4JNzyJjVogwDnv+0/Q8YHP6HJ2DRlbNiIDIFPzDi9/H5VOHoZi9BGtz02HW9uKzPQewekUeLhaeRVV1JfrMfgxwKVi2NAcHPt2HsLh4eJv1UKpj8NILm/Hlgf3Y8P2XEReiRnvtbXx5uQF5iWqcu1KCiopKRMSnID51LpKSEuEtI2pFg9ao10GnN8DEdh1zXzv2H/wG//WVZ9g3TgeDwQD4heP1t/4BKeE+aLp5Bp9fuIxls72g1xPjiMzQDA4gJH4OXl2yBGVnv0L5YBSe21Bg+VjG//U6I1ZtfAHbn8gCDB3Y9e7HaEgoQMLcpfjvb70o9PTNR39CcVk5enwaoAvPxH/74Tphpzn5xUf46sRFvLJttVDOaNQL89Yr9chfsw0vPbNiaKRt90txvrwd//jPP0GYnzeKC4+jvceEHa/+CO3VRdh7oRZv7tzOypvx4YUvYDLpUXj8DEIzluGHj+cDJi327f4AhVcrEMR2Kk1bNfZ/XYhXn1kFnU7HyovRTprw3ZEjCJmzDK8+vgQK1sehj3fj5MXrUJuNMNNPoUC0HoMInz0PP3wzB8VnDqHaOAvfX78EfU23EZO5BP/l9R1Cyc/+9m8ovtcE4oV1SCajEYPsVyRv5VPs3zp89B9/Rf6WV5AZ4/+wffleRINWoVRCyX4WaM6PbXwG5oFafFdSAbVaxZ4rYRrswYkjh5AcE4Ta8ttIz10Pb0MtlF7DP29erBweqolN9Mp+oqaDlF7ArcuFCFQ8QG9bA3yjZyEi2A81Ny7gD//WCl3/AHxDovC9x1Nw6eA1FGx7ggGWSImClStw+9Pz7IdyNfsLUCgs8/Zi87h48gt0NpSwL6oJi9duQlhXNeJSsgTAUu1Fq5+gF4GMD8FBsFOw1pRKb4bRfjR0DWD9+hxLIaUKq1cuwuHSSqQGqbBp+w60V13DjZo0qJkYwaQY+8kwiKqWATy9aaEgdoD1sXLVUnx+lvpgIsuI9aB1s67ByPVQMAa21Zbj8LfH4aMfQF03kB8bhvusRetqUl0lG6BlfmzTolka6VV+svYpvmUCmm8wnnlmE/tJO4yqB72gBTWbDOhoa0FzcysSspZj08psmPV6GId2WhWTX/VsapaVILlwuu45qIe+nk40MdnZoIrE8zu2ws/LjLDYZKxbswpBXjr4RyYhOTYKfmol2jp7hvjR29kO9k1kdR8IC6L2ZWAzmYTvWvr8PGzetAlbt2xGVlo8/AMD0NvTNVTXpGOgbG4T/hbmRnMc+pR9JZjcqGZf/O7egaGnba3tTO4OFEQFVcgsfO+ptfhm/3609TFeiQEt+3L4e5uZPDrcR2drK9R+TH5nOy37ERJIwealM+joWyn8PXo9dEyWf9DUJMjqm7c/i7TYELazaoU5UAUlEzv0DzcgYs7o+kKjMv0neqfVMyFb621gINSjv68fQUxOWpefjF/9709QsHkHTD6B2L7jVWRE+w0NNSktDWf2nkJhOPuu9LWghclMseGWww+JFSRuTAfpdHrkrtiM57csGmq+tVUHH/9w5C7Mw8KMRLz7r+/jdHEqVqxdgd379kHfsQT+TAS4cOEKVm59FkXHD+JMUCKMrRWIyFkPY38T+tkOrdWyw4bJiJbmFiRl5gHf/Qe+OKpGQpgKpVevIXnpBsTHRgq7jkbLQPGQtJp+6BUqLMvLwv4Dn6K7YDGUum6cPV+OHa//E+rOlaNPOYCo1OXIm30Z7x8swuZnX7ZWt/2V7ayrVizEgb2foIv60PfgwsUybH35Dfi138DnJ04gxNSNnqZ78AqMRqi/RXYX1uPhTmlkMntE8gK8/go7l1iJyd+zIwJw5KsvsXBOHK5fvYC0+esebkEmxhcN+xUe+RW1VpT+6vUOIzHNmNlChbCf2ZjIYPgHhyM+LhIxSSkIVnsjec48BLLX8MhoBPn5DjUfEMYOWGo9rlwrRke/GZuefgrxERbQmtm3PoD9RM+KChsqL9cbM9sZVYGhmBUdOqJJMxQ+fkhMjoeXjz8S2c5R39qLhbl5iA/1QUlJMRrb+rB0zUYsmZ+CWKb9uFV6DV4Rqdi0djl8YWCHlXuoqa1FdVUV2npNyFqUg/lsxy2/zg4iNQ1IzVmOdcuyLT+hbPfxUgchMS5GWFijwYhodlBJn5OBEB8tiopL0NFnwPqt25CZECGcC0KY1iU2IhQp6Wnw8fFGRuZcBLBTu70UFpuIcF89ilkftGOv2fQ05icxgMbEw1fbzvq+jkFlELY9vRVhAZb1MrH1DQqNQmwk8YztnEpfxCfEsQPmw96ZmJTMtBAN927h1t1qRKflMr4sGfqcVceshEQBB/aOd6ryCraNT8/XYaqe+eecAyI5IF2mFdkxr8Y5IJYDHLRiOcfrOY0DHLROYz3vWCwHOGjFco7XcxoHOGidxnresVgOcNCK5Ryv5zQOcNA6jfW8Y7Ec4KAVyzlez2kc4KB1Gut5x2I5wEErlnO8ntM4wEHrNNbzjsVygINWLOd4PadxwH6TIacNlXc8HRwgL5NB5rEghfz8/Jg7jeOg5LiepHCF1502Dty9e5e5zRjh6ztsQmpPZ+QGRICdN2+ePdUkleWglcQ+96/s5eWFzMxM5m1icTCyd0a0U9+7d8/eapLKc5lWEvs8o7KeeZ+IJaqrEOUHJLZHcu3hxDngZhzgoHWzBePD5Tstx4AbcoDvtG64aDN9yBy0Mx0Bbjh/Dlo3XLSZPmQO2pmOADecPwetGy7aTB8yB+1MR4Abzp9f47rhosk5ZLIbcMSNVldrEzQKNXxYkLuAqGioh+IrmXC37CqulJQjLmMhCyOVg6b7NaxMHIL9x7eH4DutnAhww7Y6OzuF2LdyDL2/r28o6Jw1+JxWMwidtg/nz19Gd2cLPv7gP3G7pmWou66GSnx7+jJCIyNQeuZrfFd2H0ZNBwovFD8SYXKoAnvDQTuSGzPs/QkWMfGDDz5gwe0skRKlTv/62W9RVNmK7vo7OPD1GRYSVYfDXx3B7bJi9CrDEMwC7ZXeuo7rN4YNbEJZuNUf/cuPWbjULcjPjENLSzcSMuahk8XDbehiESnHIQ7acZgyEx4dPHgQzz//PANJCwu/SzkwpFNm1jw0V1Wgqo5FkqyuQnn5HSj8QtHT3gJVaARmJadj9bLHsHZlForOn8KRo8fR1KMXclHU32HBo1uUWLd8LouR64dQtRk1dcM78sjRcdCO5MYMef+Xv/yF5ZHYiY6ODmRnZ8s267D4dPhq6lBW04eCBXH4+vApzMtdBAULc29goU0pZCiF4PdlAZy7uzrQ2tbOYhLrcO/GZZwra8CLr72MiIehRikwszV6+ugBctCO5oiH//373/8eP/7xj1nylgEEBARg/vz5ghG4LNNWeLM4v+EsMHw0cuamwtsvBKnxYUhKiIemg0VUh4oBuAOnz93Aui078MpLP0AIA/nv330flQ0NOPTph7h68z4rp0WP1oSE+Mhxh8W1B+OyxfMetre34ze/+Q127do1NLnk5GQksHRL5Lkg1gh8qLGHbzIXr0GiUYkAXyVef202Cz4NxM/Ngfr2N2ju0mPbzhdR32EQDlkUn5mij//jm/8CDYuorjeYEBbsj+bKcqhYZPnkKEuar9F9cNCO5oiH/n3mzBm89957j8wujaUTCA0NFfIjPPKBhD+U3ioEPERVUIAldYFSFYLlBbkYHOhHbFwSwmOHOwgKj8HSZTHDD9i76tsPsHx5/sNkLY98JPzBxYOxPPHIJ0899RTOnTuHNWvWDM2P5FlySnREMPj4lEyksBQHtlDKvCzERQZNWJSDdkLWeNYH5Li4hOVr+8UvfoHPPvtMeE++YUIaJjebKhcP3GzBpAz37NmzmDt3LkiW3cRSSRE1NzdLadIpdflO6xS2O77TBnY6J3dvAixRcHCw8M8RooHQoYz/cdDKyExXbqqsrAwLFy505SHaPDYOWptZ5b4FKyoqEB4ejujoaPedxIiRc9COYIYnvqW4BHfu3EFOzsP8ux4wSQ5aD1jEyaZw/fp1pKSkCKqticpJ0SBIqTvReKZ6zrUHU3HIjT/v7u4WtANWTcF4U6GduLe3d1JQj1fP+oyug6VEqLG2Y88rB6093HKzsqWlpUJguMl2w6CgIAHYYqMeUiwv0kQ4kjhoHcltB/bV2Ngo2BSQaDAZJSUlTfaxS37GZVqXXBZpgyLdK6m45DQ7lDYieWtz0MrLT5dojVRcYWFhiIy07a7fJQZtxyA4aO1gljsUpVsvAq0nqbhG852DdjRH3PxvEgtITiXrLU8lDloPWllSXbW2tjo0lLwz2MdB6wyuT1Ofly9fFg5fcnkhTNMwJTfLQSuZha7RQFNTkxB0g9xnPJ24ntYDVph8vOgioaCgwO7ZULAOEivERpkh9RpdLpDbjqOIg9ZRnJ7Gfii7TEhIiKDmsreb+vp60K2Y2NgHGo0G1AYHrb2cn8HlKXFdVVUV1q5dK4oL5IYTHx8vOsoMqdiqq6tF9S22EpdpxXLOReqVlJRg9uzZklRcJF6IJSl1xfbJQSuWcy5Qr6urC2TJRX5fM4k4aN14ta9cuYKsrCyH5qV1BXZx0LrCKogYAzkqksnhTFBxjWYPB+1ojrjB3yaTCTdu3BBiF7jBcGUfIget7Cyd/gYpc3hERITDja+nf2a29cBBaxufXKYU6UXv37/v0VZcUzGbg3YqDrnY5+SoSFZcpF+dqcRB60YrT0GQe3p6QDG45CKx17di+zfYoBM2s8ZNk5TjoBXLfSfUo12WdLKOBppcU22qvoObFfWoLS/DnfquMc3W3S3C3z/9Enr2SU9rHS5euzWmDD3goB2XLa73kO73ScVFV65yEF2/7tu3DxcuXJBVz9vRWIPyqiY2RDMaGhphYu86Gu+jsqYGl67eYIdHXxz8dDc+OnAEA3raUy3UWFGGvfu+QNmdOrAI9wiJiELj3VKhvrWM9ZWD1soJF34lFdft27dlOXxRnAIK9blhwwb8/Oc/R1RUlKwzVxgHceVqCQb627Hn44/R0NGPa5cuoeruHfSZAzE7Ogz+gUEIDQ6Et1Ix1HdgZDye/f52xIQGCKBV+AQgLY65t/cahspY33ArLysnXPi1vLxcUHGRJZdYIhPEPXv2CCmYyFiciCKDL1iwgCXxMMi224bNzkD8rfu4evUmIqMicLPoCjTKECT5KdHYZmb9qJGeMRfK9HzcPnsYhwvLsGj1ZmxekwsvbQtLJEJ7s4VCwoPQ1tGHuKBHzR75TmvlkIu+alkuAlJxiXUHJ7D++te/xqpVq/Dmm2/CClhKEPLaa69NQ3QYL2QkRzCxoxj5qx7DDZZbLCRlAWIjgqA3WvKCafr70dj4ACnZS/GDF55HfnaqwH368uhYxBur0KDtH4S/v2rMynDQjmGJaz0g427KjSA2QR0ZaRcWFuLWrUcPNW+//bZggzsdVloxCakoWLYEaalJWLxiDRbOiUPU7GR4m3rRozMjIzMDNSy1qCkgEmnpaYgKt0So8VIFIiUx/mGuBRPus8NaQsQ4DppsUpxclAMsI4352LFjZvaTKWmEzNnRzHZW2sCEf8uWLTMzO1yhTRZRcei9mE6YjGymNmyha2ePm8+XVQlFdVqt2ThmXiYzyx0mfN7z4L75i0Pfjtss32lda2N9ZDRkX0BWXFJVXOQ/RvnD8vPzhfZ/9rOfifZUeGSAdv6Ru3QlO1yFCbV82OWIUjF8ELM0pWAaEsszL5aDbM3aleP2wA9i47LF+Q9JjiWwzpo1S9JgKKMNBZfbwnLPJicnY/fu3di6daukNsVWVvqoER2htqm6f1AI/CcoyUE7AWOc+ZjkTDKKWbp0qehh0KHmu+++E7QO1rD1dPj63e9+J7pNV6nIQesqKzFiHKSTJf2pWBUXGdXQ4Ss2Nla01mHEcFzuLZdpXWxJyFGxrq5OdJQYAuzJkyc9FrC0XHyndTHQkqNiRkYGVKqx+smphtrP9J+nT58WDGrS09OnKu62n/Od1oWWzmrFlZqaaveoyMGRktvRJYQnA5YYw3dau+ExfRWKi4uxaNEiu1VcdOt1id3vE2DlMqiZvllKb5mDVjoPZWmBVFwU5cXeXF8PHjwAeeWyCwOPDaI8msFcPBjNESf8TdlhyCjGqpqydQg1zNyPALt8+XJJgJUSZVFKXVvnOboc32lHc8QJfxNgST0VGBhoc++0M1MAZTKEoWyMYonsaunGTGwsL9J28JRMYrnvpvX6+vqYsXQD1q1bZ/MMKHYX6XKfeOIJZgU10b2Rbc2RDEwuPKQqE0PMOMDhcrSCLBLEDJbXkYcD5DkQExMDWzUGBFYC+erVq0WpxeQZtXNb4eKBE/lPoebp53WqXF/WIZI4QGqxNWvWiDZVtLblzq8ctE5cPbKVpcOXLVZcdOAiUeKxxx6DMw4/TmTTmK45aMewxDEP6OQfEBBgk48W2REQUQxaWwDumBk4rxeu8nIC78kCi2TTqVRcdNw4f/68AFSSYTlgLYvFd1ongJYAS6f2yU7+5OBHgCVb2BUrVjhhlK7bpSygJQaTa7JYRQTVIx8oV07YRvOT6k9FIKWDFyVbJnXVREQ78ZkzZ4TbMbEOjRO17QnPZQEtXSWSGoYSToghAi0ZfNDPpVglt5h+ba1DOkyyC5CSa5a8aulLSdG758yZM+FhikBNhi8Ukn6mRfi2dT1kAS3tQOTKIWVRyVJf6k5m66TtLUe/JGQTQGATSzQ3OlDRyX8inSwBlkwLSQXGATsxp2U5iNEBQSrgaLd15YMGAVcK0U5LqZPI5WU8olupo0ePCsbfHLDjcWj4mSw77XBz/N1EHCA5laIdUjDk0cRcxYWYWmSWSGE8OU3OAQ7ayfkj26ekBRjPs7atrU2w1MrLy0NcXJxs/XlyQxy0Dlzd0SKU1RaWTAvH24EdODS36oqD1knLRTdiZJJIlwZivW6dNHSnd8tB64QloAMZaUvIFpaSIXOyjwOyaA/s63LmlqYLFAIsBYMjOwIOWHFY4DutOL7ZXYuieJNVFwF106ZNMzrRh93MG1WBg3YUQ6brT3JJoVgGZAtLmgRO4jlgk3hg1Pajrb0XOg3zB2I3Oz3tbRjUSVO2ix+y/DVJhzrdRDdhpKflgJXO6SHQ6vXjpU83s9Q4Olxmd+Ftff04cWAfSiua0NfZiMJLJdJ7d5EWKDr29u3b8ac//Qks1uq0jcoRX45pG7wLNTwE2kOf7xcyiRg1HThx6jwbIrPlPHkCZTduorJVh/hgE749dRxHjxUiLD4FbVU30dRjCUcux3zImGQyUz05+pioDdKfHjx4EG+99Zaggtq2bRv+/Oc/o6KiQqjiikY8E81lJjwfEq6CA8wsXHgjvLqqUHjlBualx6CuXYM0dRt8QsIQFBqDZYuXYtbSAmatFIRIVr685gFm5SRK5hP5PR05ckRwhRYTw0rKAMjegQ5I9PNN4KUbqi+//FL4R2B9/PHHsXPnTiHyi5R+eF35ODAE2ty8HBw+dgK+/gFYmZ2Ar4+dRc7ydVB1FMNoYoFwFV7MbNAPYWGWDCs+Xkr09w0KI7HEbhY/KPL3nzdvnnDN6QybWrKVHW0QExYWJgTBINCSdRed/jm5BgeGQEuJHAY6vkIv0rB1TQbOXTmE7WmzoVE0YrDRkl3PbOhlMaOuYW7iE+jp1yMxzmL8IdUHnfK8kiV/YmKiU7xMyaSSrMwoYAbpT5988kmsX79+yHiFdmDSr0olfgiTykFL/SHQQqHGcy+/BJPSFyFRIXjj9RcRrmImh0kZ8L96DO2DeqzcuAlF5S3obKlDv1coMhLGWiyJHRYdUkgtJDaLi9h+qR5ZVn344YeCpyt9cUYTmRVKJQK+NRkzB680bg6DlrUTFTsc3z8hwZLOUukfjmV5GWisb0X2nAw8mZCB2tslyM5fggBvqYKBtMHLVZtCY053eEwCKokYZOS9cuVKl3Ytkouv09XOI6CdqJOUBXkwGoZVYvEZ2UhkBxdOtnOAxA/y9yLZmZJ3FBQUiHZPsr1Xzyxp8+nCy3sYpDM9WIRYKFCwN0pkRwc7cr3p7e0V29SMrmczaGc0l2SePMnQixcvFnIjkLqPk30c4KC1j1+ylaagcxTPgHZccinnZDsHOGht55XkkqO1BpR2iYBbVFSE2tpaye3PlAY4aB200qTOq6+vH9MbAZeMwSmlKAVK5jQ1Bzhop+aRLCVI/0zeCuPJsKGhoUJQZQqXZLV3kKVTD22Eg9ZBC0s6WtIc0I46HtH1Nd3CkZhAlxCcJuaAbKCVejfvyoE6iH1yjI+0BrTjTiS/0nU2GYmTly7JuZzG54BNlwvjVx1+SgpzisUl1oSPFO+UbdBVicZHAY2l6FVpfnQdvGDBAuFygWJ1jfdFoMMaGelQqlDKDUaXEJwe5YAsORcopA8dMsReOhAoaAHpKlVsG49OS96/yC6C3L1Hn/7t6YVsD8hyjAJ2kNE5ybGUTnQyolCfxBu69uU0zAFZQDvcHH9nCwcoCuOpU6cEa7KpvqTXrl0TMs9QQA+pIpgtY3OHMrLJtO4wWVcZI4lRCQkJoOTNUxHdnJFHB4X/JLUZJ4CD1kkooOiJdOCyRZanwHQkWpCcK4eZpJOmLFu3HLSysdK+hkgsIO9cSrNkC5GFGMWtJeCSp8VMJg5aJ64+BVcmjQT5pdlCdHAjoFNoe9LWzFTioHXiypPGJCsrSwiTZOswCOgkWpBmgULhz0Ti2gMXWHXaOem2jA5nthLJw6RZID2ulDChFDKf5OrxdMa2jIVUcpQPzZEOqRy0tqzMNJchewS6AZss4814Q6B6lFs3NzdXdFJlujIm/bNY1306GJIe25FZeGS5ERuPofyZ7RwIDw8HudHTBQbJrLYS1aOLB3LfIa8IOqjZS3R1TLu82IsTUsNRVnRHEpdpHcntSfqinYrc1EfHX5ikivAR3axRYGYyxBHr5i4lXJOUulPNbaLPOWgn4oyDn5NcSO7rN2/etLtnCh9KogWZPtJu7enEQetCK0xaATIEF6OHpYPQhg0bhCSEtup+XWjqdg2Fg9Yudk1vYbpwIKMhsfa0ZPb42GOPobOz06Yr4umdzfS1zkE7fbwV1TIlvqOLAwKeGKIDFdnkUhsXL14U04TL1+GgdcElomB8Yndb63QIuHSoo2tfTyMOWhdcUTIQp1M5JcmWQuTpS4c0K3DJ7vntt9+W3K6UMclRl4NWDi5OQxukAiNHR6mUn58PirFAutyf/OQn2LVrF/76179Kbdap9Tloncr+iTsn1/KgoCBZFPeklfjjH/8oRIakHilMv6MvBExGPQa1OpiZyGKaNDasCQMaS5TKgf7xrdk4aCfGjdM/IX8y0rvae+EwcuAkZvz0pz/FJ598MvS4tbUV77333tDf0//GjKKL51Db2ITjX3yFlj6d0CUBWacbNmzXa3qw5//+O744aTGOb6goQ9fg2CQuHLTTv2Kie6CrXQr4TMnyxBIBng52JCaMpN27dwuHPbE2ByPbsr6/fvEs7rcNQNfTirLb1eyxGZeZBqOmksVzaBqASt+OT/d8hqMnrwhVmm5fw1dHL1mqm40ovlCI6xV10GgtQE1KTcGlorFz56C1ctxFX8kYpq6uTrAtEDNEsi148803BRvcvXv3YsuWLYLzKNnx/upXvxI8IcRaeI0ej5+PHjdvV6K++hZLhXAara1NuFvdgMbqKqgjExDPDpgL5mUjZ0Eyzn97EH/7aC+OHdmPD/7f52jtMyB3xQY8vX45fJQW+cE3OAoN1ZVCApuRfXHQjuSGC76nCwMyaJF6y0V+Zjt27BASoBw7dgwvvPACjh8/jkOHDsmWPTI1azHMnVW4WdeD9LggnD75HdKyFkGn7YWJXZz4+AUgKiKaWaTFIiUzC7nz5zAgp6MgPxeBah/4qnyZzGtg+7OVWF4PVnd0HHYOWit/XF0NVqEAAAetSURBVPiV4tm2t7fL4q1AHr0UyYZkXMriQ06WcvmdKVXBCFNpUdepwNLsJJTerMX8eSmICgqGvrePcViBvu5mlDMRYFZiGtZv3ILtz2xB1twM+PlYoKhn1mpanVWONcCftakatTbcNHEUQ1zxT/r5ppsyunCgYHVyEbUVFxcnxFaQq82sxSsQMeCLhBg/PP+DUISwFAc+8xeg6GgRBvVLUbAyFzUN9TDlJCMoejayox/tOS5tLrz1lszsmq5WJKexzEKPFgEH7SiGuOqfFFKpuroazc3NQhYeucZJ2gW5ZFoaUzCTXS2QA5Ndg4Rh+kcmY15iJaobWpG3ZgvyJhl8yvxFsFoF11TVIH/R0jGlR4N4TAH+wHU4QPpWMaaLk82A3GUcQfmr1iItLsyurlLm5yFEPZw2wVqZg9bKCTd4jY6OFjwMHH0xIAtrFF5Q+frY1ZRKPVqatVTnoLWLjc4vnJeXJ8rDwfkjl28EHLTy8dIhLdHVLl3xTme2dIdMREInHLQSmOesqnTDRf5g5Mw4E4mD1g1XnXSr5OEg9cLBDacuDJmD1k1XjvS2ZPgiJdCzm059jN7WXecx48ZNN1skJszE3ZbvtG4M9+TkZCGsfktLixvPwv6hc9DazzOXqrFw4ULJFw5klCOWpNQV2ye/xhXLORepFxsbK2gSKF4CXfXaS6SBILceMmEUQ3IZ29jTNw9AZw+3XLQsuZtTJpyNGzfabUdAoKMcEFKItBlyGpNPNRYO2qk45CafX7lyRfC8tSeAnZtMbcwwuUw7hiXu+YCCM1dWVgqu5+45A9tHzUFrO69cuiTF8iKZdqI0pi49eDsHx0FrJ8NcuTjpbZuammzKmOPK85hqbBy0U3HIjT6nCwdKJuLpFw4ctG4ESluGSk6QFCqUfMo8lThoPXBlyS6htLTUA2dmmRK/XPDApY2PjxcuHCjJ9lQZcyiYB10wiHW7If8yupgg0cRRxEHrKE47uJ+cnBwhZRMBeDLHRatdrtjrWAI8XS6Qm7ujiIPWUZx2cD+US5fyOFAsMBIXJiKj0SjY5hLwxBDlIaupqRFTVXQdx+3poofIK4rlABnTkBMkAXMiop/1yXbiiepZn1NdR4oG1C8HrZX7HvhKoZAoQPNUbudi5VlimZS6YlnOQSuWc25Sjy4c6EAmJmOOq06Rg9ZVV0amcVHiELpw8KTrXQ5amcDhys2QEyRlLKdcup5AHLSesIo2zGE6QirZ0O20FOGgnRa2ul6jdMlAyZdJvrWSMw5R1r6lvHI9rRTuuVldCqlEeltSUVGGG7oQWL58uZvNgqu83G7BpAyYrlsPHjyIdevW4Z133sGFCxckJSGxZyxdrY1oautAW1MzNMbxIjWaUVF6CX//+3/i5IUSIRp44/3qcbvg4sG4bPG8hyUlJUIE8N/+9rdCtnKaIe26pAqTcrkwklP9vT0wPMy3ZHz4qmE3ZlpNH8tjdhk9na346G+7cbPqgVCtv6MZ96qahPd9LTX45vRlRLLIkNfPfYuTRffgZejF3Yax6VY5aEdy3YPfU9C6iIiIR2ZI8i3l0JXrRuv6+RMoqmxFV/0d7D98mu3iWhw+/DXulBWjzyscIb5a3Ci/xbL13ENPVwfKLhXi66On0dLaBu+QWfjRW29hy6bNWDwvAW3tXYhJy0RpcdkjY6Y/OGjHsMQzH5DhzJ49e/DGG28MTZDczimVqVygzcyaj+bKu6iurcX9mmohsqPSLxw9HS1QhUYgNikdq5atxvo1ebhy+ggOHDmF4quF+HjvIfTovKD2VqL+zlXceGDCuoIsNk4V2plYMTrMHj+IDS2h57+ha933338flJ+Mkt+RaFBRUSGkaJJj9mFxaVBdKUJphwoFC+Lw9eFTePql19Ba8g1zuCT7BzNMRhO8fdVY/8xOJMbH4XK1AS8+9wSUzIahovQCrla04oWXXkKoEE/ZzMQN44hsN5ZR8p1WjtVyozYoPsG7776LX/7yl8Kop7JLsGtqCm8kxkWyXTUG2fNS4RsQhpS4UCTFJ0DT0caaUkFp6sDJ0xcFOTo5eym2blwuALaz/gb+9Q9/RE1dPQ5++ndcukGHMAPCw2LGZLfhcQ/sWhXPKkyHsmvXrgnpmcTa05JpYi0TB6zxFkwGHQaNSgT4svRLg3oE+qth0nXjwP5vsGLTNqi1rWjo0LMkeGksQdMwDXS14FZFNQscooVOb0LinAUINrWi2RiOnNTY4YLsHQftI+yYeX9YVWDBwdacNPbxYDRoJ6rdWH0XWlU423kjJyoy5nl1+U0kZi7A6FQhXKYdw6qZ9SA7O1t0HC97OBWXkmFPcaFsCgPseMRl2vG4MoOe0dWuuxEHrbutGB8v19NyDLgfB/hO635rNuNHzEE74yHgfgzgoHW/NZvxI+agnfEQcD8GcNC635rN+BFz0M54CLBrUWasIpak1BXbJ78RE8s5D6lH0WfogsHLa/RlqW0TpLqTRbCxrRX7SnHQ2scvjytNVl9kV0vxEcQQgZZMHh1J3GDGkdx20b4MBoPo8EYkHogFvFh2cNCK5Ryv5zQO8IOY01jPOxbLAQ5asZzj9ZzGAQ5ap7GedyyWAxy0YjnH6zmNAxy0TmM971gsB/4/+yh+MbpZUFIAAAAASUVORK5CYII=
skip-gram结构图
word2vec的目标函数是这样的,对于每个词,使得它在上下文词的条件下出现的概率最大,即对数似然函数:
其中
假设词汇数有V个,那么的计算复杂的就为V。为了简化计算,通常的做法采用层次softmax算法,即引入哈夫曼(Huffman code)二叉树,使得复杂度变为logV。不过付出的代价是人为增强了词与词之间的耦合性。例如一个word出现的条件概率的变化,会影响到其路径上所有非叶节点的概率变化,间接地对其他word出现的条件概率带来不同程度的影响。所以Hierarchical Softmax方法和原始问题并不是等价的,但是这种近似并不会显著带来性能上的损失同时又使得模型的求解规模显著上升。
它以各词在语料中的频数当权值构造出来的哈夫曼树。
左子树为1,右子树为-1.
continuous bagof words
log-linear分类器,让中间的词获得正确的分类。通常的做法是前面4个词、后面4个词,中间1个词,这个词就是要分类的词。
训练复杂度为:
Q = N*D + D*log2(V )
其中,N为当前输入层的维度(词个数),D为投射层(projection layer)的维度,V为词典数。
CBOW结构图
1.2 Negative Sampling
噪声采样,目的是为了增强模型的鲁棒性。
其目标函数为:
其中
对于小数据量,K建议设为5-20;对于大数据量,K建议设为2-5.
Pn(w) 表示类均匀分别函数,例如U(w)3/4/Z
低频词去掉(如阈值5);高频词按如下概率去掉,公式如下:
其中,t为高频词阈值,如10-5, f(w)表示词频。其中
f(wi)f(wi)是词wiwi的词频,t是阈值。而这个是Mikolov论文里的说法,实际Word2Vec的代码,以及后续gensim的实现,都采用了如下公式来表示词wiwi被丢弃的概率:
P(wi)=1−(√vwisample∗NW+1)∗sample∗NWvwi
或者这样写看起来简洁一些:
P(wi)=1−(√sample∗NWvwi+sample∗NWvwi)
其中:NW是参与训练的单词总数,包含重复单词,实质即词频累加;vwi是词wi的词频。在gensim的实现中,对sample<1和sample≥1的情况区分对待,具体可去看gensim版源码。
(1)语义(semantic)和句法(syntactic)准确性大概可达50-60%,skip-gram要比连续词袋模型综合上准确性高。
(2)Skip-gram +RNNLMs的准确性最高,比单纯的skip-gram高10%左右,但是训练速度慢大概10倍。
(3)词投射后的向量维度越高,准确性也越高。提高5%左右准确性。
doc2vec算法步骤:
1) 获得word2vec;
2) 通过word2vec,连接文本中的词向量,获得文本的初始向量
3) 把文本当作一个词向量看待,按word2vec相同的方法,训练文本向量,此时只更新文本向量,词向量固定。其训练方法就是,对于一篇文本,上面的词,所分类的结果,不仅要得到上下文的词,还要增加个文本向量。