windons实现Yang Liu的Topical Word Embeddings

奶奶的,搞了好几天才搞出来,怪我太笨,现在记录下来以免我忘记

1.文章中已经提供了完全可以实现的代码,点击打开链接,下载下来

这个代码根本都不需要用pycharm打开,直接在cmd里面就能实现

2.

  • In all three models, author have modified the source code of gensim and included them, so you needn't to install the gensim by yourself.用户不需要再下载gensim模型了,作者代码里面都有的
  •  compile the word2vec_inner.pyx. 下面需要对word2vec_inner.pyx文件进行complie。这个工作真的做了好久啊,之前不知道需要comlie再运行代码

3.complie步骤,下载下来文件的命名是topical_word_embeddings-master,这个文件名是不符合complie的规则的,我把他修改成了topicalembeddings,然后

在windows平台使用Microsoft Visual C++ Compiler for Python 2.7编译python扩展 下载地址: 点击打开链接

1.下载完成并安装。以本机为例,安装完成后的路径为: 

1
C:\Users\Administrator\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0

2.修改python安装目录下Lib\distutils\msvc9compiler.py文件(如有必要可能msvccompiler.py文件也需要做相应更改,视系统而定),找到get_build_version方法直接return 9.0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def get_build_version():
    """Return the version of MSVC that was used to build Python.
 
    For Python 2.3 and up, the version number is included in
    sys.version.  For earlier versions, assume the compiler is MSVC 6.
    """
    return 9.0
    prefix = "MSC v."
    = sys.version.find(prefix)
    if == -1:
        return 6
    = + len(prefix)
    s, rest = sys.version[i:].split(" "1)
    majorVersion = int(s[:-2]) - 6
    minorVersion = int(s[2:3]) / 10.0
    # I don't think paths are affected by minor version in version 6
    if majorVersion == 6:
        minorVersion = 0
    if majorVersion >= 6:
        return majorVersion + minorVersion
    # else we don't know what version of the compiler this is
    return None

然后再找到find_vcvarsall方法直接返回vcvarsall.bat的路径(以自己机器安装后的路径为准)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def find_vcvarsall(version):
    """Find the vcvarsall.bat file
 
    At first it tries to find the productdir of VS 2008 in the registry. If
    that fails it falls back to the VS90COMNTOOLS env var.
    """
    return r'C:\Users\Administrator\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\vcvarsall.bat'
    vsbase = VS_BASE % version
    try:
        productdir = Reg.get_value(r"%s\Setup\VC" % vsbase,
                                   "productdir")
    except KeyError:
        productdir = None
 
    # trying Express edition
    if productdir is None:
        vsbase = VSEXPRESS_BASE % version
        try:
            productdir = Reg.get_value(r"%s\Setup\VC" % vsbase,
                                       "productdir")
        except KeyError:
            productdir = None
            log.debug("Unable to find productdir in registry")
 
    if not productdir or not os.path.isdir(productdir):
        toolskey = "VS%0.f0COMNTOOLS" % version
        toolsdir = os.environ.get(toolskey, None)
 
        if toolsdir and os.path.isdir(toolsdir):
            productdir = os.path.join(toolsdir, os.pardir, os.pardir, "VC")
            productdir = os.path.abspath(productdir)
            if not os.path.isdir(productdir):
                log.debug("%s is not a valid directory" % productdir)
                return None
        else:
            log.debug("Env var %s is not set or invalid" % toolskey)
    if not productdir:
        log.debug("No productdir found")
        return None
    vcvarsall = os.path.join(productdir, "vcvarsall.bat")
    if os.path.isfile(vcvarsall):
        return vcvarsall
    log.debug("Unable to find vcvarsall.bat")
    return None

3.上述完成之后就可以在windwos下正常编译python的C扩展。

首先在 gensim下的models建立文件setup.py

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy
extensions = [
    Extension("word2vec_inner", ["word2vec_inner.pyx"],
              include_dirs=[numpy.get_include()])
]
setup(
    name="word2vec_inner",
    ext_modules=cythonize(extensions),
)

cmd进入gensim下的models目录运行

1
python setup.py install

将gensim文件夹下的When compiling *.pyx, you may meet the the error of scipy ‘from scipy.... import fabls’, it may happen for the version of scipy. You can change the code, in word2vec_inner.pyx, 'from 'scipy.linalg.blas import fblas'修改为 'import scipy.linalg.blas as fblas'
4.代码运行步骤
1. Get the gibbslda++, run it and get the tassign file and the wordmap.txt ####
2. Use the command:  python train.py wordmap_filename tassign_filename  to run the TWE-3 ####(注意这点非常重要啊,这个是在cmd里直接运行的都不需要打开代码)
3. Output file are under the directory  output word_vector.txt and  topic_vector.txt
1. Get the gibbslda++, run it and get the tassign file and the wordmap.txt ####2. Use the command: python train.py wordmap_filename tassign_filename to run the TWE-3 ####3. Output file are under the directory outputword_vector.txtand topic_vector.txt



好了好了搞了这么久的代码终于明白什么情况了,可以写论文了

你可能感兴趣的:(python)