使用显卡训练DL4J的问题总结

1、先写结论

1.1 目前测试可行的配置

第一种:

(1)显卡配置:GTX1050Ti

(2)系统环境:win10、cuda=9.2 
(3)pom依赖:cuda=9.2    nd4j=1.0.0-beta6

第二种配置:

(1)显卡配置:RTX3080

(2)系统环境:win10、cuda=11.2 或cuda=11.6
(3)pom依赖:cuda=11.2    nd4j=1.0.0-M1.1 (这里不能用1.0.0-M1,会报错-详见下方,是一个bug,在新版M1.1中不会出现。也不要用1.0.0-M2,因为虽然nd4j-cuda-11.2-platform最高支持1.0.0-M2,但deeplearing4j-cuda-11.2最高只支持到1.0.0-M1.1。)

备注:这里说明cuda大版本(version第一个小数点前的数字)一致时,系统环境pom.xml中使用的cuda小版本可以不一致。

1.2 错误的配置 

(1)系统环境cuda=11.2,pom.xml中cuda=11.2 且 nd4j=1.0.0-M1

或者系统环境cuda=11.6,pom.xml中cuda=11.2 且 nd4j=1.0.0-M1

系统环境:笔记本cuda=11.2 ;pom依赖:cuda=11.2    nd4j=1.0.0-M1
或
或者系统环境cuda=11.6,pom.xml中cuda=11.2 且 nd4j=1.0.0-M1
的报错日志:

[main] INFO org.deeplearning4j.nn.multilayer.MultiLayerNetwork - Starting MultiLayerNetwork with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
[main] ERROR org.deeplearning4j.common.config.DL4JClassLoading - Cannot create instance of class 'org.deeplearning4j.cuda.recurrent.CudnnLSTMHelper'.
java.lang.NoSuchMethodException: org.deeplearning4j.cuda.recurrent.CudnnLSTMHelper.(java.lang.Class, [Ljava.lang.Object;)
	at java.lang.Class.getConstructor0(Class.java:3082)
	at java.lang.Class.getDeclaredConstructor(Class.java:2178)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:103)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:89)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:74)
	at org.deeplearning4j.nn.layers.HelperUtils.createHelper(HelperUtils.java:57)
	at org.deeplearning4j.nn.layers.recurrent.LSTM.initializeHelper(LSTM.java:53)
	at org.deeplearning4j.nn.layers.recurrent.LSTM.(LSTM.java:49)
	at org.deeplearning4j.nn.conf.layers.LSTM.instantiate(LSTM.java:78)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.init(MultiLayerNetwork.java:714)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.init(MultiLayerNetwork.java:604)
	at zj.rnn.effectiveness.train.wordvector.TestWordVector.main(TestWordVector.java:89)
Exception in thread "main" java.lang.RuntimeException: java.lang.NoSuchMethodException: org.deeplearning4j.cuda.recurrent.CudnnLSTMHelper.(java.lang.Class, [Ljava.lang.Object;)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:108)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:89)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:74)
	at org.deeplearning4j.nn.layers.HelperUtils.createHelper(HelperUtils.java:57)
	at org.deeplearning4j.nn.layers.recurrent.LSTM.initializeHelper(LSTM.java:53)
	at org.deeplearning4j.nn.layers.recurrent.LSTM.(LSTM.java:49)
	at org.deeplearning4j.nn.conf.layers.LSTM.instantiate(LSTM.java:78)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.init(MultiLayerNetwork.java:714)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.init(MultiLayerNetwork.java:604)
	at zj.rnn.effectiveness.train.wordvector.TestWordVector.main(TestWordVector.java:89)
Caused by: java.lang.NoSuchMethodException: org.deeplearning4j.cuda.recurrent.CudnnLSTMHelper.(java.lang.Class, [Ljava.lang.Object;)
	at java.lang.Class.getConstructor0(Class.java:3082)
	at java.lang.Class.getDeclaredConstructor(Class.java:2178)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:103)
	... 9 more

Process finished with exit code 1

(2)系统环境cuda=11.6,pom.xml中cuda=10.2 且 nd4j=1.0.0-beta7

这里的错误就是系统环境的cuda、cudnn版本和pom.xml中不一致导致的。也有说是RTX3080算力比较高,使用cuda10.2与之不匹配的问题。

解决:升级cuda=11.2,nd4j=1.0.0-M1.1

系统环境cuda=11.6,pom.xml中cuda=10.2 且 nd4j=1.0.0-beta7


[main] WARN org.nd4j.linalg.factory.Nd4jBackend - Skipped [JCublasBackend] backend (unavailable): java.lang.UnsatisfiedLinkError: C:\Users\A\.javacpp\cache\rnn-effective-0.0.1-bin.jar\org\bytedeco\cuda\windows-x86_64\jnicudart.dll: Can't find dependent libraries
Exception in thread "main" java.lang.ExceptionInInitializerError
        at org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable$Builder.(InMemoryLookupTable.java:637)
        at org.deeplearning4j.models.sequencevectors.SequenceVectors$Builder.presetTables(SequenceVectors.java:941)
        at org.deeplearning4j.models.word2vec.Word2Vec$Builder.build(Word2Vec.java:615)
        at zj.rnn.effectiveness.util.PrepareWordVector.trainWordVector(PrepareWordVector.java:133)
        at zj.rnn.effectiveness.train.wordvector.RnnClassifyWithTrainWordVector.main(RnnClassifyWithTrainWordVector.java:64)
Caused by: java.lang.RuntimeException: org.nd4j.linalg.factory.Nd4jBackend$NoAvailableBackendException: Please ensure that you have an nd4j backend on your classpath. Please see: https://deeplearning4j.konduit.ai/nd4j/backend
        at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5094)
        at org.nd4j.linalg.factory.Nd4j.(Nd4j.java:270)
        ... 5 more
Caused by: org.nd4j.linalg.factory.Nd4jBackend$NoAvailableBackendException: Please ensure that you have an nd4j backend on your classpath. Please see: https://deeplearning4j.konduit.ai/nd4j/backend
        at org.nd4j.linalg.factory.Nd4jBackend.load(Nd4jBackend.java:221)
        at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5091)
        ... 6 more

(3)系统环境cuda=10.2,pom.xml中cuda=10.2 且 nd4j=1.0.0-beta7

虽然词向量的保存和读取都是用的同一类型方法,但仍然报错。最后选用高版本的cuda=11.2, nd4j=1.0.0-M1.1就可以完美解决所有问题。 

系统环境cuda=10.2,pom.xml中cuda=10.2 且 nd4j=1.0.0-beta7。在读词向量的时候报错。
其中,词向量的训练保存代码:
        // 1、词向量训练
        SentenceIterator iter = null;
        try {
            iter = new BasicLineIterator(hanLpFilePath);
            TokenizerFactory t = new DefaultTokenizerFactory();
            Word2Vec vec = new Word2Vec.Builder().minWordFrequency(3) // 词在文本(整条训练语句,与窗口大小无关)必须出现的最少次数,短文本中设置只要出现一次就拿下
                    .epochs(5) // 迭代次数
                    .layerSize(wordVectorSize) // 每个词用wordVector表示的大小
                    .seed(42).windowSize(8) // 上下文窗口大小,表示每个词需要考虑前8个词和后8个词,和最小词频无关
                    .iterate(iter).tokenizerFactory(t).build();
            vec.fit();
            // 保存词向量
            WordVectorSerializer.writeWord2VecModel(vec, vectorPath);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        // 2、读取词向量
WordVectors wordVectors = WordVectorSerializer.readWord2VecModel(new File(vectorPath));

[main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
[main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 32
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Windows 10]
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [16]; Memory: [7.1GB];
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [CUBLAS]
[main] INFO org.nd4j.linalg.jcublas.JCublasBackend - ND4J CUDA build version: 10.2.89
[main] INFO org.nd4j.linalg.jcublas.JCublasBackend - CUDA device 0: [NVIDIA GeForce RTX 3080]; cc: [8.6]; Total memory: [10736893952]
[main] ERROR org.deeplearning4j.models.embeddings.loader.WordVectorSerializer - Cannot read binary model
U             syn0.txt\???[q??????χH??B     &??Rw?L?#,?#E??O?ZUk)q?7s?9???CZ?j??9????????k??9?????Zf???3??s??Yu?}V?{??U???~??[??g???m?y????m??????Y??z???z??_????r?~????[W?{?V????7?=G??L?????m?~{?]?????SN)k?>&???e???)s???Vj[?6}?,z????}?y[ie?~??zic???\K??G??????????/??N?E?X{???????????:???\????????Z??T????????f/?\???n|s??????????o?1?.???j??7k?1?V?????+u7?3???z?z?^J??q?v?/??j??u???;?E?(??U??V???/K+Z?,K???t?o{??E?d?it??g??7'*7u??G:??m?V??j?v??;??,?~??1"
        at java.lang.NumberFormatException.forInputString(Unknown Source)
        at java.lang.Integer.parseInt(Unknown Source)
        at java.lang.Integer.parseInt(Unknown Source)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readBinaryModel(WordVectorSerializer.java:278)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readAsBinaryNoLineBreaks(WordVectorSerializer.java:2444)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readAsBinaryNoLineBreaks(WordVectorSerializer.java:2426)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2413)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2372)
        at maotiao.train.wordvector.rnn.RnnClassifyWordVector.main(RnnClassifyWordVector.java:79)
[main] ERROR org.deeplearning4j.models.embeddings.loader.WordVectorSerializer - Unable to guess input file format
java.lang.RuntimeException: Unable to guess input file format. Please use corresponding loader directly
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readAsBinaryNoLineBreaks(WordVectorSerializer.java:2447)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readAsBinaryNoLineBreaks(WordVectorSerializer.java:2426)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2413)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2372)
        at maotiao.train.wordvector.rnn.RnnClassifyWordVector.main(RnnClassifyWordVector.java:79)
Exception in thread "main" java.lang.RuntimeException: Unable to guess input file format. Please use corresponding loader directly
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2416) 
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2372) 
        at maotiao.train.wordvector.rnn.RnnClassifyWordVector.main(RnnClassifyWordVector.java:79)

 2、cuda和显卡的匹配关系

显卡的和cuda的匹配关系可看英伟达显卡、cuda、cudnn、tensorflow-gpu、torch-gpu版本对应关系

需要说明:官网上的映射关系都是指最高匹配版本,如RTX3080 最高匹配cuda 11.7,也就是cuda <= 11.7都是可以的,但是如果版本低于11可能会和显卡的算力(NVIDIA支持的显卡算力CC(computer-capability)) 不匹配,在模型训练时可能也会报错。

笔者同时在RTX3080 的台式机上同时安装了cuda11.6、cuda11.2、cuda10.2。在GTX1050Ti上同时安装了cuda9.2、cuda9.0。

3、DL4J train on GPU所需的依赖



    4.0.0
    maotiao-classify-gpu

    
         
       
        
       
        
        1.0.0-M1.1
        1.0.0-M1.1
        11.2
    

    
        
            org.slf4j
            slf4j-simple
            1.7.25
            compile
        

        
            com.hankcs
            hanlp
            portable-1.7.1
        
        
        
            org.apache.poi
            poi
            3.13
        
        
        
            org.apache.poi
            poi-ooxml
            3.13
        
        


        
        
        

        
        
            org.nd4j
            nd4j-cuda-${cuda.version}-platform
            ${nd4j.version}
        
        
            org.deeplearning4j
            deeplearning4j-cuda-${cuda.version}
            ${dl4j.version}
        
        

        
            org.deeplearning4j
            deeplearning4j-core
            ${dl4j.version}
        

        
            org.deeplearning4j
            deeplearning4j-nlp
            ${dl4j.version}
        
    

    0.0.1
    com.tianque

    
        ${project.artifactId}
        
            
            
                org.apache.maven.plugins
                maven-resources-plugin
                2.7
                
                    UTF-8
                
            
            
            
                org.apache.maven.plugins
                maven-compiler-plugin
                3.5.1
                
                    1.8
                    1.8
                    UTF-8
                
            
      
            
                org.codehaus.mojo
                exec-maven-plugin
                1.4.0
                
                    
                        
                            exec
                        
                    
                
                
                    java
                
            
            
                org.apache.maven.plugins
                maven-shade-plugin
                3.0.0
                
                    true
                    bin
                    true
                    
                        
                            *:*
                            
                                org/datanucleus/**
                                META-INF/*.SF
                                META-INF/*.DSA
                                META-INF/*.RSA
                            
                        
                    

                

                
                    
                        package
                        
                            shade
                        
                        
                            
                                
                                    reference.conf
                                
                                
                            
                        
                    
                
            
        

    


你可能感兴趣的:(算法学习,其他,java,deep,learning,算法,深度学习)