基因数据处理73之从HDFS读取fasta文件存为Adam的parquet文件

1.GRCH38chr14:

hadoop@Master:~/xubo/project/load$ ./load.sh 
start:
1
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
load time:16550 ms
save time:72518 ms
run time:89068 ms
*************end*************

2.全基因组

hadoop@Master:~/xubo/project/load$ ./load.sh 
start:
456                                                                             
load time:314296 ms
[Stage 4:=============================>                          (13 + 11) / 25]16/06/07 22:35:00 ERROR TaskSchedulerImpl: Lost executor 0 on 219.219.220.215: remote Rpc client disassociated
[Stage 4:>                                                          (0 + 6) / 6]16/06/07 22:36:53 ERROR TaskSchedulerImpl: Lost executor 1 on 219.219.220.180: remote Rpc client disassociated
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".                
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
save time:853468 ms
run time:1167764 ms
*************end*************
16/06/07 22:46:39 WARN QueuedThreadPool: 3 threads could not be stopped

代码:

package org.gcdss.cli.load

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.rdd.ADAMContext._

//import org.bdgenomics.avocado.AvocadoFunSuite

object loadFastaFromfna {
  //  def resourcePath(path: String) = ClassLoader.getSystemClassLoader.getResource(path).getFile

  //  def tmpFile(path: String) = Files.createTempDirectory("").toAbsolutePath.toString + "/" + path

  //  def apply(local: Boolean, fqFile: String, faFile: String, configFile: String, output: String) {
  def main(args: Array[String]) {
    println("start:")
    var conf = new SparkConf().setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))).setMaster("spark://219.219.220.149:7077")
    //    var conf = new SparkConf().setAppName("AvocadoSuite").setMaster("local[4]")
    val sc = new SparkContext(conf)
    val ac = new ADAMContext(sc)
    val sqlContext = new SQLContext(sc)
    val startTime = System.currentTimeMillis()
//    val path = "hdfs://219.219.220.149:9000/xubo/ref/GRCH38chr14/GRCH38chr14.fasta"
    val path = "hdfs://219.219.220.149:9000/xubo/ref/GRCH38Index/GCA_000001405.15_GRCh38_full_analysis_set.fna"
    val rdd = sc.loadFasta(path, 1000000000L)
    println(rdd.count())
    //    rdd.take(1).foreach(println)
    //    rdd.foreach(println)
    val loadTime = System.currentTimeMillis()
    println("load time:" + (loadTime - startTime) + " ms")
    rdd.adamParquetSave("/xubo/ref/GRCH38Index/GCA_000001405.15_GRCh38_full_analysis_set.adam")
    val saveTime = System.currentTimeMillis()
    println("save time:" + (saveTime - loadTime) + " ms")
    println("run time:" + (saveTime - startTime) + " ms")
    println("*************end*************")
    sc.stop()

  }

}

脚本:

    #!/usr/bin/env bash  
    spark-submit   \
--class  org.gcdss.cli.load.loadFastaFromfna \
--master spark://219.219.220.149:7077 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=org.bdgenomics.adam.serialization.ADAMKryoRegistrator \
--jars /home/hadoop/cloud/adam/lib/adam-apis_2.10-0.18.3-SNAPSHOT.jar,/home/hadoop/cloud/adam/lib/adam-cli_2.10-0.18.3-SNAPSHOT.jar,/home/hadoop/cloud/adam/lib/adam-core_2.10-0.18.3-SNAPSHOT.jar,/home/hadoop/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem/BWAMEMSparkAll/gcdss-cli-0.0.3-SNAPSHOT.jar \
--executor-memory 4096M \
--total-executor-cores 20 BWAMEMSparkAll.jar

参考

【1】https://github.com/xubo245/AdamLearning
【2】https://github.com/bigdatagenomics/adam/ 
【3】https://github.com/xubo245/SparkLearning
【4】http://spark.apache.org
【5】http://stackoverflow.com/questions/28166667/how-to-pass-d-parameter-or-environment-variable-to-spark-job  
【6】http://stackoverflow.com/questions/28840438/how-to-override-sparks-log4j-properties-per-driver

研究成果:

【1】 [BIBM] Bo Xu, Changlong Li, Hang Zhuang, Jiali Wang, Qingfeng Wang, Chao Wang, and Xuehai Zhou, "Distributed Gene Clinical Decision Support System Based on Cloud Computing", in IEEE International Conference on Bioinformatics and Biomedicine. (BIBM 2017, CCF B)
【2】 [IEEE CLOUD] Bo Xu, Changlong Li, Hang Zhuang, Jiali Wang, Qingfeng Wang, Xuehai Zhou. Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark (CLOUD 2017, CCF-C).
【3】 [CCGrid] Bo Xu, Changlong Li, Hang Zhuang, Jiali Wang, Qingfeng Wang, Jinhong Zhou, Xuehai Zhou. DSA: Scalable Distributed Sequence Alignment System Using SIMD Instructions. (CCGrid 2017, CCF-C).
【4】more: https://github.com/xubo245/Publications

Help

If you have any questions or suggestions, please write it in the issue of this project or send an e-mail to me: [email protected]
Wechat: xu601450868
QQ: 601450868

你可能感兴趣的:(基因数据处理)