1.先使用art生成数据:
请看前一篇
2.上传fastq到hdfs:
hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ spark-submit --class cs.ucla.edu.bwaspark.BWAMEMSpark --master local[2] /home/hadoop/xubo/tools/cloud-scale-bwamem-0.2.1/target/cloud-scale-bwamem-0.2.0-assembly.jar upload-fastq 0 1 fastq/G38L100c1Nhs20.fastq /xubo/data/alignment/cs-bwamem/fastq/g38L100c1Nhs20upload.fastq
command: upload-fastq
Map('isPairEnd -> 0, 'filePartNum -> 1, 'inFilePath1 -> fastq/G38L100c1Nhs20.fastq, 'outFilePath -> /xubo/data/alignment/cs-bwamem/fastq/g38L100c1Nhs20upload.fastq)
Upload FASTQ command line arguments: 0 1 fastq/G38L100c1Nhs20.fastq /xubo/data/alignment/cs-bwamem/fastq/g38L100c1Nhs20upload.fastq 250000
[WARNING] Avro: Invalid default for field comment: null not a "bytes"
[WARNING] Avro: Invalid default for field comment: null not a "bytes"
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Upload FASTQ to HDFS Finished!!!
3.进行align:
hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ spark-submit --executor-memory 2g --class cs.ucla.edu.bwaspark.BWAMEMSpark --total-executor-cores 2 --master local[2] --conf spark.driver.host=**MasterIP** --conf spark.driver.cores=2 --conf spark.driver.maxResultSize=2g --conf spark.storage.memoryFraction=0.7 --conf spark.akka.threads=2 --conf spark.akka.frameSize=1024 /home/hadoop/xubo/tools/cloud-scale-bwamem-0.2.1/target/cloud-scale-bwamem-0.2.0-assembly.jar cs-bwamem -bfn 1 -bPSW 1 -sbatch 10 -bPSWJNI 1 -oChoice 2 -oPath hdfs://**MasterIP**:9000/xubo/11.adam -localRef 1 -isSWExtBatched 1 0 GRCH38BWAindex/GRCH38chr1L3556522.fasta /xubo/data/alignment/cs-bwamem/fastq/g38L100c1Nhs20upload.fastq
command: cs-bwamem
Map('isPSWJNI -> 1, 'localRef -> 1, 'batchedFolderNum -> 1, 'isPSWBatched -> 1, 'subBatchSize -> 10, 'inFASTQPath -> /xubo/data/alignment/cs-bwamem/fastq/g38L100c1Nhs20upload.fastq, 'inFASTAPath -> GRCH38BWAindex/GRCH38chr1L3556522.fasta, 'outputPath -> hdfs://**MasterIP**:9000/xubo/11.adam, 'isSWExtBatched -> 1, 'isPairEnd -> 0, 'outputChoice -> 2)
CS- BWAMEM command line arguments: false GRCH38BWAindex/GRCH38chr1L3556522.fasta /xubo/data/alignment/cs-bwamem/fastq/g38L100c1Nhs20upload.fastq 1 true 10 true ./target/jniNative.so 2 hdfs://**MasterIP**:9000/xubo/11.adam
HDFS master: hdfs://Master:9000
Input HDFS folder number: 1
Head line: @RG ID:foo SM:bar
Read Group ID: foo
Load Index Files
Load BWA-MEM options
Output choice: 2
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[WARNING] Avro: Invalid default for field comment: null not a "bytes"
[WARNING] Avro: Invalid default for field comment: null not a "bytes"
[WARNING] Avro: Invalid default for field comment: null not a "bytes"
CS-BWAMEM Finished!!!
Jun 3, 2016 11:32:26 AM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 1
Jun 3, 2016 11:32:27 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Jun 3, 2016 11:32:27 AM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records.
Jun 3, 2016 11:32:27 AM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Jun 3, 2016 11:32:27 AM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 17 ms. row count = 1
MasterIP需要修改成相对应的
4.查看adam文件:
cs-bwamem提供了merge,按所给的方法没有成功。
可以使用SparkSQL直接读取:
package org.bdgenomics.avocado.cli
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.bdgenomics.adam.rdd.ADAMContext._
/**
* Created by xubo on 2016/5/27.
* 从hdfs下载经过avocado匹配好的数据
* run:success
*/
object parquetRead2csbwamem {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$')))
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
println("start:")
val file = "hdfs://**MasterIp**:9000/xubo/14.adam/0"
val df3 = sqlContext.read.option("mergeSchema", "true").parquet(file)
// df3.printSchema()
df3.show()
println("end")
sc.stop
}
}
结果:
+--------------------+---------+---------+----+--------------------+--------------------+--------------------+-----+---------------------+-------------------+----------+----------+----------+----------+-----------+------------+-------------------------+-------------+------------------+------------------+----------------+------------------+----------------------+--------------------+--------+--------------------+---------------+---------------------------+----------------------+-----------------------+--------------------+----------------------+------------------+------------------------------------+-------------------+-----------------------+-----------------+------------------+----------------+----------+
| contig| start| end|mapq| readName| sequence| qual|cigar|basesTrimmedFromStart|basesTrimmedFromEnd|readPaired|properPair|readMapped|mateMapped|firstOfPair|secondOfPair|failedVendorQualityChecks|duplicateRead|readNegativeStrand|mateNegativeStrand|primaryAlignment|secondaryAlignment|supplementaryAlignment|mismatchingPositions|origQual| attributes|recordGroupName|recordGroupSequencingCenter|recordGroupDescription|recordGroupRunDateEpoch|recordGroupFlowOrder|recordGroupKeySequence|recordGroupLibrary|recordGroupPredictedMedianInsertSize|recordGroupPlatform|recordGroupPlatformUnit|recordGroupSample|mateAlignmentStart|mateAlignmentEnd|mateContig|
+--------------------+---------+---------+----+--------------------+--------------------+--------------------+-----+---------------------+-------------------+----------+----------+----------+----------+-----------+------------+-------------------------+-------------+------------------+------------------+----------------+------------------+----------------------+--------------------+--------+--------------------+---------------+---------------------------+----------------------+-----------------------+--------------------+----------------------+------------------+------------------------------------+-------------------+-----------------------+-----------------+------------------+----------------+----------+
|[chr1,248956422,n...|225496693|225496793| 60|chr1-1 RG ID:foo ...|CATATTTACCAATTAAA...|@C@D@FFDFHHHHIJ.J...| 100M| 0| 0| false| false| true| false| false| false| false| false| false| false| true| false| false| 61A38| null|NM:i:1 AS:i:95 XS...| foo| null| null| null| null| null| null| null| null| null| bar| null| null| null|
+--------------------+---------+---------+----+--------------------+--------------------+--------------------+-----+---------------------+-------------------+----------+----------+----------+----------+-----------+------------+-------------------------+-------------+------------------+------------------+----------------+------------------+----------------------+--------------------+--------+--------------------+---------------+---------------------------+----------------------+-----------------------+--------------------+----------------------+------------------+------------------------------------+-------------------+-----------------------+-----------------+------------------+----------------+----------+
end
结果分析:
与bwa。snap匹配的和art生成的均一致!!!可以参考【2】
hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ cat G38L100c1Nhs20.sam
@SQ SN:chr1 LN:248956422
@PG ID:bwa PN:bwa VN:0.7.13-r1126 CL:bwa samse GRCH38chr1L3556522.fna G38L100c1Nhs20.sai G38L100c1Nhs20.fq
chr1-1 0 chr1 225496694 37 100M * 0 0 CATATTTACCAATTAAAGTCACAAAATATTTCTCATTATTTATTCATGCAGGTAACTGAGACAAAGATAGTGCAGAAATCAACTTTAAATAAAAAATTAT @C@D@FFDFHHHHIJ.JBIJJGJGIJ:G47JHJ@IJJ91BJJIGHHHEIJDGD=IJJJBJJ'DG=3D)chr1 chr1-1 225496693 +
CATATTTACCAATTAAAGTCACAAAATATTTCTCATTATTTATTCATGCAGGTAACTGAGAAAAAGATAGTGCAGAAATCAACTTTAAATAAAAAATTAT
CATATTTACCAATTAAAGTCACAAAATATTTCTCATTATTTATTCATGCAGGTAACTGAGACAAAGATAGTGCAGAAATCAACTTTAAATAAAAAATTAT
参考:
【1】 https://github.com/ytchen0323/cloud-scale-bwamem
【2】http://blog.csdn.net/xubo245/article/details/51576880