基因数据处理18之基因序列生成工具wgsim安装和使用


1.下载:

https://github.com/lh3/wgsim

可以git或者zip


2.安装:

gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm


3.数据下载:可以使用bwakit下载:

https://github.com/lh3/bwa/tree/master/bwakit

下载:

bwa.kit/run-gen-ref hs38DH

4.使用方法和默认配置:

hadoop@Master:~/cloud/spark-1.5.2/examples/src/main/resources$ wgsim

Program: wgsim (short read simulator)
Version: 0.3.2
Contact: Heng Li 

Usage:   wgsim [options]   

Options: -e FLOAT      base error rate [0.020]
         -d INT        outer distance between the two ends [500]
         -s INT        standard deviation [50]
         -N INT        number of read pairs [1000000]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -r FLOAT      rate of mutations [0.0010]
         -R FLOAT      fraction of indels [0.15]
         -X FLOAT      probability an indel is extended [0.30]
         -S INT        seed for random generator [0, use the current time]
         -A FLOAT      discard if the fraction of ambiguous bases higher than FLOAT [0.05]
         -h            haplotype mode

5.使用实践:

(1)默认双端:

 wgsim  hs38DH.fa PE/hs38DHPE1LallF1.fq PE/hs38DHPE1LallF2.fq


(2)默认匹配

hadoop@Mcnode1:~/cloud/adam/xubo/data/hs38DH$ wgsim hs38DH.fa hs38DHSELallF1V2.fq /dev/null


(3)-N  产生reads的数量

-N 10000

wgsim -N 1000 hs38DH.fa PE/hs38DHPE1L1000F1.fq PE/hs38DHPE1L1000F2.fq

查看:

文件长度:

hadoop@Mcnode1:~/cloud/adam/xubo/data/hs38DH$ cat PE/hs38DHPE1L10000F1.fq |wc -l
39740

fq的格式为一条reads四行信息


文件内容:

hadoop@Mcnode1:~/cloud/adam/xubo/data/hs38DH$ cat PE/hs38DHPE1L10000F1.fq |head -20
@chrUn_KN707606v1_decoy_29_523_2:0:0_1:0:0_0/1
ATGCCCAGCTGGTTTCTGATACTTCTAATCAAATGTCTTATCCCCCAAATTAGCCCTGGGAGTGAGAATA
+
2222222222222222222222222222222222222222222222222222222222222222222222
@chrUn_KN707606v1_decoy_657_1222_1:0:0_1:0:0_1/1
GTGGTGCACACCTGTAGTGCCTGTTCCTTGGGAGGCTGAGGCCGGAGGATCCCTTGAGCCCAGGAGTTCA
+
2222222222222222222222222222222222222222222222222222222222222222222222
@chrUn_KN707606v1_decoy_1052_1588_2:0:0_1:1:0_2/1
GTCCAAACACCACGTGACAAGCCCATTCTTCCATTTTCTCAGACCATAAACTGCACTGTCCTCTAACTGC
+
2222222222222222222222222222222222222222222222222222222222222222222222
@chrUn_KN707607v1_decoy_1123_1686_1:0:0_2:0:0_0/1
GAGGATATTTTGTTTAGTCACTAGGATTTCTTAACATTCTGAAATTCTATTCACCTCTGATTTTGTCTAT
+
2222222222222222222222222222222222222222222222222222222222222222222222
@chrUn_KN707607v1_decoy_877_1369_0:0:0_0:0:0_1/1
TATAGTTAACATAACATGGTCTATCTTTAGATAATCTCCATGCACAGTAAGATAATATTTTTTCTAGGAC
+
2222222222222222222222222222222222222222222222222222222222222222222222

(4)-1 第一个的reads的长度

-1 10表示第一个位置的fq的reads长为10

 wgsim -N10000 -1 10 hs38DH.fa SE/hs38DHSE1N10000L10F1.fq /dev/null
信息查看:

hadoop@Mcnode1:~/cloud/adam/xubo/data/hs38DH$ cat SE/hs38DHSE1N10000L10F1.fq |wc -l
39740
hadoop@Mcnode1:~/cloud/adam/xubo/data/hs38DH$ cat SE/hs38DHSE1N10000L10F1.fq |head -20
@chrUn_KN707606v1_decoy_216_790_0:0:0_2:0:0_0/1
CATGTCTTTC
+
2222222222
@chrUn_KN707606v1_decoy_1191_1728_0:0:0_1:0:0_1/1
TTAACCTTAA
+
2222222222
@chrUn_KN707606v1_decoy_792_1284_1:0:0_0:0:0_2/1
CAGAACAAAA
+
2222222222
@chrUn_KN707607v1_decoy_1925_2441_0:0:0_1:0:0_0/1
TGCAGGTTTG
+
2222222222
@chrUn_KN707607v1_decoy_2305_2757_1:0:0_3:0:0_1/1
GGACAAGGGA
+
2222222222





6.其他:

(1)匹配:

使用BWA构建索引:

hadoop@Master:~/cloud/adam/xubo/data/wgsim/hs38DH$ ll -h
total 22M
drwxrwxr-x 4 hadoop hadoop 4.0K  4月 15 15:48 ./
drwxrwxr-x 7 hadoop hadoop 4.0K  4月 11 17:10 ../
-rw-rw-r-- 1 hadoop hadoop 8.0M  4月 11 17:08 hs38DH.fa
-rw-r--r-- 1 hadoop hadoop 477K  4月 11 17:08 hs38DH.fa.alt
-rw-rw-r-- 1 hadoop hadoop   15  4月 11 17:10 hs38DH.fa.amb
-rw-rw-r-- 1 hadoop hadoop 365K  4月 11 17:10 hs38DH.fa.ann
-rw-rw-r-- 1 hadoop hadoop 7.6M  4月 11 17:10 hs38DH.fa.bwt
-rw-rw-r-- 1 hadoop hadoop 1.9M  4月 11 17:10 hs38DH.fa.pac
-rw-rw-r-- 1 hadoop hadoop 3.8M  4月 11 17:10 hs38DH.fa.sa
drwxrwxr-x 2 hadoop hadoop 4.0K  4月 15 16:23 PE/
drwxrwxr-x 2 hadoop hadoop 4.0K  4月 15 15:48 SE/

(2)转变成adam

hadoop@Master:~/cloud$ adam-submit fasta2adam /xubo/adam/hs38DH/hs38DH.fa /xubo/adam/hs38DH/adam/hs38DH.adam
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/home/hadoop/cloud/spark-1.5.2//bin/spark-submit
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.



你可能感兴趣的:(基因数据处理)