在临床数据分析时,使用的人类参考基因组为UCSC上面的hg19,hg39版本,且常常将除1-22,X,Y,M,以外的染色体去除,避免数据的干扰。
This directory contains the Feb. 2009 assembly of the human genome (hg19,
GRCh37 Genome Reference Consortium Human Reference 37 (GCA_000001405.1)),
as well as repeat annotations and GenBank sequences.
The Feb. 2009 human reference sequence (GRCh37) was produced by the
Genome Reference Consortium:
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/
chromFa.tar.gz - The assembly sequence in one file per chromosome.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.
md5sum.txt - checksums of files in this directory
hg19.chrom.sizes - Two-column tab-separated text file containing assembly
sequence names and sizes.
# 打开码文件
cat md5sum.txt
## 将chromFa.tar.gz的码写入一个新文件
echo "ec3c974949f87e6c88795c17985141d3 chromFa.tar.gz" > check_md5sum.txt
## 检测
md5sum -c check_md5sum.txt
##显示 chromFa.tar.gz: OK
cat hg19.chrom.sizes |head -n 25
## 显示
chr1 249250621
chr2 243199373
chr3 198022430
chr4 191154276
chr5 180915260
chr6 171115067
chr7 159138663
chrX 155270560
chr8 146364022
chr9 141213431
chr10 135534747
chr11 135006516
chr12 133851895
chr13 115169878
chr14 107349540
chr15 102531392
chr16 90354753
chr17 81195210
chr18 78077248
chr20 63025520
chrY 59373566
chr19 59128983
chr22 51304566
chr21 48129895
chr6_ssto_hap7 4928567
tar -zxf chromFa.tar.gz
mv chrUn_gl0002* chr*_random.fa chr*ctg* chr*apd* chr*cox* chr*hap* temp
#mkdir chrfa
#mv chromFa.tar.gz chrfa/
当面目录只剩下
chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa
for i in chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa
do
echo $i;
#cat $i >> hg19.fa
mv $i chrfa
done
bwa index -a bwtsw -p hg19.fa hg19.fa
索引构建完后显示
[BWTIncConstructFromPacked] 670 iterations done. 6154206942 characters processed.
[BWTIncConstructFromPacked] 680 iterations done. 6177547390 characters processed.
[bwt_gen] Finished constructing BWT in 687 iterations.
[bwa_index] 2423.30 seconds elapse.
[bwa_index] Update BWT... 19.05 sec
[bwa_index] Pack forward-only FASTA... 16.77 sec
[bwa_index] Construct SA from BWT and Occ... 1192.17 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index -a bwtsw -p hg19.fa hg19.fa
[main] Real time: 3814.250 sec; CPU: 3676.297 sec
文件生成查看
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg19$ ls -lht
total 8.0G
-rwxrwxrwx 1 toucan toucan 3.0G Sep 27 09:53 hg19.fa
-rwxrwxrwx 1 toucan toucan 1.5G Sep 27 11:17 hg19.fa.sa
-rwxrwxrwx 1 toucan toucan 6.6K Sep 27 10:57 hg19.fa.amb
-rwxrwxrwx 1 toucan toucan 944 Sep 27 10:57 hg19.fa.ann
-rwxrwxrwx 1 toucan toucan 739M Sep 27 10:57 hg19.fa.pac
-rwxrwxrwx 1 toucan toucan 2.9G Sep 27 10:57 hg19.fa.bwt
-rwxrwxrwx 1 toucan toucan 3.5K Sep 27 13:16 work_index.log
-rwxrwxrwx 1 toucan toucan 685 Sep 27 10:06 work_index.sh
drwxrwxrwx 1 toucan toucan 4.0K Sep 27 09:58 chrfa
drwxrwxrwx 1 toucan toucan 4.0K Sep 27 09:17 temp
drwxrwxrwx 1 toucan toucan 4.0K Sep 27 09:00 download
hg38.fa.gz - "Soft-masked" assembly sequence in one file.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.
hg38.fa.gz - "Soft-masked" assembly sequence in one file.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.
md5sum.txt - checksums of files in this directory
hg38.chrom.sizes - Two-column tab-separated text file containing assembly
sequence names and sizes.
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ echo "1c9dcaddfa41027f17cd8f7a82c7293b hg38.fa.gz" >> check_md5sum.txt
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ echo "a5aa5da14ccf3d259c4308f7b2c18cb0 hg38.chromFa.tar.gz" >> check_md5sum.txt
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ cat check_md5sum.txt
1c9dcaddfa41027f17cd8f7a82c7293b hg38.fa.gz
a5aa5da14ccf3d259c4308f7b2c18cb0 hg38.chromFa.tar.gz
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ md5sum -c check_md5sum.txt
hg38.fa.gz: OK
hg38.chromFa.tar.gz: OK
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ cat hg38.chrom.sizes |head -n 25
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
chr5 181538259
chr6 170805979
chr7 159345973
chrX 156040895
chr8 145138636
chr9 138394717
chr11 135086622
chr10 133797422
chr12 133275309
chr13 114364328
chr14 107043718
chr15 101991189
chr16 90338345
chr17 83257441
chr18 80373285
chr20 64444167
chr19 58617616
chrY 57227415
chr22 50818468
chr21 46709983
tar -zxf hg38.chromFa.tar.gz
cd chroms/
mkdir chrfa
cat work.sh
for i in chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa
do
echo $i;
cat $i >> ../hg38.fa
mv $i ../chrfa
done
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38$ rm -rf chroms/
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38$ mv hg38.chromFa.tar.gz chrfa/
查看结果
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38$ cat hg38.fa |grep ">"|uniq -c
1 >chr1
1 >chr2
1 >chr3
1 >chr4
1 >chr5
1 >chr6
1 >chr7
1 >chr8
1 >chr9
1 >chr10
1 >chr11
1 >chr12
1 >chr13
1 >chr14
1 >chr15
1 >chr16
1 >chr17
1 >chr18
1 >chr19
1 >chr20
1 >chr21
1 >chr22
1 >chrX
1 >chrY
1 >chrM
bwa index -a bwtsw -p hg38.fa hg38.fa
输出显示:
[BWTIncConstructFromPacked] 630 iterations done. 6018708242 characters processed.
[BWTIncConstructFromPacked] 640 iterations done. 6055482898 characters processed.
[BWTIncConstructFromPacked] 650 iterations done. 6088164258 characters processed.
[BWTIncConstructFromPacked] 660 iterations done. 6117207522 characters processed.
[BWTIncConstructFromPacked] 670 iterations done. 6143017202 characters processed.
[BWTIncConstructFromPacked] 680 iterations done. 6165952882 characters processed.
[bwt_gen] Finished constructing BWT in 686 iterations.
[bwa_index] 2564.67 seconds elapse.
[bwa_index] Update BWT... 19.27 sec
[bwa_index] Pack forward-only FASTA... 16.36 sec
[bwa_index] Construct SA from BWT and Occ... 1230.34 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index -a bwtsw -p hg38.fa hg38.fa
[main] Real time: 3969.074 sec; CPU: 3855.406 sec
输出文件
-rwxrwxrwx 1 toucan toucan 3150052305 Oct 5 15:44 hg38.fa*
-rwxrwxrwx 1 toucan toucan 16843 Oct 5 16:34 hg38.fa.amb*
-rwxrwxrwx 1 toucan toucan 954 Oct 5 16:34 hg38.fa.ann*
-rwxrwxrwx 1 toucan toucan 3088286508 Oct 5 16:33 hg38.fa.bwt*
-rwxrwxrwx 1 toucan toucan 772071602 Oct 5 16:34 hg38.fa.pac*
-rwxrwxrwx 1 toucan toucan 1544143256 Oct 5 16:54 hg38.fa.sa*
其他索引构建
samtools faidx hg19.fa
## 显示生成
hg19.fa.fai
samtools faidx hg38.fa
gatk CreateSequenceDictionary -R hg19.fa -O hg19.dict