###############################################################
####### #######
####### 1 reads比对 #######
####### 20220108 #######
###############################################################

基因组文件：reference.fasta
测序数据文件：example_R1.fq.gz example_R2.fq.gz

1 为参考基因组构建索引

samtools faidx reference.fasta
bwa index reference.fasta

2 将测序数据文件比对到参考基因组

-t 设置线程数,-R 添加reads的标签信息,view -Sb 将结果文件由sam转为bam，减少存储。

bwa mem -t 2 \ 
-R '@RG\tID:example\tPL:illumina\tSM:example' \ 
reference.fasta \ 
./example_R1.fq.gz ./example_R2.fq.gz \ 
| samtools view -Sb - > example.bam

3 比对结果文件进行排序

-@ 设置线程数，-m 每个线程可使用的最大内存，-o 指定输出文件名称

samtools sort -@ 2 -m 10G -o example.sort.bam example.bam

4 去除比对结果中的PCR重复

samtools rmdup -S example.sort.bam example.sort.rmdup.bam

5 过滤比对质量少于20的reads

samtools view example.sort.rmdup.bam -q 20 -O BAM -o example.sort.rmdup.filter.bam

###############################################################
####### #######
####### 2 变异检测 #######
####### 20220108 #######
###############################################################

基因组文件：reference.fasta
比对结果文件：example.bam

注：GATK3中需要单独对bam文件进行局部重比对，以提高Indel附近区域比对结果的准确性，但在GATK4中，

局部重比对的步骤已经自动包含在HaplotypeCaller模块中，无需单独进行。

1 分别对每个样品进行变异检测

用到的GATK模块为HaplotypeCaller，--java-options 设置JAVA的参数 -XX:+UseSerialGC 单线程运行，-ERC 输出GVCF

gatk --java-options "-XX:+UseSerialGC"
HaplotypeCaller -R reference.fasta
-ERC GVCF
-I example.bam
-O example.g.vcf.gz

2 对所有样品的gvcf文件进行合并

用到的GATK模块为CombineGVCFs, -Xmx400g -Xms400g 指定最大内存，-V 每个样品的gvcf文件 -O 指定输出文件名称

gatk --java-options "-Xmx400g -Xms400g -XX:+UseSerialGC"
CombineGVCFs -R reference.fasta
-V example1.g.vcf.gz -V example2.g.vcf.gz -V example3.g.vcf.gz
-O combine.g.vcf.gz

3 群体变异检测

用到的GATK模块为GenotypeGVCFs, -V gvcf文件 -O 指定输出的vcf文件

gatk --java-options "-Xmx400g -Xms400g -XX:+UseSerialGC"
GenotypeGVCFs -R reference.fasta
-V combine.g.vcf.gz
-O combine.call.vcf.gz

4 提取SNP

用到的GATK模块为SelectVariants, -select-type 指定提取类型

gatk --java-options "-Xmx400g -Xms400g -XX:+UseSerialGC"
SelectVariants -R reference.fasta
-V combine.call.vcf.gz
-select-type SNP
-O combine.SNP.vcf.gz

4 过滤SNP

用到的GATK模块为VariantFiltration, --filter-expression 设置过滤参数及阈值

--filter-name 为每个位点添加标记，大于过滤阈值的标记为PASS，否则标记为Filter

gatk --java-options "-Xmx400g -Xms400g -XX:+UseSerialGC"
VariantFiltration -R reference.fasta
-V combine.SNP.vcf.gz
--filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0
|| SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"
--filter-name "Filter"
-O combine.SNP.filter.vcf.gz

5 提取过滤好的SNP

用到的GATK模块为SelectVariants，--exclude-filtered 去除被过滤掉的SNP，即标记为Filter的位点

gatk --java-options "-Xmx400g -Xms400g -XX:+UseSerialGC"
SelectVariants -R reference.fasta
-V combine.SNP.filter.vcf.gz
--exclude-filtered
-O combine.SNP.filtered.vcf.gz

###############################################################
####### #######
####### 3 变异过滤 #######
####### 20220109 #######
###############################################################

VCF文件：example.vcf

1 去除Indel附近7bp范围的SNP

-g 设置距离，单位bp， -O 指定输出文件格式 v表示未压缩的vcf文件

注：此步需要vcf文件中包含indel，过滤完后将indel去除

bcftools filter -g 7 -O v -o 1.vcf example.vcf

2 去除10bp范围内有大于3个SNP的SNP cluster

1 为vcf文件建立索引

gatk IndexFeatureFile -F 1.vcf

2 为符合要求的位点添加标记SNP cluster，-cluster 指定SNP数目 -window 指定窗口范围，单位bp

gatk VariantFiltration -V 1.vcf -cluster 3 -window 10 -O 2.vcf

3 去除SNP cluster

gatk --java-options "-XX:+UseSerialGC"
SelectVariants -V 2.vcf
-select "FILTER == SnpCluster"
--invertSelect
-O 2.Filtered.vcf

3 将质量值小于20，覆盖的reads数小于5的基因型设为miss即./.

vcftools --vcf 2.Filtered.vcf
--minDP 5 --minGQ 20
--recode --recode-INFO-all
--out 3.Filtered

4 去除非二等位基因、缺失率大于90%，次要等位基因频率小于5%的位点

vcftools --vcf 3.Filtered.recode.vcf
--max-allele 2 --min-allele 2
--max-missing 0.9 --maf 0.05
--recode --recode-INFO-all
--out 4.Filtered

5 过滤连锁不平衡位点

1 转换vcf文件为ped格式

plink --vcf 4.Filtered.recode.vcf
--allow-extra-chr --recode
--out 5.ld

2 将map文件中的第二列改为每个snp位点特有的标识符

awk '{print 1"_"3"\t"$4}' 5.ld.map > 5.ld1.map

3 修改相应的ped文件的名称

mv 5.ld.ped 5.ld1.ped

4 计算位点之间的ld值, --indep-pairwise 设置窗口、步长、r2值

plink --file 5.ld1
--allow-extra-chr
--indep-pairwise 50 10 0.2
--out ld

5 抽出低连锁的位点，--extract 抽取满足要求的位点，--make-bed 转换为bed格式

plink --file 5.ld1
--allow-extra-chr
--extract ld.prune.in
--make-bed --out filter_ld

6 将bed文件转为vcf文件

plink --bfile filter_ld
--allow-extra-chr
--recode vcf-iid --out filter_ld

###############################################################
####### #######
####### 4 系统树、Admixture和PCA #######
####### 20220109 #######
###############################################################

VCF文件：example.vcf
BED文件：example.bed

系统树

1 将SNP转为fasta格式，用到的软件为vcf2phylip.py(https://github.com/edgardomortiz/vcf2phylip)

python vcf2phylip.py -f -i example.vcf

2 megacc 构建NJ树

infer_NJ_nucleotide_NOld.mao为参数文件，常用参数如下：

[ AnalysisSettings ]

Analysis = Phylogeny Reconstruction

Scope = All Selected Taxa

Statistical Method = Neighbor-joining #指定建树方法

Phylogeny Test = ====================

Test of Phylogeny = Bootstrap method

No. of Bootstrap Replications = 1000 #指定Bootstrap次数

Substitution Model = ====================

Substitutions Type = Nucleotide

Model/Method = p-distance #指定模型

Number of Threads = 4 #指定线程数目

-a 指定参数文件，-d指定序列文件

megacc -a infer_NJ_nucleotide_NOld.mao -d example.fasta -o example_out

3 IQTREE 构建ML树

-s 指定序列文件，-o 指定外类群名称， -T 指定线程数目，-B 指定Bootstrap次数，-m MFP 自动搜索最适合模型并建树

iqtree -s example.fasta -o acoe -T 10 -B 1000 -m MFP

Admixture

1 依次计算K=2到K=20

for k in {2..20} ;do admixture --cv=10 example.bed {k}.out; done

2 处理admixture结果

1 提取出Q文件中的个体顺序

awk '{print $2}' example.fam > ID

2 给Q文件添加个体id

for i in *Q ;do paste ID i.id ;done

3 按照sortID文件中的指定顺序，给Q文件排序

for i in *id ;do python3 sortQ.py sortID i.sort ;done

3 绘图

a <- read.table("sort.5k",row.names = 1)
b <- t(as.matrix(a))
b.reverse <- b[,324:1]
p <- barplot(b.reverse,space = 0,border = NA,axisnames = F,
col = c("gold","firebrick","DarkGoldenrod4","IndianRed1",))
axis(side = 1, p, labels = F, at = c(0:324))
labs <- colnames(b.reverse)
text(cex=.3, x=p-0.25,y=-0.05,labs,xpd=TRUE,srt=45,adj = c(1,1))

PCA

--pca 324 计算324个PC的值，324为输入文件中的个体数

plink --allow-extra-chr --bfile example --pca 324

2022-01-10