WGS分析笔记（1）数据+质控 2021-12-15

最近在做数据分析，同时把笔记整理一下，查漏补缺。

Step1. miniconda3创建虚拟环境

conda env list  #可以查看虚拟环境，默认安装的为base环境，*号表示当前环境
conda create -n your_env_name #创建虚拟环境
conda activate your_env_name #激活虚拟环境
source activate your_env_name #激活虚拟环境
conda deactivate #退出当前环境
conda create --name your_env_name --clone old_env_name # 从一个虚拟环境克隆一个新环境
conda remove --name your_env_name # 删除虚拟环境
conda create --name your_env_name python=3.6 #指定python环境
conda export > your_project_env.yaml # 导出环境变量
conda search bioconda #查找软件
conda install -c bioconda blast=2.7.1 samtools=1.7 #安装软件并指定版本，空格后可添加多个软件
conda install -y fastqc=0.11.7 #添加-y参数跳过确认的步骤
creat -n dna sra-tools fastqc cutadapt trimmomatic star hisat2 samtools subread htseq #转录组分析常用软件
create -n medaka -c conda-forge -c bioconda medaka # 创建一个名为medaka的环境，同时安装bioconda

Step2.原始SRA数据下载

三种方式：

Aspera Connect
sratoolkit的prefetch
ftp

注意：不推荐wget或curl下载，速度慢，且有时下载不完全

prefetch SRRxxxxxxx

Step3.SRA转fastq

fastq-dump --split-3 -O /your path/ SRRxxxxxxx.1

如遇报错：

An error occurred during processing.
A report was generated into the file '/root/ncbi_error_report.xml'.
If the problem persists, you may consider sending the file
to '[email protected]' for assistance.

这是因为你的磁盘不够了，文件写不下去了,这时就需要清理文件了，或者在输出fastq文件时进行压缩

fastq-dump --split-3 -O /your path/ --gzip file.sra

如涉及磁盘的空间问题，压缩文件，格式转化完成后删除原始数据

Step4.数据质量检测

软件 fastqc

mkdir qc
fastqc -o qc /your path/SRRxxxxxx_1.fastq

查看指标：
-read各个位置的碱基质量值分布
-碱基的总体质量值分布
-read各个位置上碱基分布比例，目的是为了分析碱基的分离程度
-GC含量分布
-read各位置的N含量
-read是否还包含测序的接头序列

1.jpg

随着illumina测序的不断优化，一般目前的测序数据都还可以。

Step5.质量控制

软件：fastp
安装：wget http://opengene.org/fastp/fastp

chmod 755 ./fastp
./fastp
pwd fastp # 确认本地路径
export PATH=$your path/:PATH # 添加环境变量 完成全局调用设置
fastp -i /your path/SRRxxxxxxx_1.fq.gz -I /your path/SRRxxxxxxx_2.fq.gz -o /your path/cleandata/cleanSRRxxxxxxx_1.fq.gz -O /your path/cleandata/cleanSRRxxxxxxx_2.fq.gz -c -q 20 -w 8 #
fastp -i /your path/SRRxxxxxxx_1.fq.gz -I /your path/SRRxxxxxxx_2.fq.gz -o /your path/cleandata/cleanSRRxxxxxxx_1.fq.gz -O /your path/cleandata/cleanSRRxxxxxxx_2.fq.gz -c -q 20 -u 50 -n 15 -5 20 -3 20 -w #
    -c 对overlap区域进行纠错，适用于paired-end read
    -w 线程数, 推荐8
    -q 设置低质量的标准，默认是15
    -u 低质量碱基所占比例，默认40,代表40%，只要有一条read不满足条件就成对丢掉
    -n 过滤N碱基过多的reads，15代表个数，因为一般paired-end read 150的reads长度是150
    -5 根据质量值来截取reads，对应 5‘端，得到reads长度可能不等
    -3 根据质量值来截取reads，对应 3’端，得到reads长度可能不等

具体参数请参考官网说明
最后，查看clean data结果，直接查看fastp也会生成一份报告。

2.jpg

水平有限，如存在什么错误请评论指出！请大家多多批评指正，多多交流，谢谢！

参考：
https://www.zhihu.com/question/26011991 作者：黄树嘉
https://blog.csdn.net/weixin_42953727/article/details/90576214 作者：weixin_42953727
https://www.jianshu.com/p/817450b99461 作者：十三而舍
https://www.jianshu.com/p/762601f91539 作者：wo_monic