Biostar学习笔记（4）GenBank, FASTA, FASTQ and download SRA files from NCBI

Sequence data formats

1. Common sequence data formats including GenBank, FASTA, FASTQ formats. GenBank and FASTA format often represent curated sequencing information. FASTQ often represent experimentally obtained data.

(1) GenBank file format

GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.
More information on GenBank format can be found here

When do we use the GenBank format?

GenBank format can represent variety of information while keeping this information human-readable. It is not suitable for data-analysis.

(2) FASTA format

在生物信息学中，FASTA格式是一种用于记录核酸序列或肽序列的文本格式，其中的核酸或氨基酸均以单个字母编码呈现。该格式同时还允许在序列之前定义名称和编写注释。这一格式最初由FASTA软件包定义，但现今已是生物信息学领域的一项标准。
FASTA简明的格式降低了序列操纵和分析的难度，令序列可被文本处理工具和诸如Python、Ruby和Perl等脚本语言处理。
FASTA is a DNA sequence format for specifying or representing DNA sequences. It does not contain sequence quality information.
Reference: Wikipedia FASTA格式

(3) FASTQ file format

FASTQ is extended FASTA file format with sequencing quality score (phred score).
Please refer to the following references:

fasta与fastq格式文件解读
Wikipedia FASTQ格式 (Simplified Chinese) or FASTQ format (English)
FASTQ文件中，一个序列通常由四行组成：
第一行以@开头，之后为序列的标识符以及描述信息（与FASTA格式的描述行类似）
第二行为序列信息
第三行以+开头，之后可以再次加上序列的标识及描述信息（可选）
第四行为质量得分信息，与第二行的序列相对应，长度必须与第二行相同
The character '!' represents the lowest quality while '~' is the highest. Here are the quality value characters in left-to-right increasing order of quality (ASCII):

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Further reading:
Differences between FASTA, FASTQ and SAM formats

2. Databases that contain gene sequencing data

NCBI GEO: can search datasets (sequencing data from a series of participants)
NCBI SRA: can search sequencing data from individual participant
ArrayExpress: Experiments are submitted directly to ArrayExpress or are imported from the NCBI Gene Expression Omnibus database. For high-throughput sequencing based experiments the raw data is brokered to the European Nucleotide Archive, while the experiment descriptions and processed data are archived in ArrayExpress.
European Nucleotide Archive: Learn more about how to use ENA by reading ENA: Guidelines and Tips.

I prefer NCBI GEO and SRA because I can use Aspera to download SRA files, which is super fast. It's best to keep Aspera connect software up-to-date.

Install Aspera connect on Ubuntu Linux

mkdir -p ~/biosoft/ascp && cd ~/biosoft/ascp
wget https://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz
tar -zxvf aspera-connect-3.7.4.147727-linux-64.tar.gz
bash aspera-connect-3.7.4.147727-linux-64.sh
# Installing Aspera Connect
# Deploying Aspera Connect (/home/jshi/.aspera/connect) for the current user only.
# Unable to update desktop database, Aspera Connect may not be able to auto-launch
# Restart firefox manually to load the Aspera Connect plug-in
# Install complete.
# construct soft link
sudo ln -s /home/jshi/.aspera/connect/bin/ascp /usr/bin/ascp
ascp -h # help
ascp -A # version

If you have older version, you need to uninstall before you install newer version of Aspera. Actually, you need to delete related files in the following folder:

# ~/.mozilla/plugins/libnpasperaweb.so
# ~/.aspera/connect
rm ~/.mozilla/plugins/libnpasperaweb_{connect build #}.so
yes|rm -rf ~/.aspera/connect

3. How to download SRA files from NCBI SRA database?

According to SRA group, they recommand Prefetch program provided in SRAtoolkit. More detail can be found in Download Guide.

1. Download SRA files by using prefetch

I don't recommand install SRAtoolkit by using sudo apt-get install sratoolkit because the version might be older. I personally prefer to install the latest softwares.
SRA files will be deposited in the default file folder ~/ncbi/public/sra.

# Install SRAtoolkit
mkdir -p ~/biosoft/sratools && cd ~/biosoft/sratools
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.8.2-1/sratoolkit.2.8.2-1-ubuntu64.tar.gz
tar -zxvf sratoolkit.2.8.2-1-ubuntu64.tar.gz
# You
echo 'export PATH=$PATH:/home/jshi/biosoft/sratools/sratoolkit.2.8.2-1-ubuntu64/bin' >>
~/.bashrc 
source ~/.bashrc

Prefetch can use several different way to download SAR files, the default one is Aspera, if you want prefetch to use only Aspera to download, you can use the following code.

mkdir -p ~/data/project/GSE48240 && cd ~/data/project/GSE48240
# manually generate SRA file list
touch GSE48240.txt
for i in $(seq -w 1 3); do echo "SRR92222""$i" >>GSE48240.txt;done
# Using efetch to generate SRA file list
esearch -db sra -query PRJNA209632 | efetch -format runinfo | cut -f 1 -d ',' |grep SRR >> GSE48240.txt
prefetch -t ascp -a "/usr/bin/ascp|/home/jshi/.aspera/connect/etc/asperaweb_id_dsa.openssh" --option-file GSE48240.txt

Alternatively, you can use curl, wget or ftp to download from generated download links, but will be as slow as snail.

2. Convert SRA files to FASTQ files on the fly

This is a better way if you don't have too much space to save the SRA files. fastq-dump will covert SRA files to fastq files on the fly.

cat GSE48240.txt | xargs -n 1 echo fastq-dump --split-files $1

other

R中修改个别变量名（reshape包）使用names（）函数

names(leadership)

names(leadership)[2] <- “testDate”

names(leadership)[6:10] <-c(“item1”, “item2”, “item3”, “item4”, “item5”)

How do I remove part of a string?
https://stackoverflow.com/questions/9704213/r-remove-part-of-string
gsub
sub_str

# install bioawk
apt-get install bison
cd ~/biosoft
git clone https://github.com/lh3/bioawk
cd bioawk
make
sudo cp bioawk /usr/local/bin

# Download and unzip the file on the fly. 
curl http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz | gunzip -c > chr22.fa

# Look at the file
cat chr22.fa | head -4

# Count how many "N" are in chr22 sequence
cat chr22.fa | grep -o N  | wc -l

# Count how many bases are in Chr22?
cat chr22.fa | bioawk -c fastx '{ print length($seq) }'