基于RNA-seq的基因表达分析

我的青春
    最近在做一些小麦基因的表达分析,想到使用RNA-seq的数据进行生物信息学分析,并且比我做实验用的组织还要多。

序列预处理

    下载数据之后,首先要对数据进行低质量序列和载体序列等污染序列去除,我这里结合了两个软件AdapterRemoval和bbduk2, bbduk2是bbmap中的一个子程序。

AdapterRemoval --file1 input1.fastq.gz --file2 input2.fastq.gz --qualitybase 33 --trimns --minlength 40 --threads 10 --adapter-list ~/adapterremoval-2.1.7/benchmark/adapters/adapters.fasta --output1 output1.fastq.gz --output2 output2.fastq.gz

    可在终端键入AdapterRemoval,即可看见详细参数。如下

AdapterRemoval ver. 2.1.7

This program searches for and removes remnant adapter sequences from
your read data.  The program can analyze both single end and paired end
data.  For detailed explanation of the parameters, please refer to the
man page.  For comments, suggestions  and feedback please contact Stinus
Lindgreen ([email protected]) and Mikkel Schubert ([email protected]).

If you use the program, please cite the paper:
    Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid
    adapter trimming, identification, and read merging.
    BMC Research Notes, 12;9(1):88.

    http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2


Arguments:                           Description:
  --help                             Display this message.
  --version                          Print the version string.

  --file1 FILE                       Input file containing mate 1 reads or single-ended reads [REQUIRED].
  --file2 FILE                       Input file containing mate 2 reads [OPTIONAL].

FASTQ OPTIONS:
  --qualitybase BASE                 Quality base used to encode Phred scores in input; either 33, 64, or solexa [current: 33].
  --qualitybase-output BASE          Quality base used to encode Phred scores in output; either 33, 64, or solexa. By default, reads will be written in the same format as the that specified using --qualitybase.
  --qualitymax BASE                  Specifies the maximum Phred score expected in input files, and used when writing output. ASCII encoded values are limited to the characters '!' (ASCII = 33) to'~' (ASCII = 126), meaning that possible scores are 0 - 93 with offset 33, and 0 - 62 for offset 64 and Solexa scores [default: 41].
  --mate-separator CHAR              Character separating the mate number (1 or 2) from the read name in FASTQ records [default: '/'].
  --interleaved                      This option enables both the --interleaved-input option and the
                                       --interleaved-output option [current: off].
  --interleaved-input                The (single) input file provided contains both the mate 1 and mate 2 reads, one pair after the other, with one mate 1 reads followed by one mate 2 read. This option is implied by the --interleaved option [current: off].
  --interleaved-output               If set, trimmed paired-end reads are written to a single file containing mate 1 and mate 2 reads, one pair after the other. This option is implied by the --interleaved option [current: off].

OUTPUT FILES:
  --basename BASENAME                Default prefix for all output files for which no filename was explicitly set [current: your_output].
  --settings FILE                    Output file containing information on the parameters used in the run as well as overall statistics on the reads after trimming [default: BASENAME.settings]
  --output1 FILE                     Output file containing trimmed mate1 reads [default: BASENAME.pair1.truncated (PE), BASENAME.truncated (SE), or BASENAME.paired.truncated (interleaved PE)]
  --output2 FILE                     Output file containing trimmed mate 2 reads [default: BASENAME.pair2.truncated (only used in PE mode, but not if --interleaved-output is enabled)]
  --singleton FILE                   Output file to which containing paired reads for which the mate has been discarded [default: BASENAME.singleton.truncated]
  --outputcollapsed FILE             If --collapsed is set, contains overlapping mate-pairs which have been merged into a single read (PE mode) or reads for which the adapter was identified by a minimum overlap, indicating that the entire template molecule is present. This does not include which have subsequently been trimmed due to low-quality or ambiguous nucleotides [default: BASENAME.collapsed]
  --outputcollapsedtruncated FILE    Collapsed reads (see --outputcollapsed) which were trimmed due the presence of low-quality or ambiguous nucleotides [default: BASENAME.collapsed.truncated]
  --discarded FILE                   Contains reads discarded due to the --minlength, --maxlength or --maxns options [default: BASENAME.discarded]

OUTPUT COMPRESSION:
  --gzip                             Enable gzip compression [current: off]
  --gzip-level LEVEL                 Compression level, 0 - 9 [current: 6]
  --bzip2                            Enable bzip2 compression [current: off]
  --bzip2-level LEVEL                Compression level, 0 - 9 [current: 9]

TRIMMING SETTINGS:
  --adapter1 SEQUENCE                Adapter sequence expected to be found in mate 1 reads [current: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG].
  --adapter2 SEQUENCE                Adapter sequence expected to be found in mate 2 reads [current: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT].
  --adapter-list FILENAME            Read table of white-space separated adapters pairs, used as if the first column was supplied to --adapter1, and the second column was supplied to --adapter2; only the first adapter in each pair is required SE trimming mode [current:<not set>].

  --mm MISMATCH_RATE                 Max error-rate when aligning reads and/or adapters. If > 1, the max error-rate is set to 1 / MISMATCH_RATE; if < 0, the defaults are used, otherwise the user-supplied value is used directly. [defaults: 1/3 for trimming; 1/10 when identifing adapters].
  --maxns MAX                        Reads containing more ambiguous bases (N) than this number after trimming are discarded [current: 1000].
  --shift N                          Consider alignments where up to N nucleotides are missing from the 5' termini [current: 2].

  --trimns                           If set, trim ambiguous bases (N) at 5'/3' termini [current: off]
  --trimqualities                    If set, trim bases at 5'/3' termini with quality scores <= to --minquality value [current: off]
  --minquality PHRED                 Inclusive minimum; see --trimqualities for details [current: 2]
  --minlength LENGTH                 Reads shorter than this length are discarded following trimming [current: 15].
  --maxlength LENGTH                 Reads longer than this length are discarded following trimming [current:4294967295].
  --collapse                         When set, paired ended read alignments of --minalignmentlength or more bases are combined into a single consensus sequence, representing the complete insert,and written to either basename.collapsed or basename.collapsed.truncated (if trimmed due to low-quality bases following collapse); for single-ended reads,putative complete inserts are identified as having at least --minalignmentlength bases overlap with the adapter sequence, and are written to the the same files [current: off].
  --minalignmentlength LENGTH        If --collapse is set, paired reads must overlap at least this number of bases to be collapsed, and single-ended reads must overlap at least this number of bases with the adapter to be considered complete template molecules [current:11].
  --minadapteroverlap LENGTH         In single-end mode, reads are only trimmed if the overlap between read and the adapter is at least X bases long, not counting ambiguous nucleotides (N); this is independant of the --minalignmentlength when using --collapse, allowing a conservative selection of putative complete inserts while ensuring that all possible adapter contamination is trimmed [current: 0].

DEMULTIPLEXING:
  --barcode-list FILENAME            List of barcodes or barcode pairs for single or double-indexed demultiplexing. Note that both indexes should be specified for both single-end and paired-end trimming, if double-indexed multiplexing was used, in order to ensure that the demultiplexed reads can be trimmed correctly [current: <not set>].
  --barcode-mm N                     Maximum number of mismatches allowed when counting mismatches in both the mate 1 and the mate 2 barcode for paired reads.
  --barcode-mm-r1 N                  Maximum number of mismatches allowed for the mate 1 barcode; if not set, this value is equal to the '--barcode-mm' value; cannot be higher than the '--barcode-mm value'.
  --barcode-mm-r2 N                  Maximum number of mismatches allowed for the mate 2 barcode; if not set, this value is equal to the '--barcode-mm' value; cannot be higher than the '--barcode-mm value'.

MISC:
  --identify-adapters                Attempt to identify the adapter pair of PE reads, by searching for overlapping reads [current: off].
  --seed SEED                        Sets the RNG seed used when choosing between bases with equal Phred scores when collapsing. Note that runs are not deterministic if more than one thread is used. If not specified, a seed is generated using the current time.
  --threads THREADS                  Maximum number of threads [current: 1]

其中--identify-adapters 参数可以在PE reads中鉴定载体序列
bbduk2的命令如下

/data1/masw/bbmap/bbduk2.sh -da in=ATW_AKOSW_2_1_D0KD1ACXX.IND12.fastq_1.gz IN2=ATW_AKOSW_2_2_D0KD1ACXX.IND12.fastq_1.gz out=ATW_AKOSW_2_1_D0KD1ACXX.IND12.fastq_2.gz out2=ATW_AKOSW_2_2_D0KD1ACXX.IND12.fastq_2.gz stats=1.2.txt k=20 minlength=40 mink=8 hdist=2 ref=/data1/masw/bbmap/resources/sequencing_artifacts.fa.gz tbo entropy=0.5 entropywindow=50 entropyk=5

同样的在终端下键入命令/data1/masw/bbmap/bbduk2.sh 可以查看详细的参数

Written by Brian Bushnell
Last modified June 27, 2016

BBDuk2 is like BBDuk but can kfilter, kmask, and ktrim in a single pass.
It does not replace BBDuk, and is only provided to allow maximally efficient
pipeline integration when multiple steps will be performed.  The syntax is 
slightly different.

Description:  Compares reads to the kmers in a reference dataset, optionally 
allowing an edit distance. Splits the reads into two outputs - those that 
match the reference, and those that don't. Can also trim (remove) the matching 
parts of the reads rather than binning the reads.

Usage:  bbduk2.sh in=file> out=file> fref=

Input may be stdin or a fasta or fastq file, compressed or uncompressed.
If you pipe via stdin/stdout, please include the file type; e.g. for gzipped 
fasta input, set in=stdin.fa.gz


Input parameters:
in=<file>           Main input. in=stdin.fq will pipe from stdin.
in2=<file>          Input for 2nd read of pairs in a different file.
fref=<file,file>    Comma-delimited list of fasta reference files for filtering.
rref=<file,file>    Comma-delimited list of fasta reference files for right-trimming.
lref=<file,file>    Comma-delimited list of fasta reference files for left-trimming.
mref=<file,file>    Comma-delimited list of fasta reference files for masking.
fliteral=  Comma-delimited list of literal sequences for filtering.
rliteral=  Comma-delimited list of literal sequences for right-trimming.
lliteral=  Comma-delimited list of literal sequences for left-trimming.
mliteral=  Comma-delimited list of literal sequences for masking.
touppercase=f       (tuc) Change all bases upper-case.
interleaved=auto    (int) t/f overrides interleaved autodetection.
qin=auto            Input quality offset: 33 (Sanger), 64, or auto.
reads=-1            If positive, quit after processing X reads or pairs.
copyundefined=f     (cu) Process non-AGCT IUPAC reference bases by making all
                    possible unambiguous copies.  Intended for short motifs
                    or adapter barcodes, as time/memory use is exponential.

Output parameters:
out=<file>          (outnonmatch) Write reads here that do not contain 
                    kmers matching the database.  'out=stdout.fq' will pipe 
                    to standard out.
out2=<file>         (outnonmatch2) Use this to write 2nd read of pairs to a 
                    different file.
outm=<file>         (outmatch) Write reads here that contain kmers matching
                    the database.
outm2=<file>        (outmatch2) Use this to write 2nd read of pairs to a 
                    different file.
outs=<file>         (outsingle) Use this to write singleton reads whose mate 
                    was trimmed shorter than minlen.
stats=<file>        Write statistics about which contamininants were detected.
refstats=<file>     Write statistics on a per-reference-file basis.
rpkm=<file>         Write RPKM for each reference sequence (for RNA-seq).
dump=<file>         Dump kmer tables to a file, in fasta format.
nzo=t               Only write statistics about ref sequences with nonzero hits.
overwrite=t         (ow) Grant permission to overwrite files.
showspeed=t         (ss) 'f' suppresses display of processing speed.
ziplevel=2          (zl) Compression level; 1 (min) through 9 (max).
fastawrap=80        Length of lines in fasta output.
qout=auto           Output quality offset: 33 (Sanger), 64, or auto.
statscolumns=3      (cols) Number of columns for stats output, 3 or 5.
                    5 includes base counts.
rename=f            Rename reads to indicate which sequences they matched.
refnames=f          Use names of reference files rather than scaffold IDs.
trd=f               Truncate read and ref names at the first whitespace.
ordered=f           Set to true to output reads in same order as input.

Histogram output parameters:
bhist=<file>        Base composition histogram by position.
qhist=<file>        Quality histogram by position.
qchist=<file>       Count of bases with each quality value.
aqhist=<file>       Histogram of average read quality.
bqhist=<file>       Quality histogram designed for box plots.
lhist=<file>        Read length histogram.
gchist=<file>       Read GC content histogram.
gcbins=100          Number gchist bins.  Set to 'auto' to use read length.

Histograms for sam files only (requires sam format 1.4 or higher):

ehist=<file>        Errors-per-read histogram.
qahist=<file>       Quality accuracy histogram of error rates versus quality 
                    score.
indelhist=<file>    Indel length histogram.
mhist=<file>        Histogram of match, sub, del, and ins rates by read location.
idhist=<file>       Histogram of read count versus percent identity.
idbins=100          Number idhist bins.  Set to 'auto' to use read length.

Processing parameters:
k=27                Kmer length used for finding contaminants.  Contaminants 
                    shorter than k will not be found.  k must be at least 1.
rcomp=t             Look for reverse-complements of kmers in addition to 
                    forward kmers.
maskmiddle=t        (mm) Treat the middle base of a kmer as a wildcard, to 
                    increase sensitivity in the presence of errors.
minkmerhits=1       (mkh) Reads need at least this many matching kmers 
                    to be considered as matching the reference.
hammingdistance=0   (hdist) Maximum Hamming distance for ref kmers (subs only).
                    Memory use is proportional to (3*K)^hdist.
qhdist=0            Hamming distance for query kmers; impacts speed, not memory.
editdistance=0      (edist) Maximum edit distance from ref kmers (subs 
                    and indels).  Memory use is proportional to (8*K)^edist.
hammingdistance2=0  (hdist2) Sets hdist for short kmers, when using mink.
qhdist2=0           Sets qhdist for short kmers, when using mink.
editdistance2=0     (edist2) Sets edist for short kmers, when using mink.
forbidn=f           (fn) Forbids matching of read kmers containing N.
                    By default, these will match a reference 'A' if 
                    hdist>0 or edist>0, to increase sensitivity.
removeifeitherbad=t (rieb) Paired reads get sent to 'outmatch' if either is 
                    match (or either is trimmed shorter than minlen).  
                    Set to false to require both.
findbestmatch=f     (fbm) If multiple matches, associate read with sequence 
                    sharing most kmers.  Reduces speed.
skipr1=f            Don't do kmer-based operations on read 1.
skipr2=f            Don't do kmer-based operations on read 2.
ecco=f              For overlapping paired reads only.  Performs error-
                    correction with BBMerge prior to kmer operations.
recalibrate=f       (recal) Recalibrate quality scores.  Requires calibration
                    matrices generated by CalcTrueQuality.
sam=<file,file>     If recalibration is desired, and matrices have not already
                    been generated, BBDuk will create them from the sam file.

Speed and Memory parameters:
threads=auto        (t) Set number of threads to use; default is number of 
                    logical processors.
prealloc=f          Preallocate memory in table.  Allows faster table loading 
                    and more efficient memory usage, for a large reference.
monitor=f           Kill this process if it crashes.  monitor=600,0.01 would 
                    kill after 600 seconds under 1% usage.
minrskip=1          (mns) Force minimal skip interval when indexing reference 
                    kmers.  1 means use all, 2 means use every other kmer, etc.
maxrskip=1          (mxs) Restrict maximal skip interval when indexing 
                    reference kmers. Normally all are used for scaffolds<100kb, 
                    but with longer scaffolds, up to maxrskip-1 are skipped.
rskip=              Set both minrskip and maxrskip to the same value.
                    If not set, rskip will vary based on sequence length.
qskip=1             Skip query kmers to increase speed.  1 means use all.
speed=0             Ignore this fraction of kmer space (0-15 out of 16) in both
                    reads and reference.  Increases speed and reduces memory.
Note: Do not use more than one of 'speed', 'qskip', and 'rskip'.

Trimming/Filtering/Masking parameters:
Note - for BBDuk2, kmer filtering, trimming, and masking are independent,
and all can be performed at the same time.

ktrim=f             Trim reads to remove bases matching reference kmers.
                    Values: 
                            f (don't trim), 
                            r (trim to the right), 
                            l (trim to the left)
kmask=f             Replace bases matching ref kmers with another symbol.
                    Allows any non-whitespace character other than t or f,
                    and processes short kmers on both ends.  'kmask=lc' will
                    convert masked bases to lowercase.
mink=0              Look for shorter kmers at read tips down to this length, 
                    when k-trimming or masking.  0 means disabled.  Enabling
                    this will disable maskmiddle.
qtrim=f             Trim read ends to remove bases with quality below trimq.
                    Performed AFTER looking for kmers.
                    Values: 
                            rl (trim both ends), 
                            f (neither end), 
                            r (right end only), 
                            l (left end only),
                            w (sliding window)
trimq=6             Regions with average quality BELOW this will be trimmed.
minlength=10        (ml) Reads shorter than this after trimming will be 
                    discarded.  Pairs will be discarded if both are shorter.
mlf=0               (minlengthfraction) Reads shorter than this fraction of 
                    original length after trimming will be discarded.
maxlength=          Reads longer than this after trimming will be discarded.
                    Pairs will be discarded only if both are longer.
minavgquality=0     (maq) Reads with average quality (after trimming) below 
                    this will be discarded.
maqb=0              If positive, calculate maq from this many initial bases.
chastityfilter=f    (cf) Discard reads with id containing ' 1:Y:' or ' 2:Y:'.
barcodefilter=f     Remove reads with unexpected barcodes if barcodes is set,
                    or barcodes containing 'N' otherwise.  A barcode must be
                    the last part of the read header.
barcodes=           Comma-delimited list of barcodes or files of barcodes.
maxns=-1            If non-negative, reads with more Ns than this 
                    (after trimming) will be discarded.
mcb=0               (minconsecutivebases) Discard reads without at least 
                    this many consecutive called bases.
ottm=f              (outputtrimmedtomatch) Output reads trimmed to shorter 
                    than minlength to outm rather than discarding.
tp=0                (trimpad) Trim this much extra around matching kmers.
tbo=f               (trimbyoverlap) Trim adapters based on where paired 
                    reads overlap.
strictoverlap=t     Adjust sensitivity for trimbyoverlap mode.
minoverlap=14       Require this many bases of overlap for detection.
mininsert=50        Require insert size of at least this for overlap. 
                    Should be reduced to 16 for small RNA sequencing.
tpe=f               (trimpairsevenly) When kmer right-trimming, trim both 
                    reads to the minimum length of either.
forcetrimleft=0     (ftl) If positive, trim bases to the left of this position
                    (exclusive, 0-based).
forcetrimright=0    (ftr) If positive, trim bases to the right of this position
                    (exclusive, 0-based).
forcetrimright2=0   (ftr2) If positive, trim this many bases on the right end.
forcetrimmod=0      (ftm) If positive, right-trim length to be equal to zero,
                    modulo this number.
restrictleft=0      If positive, only look for kmer matches in the 
                    leftmost X bases.
restrictright=0     If positive, only look for kmer matches in the 
                    rightmost X bases.
mingc=0             Discard reads with GC content below this.
maxgc=1             Discard reads with GC content above this.
gcpairs=t           Use average GC of paired reads.
                    Also affects gchist.

Entropy/Complexity parameters:
entropy=-1          Set between 0 and 1 to filter reads with entropy below
                    that value.  Higher is more stringent.
entropywindow=50    Calculate entropy using a sliding window of this length.
entropyk=5          Calculate entropy using kmers of this length.
minbasefrequency=0  Discard reads with a minimum base frequency below this.

Cardinality estimation:
cardinality=f           (loglog) Count unique kmers using the LogLog algorithm.
loglogk=31              Use this kmer length for counting.
loglogbuckets=1999      Use this many buckets for counting.

Java Parameters:

-Xmx                This will be passed to Java to set memory usage, overriding 
                    the program's automatic memory detection. -Xmx20g will 
                    specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  
                    The max is typically 85% of physical memory.

There is a changelog at /bbmap/docs/changelog_bbduk.txt
Please contact Brian Bushnell at [email protected] if you encounter any problems.

    去除载体序列后,可以查看mapping rate是否提高,正常情况下mapping应该在80%以上。如果mapping rate实在太低,要考虑这个sample的质量问题,有可能影响结果的准确性

基因组index

hisat2-build-l -p 20 ./IWGSC_v1.0/Wheat_IWGSC_WGA_v1.0_pseudomolecules/161010_Chinese_Spring_v1.0_pseudomolecules.fasta IWGSCv1.0_hiast2

序列比对到基因组

    这一步使用 hisat2,hisat2 比对非常快而且资源要求较少,但是需要先对参考基因组index。mapping使用的命令是:

#!/usr/bin/env python
# -*- coding: utf-8 -*-


import subprocess


with open('hisat2_list.txt', 'r') as f:
    for line in f:
        line = line.strip().split()
        input1, input2, output1, output2 = line
        print input1, input2
        proc = subprocess.Popen(['hisat2', '-p', '20', '--dta', '-x', '../NRGenome_hisat2/NRGenome', '--known-splicesite-infile', '../annotation/1.ss', '--novel-splicesite-infile', 'all.ss', '--novel-splicesite-outfile',output1, \
                                 '-t', '-1', input1, '-2', input2, '-S', output2], shell=False)
        proc.wait()

    接下来就是筛选sam结果,比如只保留一个hit的reads或者完全匹配的reads等。如果能够对sam格式熟悉,就能够简单的做到filter,这里也不在详述。将hisat2的详细参数列出

No index, query, or output file specified!
HISAT2 version 2.0.4 by Daehwan Kim ([email protected], www.ccb.jhu.edu/people/infphilo)
Usage: 
  hisat2 [options]* -x  {-1  -2  | -U  | --sra-acc } [-S ]

    Index filename prefix (minus trailing .X.ht2).
         Files with #1 mates, paired with files in .
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
         Files with #2 mates, paired with files in .
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
          Files with unpaired reads.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
          Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654.
        File for SAM output (default: stdout)

  , ,  can be comma-separated lists (no whitespace) and can be
  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.

Options (defaults in parentheses):

 Input:
  -q                 query input files are FASTQ .fq/.fastq (default)
  --qseq             query input files are in Illumina's qseq format
  -f                 query input files are (multi-)FASTA .fa/.mfa
  -r                 query input files are raw one-sequence-per-line
  -c                 , ,  are sequences themselves, not files
  -s/--skip     skip the first  reads/pairs in the input (none)
  -u/--upto     stop after first  reads/pairs (no limit)
  -5/--trim5    trim  bases from 5'/left end of reads (0)
  -3/--trim3    trim  bases from 3'/right end of reads (0)
  --phred33          qualities are Phred+33 (default)
  --phred64          qualities are Phred+64
  --int-quals        qualities encoded as space-delimited integers
  --sra-acc          SRA accession ID

 Alignment:
  --n-ceil     func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)
  --ignore-quals     treat all quality values as 30 on Phred scale (off)
  --nofw             do not align forward (original) version of read (off)
  --norc             do not align reverse-complement version of read (off)

 Spliced Alignment:
  --pen-cansplice               penalty for a canonical splice site (0)
  --pen-noncansplice            penalty for a non-canonical splice site (12)
  --pen-canintronlen           penalty for long introns (G,-8,1) with canonical splice sites
  --pen-noncanintronlen        penalty for long introns (G,-8,1) with noncanonical splice sites
  --min-intronlen               minimum intron length (20)
  --max-intronlen               maximum intron length (500000)
  --known-splicesite-infile    provide a list of known splice sites
  --novel-splicesite-outfile   report a list of splice sites
  --novel-splicesite-infile    provide a list of novel splice sites
  --no-temp-splicesite               disable the use of splice sites found
  --no-spliced-alignment             disable spliced alignment
  --rna-strandness           Specify strand-specific information (unstranded)
  --tmo                              Reports only those alignments within known transcriptome
  --dta                              Reports alignments tailored for transcript assemblers
  --dta-cufflinks                    Reports alignments tailored specifically for cufflinks

 Scoring:
  --ma          match bonus (0 for --end-to-end, 2 for --local) 
  --mp ,   max and min penalties for mismatch; lower qual = lower penalty <2,6>
  --sp ,   max and min penalties for soft-clipping; lower qual = lower penalty <1,2>
  --np          penalty for non-A/C/G/Ts in read/ref (1)
  --rdg ,  read gap open, extend penalties (5,3)
  --rfg ,  reference gap open, extend penalties (5,3)
  --score-min  min acceptable alignment score w/r/t read length
                     (L,0.0,-0.2)

 Reporting:
  (default)          look for multiple alignments, report best, with MAPQ
   OR
  -k            report up to  alns per read; MAPQ not meaningful
   OR
  -a/--all           report all alignments; very slow, MAPQ not meaningful

 Paired-end:
  --fr/--rf/--ff     -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)
  --no-mixed         suppress unpaired alignments for paired reads
  --no-discordant    suppress discordant alignments for paired reads

 Output:
  -t/--time          print wall-clock time taken by search phases
  --un            write unpaired reads that didn't align to 
  --al            write unpaired reads that aligned at least once to 
  --un-conc       write pairs that didn't align concordantly to 
  --al-conc       write pairs that aligned concordantly at least once to 
  (Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.
  --un-gz , to gzip compress output, or add '-bz2' to bzip2 compress output.)
  --quiet            print nothing to stderr except serious errors
  --met-file   send metrics to file at  (off)
  --met-stderr       send metrics to stderr (off)
  --met         report internal counters & metrics every  secs (1)
  --no-head          supppress header lines, i.e. lines starting with @
  --no-sq            supppress @SQ header lines
  --rg-id      set read group id, reflected in @RG line and RG:Z: opt field
  --rg         add  ("lab:value") to @RG line of SAM header.
                     Note: @RG line only printed when --rg-id is set.
  --omit-sec-seq     put '*' in SEQ and QUAL fields for secondary alignments.

 Performance:
  -o/--offrate  override offrate of index; must be >= index's offrate
  -p/--threads  number of alignment threads to launch (1)
  --reorder          force SAM output order to match order of input reads
  --mm               use memory-mapped I/O for index; many 'bowtie's can share

 Other:
  --qc-filter        filter out reads that are bad according to QSEQ filter
  --seed        seed for random number generator (0)
  --non-deterministic seed rand. gen. arbitrarily instead of using read attributes
  --remove-chrname   remove 'chr' from reference names in alignment
  --add-chrname      add 'chr' to reference names in alignment 
  --version          print version information and quit
  -h/--help          print this usage message
(ERR): hisat2-align exited with value 1

这里我需要保留完全匹配的reads,筛选如下

#!/usr/bin/env python
# -*- coding: utf-8 -*-


import subprocess


with open('sam_file.txt', 'r') as f:
    for line in f:
        line = line.strip()
        print line
        proc = subprocess.Popen('grep -E "@|NM:i:0" ' + line + ' > ' + line[:-3] + 'perfectmatch.sam', shell=True) 
        proc.wait()

注意此处筛选遗漏了插入缺失的情况,会在这里

    有了sam文件我们可以组装出转录本,但是本研究的目的是给定一个基因的转录本去衡量表达情况,所以这一步骤非必需。对于如何组装出转录本可参考文献Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

统计转录本表达的counts

    根据基因在参考基因组上的位置信息进行表达量统计,基因在基因组上的位置信息一般保存成gff3或gtf格式,可以使blastn,gmap等软件获取位置信息(注意exon-intron一定要准确),gff3格式如下

chr1A   NRGenome    mRNA    5946352 5946999 .   -   .   ID=UN044011.mrna1;Name=UN044011;Parent=UN044011.path1;coverage=100.0;identity=100.0;matches=648;mismatches=0;indels=0;unknowns=0    
chr1A   NRGenome    exon    5946352 5946999 100 -   .   ID=UN044011.mrna1.exon1;Name=UN044011;Parent=UN044011.mrna1;Target=UN044011 1   648 +   
chr1A   NRGenome    mRNA    9968301 9968632 .   +   .   ID=UN080299.mrna1;Name=UN080299;Parent=UN080299.path1;coverage=100.0;identity=100.0;matches=213;mismatches=0;indels=0;unknowns=0    
chr1A   NRGenome    exon    9968301 9968396 100 +   .   ID=UN080299.mrna1.exon1;Name=UN080299;Parent=UN080299.mrna1;Target=UN080299 1   96  +   
chr1A   NRGenome    exon    9968516 9968632 100 +   .   ID=UN080299.mrna1.exon2;Name=UN080299;Parent=UN080299.mrna1;Target=UN080299 97  213 +   
chr1A   NRGenome    mRNA    12807377    12808514    .   -   .   ID=UN129475.mrna1;Name=UN129475;Parent=UN129475.path1;coverage=100.0;identity=100.0;matches=156;mismatches=0;indels=0;unknowns=0    
chr1A   NRGenome    exon    12808501    12808514    100 -   .   ID=UN129475.mrna1.exon1;Name=UN129475;Parent=UN129475.mrna1;Target=UN129475 1   14  +   
chr1A   NRGenome    exon    12807377    12807518    100 -   .   ID=UN129475.mrna1.exon2;Name=UN129475;Parent=UN129475.mrna1;Target=UN129475 15  156 +   

    有了位置信息,使用featurecounts 计算表达的counts。这里只统计unique reads,命令如下(每次只需要修改输入的基因位置信息以及输出文件即可):

featureCounts -T 20 -t exon -g Name --readExtension5 70  --readExtension3 70 -p --donotsort -C -a ../Triticum_aestivum.TGACv1.cds.1.gff3 -o TGAC_unique_in_expression.txt ATW_AOSW_1.perfectmatch.sam ATW_AAOSW_6.perfectmatch.sam ATW_ANOSW_1.perfectmatch.sam ATW_LOSW_5.perfectmatch.sam ATW_ADOSW_1.perfectmatch.sam ATW_AEOSW_1.perfectmatch.sam ATW_DOSW_2.perfectmatch.sam ATW_POSW_6.perfectmatch.sam ATW_IOSW_4.perfectmatch.sam ATW_KOSW_4.perfectmatch.sam ATW_ROSW_7.perfectmatch.sam ATW_ALOSW_3.perfectmatch.sam ATW_TOSW_8.perfectmatch.sam ATW_VOSW_6.perfectmatch.sam ATW_MOSW_5.perfectmatch.sam ATW_NOSW_6.perfectmatch.sam ATW_COSW_1.perfectmatch.sam ATW_AGOSW_2.perfectmatch.sam ATW_GOSW_3.perfectmatch.sam ATW_HOSW_3.perfectmatch.sam ATW_ABOSW_7.perfectmatch.sam ATW_ACOSW_1.perfectmatch.sam ATW_QOSW_7.perfectmatch.sam ATW_AHOSW_3.perfectmatch.sam SRR1175868.perfectmatch.sam SRR1177760.perfectmatch.sam SRR1177761.perfectmatch.sam NG-5789_1A_lib7482.perfectmatch.sam NG-5789_1B_lib7486.perfectmatch.sam NG-5789_2A_lib7483.perfectmatch.sam NG-5789_2B_lib7487.perfectmatch.sam NG-5789_3A_lib7484.perfectmatch.sam NG-5789_3B_lib7488.perfectmatch.sam NG-5789_4A_lib7485.perfectmatch.sam NG-5789_4B_lib7489.perfectmatch.sam ATW_SOSW_8.perfectmatch.sam ATW_AFOSW_2.perfectmatch.sam ATW_AIOSW_2.perfectmatch.sam ATW_AKOSW_2.perfectmatch.sam ATW_FOSW_2.perfectmatch.sam ATW_AMOSW_4.perfectmatch.sam

    同样的在这里列出featureCounts的详细参数。具体每项参数的意义请自行了解

Version 1.5.1

Usage: featureCounts [options] -a  -o  input_file1 [input_file2] ... 

## Required arguments:

  -a <string>         Name of an annotation file. GTF/GFF format by default.
                      See -F option for more formats.

  -o <string>         Name of the output file including read counts. A separate
                      file including summary statistics of counting results is
                      also included in the output (`<string>.summary')

  input_file1 [input_file2] ...   A list of SAM or BAM format files.

## Options:
# Annotation

  -F <string>         Specify format of provided annotation file. Acceptable
                      formats include `GTF/GFF' and `SAF'. `GTF/GFF' by default.
                      See Users Guide for description of SAF format.

  -t <string>         Specify feature type in GTF annotation. `exon' by 
                      default. Features used for read counting will be 
                      extracted from annotation using the provided value.

  -g <string>         Specify attribute type in GTF annotation. `gene_id' by 
                      default. Meta-features used for read counting will be 
                      extracted from annotation using the provided value.

  -A <string>         Provide a chromosome name alias file to match chr names in
                      annotation with those in the reads. This should be a two-
                      column comma-delimited text file. Its first column should
                      include chr names in the annotation and its second column
                      should include chr names in the reads. Chr names are case
                      sensitive. No column header should be included in the
                      file.

# Level of summarization

  -f                  Perform read counting at feature level (eg. counting 
                      reads for exons rather than genes).

# Overlap between reads and features

  -O                  Assign reads to all their overlapping meta-features (or 
                      features if -f is specified).

  --minOverlap   Minimum number of overlapping bases in a read that is
                      required for read assignment. 1 by default. Number of
                      overlapping bases is counted from both reads if paired
                      end. If a negative value is provided, then a gap of up
                      to specified size will be allowed between read and the
                      feature that the read is assigned to.

  --fracOverlap  Minimum fraction of overlapping bases in a read that is
                      required for read assignment. Value should be within range
                      [0,1]. 0 by default. Number of overlapping bases is
                      counted from both reads if paired end. Both this option
                      and '--minOverlap' option need to be satisfied for read
                      assignment.

  --largestOverlap    Assign reads to a meta-feature/feature that has the 
                      largest number of overlapping bases.

  --readExtension5  Reads are extended upstream by  bases from their
                      5' end.

  --readExtension3  Reads are extended upstream by  bases from their
                      3' end.

  --read2pos <5:3>    Reduce reads to their 5' most base or 3' most base. Read
                      counting is then performed based on the single base the 
                      read is reduced to.

# Multi-mapping reads

  -M                  Multi-mapping reads will also be counted. For a multi-
                      mapping read, all its reported alignments will be 
                      counted. The `NH' tag in BAM/SAM input is used to detect 
                      multi-mapping reads.

# Fractional counting

  --fraction          Assign fractional counts to features. This option must
                      be used together with '-M' or '-O' or both. When '-M' is
                      specified, each reported alignment from a multi-mapping
                      read (identified via 'NH' tag) will carry a fractional
                      count of 1/x, instead of 1 (one), where x is the total
                      number of alignments reported for the same read. When '-O'
                      is specified, each overlapping feature will receive a
                      fractional count of 1/y, where y is the total number of
                      features overlapping with the read. When both '-M' and
                      '-O' are specified, each alignment will carry a fraction
                      count of 1/(x*y).

# Read filtering

  -Q             The minimum mapping quality score a read must satisfy in
                      order to be counted. For paired-end reads, at least one
                      end should satisfy this criteria. 0 by default.

  --splitOnly         Count split alignments only (ie. alignments with CIGAR
                      string containing 'N'). An example of split alignments is
                      exon-spanning reads in RNA-seq data.

  --nonSplitOnly      If specified, only non-split alignments (CIGAR strings do
                      not contain letter 'N') will be counted. All the other
                      alignments will be ignored.

  --primary           Count primary alignments only. Primary alignments are 
                      identified using bit 0x100 in SAM/BAM FLAG field.

  --ignoreDup         Ignore duplicate reads in read counting. Duplicate reads 
                      are identified using bit Ox400 in BAM/SAM FLAG field. The 
                      whole read pair is ignored if one of the reads is a 
                      duplicate read for paired end data.

# Strandness

  -s             Perform strand-specific read counting. Acceptable values:
                      0 (unstranded), 1 (stranded) and 2 (reversely stranded).
                      0 by default.

# Exon-exon junctions

  -J                  Count number of reads supporting each exon-exon junction.
                      Junctions were identified from those exon-spanning reads
                      in the input (containing 'N' in CIGAR string). Counting
                      results are saved to a file named '.jcounts'

  -G <string>         Provide the name of a FASTA-format file that contains the
                      reference sequences used in read mapping that produced the
                      provided SAM/BAM files. This optional argument can be used
                      with '-J' option to improve read counting for junctions.

# Parameters specific to paired end reads

  -p                  If specified, fragments (or templates) will be counted
                      instead of reads. This option is only applicable for
                      paired-end reads.

  -B                  Count read pairs that have both ends successfully aligned 
                      only.

  -P                  Check validity of paired-end distance when counting read 
                      pairs. Use -d and -D to set thresholds.

  -d             Minimum fragment/template length, 50 by default.

  -D             Maximum fragment/template length, 600 by default.

  -C                  Do not count read pairs that have their two ends mapping 
                      to different chromosomes or mapping to same chromosome 
                      but on different strands.

  --donotsort         Do not sort reads in BAM/SAM input. Note that reads from 
                      the same pair are required to be located next to each 
                      other in the input.

# Number of CPU threads

  -T             Number of the threads. 1 by default.

# Miscellaneous

  -R                  Output detailed assignment result for each read. A text 
                      file will be generated for each input file, including 
                      names of reads and meta-features/features reads were 
                      assigned to. See Users Guide for more details.

  --tmpDir    Directory under which intermediate files are saved (later
                      removed). By default, intermediate files will be saved to
                      the directory specified in '-o' argument.

  --maxMOp       Maximum number of 'M' operations allowed in a CIGAR
                      string. 10 by default. Both 'X' and '=' are treated as 'M'
                      and adjacent 'M' operations are merged in the CIGAR
                      string.

  -v                  Output version of the program.


均一化表达量

    这里使用FPKM表示,因为我用的是PE数据,而单端测序数据可以使用RPKM。我自己写了一个python脚本统计,其他人使用需要进行修改

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
__author__ = 'shengwei ma'
__author_email__ = '[email protected]'

import numpy as np

raw_total = [('root_Z10_rep1', 49168553), ('root_Z10_rep2', 44047402), ('root_Z13_rep1', 78098556),
             ('root_Z13_rep2', 38474362), ('root_Z39_rep1', 79981030), ('root_Z39_rep2', 41041508),
             ('stem_Z30_rep1', 46935246), ('stem_Z30_rep2', 38803969), ('stem_Z32_rep1', 51627704),
             ('stem_Z32_rep2', 37219517), ('stem_Z65_rep1', 39849949), ('stem_Z65_rep2', 40299574),
             ('leaf_Z10_rep1', 38168988), ('leaf_Z10_rep2', 43073693), ('leaf_Z23_rep1', 44071613),
             ('leaf_Z23_rep2', 40380776), ('leaf_Z71_rep1', 32810256), ('leaf_Z71_rep2', 35749803),
             ('spike_Z32_rep1', 46203474), ('spike_Z32_rep2', 43612313), ('spike_Z39_rep1', 40406588),
             ('spike_Z39_rep2', 47596209), ('spike_Z65_rep1', 43071042), ('spike_Z65_rep2', 48443902),
             ('carpel', 57881099), ('carpel-like structure', 63914055), ('stamen', 72275259),
             ('latent_lepto_rep1', 31693600), ('latent_lepto_rep2', 40260140), ('diplo_dia_rep1', 56486977),
             ('diplo_dia_rep2', 43990501), ('zygo_pachy_rep1', 37037924), ('zygo_pachy_rep2', 37678253),
             ('metaphaseI_rep1', 26954435), ('metaphaseI_rep2', 32180104), ('grain_Z71_rep1', 44263291),
             ('grain_Z71_rep2', 36875603), ('grain_Z75_rep1', 47740143), ('grain_Z75_rep2', 51819168),
             ('grain_Z85_rep1', 36879170), ('grain_Z85_rep2', 31412470), ('Wheat_Room1_10DPA', 16712256),
             ('Wheat_Room1_10DPA_Rep', 22819483), ('Wheat_Room2_10DPA', 27121510), ('Wheat_Room2_10DPA_Rep', 29453109),
             ('Wheat_Room1_AL_20DPA', 30598515), ('Wheat_Room1_AL_20DPA_Rep', 28518937), ('Wheat_Room2_AL_20DPA', 24838220),
             ('Wheat_Room2_AL_20DPA_Rep', 27715580), ('Wheat_Room1_AL_20DPA_Extra1', 29978007), ('Wheat_Room1_AL_20DPA_Extra2', 30079461),
             ('Wheat_Room1_SE_20DPA', 25140145), ('Wheat_Room1_SE_20DPA_Rep', 24446796), ('Wheat_Room2_SE_20DPA', 21339690),
             ('Wheat_Room2_SE_20DPA_Rep', 22815780),
             ('Wheat_Room1_TC_20DPA', 16629117), ('Wheat_Room1_TC_20DPA_Rep', 27612315), ('Wheat_Room2_TC_20DPA', 25304622),
             ('Wheat_Room2_TC_20DPA_Rep', 25352139), ('Wheat_Room1_REF_20DPA', 29929219), ('Wheat_Room1_REF_20DPA_Rep', 26636425),
             ('Wheat_Room2_REF_20DPA', 24316737), ('Wheat_Room2_REF_20DPA_Rep', 29330096), ('Wheat_Room1_SE_30DPA', 22777481),
             ('Wheat_Room1_SE_30DPA_Rep', 22777481), ('Wheat_Room2_SE_30DPA', 30513836), ('Wheat_Room2_SE_30DPA_Rep', 21486098),
             ('Wheat_Room1_AL_SE_30DPA', 28821672), ('Wheat_Room1_AL_SE_30DPA_Rep', 20134665), ('Wheat_Room2_AL_SE_30DPA', 23721856),
             ('Wheat_Room2_AL_SE_30DPA_Rep', 24896811), ('wheat_23_1', 28444918), ('wheat_23_2', 67968193),
             ('wheat_23_3', 24321425), ('wheat_4_1', 35430306), ('wheat_4_2', 22527710), ('wheat_4_3', 16848204)]


with open('MLJ_unique_expression.txt', 'r') as f:
    print "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" \
                  "\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" \
                  "\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" \
                  "\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t" % \
                  ('Geneid', 'Chr', 'Start', 'End', 'Strand', 'Length', 'root_Z10', 'root_Z13','root_Z39',
                   'stem_Z30', 'stem_Z32', 'stem_Z65', 'leaf_Z10', 'leaf_Z23', 'leaf_Z71',
                   'spike_Z32', 'spike_Z39', 'spike_Z65', 'carpel', 'carpel_like_structure',
                   'stamen', 'latet_lepto', 'diplo_dia', 'zygo_pachy', 'metaphaseI',
                   'grain_Z71', 'grain_Z75', 'grain_Z85', 'Wheat_10DPA', 'Wheat_AL_20DPA',
                   'Wheat_SE_20DPA', 'Wheat_TC_20DPA', 'Wheat_REF_20DPA', 'Wheat_SE_30DPA',
                   'Wheat_AL.SE_30DPA', 'wheat_23', 'wheat_4', 'root_Z10_std', 'root_Z13_std', 'root_Z39_std',
                   'stem_Z30_std', 'stem_Z32_std', 'stem_Z65_std', 'leaf_Z10_std', 'leaf_Z23_std', 'leaf_Z71_std',
                   'spike_Z32_std', 'spike_Z39_std', 'spike_Z65_std', 'carpel_std', 'carpel-like_std', 'stamen_std',
                   'latet_lepto_std', 'diplo_dia_std', 'zygo_pachy_std', 'metaphaseI_std', 'grain_Z71_std',
                   'grain_Z75_std', 'grain_Z85_std','Wheat_10DPA_std', 'Wheat_AL_20DPA_std','Wheat_SE_20DPA_std',
                   'Wheat_TC_20DPA_std', 'Wheat_REF_20DPA_std', 'Wheat_SE_30DPA_std',
                   'Wheat_AL.SE_30DPA_std', 'wheat_23_std', 'wheat_4_std')
    for line in f:
        if line.startswith('#') or line.startswith('Geneid'):
            pass
        else:
            new = line.strip().split('\t')
            (Geneid, Chr, Start, End, Strand, Length, root_Z10_rep1, root_Z10_rep2, root_Z13_rep1, root_Z13_rep2,
             root_Z39_rep1, root_Z39_rep2, stem_Z30_rep1, stem_Z30_rep2, stem_Z32_rep1, stem_Z32_rep2, stem_Z65_rep1,
             stem_Z65_rep2, leaf_Z10_rep1, leaf_Z10_rep2, leaf_Z23_rep1, leaf_Z23_rep2, leaf_Z71_rep1, leaf_Z71_rep2,
             spike_Z32_rep1, spike_Z32_rep2, spike_Z39_rep1, spike_Z39_rep2, spike_Z65_rep1, spike_Z65_rep2, carpel,
             carpel_like_structure, stamen, latet_lepto_rep1, latent_lepto_rep2, diplo_dia_rep1, diplo_dia_rep2,
             zygo_pachy_rep1, zygo_pachy_rep2, metaphaseI_rep1, metaphaseI_rep2, grain_Z71_rep1, grain_Z71_rep2,
             grain_Z75_rep1, grain_Z75_rep2, grain_Z85_rep1, grain_Z85_rep2, Wheat_Room1_10DPA, Wheat_Room1_10DPA_Rep,
             Wheat_Room2_10DPA, Wheat_Room2_10DPA_Rep, Wheat_Room1_AL_20DPA, Wheat_Room1_AL_20DPA_Rep,
             Wheat_Room2_AL_20DPA, Wheat_Room2_AL_20DPA_Rep, Wheat_Room1_AL_20DPA_Extra1, Wheat_Room1_AL_20DPA_Extra2,
             Wheat_Room1_SE_20DPA, Wheat_Room1_SE_20DPA_Rep, Wheat_Room2_SE_20DPA, Wheat_Room2_SE_20DPA_Rep,
             Wheat_Room1_TC_20DPA, Wheat_Room1_TC_20DPA_Rep, Wheat_Room2_TC_20DPA, Wheat_Room2_TC_20DPA_Rep,
             Wheat_Room1_REF_20DPA, Wheat_Room1_REF_20DPA_Rep, Wheat_Room2_REF_20DPA, Wheat_Room2_REF_20DPA_Rep,
             Wheat_Room1_SE_30DPA, Wheat_Room1_SE_30DPA_Rep, Wheat_Room2_SE_30DPA, Wheat_Room2_SE_30DPA_Rep,
             Wheat_Room1_AL_SE_30DPA, Wheat_Room1_AL_SE_30DPA_Rep, Wheat_Room2_AL_SE_30DPA, Wheat_Room2_AL_SE_30DPA_Rep,
             wheat_23_1, wheat_23_2, wheat_23_3, wheat_4_1, wheat_4_2, wheat_4_3) = new
            new_root_Z10_rep1 = int(root_Z10_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[0][-1]))
            new_root_Z10_rep2 = int(root_Z10_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[1][-1]))
            new_root_Z13_rep1 = int(root_Z13_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[2][-1]))
            new_root_Z13_rep2 = int(root_Z13_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[3][-1]))
            new_root_Z39_rep1 = int(root_Z39_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[4][-1]))
            new_root_Z39_rep2 = int(root_Z39_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[5][-1]))
            new_stem_Z30_rep1 = int(stem_Z30_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[6][-1]))
            new_stem_Z30_rep2 = int(stem_Z30_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[7][-1]))
            new_stem_Z32_rep1 = int(stem_Z32_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[8][-1]))
            new_stem_Z32_rep2 = int(stem_Z32_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[9][-1]))
            new_stem_Z65_rep1 = int(stem_Z65_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[10][-1]))
            new_stem_Z65_rep2 = int(stem_Z65_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[11][-1]))
            new_leaf_Z10_rep1 = int(leaf_Z10_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[12][-1]))
            new_leaf_Z10_rep2 = int(leaf_Z10_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[13][-1]))
            new_leaf_Z23_rep1 = int(leaf_Z23_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[14][-1]))
            new_leaf_Z23_rep2 = int(leaf_Z23_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[15][-1]))
            new_leaf_Z71_rep1 = int(leaf_Z71_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[16][-1]))
            new_leaf_Z71_rep2 = int(leaf_Z71_rep2) * pow(10.0 , 6) / (int(Length) * int(raw_total[17][-1]))
            new_spike_Z32_rep1 = int(spike_Z32_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[18][-1]))
            new_spike_Z32_rep2 = int(spike_Z32_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[19][-1]))
            new_spike_Z39_rep1 = int(spike_Z39_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[20][-1]))
            new_spike_Z39_rep2 = int(spike_Z39_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[21][-1]))
            new_spike_Z65_rep1 = int(spike_Z65_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[22][-1]))
            new_spike_Z65_rep2 = int(spike_Z65_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[23][-1]))
            new_carpel = int(carpel) * pow(10.0, 9) / (int(Length) * int(raw_total[24][-1]))
            new_carpel_like_structure = int(carpel_like_structure) * pow(10.0, 9) / (int(Length) * int(raw_total[25][-1]))
            new_stamen = int(stamen) * pow(10.0, 9) / (int(Length) * int(raw_total[26][-1]))
            new_latet_lepto_rep1 = int(latet_lepto_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[27][-1]))
            new_latet_lepto_rep2 = int(latent_lepto_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[28][-1]))
            new_diplo_dia_rep1 = int(diplo_dia_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[29][-1]))
            new_diplo_dia_rep2 = int(diplo_dia_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[30][-1]))
            new_zygo_pachy_rep1 = int(zygo_pachy_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[31][-1]))
            new_zygo_pachy_rep2 = int(zygo_pachy_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[32][-1]))
            new_metaphaseI_rep1 = int(metaphaseI_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[33][-1]))
            new_metaphaseI_rep2 = int(metaphaseI_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[34][-1]))
            new_grain_Z71_rep1 = int(grain_Z71_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[35][-1]))
            new_grain_Z71_rep2 = int(grain_Z71_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[36][-1]))
            new_grain_Z75_rep1 = int(grain_Z75_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[37][-1]))
            new_grain_Z75_rep2 = int(grain_Z75_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[38][-1]))
            new_grain_Z85_rep1 = int(grain_Z85_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[39][-1]))
            new_grain_Z85_rep2 = int(grain_Z85_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[40][-1]))
            Wheat_Room1_10DPA = int(Wheat_Room1_10DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[41][-1]))
            Wheat_Room1_10DPA_Rep = int(Wheat_Room1_10DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[42][-1]))
            Wheat_Room2_10DPA = int(Wheat_Room2_10DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[43][-1]))
            Wheat_Room2_10DPA_Rep = int(Wheat_Room2_10DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[44][-1]))
            Wheat_Room1_AL_20DPA = int(Wheat_Room1_AL_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[45][-1]))
            Wheat_Room1_AL_20DPA_Rep = int(Wheat_Room1_AL_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[46][-1]))
            Wheat_Room2_AL_20DPA = int(Wheat_Room2_AL_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[47][-1]))
            Wheat_Room2_AL_20DPA_Rep = int(Wheat_Room2_AL_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[48][-1]))
            Wheat_Room1_AL_20DPA_Extra1 = int(Wheat_Room1_AL_20DPA_Extra1) * pow(10.0, 9) / (int(Length) * int(raw_total[49][-1]))
            Wheat_Room1_AL_20DPA_Extra2 = int(Wheat_Room1_AL_20DPA_Extra2) * pow(10.0, 9) / (int(Length) * int(raw_total[50][-1]))
            Wheat_Room1_SE_20DPA = int(Wheat_Room1_SE_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[51][-1]))
            Wheat_Room1_SE_20DPA_Rep = int(Wheat_Room1_SE_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[52][-1]))
            Wheat_Room2_SE_20DPA = int(Wheat_Room2_SE_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[53][-1]))
            Wheat_Room2_SE_20DPA_Rep = int(Wheat_Room2_SE_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[54][-1]))
            Wheat_Room1_TC_20DPA = int(Wheat_Room1_TC_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[55][-1]))
            Wheat_Room1_TC_20DPA_Rep = int(Wheat_Room1_TC_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[56][-1]))
            Wheat_Room2_TC_20DPA = int(Wheat_Room2_TC_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[57][-1]))
            Wheat_Room2_TC_20DPA_Rep = int(Wheat_Room2_TC_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[58][-1]))
            Wheat_Room1_REF_20DPA = int(Wheat_Room1_REF_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[59][-1]))
            Wheat_Room1_REF_20DPA_Rep = int(Wheat_Room1_REF_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[60][-1]))
            Wheat_Room2_REF_20DPA = int(Wheat_Room2_REF_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[61][-1]))
            Wheat_Room2_REF_20DPA_Rep = int(Wheat_Room2_REF_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[62][-1]))
            Wheat_Room1_SE_30DPA = int( Wheat_Room1_SE_30DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[63][-1]))
            Wheat_Room1_SE_30DPA_Rep = int(Wheat_Room1_SE_30DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[64][-1]))
            Wheat_Room2_SE_30DPA = int(Wheat_Room2_SE_30DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[65][-1]))
            Wheat_Room2_SE_30DPA_Rep = int(Wheat_Room2_SE_30DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[66][-1]))
            Wheat_Room1_AL_SE_30DPA = int(Wheat_Room1_AL_SE_30DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[67][-1]))
            Wheat_Room1_AL_SE_30DPA_Rep = int(Wheat_Room1_AL_SE_30DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[68][-1]))
            Wheat_Room2_AL_SE_30DPA = int(Wheat_Room2_AL_SE_30DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[69][-1]))
            Wheat_Room2_AL_SE_30DPA_Rep = int(Wheat_Room2_AL_SE_30DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[70][-1]))
            wheat_23_1 = int(wheat_23_1) * pow(10.0, 9) / (int(Length) * int(raw_total[71][-1]))
            wheat_23_2 = int(wheat_23_2) * pow(10.0, 9) / (int(Length) * int(raw_total[72][-1]))
            wheat_23_3 = int(wheat_23_3) * pow(10.0, 9) / (int(Length) * int(raw_total[73][-1]))
            wheat_4_1 = int(wheat_4_1) * pow(10.0, 9) / (int(Length) * int(raw_total[74][-1]))
            wheat_4_2 = int(wheat_4_2) * pow(10.0, 9) / (int(Length) * int(raw_total[75][-1]))
            wheat_4_3 = int(wheat_4_3) * pow(10.0, 9) / (int(Length) * int(raw_total[76][-1]))

            root_Z10_mean = np.mean(np.array([new_root_Z10_rep1, new_root_Z10_rep2]))
            root_Z10_std = np.std(np.array([new_root_Z10_rep1, new_root_Z10_rep2]))
            root_Z13_mean = np.mean(np.array([new_root_Z13_rep1, new_root_Z13_rep2]))
            root_Z13_std = np.std(np.array([new_root_Z13_rep1, new_root_Z13_rep2]))
            root_Z39_mean = np.mean(np.array([new_root_Z39_rep1, new_root_Z39_rep2]))
            root_Z39_std = np.std(np.array([new_root_Z39_rep1, new_root_Z39_rep2]))
            stem_Z30_mean = np.mean(np.array([new_stem_Z30_rep1, new_stem_Z30_rep2]))
            stem_Z30_std = np.std(np.array([new_stem_Z30_rep1, new_stem_Z30_rep2]))
            stem_Z32_mean = np.mean(np.array([new_stem_Z32_rep1, new_stem_Z32_rep2]))
            stem_Z32_std = np.std(np.array([new_stem_Z32_rep1, new_stem_Z32_rep2]))
            stem_Z65_mean = np.mean(np.array([new_stem_Z65_rep1, new_stem_Z65_rep2]))
            stem_Z65_std = np.std(np.array([new_stem_Z65_rep1, new_stem_Z65_rep2]))
            leaf_Z10_mean = np.mean(np.array([new_leaf_Z10_rep1, new_leaf_Z10_rep2]))
            leaf_Z10_std = np.std(np.array([new_leaf_Z10_rep1, new_leaf_Z10_rep2]))
            leaf_Z23_mean = np.mean(np.array([new_leaf_Z23_rep1, new_leaf_Z23_rep2]))
            leaf_Z23_std = np.std(np.array([new_leaf_Z23_rep1, new_leaf_Z23_rep2]))
            leaf_Z71_mean = np.mean(np.array([new_leaf_Z71_rep1, new_leaf_Z71_rep2]))
            leaf_Z71_std = np.std(np.array([new_leaf_Z71_rep1, new_leaf_Z71_rep2]))
            spike_Z32_mean = np.mean(np.array([new_spike_Z32_rep1, new_spike_Z32_rep2]))
            spike_Z32_std = np.std(np.array([new_spike_Z32_rep1, new_spike_Z32_rep2]))
            spike_Z39_mean = np.mean(np.array([new_spike_Z39_rep1, new_spike_Z39_rep2]))
            spike_Z39_std = np.std(np.array([new_spike_Z39_rep1, new_spike_Z39_rep2]))
            spike_Z65_mean = np.mean(np.array([new_spike_Z65_rep1, new_spike_Z65_rep2]))
            spike_Z65_std = np.std(np.array([new_spike_Z65_rep1, new_spike_Z65_rep2]))
            latet_lepto_mean = np.mean(np.array([new_latet_lepto_rep1, new_latet_lepto_rep2]))
            latet_lepto_std = np.std(np.array([new_latet_lepto_rep1, new_latet_lepto_rep2]))
            diplo_dia_mean = np.mean(np.array([new_diplo_dia_rep1, new_diplo_dia_rep2]))
            diplo_dia_std = np.std(np.array([new_diplo_dia_rep1, new_diplo_dia_rep2]))
            zygo_pachy_mean = np.mean(np.array([new_zygo_pachy_rep1, new_zygo_pachy_rep2]))
            zygo_pachy_std = np.std(np.array([new_zygo_pachy_rep1, new_zygo_pachy_rep2]))
            metaphaseI_mean = np.mean(np.array([new_metaphaseI_rep1, new_metaphaseI_rep2]))
            metaphaseI_std = np.std(np.array([new_metaphaseI_rep1, new_metaphaseI_rep2]))
            grain_Z71_mean = np.mean(np.array([new_grain_Z71_rep1, new_grain_Z71_rep2]))
            grain_Z71_std = np.std(np.array([new_grain_Z71_rep1, new_grain_Z71_rep2]))
            grain_Z75_mean = np.mean(np.array([new_grain_Z75_rep1, new_grain_Z75_rep2]))
            grain_Z75_std = np.std(np.array([new_grain_Z75_rep1, new_grain_Z75_rep2]))
            grain_Z85_mean = np.mean(np.array([new_grain_Z85_rep1, new_grain_Z85_rep2]))
            grain_Z85_std = np.std(np.array([new_grain_Z85_rep1, new_grain_Z85_rep2]))

            Wheat_10DPA_mean = np.mean(np.array([Wheat_Room1_10DPA, Wheat_Room1_10DPA_Rep,Wheat_Room2_10DPA, Wheat_Room2_10DPA_Rep]))
            Wheat_10DPA_std = np.std(np.array([Wheat_Room1_10DPA, Wheat_Room1_10DPA_Rep,Wheat_Room2_10DPA, Wheat_Room2_10DPA_Rep]))
            Wheat_AL_20DPA_mean = np.mean(np.array([Wheat_Room1_AL_20DPA, Wheat_Room1_AL_20DPA_Rep,Wheat_Room2_AL_20DPA, Wheat_Room2_AL_20DPA_Rep, Wheat_Room1_AL_20DPA_Extra1, Wheat_Room1_AL_20DPA_Extra2]))
            Wheat_AL_20DPA_std = np.std(np.array([Wheat_Room1_AL_20DPA, Wheat_Room1_AL_20DPA_Rep,Wheat_Room2_AL_20DPA, Wheat_Room2_AL_20DPA_Rep, Wheat_Room1_AL_20DPA_Extra1, Wheat_Room1_AL_20DPA_Extra2]))
            Wheat_SE_20DPA_mean = np.mean(np.array([Wheat_Room1_SE_20DPA, Wheat_Room1_SE_20DPA_Rep, Wheat_Room2_SE_20DPA, Wheat_Room2_SE_20DPA_Rep]))
            Wheat_SE_20DPA_std = np.std(np.array([Wheat_Room1_SE_20DPA, Wheat_Room1_SE_20DPA_Rep, Wheat_Room2_SE_20DPA, Wheat_Room2_SE_20DPA_Rep]))
            Wheat_TC_20DPA_mean = np.mean(np.array([Wheat_Room1_TC_20DPA, Wheat_Room1_TC_20DPA_Rep, Wheat_Room2_TC_20DPA, Wheat_Room2_TC_20DPA_Rep]))
            Wheat_TC_20DPA_std = np.std(np.array([Wheat_Room1_TC_20DPA, Wheat_Room1_TC_20DPA_Rep, Wheat_Room2_TC_20DPA, Wheat_Room2_TC_20DPA_Rep]))
            Wheat_REF_20DPA_mean = np.mean(np.array([Wheat_Room1_REF_20DPA, Wheat_Room1_REF_20DPA_Rep, Wheat_Room2_REF_20DPA, Wheat_Room2_REF_20DPA_Rep]))
            Wheat_REF_20DPA_std = np.std(np.array([Wheat_Room1_REF_20DPA, Wheat_Room1_REF_20DPA_Rep, Wheat_Room2_REF_20DPA, Wheat_Room2_REF_20DPA_Rep]))
            Wheat_SE_30DPA_mean = np.mean(np.array([Wheat_Room1_SE_30DPA, Wheat_Room1_SE_30DPA_Rep, Wheat_Room2_SE_30DPA, Wheat_Room2_SE_30DPA_Rep]))
            Wheat_SE_30DPA_std = np.std(np.array([Wheat_Room1_SE_30DPA, Wheat_Room1_SE_30DPA_Rep, Wheat_Room2_SE_30DPA, Wheat_Room2_SE_30DPA_Rep]))
            Wheat_AL_SE_30DPA_mean = np.mean(np.array([Wheat_Room1_AL_SE_30DPA, Wheat_Room1_AL_SE_30DPA_Rep, Wheat_Room2_AL_SE_30DPA, Wheat_Room2_AL_SE_30DPA_Rep]))
            Wheat_AL_SE_30DPA_std = np.std(np.array([Wheat_Room1_AL_SE_30DPA, Wheat_Room1_AL_SE_30DPA_Rep, Wheat_Room2_AL_SE_30DPA, Wheat_Room2_AL_SE_30DPA_Rep]))
            wheat_23_mean = np.mean(np.array([wheat_23_1, wheat_23_2, wheat_23_3]))
            wheat_23_std = np.std(np.array([wheat_23_1, wheat_23_2, wheat_23_3]))
            wheat_4_mean = np.mean(np.array([wheat_4_1, wheat_4_2, wheat_4_3]))
            wheat_4_std = np.std(np.array([wheat_4_1, wheat_4_2, wheat_4_3]))

            print "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" \
                  "\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" \
                  "\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" \
                  "\t%s\t\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" % \
                  (Geneid, Chr, Start, End, Strand, Length, root_Z10_mean, root_Z13_mean,root_Z39_mean, stem_Z30_mean,
                   stem_Z32_mean, stem_Z65_mean, leaf_Z10_mean, leaf_Z23_mean, leaf_Z71_mean, spike_Z32_mean,
                   spike_Z39_mean, spike_Z65_mean, new_carpel, new_carpel_like_structure, new_stamen, latet_lepto_mean,
                   diplo_dia_mean, zygo_pachy_mean, metaphaseI_mean, grain_Z71_mean, grain_Z75_mean, grain_Z85_mean,
                   Wheat_10DPA_mean, Wheat_AL_20DPA_mean, Wheat_SE_20DPA_mean, Wheat_TC_20DPA_mean, Wheat_REF_20DPA_mean,
                   Wheat_SE_30DPA_mean, Wheat_AL_SE_30DPA_mean, wheat_23_mean, wheat_4_mean,
                   root_Z10_std, root_Z13_std, root_Z39_std, stem_Z30_std, stem_Z32_std, stem_Z65_std, leaf_Z10_std,
                   leaf_Z23_std, leaf_Z71_std, spike_Z32_std, spike_Z39_std, spike_Z65_std, 'null', 'null', 'null',
                   latet_lepto_std, diplo_dia_std, zygo_pachy_std, metaphaseI_std, grain_Z71_std, grain_Z75_std,
                   grain_Z85_std, Wheat_10DPA_std, Wheat_AL_20DPA_std, Wheat_SE_20DPA_std, Wheat_TC_20DPA_std,
                   Wheat_REF_20DPA_std, Wheat_SE_30DPA_std, Wheat_AL_SE_30DPA_std, wheat_23_std, wheat_4_std)

这里只能使用FPKM而不是TPM,因为我们没有所有的转录本信息,故不能统计出TPM。可变剪切现象广泛存在,而二代测序不能有效区分可变剪切的转录本的表达量。在一定意义说只能衡量转录水平的表达量,而不能衡量转录后水平的表达量。

你可能感兴趣的:(数据,RNA-seq,软件,基因表达,生物信息,生物)