我的青春
最近在做一些小麦基因的表达分析,想到使用RNA-seq的数据进行生物信息学分析,并且比我做实验用的组织还要多。
下载数据之后,首先要对数据进行低质量序列和载体序列等污染序列去除,我这里结合了两个软件AdapterRemoval和bbduk2, bbduk2是bbmap中的一个子程序。
AdapterRemoval --file1 input1.fastq.gz --file2 input2.fastq.gz --qualitybase 33 --trimns --minlength 40 --threads 10 --adapter-list ~/adapterremoval-2.1.7/benchmark/adapters/adapters.fasta --output1 output1.fastq.gz --output2 output2.fastq.gz
可在终端键入AdapterRemoval,即可看见详细参数。如下
AdapterRemoval ver. 2.1.7
This program searches for and removes remnant adapter sequences from
your read data. The program can analyze both single end and paired end
data. For detailed explanation of the parameters, please refer to the
man page. For comments, suggestions and feedback please contact Stinus
Lindgreen ([email protected]) and Mikkel Schubert ([email protected]).
If you use the program, please cite the paper:
Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid
adapter trimming, identification, and read merging.
BMC Research Notes, 12;9(1):88.
http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2
Arguments: Description:
--help Display this message.
--version Print the version string.
--file1 FILE Input file containing mate 1 reads or single-ended reads [REQUIRED].
--file2 FILE Input file containing mate 2 reads [OPTIONAL].
FASTQ OPTIONS:
--qualitybase BASE Quality base used to encode Phred scores in input; either 33, 64, or solexa [current: 33].
--qualitybase-output BASE Quality base used to encode Phred scores in output; either 33, 64, or solexa. By default, reads will be written in the same format as the that specified using --qualitybase.
--qualitymax BASE Specifies the maximum Phred score expected in input files, and used when writing output. ASCII encoded values are limited to the characters '!' (ASCII = 33) to'~' (ASCII = 126), meaning that possible scores are 0 - 93 with offset 33, and 0 - 62 for offset 64 and Solexa scores [default: 41].
--mate-separator CHAR Character separating the mate number (1 or 2) from the read name in FASTQ records [default: '/'].
--interleaved This option enables both the --interleaved-input option and the
--interleaved-output option [current: off].
--interleaved-input The (single) input file provided contains both the mate 1 and mate 2 reads, one pair after the other, with one mate 1 reads followed by one mate 2 read. This option is implied by the --interleaved option [current: off].
--interleaved-output If set, trimmed paired-end reads are written to a single file containing mate 1 and mate 2 reads, one pair after the other. This option is implied by the --interleaved option [current: off].
OUTPUT FILES:
--basename BASENAME Default prefix for all output files for which no filename was explicitly set [current: your_output].
--settings FILE Output file containing information on the parameters used in the run as well as overall statistics on the reads after trimming [default: BASENAME.settings]
--output1 FILE Output file containing trimmed mate1 reads [default: BASENAME.pair1.truncated (PE), BASENAME.truncated (SE), or BASENAME.paired.truncated (interleaved PE)]
--output2 FILE Output file containing trimmed mate 2 reads [default: BASENAME.pair2.truncated (only used in PE mode, but not if --interleaved-output is enabled)]
--singleton FILE Output file to which containing paired reads for which the mate has been discarded [default: BASENAME.singleton.truncated]
--outputcollapsed FILE If --collapsed is set, contains overlapping mate-pairs which have been merged into a single read (PE mode) or reads for which the adapter was identified by a minimum overlap, indicating that the entire template molecule is present. This does not include which have subsequently been trimmed due to low-quality or ambiguous nucleotides [default: BASENAME.collapsed]
--outputcollapsedtruncated FILE Collapsed reads (see --outputcollapsed) which were trimmed due the presence of low-quality or ambiguous nucleotides [default: BASENAME.collapsed.truncated]
--discarded FILE Contains reads discarded due to the --minlength, --maxlength or --maxns options [default: BASENAME.discarded]
OUTPUT COMPRESSION:
--gzip Enable gzip compression [current: off]
--gzip-level LEVEL Compression level, 0 - 9 [current: 6]
--bzip2 Enable bzip2 compression [current: off]
--bzip2-level LEVEL Compression level, 0 - 9 [current: 9]
TRIMMING SETTINGS:
--adapter1 SEQUENCE Adapter sequence expected to be found in mate 1 reads [current: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG].
--adapter2 SEQUENCE Adapter sequence expected to be found in mate 2 reads [current: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT].
--adapter-list FILENAME Read table of white-space separated adapters pairs, used as if the first column was supplied to --adapter1, and the second column was supplied to --adapter2; only the first adapter in each pair is required SE trimming mode [current:<not set>].
--mm MISMATCH_RATE Max error-rate when aligning reads and/or adapters. If > 1, the max error-rate is set to 1 / MISMATCH_RATE; if < 0, the defaults are used, otherwise the user-supplied value is used directly. [defaults: 1/3 for trimming; 1/10 when identifing adapters].
--maxns MAX Reads containing more ambiguous bases (N) than this number after trimming are discarded [current: 1000].
--shift N Consider alignments where up to N nucleotides are missing from the 5' termini [current: 2].
--trimns If set, trim ambiguous bases (N) at 5'/3' termini [current: off]
--trimqualities If set, trim bases at 5'/3' termini with quality scores <= to --minquality value [current: off]
--minquality PHRED Inclusive minimum; see --trimqualities for details [current: 2]
--minlength LENGTH Reads shorter than this length are discarded following trimming [current: 15].
--maxlength LENGTH Reads longer than this length are discarded following trimming [current:4294967295].
--collapse When set, paired ended read alignments of --minalignmentlength or more bases are combined into a single consensus sequence, representing the complete insert,and written to either basename.collapsed or basename.collapsed.truncated (if trimmed due to low-quality bases following collapse); for single-ended reads,putative complete inserts are identified as having at least --minalignmentlength bases overlap with the adapter sequence, and are written to the the same files [current: off].
--minalignmentlength LENGTH If --collapse is set, paired reads must overlap at least this number of bases to be collapsed, and single-ended reads must overlap at least this number of bases with the adapter to be considered complete template molecules [current:11].
--minadapteroverlap LENGTH In single-end mode, reads are only trimmed if the overlap between read and the adapter is at least X bases long, not counting ambiguous nucleotides (N); this is independant of the --minalignmentlength when using --collapse, allowing a conservative selection of putative complete inserts while ensuring that all possible adapter contamination is trimmed [current: 0].
DEMULTIPLEXING:
--barcode-list FILENAME List of barcodes or barcode pairs for single or double-indexed demultiplexing. Note that both indexes should be specified for both single-end and paired-end trimming, if double-indexed multiplexing was used, in order to ensure that the demultiplexed reads can be trimmed correctly [current: <not set>].
--barcode-mm N Maximum number of mismatches allowed when counting mismatches in both the mate 1 and the mate 2 barcode for paired reads.
--barcode-mm-r1 N Maximum number of mismatches allowed for the mate 1 barcode; if not set, this value is equal to the '--barcode-mm' value; cannot be higher than the '--barcode-mm value'.
--barcode-mm-r2 N Maximum number of mismatches allowed for the mate 2 barcode; if not set, this value is equal to the '--barcode-mm' value; cannot be higher than the '--barcode-mm value'.
MISC:
--identify-adapters Attempt to identify the adapter pair of PE reads, by searching for overlapping reads [current: off].
--seed SEED Sets the RNG seed used when choosing between bases with equal Phred scores when collapsing. Note that runs are not deterministic if more than one thread is used. If not specified, a seed is generated using the current time.
--threads THREADS Maximum number of threads [current: 1]
其中--identify-adapters
参数可以在PE reads中鉴定载体序列
bbduk2的命令如下
/data1/masw/bbmap/bbduk2.sh -da in=ATW_AKOSW_2_1_D0KD1ACXX.IND12.fastq_1.gz IN2=ATW_AKOSW_2_2_D0KD1ACXX.IND12.fastq_1.gz out=ATW_AKOSW_2_1_D0KD1ACXX.IND12.fastq_2.gz out2=ATW_AKOSW_2_2_D0KD1ACXX.IND12.fastq_2.gz stats=1.2.txt k=20 minlength=40 mink=8 hdist=2 ref=/data1/masw/bbmap/resources/sequencing_artifacts.fa.gz tbo entropy=0.5 entropywindow=50 entropyk=5
同样的在终端下键入命令/data1/masw/bbmap/bbduk2.sh 可以查看详细的参数
Written by Brian Bushnell
Last modified June 27, 2016
BBDuk2 is like BBDuk but can kfilter, kmask, and ktrim in a single pass.
It does not replace BBDuk, and is only provided to allow maximally efficient
pipeline integration when multiple steps will be performed. The syntax is
slightly different.
Description: Compares reads to the kmers in a reference dataset, optionally
allowing an edit distance. Splits the reads into two outputs - those that
match the reference, and those that don't. Can also trim (remove) the matching
parts of the reads rather than binning the reads.
Usage: bbduk2.sh in=file> out=
去除载体序列后,可以查看mapping rate是否提高,正常情况下mapping应该在80%以上。如果mapping rate实在太低,要考虑这个sample的质量问题,有可能影响结果的准确性
hisat2-build-l -p 20 ./IWGSC_v1.0/Wheat_IWGSC_WGA_v1.0_pseudomolecules/161010_Chinese_Spring_v1.0_pseudomolecules.fasta IWGSCv1.0_hiast2
这一步使用 hisat2,hisat2 比对非常快而且资源要求较少,但是需要先对参考基因组index。mapping使用的命令是:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import subprocess
with open('hisat2_list.txt', 'r') as f:
for line in f:
line = line.strip().split()
input1, input2, output1, output2 = line
print input1, input2
proc = subprocess.Popen(['hisat2', '-p', '20', '--dta', '-x', '../NRGenome_hisat2/NRGenome', '--known-splicesite-infile', '../annotation/1.ss', '--novel-splicesite-infile', 'all.ss', '--novel-splicesite-outfile',output1, \
'-t', '-1', input1, '-2', input2, '-S', output2], shell=False)
proc.wait()
接下来就是筛选sam结果,比如只保留一个hit的reads或者完全匹配的reads等。如果能够对sam格式熟悉,就能够简单的做到filter,这里也不在详述。将hisat2的详细参数列出
No index, query, or output file specified!
HISAT2 version 2.0.4 by Daehwan Kim ([email protected], www.ccb.jhu.edu/people/infphilo)
Usage:
hisat2 [options]* -x {-1 -2 | -U | --sra-acc } [-S ]
Index filename prefix (minus trailing .X.ht2).
Files with #1 mates, paired with files in .
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
Files with #2 mates, paired with files in .
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
Files with unpaired reads.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654.
File for SAM output (default: stdout)
, , can be comma-separated lists (no whitespace) and can be
specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.
Options (defaults in parentheses):
Input:
-q query input files are FASTQ .fq/.fastq (default)
--qseq query input files are in Illumina's qseq format
-f query input files are (multi-)FASTA .fa/.mfa
-r query input files are raw one-sequence-per-line
-c , , are sequences themselves, not files
-s/--skip skip the first reads/pairs in the input (none)
-u/--upto stop after first reads/pairs (no limit)
-5/--trim5 trim bases from 5'/left end of reads (0)
-3/--trim3 trim bases from 3'/right end of reads (0)
--phred33 qualities are Phred+33 (default)
--phred64 qualities are Phred+64
--int-quals qualities encoded as space-delimited integers
--sra-acc SRA accession ID
Alignment:
--n-ceil func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)
--ignore-quals treat all quality values as 30 on Phred scale (off)
--nofw do not align forward (original) version of read (off)
--norc do not align reverse-complement version of read (off)
Spliced Alignment:
--pen-cansplice penalty for a canonical splice site (0)
--pen-noncansplice penalty for a non-canonical splice site (12)
--pen-canintronlen penalty for long introns (G,-8,1) with canonical splice sites
--pen-noncanintronlen penalty for long introns (G,-8,1) with noncanonical splice sites
--min-intronlen minimum intron length (20)
--max-intronlen maximum intron length (500000)
--known-splicesite-infile provide a list of known splice sites
--novel-splicesite-outfile report a list of splice sites
--novel-splicesite-infile provide a list of novel splice sites
--no-temp-splicesite disable the use of splice sites found
--no-spliced-alignment disable spliced alignment
--rna-strandness Specify strand-specific information (unstranded)
--tmo Reports only those alignments within known transcriptome
--dta Reports alignments tailored for transcript assemblers
--dta-cufflinks Reports alignments tailored specifically for cufflinks
Scoring:
--ma match bonus (0 for --end-to-end, 2 for --local)
--mp , max and min penalties for mismatch; lower qual = lower penalty <2,6>
--sp , max and min penalties for soft-clipping; lower qual = lower penalty <1,2>
--np penalty for non-A/C/G/Ts in read/ref (1)
--rdg , read gap open, extend penalties (5,3)
--rfg , reference gap open, extend penalties (5,3)
--score-min min acceptable alignment score w/r/t read length
(L,0.0,-0.2)
Reporting:
(default) look for multiple alignments, report best, with MAPQ
OR
-k report up to alns per read; MAPQ not meaningful
OR
-a/--all report all alignments; very slow, MAPQ not meaningful
Paired-end:
--fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)
--no-mixed suppress unpaired alignments for paired reads
--no-discordant suppress discordant alignments for paired reads
Output:
-t/--time print wall-clock time taken by search phases
--un write unpaired reads that didn't align to
--al write unpaired reads that aligned at least once to
--un-conc write pairs that didn't align concordantly to
--al-conc write pairs that aligned concordantly at least once to
(Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.
--un-gz , to gzip compress output, or add '-bz2' to bzip2 compress output.)
--quiet print nothing to stderr except serious errors
--met-file send metrics to file at (off)
--met-stderr send metrics to stderr (off)
--met report internal counters & metrics every secs (1)
--no-head supppress header lines, i.e. lines starting with @
--no-sq supppress @SQ header lines
--rg-id set read group id, reflected in @RG line and RG:Z: opt field
--rg add ("lab:value") to @RG line of SAM header.
Note: @RG line only printed when --rg-id is set.
--omit-sec-seq put '*' in SEQ and QUAL fields for secondary alignments.
Performance:
-o/--offrate override offrate of index; must be >= index's offrate
-p/--threads number of alignment threads to launch (1)
--reorder force SAM output order to match order of input reads
--mm use memory-mapped I/O for index; many 'bowtie's can share
Other:
--qc-filter filter out reads that are bad according to QSEQ filter
--seed seed for random number generator (0)
--non-deterministic seed rand. gen. arbitrarily instead of using read attributes
--remove-chrname remove 'chr' from reference names in alignment
--add-chrname add 'chr' to reference names in alignment
--version print version information and quit
-h/--help print this usage message
(ERR): hisat2-align exited with value 1
这里我需要保留完全匹配的reads,筛选如下
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import subprocess
with open('sam_file.txt', 'r') as f:
for line in f:
line = line.strip()
print line
proc = subprocess.Popen('grep -E "@|NM:i:0" ' + line + ' > ' + line[:-3] + 'perfectmatch.sam', shell=True)
proc.wait()
有了sam文件我们可以组装出转录本,但是本研究的目的是给定一个基因的转录本去衡量表达情况,所以这一步骤非必需。对于如何组装出转录本可参考文献Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown
根据基因在参考基因组上的位置信息进行表达量统计,基因在基因组上的位置信息一般保存成gff3或gtf格式,可以使blastn,gmap等软件获取位置信息(注意exon-intron一定要准确),gff3格式如下
chr1A NRGenome mRNA 5946352 5946999 . - . ID=UN044011.mrna1;Name=UN044011;Parent=UN044011.path1;coverage=100.0;identity=100.0;matches=648;mismatches=0;indels=0;unknowns=0
chr1A NRGenome exon 5946352 5946999 100 - . ID=UN044011.mrna1.exon1;Name=UN044011;Parent=UN044011.mrna1;Target=UN044011 1 648 +
chr1A NRGenome mRNA 9968301 9968632 . + . ID=UN080299.mrna1;Name=UN080299;Parent=UN080299.path1;coverage=100.0;identity=100.0;matches=213;mismatches=0;indels=0;unknowns=0
chr1A NRGenome exon 9968301 9968396 100 + . ID=UN080299.mrna1.exon1;Name=UN080299;Parent=UN080299.mrna1;Target=UN080299 1 96 +
chr1A NRGenome exon 9968516 9968632 100 + . ID=UN080299.mrna1.exon2;Name=UN080299;Parent=UN080299.mrna1;Target=UN080299 97 213 +
chr1A NRGenome mRNA 12807377 12808514 . - . ID=UN129475.mrna1;Name=UN129475;Parent=UN129475.path1;coverage=100.0;identity=100.0;matches=156;mismatches=0;indels=0;unknowns=0
chr1A NRGenome exon 12808501 12808514 100 - . ID=UN129475.mrna1.exon1;Name=UN129475;Parent=UN129475.mrna1;Target=UN129475 1 14 +
chr1A NRGenome exon 12807377 12807518 100 - . ID=UN129475.mrna1.exon2;Name=UN129475;Parent=UN129475.mrna1;Target=UN129475 15 156 +
有了位置信息,使用featurecounts 计算表达的counts。这里只统计unique reads,命令如下(每次只需要修改输入的基因位置信息以及输出文件即可):
featureCounts -T 20 -t exon -g Name --readExtension5 70 --readExtension3 70 -p --donotsort -C -a ../Triticum_aestivum.TGACv1.cds.1.gff3 -o TGAC_unique_in_expression.txt ATW_AOSW_1.perfectmatch.sam ATW_AAOSW_6.perfectmatch.sam ATW_ANOSW_1.perfectmatch.sam ATW_LOSW_5.perfectmatch.sam ATW_ADOSW_1.perfectmatch.sam ATW_AEOSW_1.perfectmatch.sam ATW_DOSW_2.perfectmatch.sam ATW_POSW_6.perfectmatch.sam ATW_IOSW_4.perfectmatch.sam ATW_KOSW_4.perfectmatch.sam ATW_ROSW_7.perfectmatch.sam ATW_ALOSW_3.perfectmatch.sam ATW_TOSW_8.perfectmatch.sam ATW_VOSW_6.perfectmatch.sam ATW_MOSW_5.perfectmatch.sam ATW_NOSW_6.perfectmatch.sam ATW_COSW_1.perfectmatch.sam ATW_AGOSW_2.perfectmatch.sam ATW_GOSW_3.perfectmatch.sam ATW_HOSW_3.perfectmatch.sam ATW_ABOSW_7.perfectmatch.sam ATW_ACOSW_1.perfectmatch.sam ATW_QOSW_7.perfectmatch.sam ATW_AHOSW_3.perfectmatch.sam SRR1175868.perfectmatch.sam SRR1177760.perfectmatch.sam SRR1177761.perfectmatch.sam NG-5789_1A_lib7482.perfectmatch.sam NG-5789_1B_lib7486.perfectmatch.sam NG-5789_2A_lib7483.perfectmatch.sam NG-5789_2B_lib7487.perfectmatch.sam NG-5789_3A_lib7484.perfectmatch.sam NG-5789_3B_lib7488.perfectmatch.sam NG-5789_4A_lib7485.perfectmatch.sam NG-5789_4B_lib7489.perfectmatch.sam ATW_SOSW_8.perfectmatch.sam ATW_AFOSW_2.perfectmatch.sam ATW_AIOSW_2.perfectmatch.sam ATW_AKOSW_2.perfectmatch.sam ATW_FOSW_2.perfectmatch.sam ATW_AMOSW_4.perfectmatch.sam
同样的在这里列出featureCounts的详细参数。具体每项参数的意义请自行了解
Version 1.5.1
Usage: featureCounts [options] -a -o input_file1 [input_file2] ...
## Required arguments:
-a <string> Name of an annotation file. GTF/GFF format by default.
See -F option for more formats.
-o <string> Name of the output file including read counts. A separate
file including summary statistics of counting results is
also included in the output (`<string>.summary')
input_file1 [input_file2] ... A list of SAM or BAM format files.
## Options:
# Annotation
-F <string> Specify format of provided annotation file. Acceptable
formats include `GTF/GFF' and `SAF'. `GTF/GFF' by default.
See Users Guide for description of SAF format.
-t <string> Specify feature type in GTF annotation. `exon' by
default. Features used for read counting will be
extracted from annotation using the provided value.
-g <string> Specify attribute type in GTF annotation. `gene_id' by
default. Meta-features used for read counting will be
extracted from annotation using the provided value.
-A <string> Provide a chromosome name alias file to match chr names in
annotation with those in the reads. This should be a two-
column comma-delimited text file. Its first column should
include chr names in the annotation and its second column
should include chr names in the reads. Chr names are case
sensitive. No column header should be included in the
file.
# Level of summarization
-f Perform read counting at feature level (eg. counting
reads for exons rather than genes).
# Overlap between reads and features
-O Assign reads to all their overlapping meta-features (or
features if -f is specified).
--minOverlap Minimum number of overlapping bases in a read that is
required for read assignment. 1 by default. Number of
overlapping bases is counted from both reads if paired
end. If a negative value is provided, then a gap of up
to specified size will be allowed between read and the
feature that the read is assigned to.
--fracOverlap Minimum fraction of overlapping bases in a read that is
required for read assignment. Value should be within range
[0,1]. 0 by default. Number of overlapping bases is
counted from both reads if paired end. Both this option
and '--minOverlap' option need to be satisfied for read
assignment.
--largestOverlap Assign reads to a meta-feature/feature that has the
largest number of overlapping bases.
--readExtension5 Reads are extended upstream by bases from their
5' end.
--readExtension3 Reads are extended upstream by bases from their
3' end.
--read2pos <5:3> Reduce reads to their 5' most base or 3' most base. Read
counting is then performed based on the single base the
read is reduced to.
# Multi-mapping reads
-M Multi-mapping reads will also be counted. For a multi-
mapping read, all its reported alignments will be
counted. The `NH' tag in BAM/SAM input is used to detect
multi-mapping reads.
# Fractional counting
--fraction Assign fractional counts to features. This option must
be used together with '-M' or '-O' or both. When '-M' is
specified, each reported alignment from a multi-mapping
read (identified via 'NH' tag) will carry a fractional
count of 1/x, instead of 1 (one), where x is the total
number of alignments reported for the same read. When '-O'
is specified, each overlapping feature will receive a
fractional count of 1/y, where y is the total number of
features overlapping with the read. When both '-M' and
'-O' are specified, each alignment will carry a fraction
count of 1/(x*y).
# Read filtering
-Q The minimum mapping quality score a read must satisfy in
order to be counted. For paired-end reads, at least one
end should satisfy this criteria. 0 by default.
--splitOnly Count split alignments only (ie. alignments with CIGAR
string containing 'N'). An example of split alignments is
exon-spanning reads in RNA-seq data.
--nonSplitOnly If specified, only non-split alignments (CIGAR strings do
not contain letter 'N') will be counted. All the other
alignments will be ignored.
--primary Count primary alignments only. Primary alignments are
identified using bit 0x100 in SAM/BAM FLAG field.
--ignoreDup Ignore duplicate reads in read counting. Duplicate reads
are identified using bit Ox400 in BAM/SAM FLAG field. The
whole read pair is ignored if one of the reads is a
duplicate read for paired end data.
# Strandness
-s Perform strand-specific read counting. Acceptable values:
0 (unstranded), 1 (stranded) and 2 (reversely stranded).
0 by default.
# Exon-exon junctions
-J Count number of reads supporting each exon-exon junction.
Junctions were identified from those exon-spanning reads
in the input (containing 'N' in CIGAR string). Counting
results are saved to a file named '.jcounts'
-G <string> Provide the name of a FASTA-format file that contains the
reference sequences used in read mapping that produced the
provided SAM/BAM files. This optional argument can be used
with '-J' option to improve read counting for junctions.
# Parameters specific to paired end reads
-p If specified, fragments (or templates) will be counted
instead of reads. This option is only applicable for
paired-end reads.
-B Count read pairs that have both ends successfully aligned
only.
-P Check validity of paired-end distance when counting read
pairs. Use -d and -D to set thresholds.
-d Minimum fragment/template length, 50 by default.
-D Maximum fragment/template length, 600 by default.
-C Do not count read pairs that have their two ends mapping
to different chromosomes or mapping to same chromosome
but on different strands.
--donotsort Do not sort reads in BAM/SAM input. Note that reads from
the same pair are required to be located next to each
other in the input.
# Number of CPU threads
-T Number of the threads. 1 by default.
# Miscellaneous
-R Output detailed assignment result for each read. A text
file will be generated for each input file, including
names of reads and meta-features/features reads were
assigned to. See Users Guide for more details.
--tmpDir Directory under which intermediate files are saved (later
removed). By default, intermediate files will be saved to
the directory specified in '-o' argument.
--maxMOp Maximum number of 'M' operations allowed in a CIGAR
string. 10 by default. Both 'X' and '=' are treated as 'M'
and adjacent 'M' operations are merged in the CIGAR
string.
-v Output version of the program.
这里使用FPKM表示,因为我用的是PE数据,而单端测序数据可以使用RPKM。我自己写了一个python脚本统计,其他人使用需要进行修改
#!/usr/bin/env python
# -*- coding: utf-8 -*-
__author__ = 'shengwei ma'
__author_email__ = '[email protected]'
import numpy as np
raw_total = [('root_Z10_rep1', 49168553), ('root_Z10_rep2', 44047402), ('root_Z13_rep1', 78098556),
('root_Z13_rep2', 38474362), ('root_Z39_rep1', 79981030), ('root_Z39_rep2', 41041508),
('stem_Z30_rep1', 46935246), ('stem_Z30_rep2', 38803969), ('stem_Z32_rep1', 51627704),
('stem_Z32_rep2', 37219517), ('stem_Z65_rep1', 39849949), ('stem_Z65_rep2', 40299574),
('leaf_Z10_rep1', 38168988), ('leaf_Z10_rep2', 43073693), ('leaf_Z23_rep1', 44071613),
('leaf_Z23_rep2', 40380776), ('leaf_Z71_rep1', 32810256), ('leaf_Z71_rep2', 35749803),
('spike_Z32_rep1', 46203474), ('spike_Z32_rep2', 43612313), ('spike_Z39_rep1', 40406588),
('spike_Z39_rep2', 47596209), ('spike_Z65_rep1', 43071042), ('spike_Z65_rep2', 48443902),
('carpel', 57881099), ('carpel-like structure', 63914055), ('stamen', 72275259),
('latent_lepto_rep1', 31693600), ('latent_lepto_rep2', 40260140), ('diplo_dia_rep1', 56486977),
('diplo_dia_rep2', 43990501), ('zygo_pachy_rep1', 37037924), ('zygo_pachy_rep2', 37678253),
('metaphaseI_rep1', 26954435), ('metaphaseI_rep2', 32180104), ('grain_Z71_rep1', 44263291),
('grain_Z71_rep2', 36875603), ('grain_Z75_rep1', 47740143), ('grain_Z75_rep2', 51819168),
('grain_Z85_rep1', 36879170), ('grain_Z85_rep2', 31412470), ('Wheat_Room1_10DPA', 16712256),
('Wheat_Room1_10DPA_Rep', 22819483), ('Wheat_Room2_10DPA', 27121510), ('Wheat_Room2_10DPA_Rep', 29453109),
('Wheat_Room1_AL_20DPA', 30598515), ('Wheat_Room1_AL_20DPA_Rep', 28518937), ('Wheat_Room2_AL_20DPA', 24838220),
('Wheat_Room2_AL_20DPA_Rep', 27715580), ('Wheat_Room1_AL_20DPA_Extra1', 29978007), ('Wheat_Room1_AL_20DPA_Extra2', 30079461),
('Wheat_Room1_SE_20DPA', 25140145), ('Wheat_Room1_SE_20DPA_Rep', 24446796), ('Wheat_Room2_SE_20DPA', 21339690),
('Wheat_Room2_SE_20DPA_Rep', 22815780),
('Wheat_Room1_TC_20DPA', 16629117), ('Wheat_Room1_TC_20DPA_Rep', 27612315), ('Wheat_Room2_TC_20DPA', 25304622),
('Wheat_Room2_TC_20DPA_Rep', 25352139), ('Wheat_Room1_REF_20DPA', 29929219), ('Wheat_Room1_REF_20DPA_Rep', 26636425),
('Wheat_Room2_REF_20DPA', 24316737), ('Wheat_Room2_REF_20DPA_Rep', 29330096), ('Wheat_Room1_SE_30DPA', 22777481),
('Wheat_Room1_SE_30DPA_Rep', 22777481), ('Wheat_Room2_SE_30DPA', 30513836), ('Wheat_Room2_SE_30DPA_Rep', 21486098),
('Wheat_Room1_AL_SE_30DPA', 28821672), ('Wheat_Room1_AL_SE_30DPA_Rep', 20134665), ('Wheat_Room2_AL_SE_30DPA', 23721856),
('Wheat_Room2_AL_SE_30DPA_Rep', 24896811), ('wheat_23_1', 28444918), ('wheat_23_2', 67968193),
('wheat_23_3', 24321425), ('wheat_4_1', 35430306), ('wheat_4_2', 22527710), ('wheat_4_3', 16848204)]
with open('MLJ_unique_expression.txt', 'r') as f:
print "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" \
"\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" \
"\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" \
"\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t" % \
('Geneid', 'Chr', 'Start', 'End', 'Strand', 'Length', 'root_Z10', 'root_Z13','root_Z39',
'stem_Z30', 'stem_Z32', 'stem_Z65', 'leaf_Z10', 'leaf_Z23', 'leaf_Z71',
'spike_Z32', 'spike_Z39', 'spike_Z65', 'carpel', 'carpel_like_structure',
'stamen', 'latet_lepto', 'diplo_dia', 'zygo_pachy', 'metaphaseI',
'grain_Z71', 'grain_Z75', 'grain_Z85', 'Wheat_10DPA', 'Wheat_AL_20DPA',
'Wheat_SE_20DPA', 'Wheat_TC_20DPA', 'Wheat_REF_20DPA', 'Wheat_SE_30DPA',
'Wheat_AL.SE_30DPA', 'wheat_23', 'wheat_4', 'root_Z10_std', 'root_Z13_std', 'root_Z39_std',
'stem_Z30_std', 'stem_Z32_std', 'stem_Z65_std', 'leaf_Z10_std', 'leaf_Z23_std', 'leaf_Z71_std',
'spike_Z32_std', 'spike_Z39_std', 'spike_Z65_std', 'carpel_std', 'carpel-like_std', 'stamen_std',
'latet_lepto_std', 'diplo_dia_std', 'zygo_pachy_std', 'metaphaseI_std', 'grain_Z71_std',
'grain_Z75_std', 'grain_Z85_std','Wheat_10DPA_std', 'Wheat_AL_20DPA_std','Wheat_SE_20DPA_std',
'Wheat_TC_20DPA_std', 'Wheat_REF_20DPA_std', 'Wheat_SE_30DPA_std',
'Wheat_AL.SE_30DPA_std', 'wheat_23_std', 'wheat_4_std')
for line in f:
if line.startswith('#') or line.startswith('Geneid'):
pass
else:
new = line.strip().split('\t')
(Geneid, Chr, Start, End, Strand, Length, root_Z10_rep1, root_Z10_rep2, root_Z13_rep1, root_Z13_rep2,
root_Z39_rep1, root_Z39_rep2, stem_Z30_rep1, stem_Z30_rep2, stem_Z32_rep1, stem_Z32_rep2, stem_Z65_rep1,
stem_Z65_rep2, leaf_Z10_rep1, leaf_Z10_rep2, leaf_Z23_rep1, leaf_Z23_rep2, leaf_Z71_rep1, leaf_Z71_rep2,
spike_Z32_rep1, spike_Z32_rep2, spike_Z39_rep1, spike_Z39_rep2, spike_Z65_rep1, spike_Z65_rep2, carpel,
carpel_like_structure, stamen, latet_lepto_rep1, latent_lepto_rep2, diplo_dia_rep1, diplo_dia_rep2,
zygo_pachy_rep1, zygo_pachy_rep2, metaphaseI_rep1, metaphaseI_rep2, grain_Z71_rep1, grain_Z71_rep2,
grain_Z75_rep1, grain_Z75_rep2, grain_Z85_rep1, grain_Z85_rep2, Wheat_Room1_10DPA, Wheat_Room1_10DPA_Rep,
Wheat_Room2_10DPA, Wheat_Room2_10DPA_Rep, Wheat_Room1_AL_20DPA, Wheat_Room1_AL_20DPA_Rep,
Wheat_Room2_AL_20DPA, Wheat_Room2_AL_20DPA_Rep, Wheat_Room1_AL_20DPA_Extra1, Wheat_Room1_AL_20DPA_Extra2,
Wheat_Room1_SE_20DPA, Wheat_Room1_SE_20DPA_Rep, Wheat_Room2_SE_20DPA, Wheat_Room2_SE_20DPA_Rep,
Wheat_Room1_TC_20DPA, Wheat_Room1_TC_20DPA_Rep, Wheat_Room2_TC_20DPA, Wheat_Room2_TC_20DPA_Rep,
Wheat_Room1_REF_20DPA, Wheat_Room1_REF_20DPA_Rep, Wheat_Room2_REF_20DPA, Wheat_Room2_REF_20DPA_Rep,
Wheat_Room1_SE_30DPA, Wheat_Room1_SE_30DPA_Rep, Wheat_Room2_SE_30DPA, Wheat_Room2_SE_30DPA_Rep,
Wheat_Room1_AL_SE_30DPA, Wheat_Room1_AL_SE_30DPA_Rep, Wheat_Room2_AL_SE_30DPA, Wheat_Room2_AL_SE_30DPA_Rep,
wheat_23_1, wheat_23_2, wheat_23_3, wheat_4_1, wheat_4_2, wheat_4_3) = new
new_root_Z10_rep1 = int(root_Z10_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[0][-1]))
new_root_Z10_rep2 = int(root_Z10_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[1][-1]))
new_root_Z13_rep1 = int(root_Z13_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[2][-1]))
new_root_Z13_rep2 = int(root_Z13_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[3][-1]))
new_root_Z39_rep1 = int(root_Z39_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[4][-1]))
new_root_Z39_rep2 = int(root_Z39_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[5][-1]))
new_stem_Z30_rep1 = int(stem_Z30_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[6][-1]))
new_stem_Z30_rep2 = int(stem_Z30_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[7][-1]))
new_stem_Z32_rep1 = int(stem_Z32_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[8][-1]))
new_stem_Z32_rep2 = int(stem_Z32_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[9][-1]))
new_stem_Z65_rep1 = int(stem_Z65_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[10][-1]))
new_stem_Z65_rep2 = int(stem_Z65_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[11][-1]))
new_leaf_Z10_rep1 = int(leaf_Z10_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[12][-1]))
new_leaf_Z10_rep2 = int(leaf_Z10_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[13][-1]))
new_leaf_Z23_rep1 = int(leaf_Z23_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[14][-1]))
new_leaf_Z23_rep2 = int(leaf_Z23_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[15][-1]))
new_leaf_Z71_rep1 = int(leaf_Z71_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[16][-1]))
new_leaf_Z71_rep2 = int(leaf_Z71_rep2) * pow(10.0 , 6) / (int(Length) * int(raw_total[17][-1]))
new_spike_Z32_rep1 = int(spike_Z32_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[18][-1]))
new_spike_Z32_rep2 = int(spike_Z32_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[19][-1]))
new_spike_Z39_rep1 = int(spike_Z39_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[20][-1]))
new_spike_Z39_rep2 = int(spike_Z39_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[21][-1]))
new_spike_Z65_rep1 = int(spike_Z65_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[22][-1]))
new_spike_Z65_rep2 = int(spike_Z65_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[23][-1]))
new_carpel = int(carpel) * pow(10.0, 9) / (int(Length) * int(raw_total[24][-1]))
new_carpel_like_structure = int(carpel_like_structure) * pow(10.0, 9) / (int(Length) * int(raw_total[25][-1]))
new_stamen = int(stamen) * pow(10.0, 9) / (int(Length) * int(raw_total[26][-1]))
new_latet_lepto_rep1 = int(latet_lepto_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[27][-1]))
new_latet_lepto_rep2 = int(latent_lepto_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[28][-1]))
new_diplo_dia_rep1 = int(diplo_dia_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[29][-1]))
new_diplo_dia_rep2 = int(diplo_dia_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[30][-1]))
new_zygo_pachy_rep1 = int(zygo_pachy_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[31][-1]))
new_zygo_pachy_rep2 = int(zygo_pachy_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[32][-1]))
new_metaphaseI_rep1 = int(metaphaseI_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[33][-1]))
new_metaphaseI_rep2 = int(metaphaseI_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[34][-1]))
new_grain_Z71_rep1 = int(grain_Z71_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[35][-1]))
new_grain_Z71_rep2 = int(grain_Z71_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[36][-1]))
new_grain_Z75_rep1 = int(grain_Z75_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[37][-1]))
new_grain_Z75_rep2 = int(grain_Z75_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[38][-1]))
new_grain_Z85_rep1 = int(grain_Z85_rep1) * pow(10.0, 9) / (int(Length) * int(raw_total[39][-1]))
new_grain_Z85_rep2 = int(grain_Z85_rep2) * pow(10.0, 9) / (int(Length) * int(raw_total[40][-1]))
Wheat_Room1_10DPA = int(Wheat_Room1_10DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[41][-1]))
Wheat_Room1_10DPA_Rep = int(Wheat_Room1_10DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[42][-1]))
Wheat_Room2_10DPA = int(Wheat_Room2_10DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[43][-1]))
Wheat_Room2_10DPA_Rep = int(Wheat_Room2_10DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[44][-1]))
Wheat_Room1_AL_20DPA = int(Wheat_Room1_AL_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[45][-1]))
Wheat_Room1_AL_20DPA_Rep = int(Wheat_Room1_AL_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[46][-1]))
Wheat_Room2_AL_20DPA = int(Wheat_Room2_AL_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[47][-1]))
Wheat_Room2_AL_20DPA_Rep = int(Wheat_Room2_AL_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[48][-1]))
Wheat_Room1_AL_20DPA_Extra1 = int(Wheat_Room1_AL_20DPA_Extra1) * pow(10.0, 9) / (int(Length) * int(raw_total[49][-1]))
Wheat_Room1_AL_20DPA_Extra2 = int(Wheat_Room1_AL_20DPA_Extra2) * pow(10.0, 9) / (int(Length) * int(raw_total[50][-1]))
Wheat_Room1_SE_20DPA = int(Wheat_Room1_SE_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[51][-1]))
Wheat_Room1_SE_20DPA_Rep = int(Wheat_Room1_SE_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[52][-1]))
Wheat_Room2_SE_20DPA = int(Wheat_Room2_SE_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[53][-1]))
Wheat_Room2_SE_20DPA_Rep = int(Wheat_Room2_SE_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[54][-1]))
Wheat_Room1_TC_20DPA = int(Wheat_Room1_TC_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[55][-1]))
Wheat_Room1_TC_20DPA_Rep = int(Wheat_Room1_TC_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[56][-1]))
Wheat_Room2_TC_20DPA = int(Wheat_Room2_TC_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[57][-1]))
Wheat_Room2_TC_20DPA_Rep = int(Wheat_Room2_TC_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[58][-1]))
Wheat_Room1_REF_20DPA = int(Wheat_Room1_REF_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[59][-1]))
Wheat_Room1_REF_20DPA_Rep = int(Wheat_Room1_REF_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[60][-1]))
Wheat_Room2_REF_20DPA = int(Wheat_Room2_REF_20DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[61][-1]))
Wheat_Room2_REF_20DPA_Rep = int(Wheat_Room2_REF_20DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[62][-1]))
Wheat_Room1_SE_30DPA = int( Wheat_Room1_SE_30DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[63][-1]))
Wheat_Room1_SE_30DPA_Rep = int(Wheat_Room1_SE_30DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[64][-1]))
Wheat_Room2_SE_30DPA = int(Wheat_Room2_SE_30DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[65][-1]))
Wheat_Room2_SE_30DPA_Rep = int(Wheat_Room2_SE_30DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[66][-1]))
Wheat_Room1_AL_SE_30DPA = int(Wheat_Room1_AL_SE_30DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[67][-1]))
Wheat_Room1_AL_SE_30DPA_Rep = int(Wheat_Room1_AL_SE_30DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[68][-1]))
Wheat_Room2_AL_SE_30DPA = int(Wheat_Room2_AL_SE_30DPA) * pow(10.0, 9) / (int(Length) * int(raw_total[69][-1]))
Wheat_Room2_AL_SE_30DPA_Rep = int(Wheat_Room2_AL_SE_30DPA_Rep) * pow(10.0, 9) / (int(Length) * int(raw_total[70][-1]))
wheat_23_1 = int(wheat_23_1) * pow(10.0, 9) / (int(Length) * int(raw_total[71][-1]))
wheat_23_2 = int(wheat_23_2) * pow(10.0, 9) / (int(Length) * int(raw_total[72][-1]))
wheat_23_3 = int(wheat_23_3) * pow(10.0, 9) / (int(Length) * int(raw_total[73][-1]))
wheat_4_1 = int(wheat_4_1) * pow(10.0, 9) / (int(Length) * int(raw_total[74][-1]))
wheat_4_2 = int(wheat_4_2) * pow(10.0, 9) / (int(Length) * int(raw_total[75][-1]))
wheat_4_3 = int(wheat_4_3) * pow(10.0, 9) / (int(Length) * int(raw_total[76][-1]))
root_Z10_mean = np.mean(np.array([new_root_Z10_rep1, new_root_Z10_rep2]))
root_Z10_std = np.std(np.array([new_root_Z10_rep1, new_root_Z10_rep2]))
root_Z13_mean = np.mean(np.array([new_root_Z13_rep1, new_root_Z13_rep2]))
root_Z13_std = np.std(np.array([new_root_Z13_rep1, new_root_Z13_rep2]))
root_Z39_mean = np.mean(np.array([new_root_Z39_rep1, new_root_Z39_rep2]))
root_Z39_std = np.std(np.array([new_root_Z39_rep1, new_root_Z39_rep2]))
stem_Z30_mean = np.mean(np.array([new_stem_Z30_rep1, new_stem_Z30_rep2]))
stem_Z30_std = np.std(np.array([new_stem_Z30_rep1, new_stem_Z30_rep2]))
stem_Z32_mean = np.mean(np.array([new_stem_Z32_rep1, new_stem_Z32_rep2]))
stem_Z32_std = np.std(np.array([new_stem_Z32_rep1, new_stem_Z32_rep2]))
stem_Z65_mean = np.mean(np.array([new_stem_Z65_rep1, new_stem_Z65_rep2]))
stem_Z65_std = np.std(np.array([new_stem_Z65_rep1, new_stem_Z65_rep2]))
leaf_Z10_mean = np.mean(np.array([new_leaf_Z10_rep1, new_leaf_Z10_rep2]))
leaf_Z10_std = np.std(np.array([new_leaf_Z10_rep1, new_leaf_Z10_rep2]))
leaf_Z23_mean = np.mean(np.array([new_leaf_Z23_rep1, new_leaf_Z23_rep2]))
leaf_Z23_std = np.std(np.array([new_leaf_Z23_rep1, new_leaf_Z23_rep2]))
leaf_Z71_mean = np.mean(np.array([new_leaf_Z71_rep1, new_leaf_Z71_rep2]))
leaf_Z71_std = np.std(np.array([new_leaf_Z71_rep1, new_leaf_Z71_rep2]))
spike_Z32_mean = np.mean(np.array([new_spike_Z32_rep1, new_spike_Z32_rep2]))
spike_Z32_std = np.std(np.array([new_spike_Z32_rep1, new_spike_Z32_rep2]))
spike_Z39_mean = np.mean(np.array([new_spike_Z39_rep1, new_spike_Z39_rep2]))
spike_Z39_std = np.std(np.array([new_spike_Z39_rep1, new_spike_Z39_rep2]))
spike_Z65_mean = np.mean(np.array([new_spike_Z65_rep1, new_spike_Z65_rep2]))
spike_Z65_std = np.std(np.array([new_spike_Z65_rep1, new_spike_Z65_rep2]))
latet_lepto_mean = np.mean(np.array([new_latet_lepto_rep1, new_latet_lepto_rep2]))
latet_lepto_std = np.std(np.array([new_latet_lepto_rep1, new_latet_lepto_rep2]))
diplo_dia_mean = np.mean(np.array([new_diplo_dia_rep1, new_diplo_dia_rep2]))
diplo_dia_std = np.std(np.array([new_diplo_dia_rep1, new_diplo_dia_rep2]))
zygo_pachy_mean = np.mean(np.array([new_zygo_pachy_rep1, new_zygo_pachy_rep2]))
zygo_pachy_std = np.std(np.array([new_zygo_pachy_rep1, new_zygo_pachy_rep2]))
metaphaseI_mean = np.mean(np.array([new_metaphaseI_rep1, new_metaphaseI_rep2]))
metaphaseI_std = np.std(np.array([new_metaphaseI_rep1, new_metaphaseI_rep2]))
grain_Z71_mean = np.mean(np.array([new_grain_Z71_rep1, new_grain_Z71_rep2]))
grain_Z71_std = np.std(np.array([new_grain_Z71_rep1, new_grain_Z71_rep2]))
grain_Z75_mean = np.mean(np.array([new_grain_Z75_rep1, new_grain_Z75_rep2]))
grain_Z75_std = np.std(np.array([new_grain_Z75_rep1, new_grain_Z75_rep2]))
grain_Z85_mean = np.mean(np.array([new_grain_Z85_rep1, new_grain_Z85_rep2]))
grain_Z85_std = np.std(np.array([new_grain_Z85_rep1, new_grain_Z85_rep2]))
Wheat_10DPA_mean = np.mean(np.array([Wheat_Room1_10DPA, Wheat_Room1_10DPA_Rep,Wheat_Room2_10DPA, Wheat_Room2_10DPA_Rep]))
Wheat_10DPA_std = np.std(np.array([Wheat_Room1_10DPA, Wheat_Room1_10DPA_Rep,Wheat_Room2_10DPA, Wheat_Room2_10DPA_Rep]))
Wheat_AL_20DPA_mean = np.mean(np.array([Wheat_Room1_AL_20DPA, Wheat_Room1_AL_20DPA_Rep,Wheat_Room2_AL_20DPA, Wheat_Room2_AL_20DPA_Rep, Wheat_Room1_AL_20DPA_Extra1, Wheat_Room1_AL_20DPA_Extra2]))
Wheat_AL_20DPA_std = np.std(np.array([Wheat_Room1_AL_20DPA, Wheat_Room1_AL_20DPA_Rep,Wheat_Room2_AL_20DPA, Wheat_Room2_AL_20DPA_Rep, Wheat_Room1_AL_20DPA_Extra1, Wheat_Room1_AL_20DPA_Extra2]))
Wheat_SE_20DPA_mean = np.mean(np.array([Wheat_Room1_SE_20DPA, Wheat_Room1_SE_20DPA_Rep, Wheat_Room2_SE_20DPA, Wheat_Room2_SE_20DPA_Rep]))
Wheat_SE_20DPA_std = np.std(np.array([Wheat_Room1_SE_20DPA, Wheat_Room1_SE_20DPA_Rep, Wheat_Room2_SE_20DPA, Wheat_Room2_SE_20DPA_Rep]))
Wheat_TC_20DPA_mean = np.mean(np.array([Wheat_Room1_TC_20DPA, Wheat_Room1_TC_20DPA_Rep, Wheat_Room2_TC_20DPA, Wheat_Room2_TC_20DPA_Rep]))
Wheat_TC_20DPA_std = np.std(np.array([Wheat_Room1_TC_20DPA, Wheat_Room1_TC_20DPA_Rep, Wheat_Room2_TC_20DPA, Wheat_Room2_TC_20DPA_Rep]))
Wheat_REF_20DPA_mean = np.mean(np.array([Wheat_Room1_REF_20DPA, Wheat_Room1_REF_20DPA_Rep, Wheat_Room2_REF_20DPA, Wheat_Room2_REF_20DPA_Rep]))
Wheat_REF_20DPA_std = np.std(np.array([Wheat_Room1_REF_20DPA, Wheat_Room1_REF_20DPA_Rep, Wheat_Room2_REF_20DPA, Wheat_Room2_REF_20DPA_Rep]))
Wheat_SE_30DPA_mean = np.mean(np.array([Wheat_Room1_SE_30DPA, Wheat_Room1_SE_30DPA_Rep, Wheat_Room2_SE_30DPA, Wheat_Room2_SE_30DPA_Rep]))
Wheat_SE_30DPA_std = np.std(np.array([Wheat_Room1_SE_30DPA, Wheat_Room1_SE_30DPA_Rep, Wheat_Room2_SE_30DPA, Wheat_Room2_SE_30DPA_Rep]))
Wheat_AL_SE_30DPA_mean = np.mean(np.array([Wheat_Room1_AL_SE_30DPA, Wheat_Room1_AL_SE_30DPA_Rep, Wheat_Room2_AL_SE_30DPA, Wheat_Room2_AL_SE_30DPA_Rep]))
Wheat_AL_SE_30DPA_std = np.std(np.array([Wheat_Room1_AL_SE_30DPA, Wheat_Room1_AL_SE_30DPA_Rep, Wheat_Room2_AL_SE_30DPA, Wheat_Room2_AL_SE_30DPA_Rep]))
wheat_23_mean = np.mean(np.array([wheat_23_1, wheat_23_2, wheat_23_3]))
wheat_23_std = np.std(np.array([wheat_23_1, wheat_23_2, wheat_23_3]))
wheat_4_mean = np.mean(np.array([wheat_4_1, wheat_4_2, wheat_4_3]))
wheat_4_std = np.std(np.array([wheat_4_1, wheat_4_2, wheat_4_3]))
print "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" \
"\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" \
"\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" \
"\t%s\t\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" % \
(Geneid, Chr, Start, End, Strand, Length, root_Z10_mean, root_Z13_mean,root_Z39_mean, stem_Z30_mean,
stem_Z32_mean, stem_Z65_mean, leaf_Z10_mean, leaf_Z23_mean, leaf_Z71_mean, spike_Z32_mean,
spike_Z39_mean, spike_Z65_mean, new_carpel, new_carpel_like_structure, new_stamen, latet_lepto_mean,
diplo_dia_mean, zygo_pachy_mean, metaphaseI_mean, grain_Z71_mean, grain_Z75_mean, grain_Z85_mean,
Wheat_10DPA_mean, Wheat_AL_20DPA_mean, Wheat_SE_20DPA_mean, Wheat_TC_20DPA_mean, Wheat_REF_20DPA_mean,
Wheat_SE_30DPA_mean, Wheat_AL_SE_30DPA_mean, wheat_23_mean, wheat_4_mean,
root_Z10_std, root_Z13_std, root_Z39_std, stem_Z30_std, stem_Z32_std, stem_Z65_std, leaf_Z10_std,
leaf_Z23_std, leaf_Z71_std, spike_Z32_std, spike_Z39_std, spike_Z65_std, 'null', 'null', 'null',
latet_lepto_std, diplo_dia_std, zygo_pachy_std, metaphaseI_std, grain_Z71_std, grain_Z75_std,
grain_Z85_std, Wheat_10DPA_std, Wheat_AL_20DPA_std, Wheat_SE_20DPA_std, Wheat_TC_20DPA_std,
Wheat_REF_20DPA_std, Wheat_SE_30DPA_std, Wheat_AL_SE_30DPA_std, wheat_23_std, wheat_4_std)
这里只能使用FPKM而不是TPM,因为我们没有所有的转录本信息,故不能统计出TPM。可变剪切现象广泛存在,而二代测序不能有效区分可变剪切的转录本的表达量。在一定意义说只能衡量转录水平的表达量,而不能衡量转录后水平的表达量。