RNA-seq练习 第二部分(基因组序列下载,注释文件下载,索引下载,比对,比对质控,HTseq-count计数,输出count矩阵文件)

1.NCBI (https://www.ncbi.nlm.nih.gov/grc)
2.UCSC (http://hgdownload.soe.ucsc.edu/downloads.html)
3.Ensemble (http://asia.ensembl.org/index.html?redirect=no)

|NCBI | UCSC| Ensemble|
|GRCh36 | hg18 | ENSEMBL release_52 |
|GRCh37 | hg19 | ENSEMBL release_59/61/64/68/69/75|
|GRCh38 | hg38 | ENSEMBL release_76/77/78/80/81/82|


$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz #下载USCS版本的hg19
$ tar -zxvf chromFa.tar.gz # 解压缩


$ cat *.fa > hg19.fa 
$ rm -rf chr* #删除单独的染色体文件,节省空间


$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.annotation.gtf.gz
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.annotation.gff3.gz
$ gzip -d gencode.v31lift37.annotation.gff3.gz #解压
$ gzip -d gencode.v31lift37.annotation.gtf.gz  #解压


记录一下GTF 和GFF3文件的内容(https://www.jianshu.com/p/3e545b9a3c68),感觉没什么区别:

GTF(General Transfer Format)其实就是GFF2,以Tab分割,分为如下几列:

  1. seqname 染色体名称- name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
  2. source来源 - name of the program that generated this feature, or the data source (database or project name)
  3. feature特性(是基因,外显子,还是其他一些什么) - feature type name, e.g. Gene, Variation, Similarity
  4. start(在染色体上的开始位置) - Start position of the feature, with sequence numbering starting at 1.
  5. end(在染色体上结束的位置) - End position of the feature, with sequence numbering starting at 1.
  6. score - A floating point value.
  7. strand - defined as + (forward) or - (reverse).
  8. frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
  9. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

GFF3(General Feature Format)的格式如下:

  1. seqid - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seq ID must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
  2. source - name of the program that generated this feature, or the data source (database or project name)
  3. type - type of feature. Must be a term or accession from the SOFA sequence ontology
  4. start - Start position of the feature, with sequence numbering starting at 1.
  5. end - End position of the feature, with sequence numbering starting at 1.
  6. score - A floating point value.
  7. strand - defined as + (forward) or - (reverse).
  8. phase - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
  9. attributes - A semicolon-separated list of tag-value pairs, providing additional information about each feature. Some of these tags are predefined, e.g. ID, Name, Alias, Parent - see the GFF documentation for more details.


$ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz #下载索引,会弹出下面的下载进度
--2019-08-31 21:48:51--  ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz
           => ‘hg19.tar.gz’
Resolving ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)...
Connecting to ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)||:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/infphilo/hisat2/data ... done.
==> SIZE hg19.tar.gz ... 4181115011
==> PASV ... done.    ==> RETR hg19.tar.gz ... done.
Length: 4181115011 (3.9G) (unauthoritative)

hg19.tar.gz                  100%[============================================>]   3.89G  1.27MB/s    in 58m 27s 

2019-08-31 22:47:18 (1.14 MB/s) - ‘hg19.tar.gz’ saved [4181115011]


$ tar -zxvf *.tar.gz #解压缩。解压的文件中,包含genome.*.ht2的8个文件和一个shell脚本。
$ rm -rf *.tar.gz #删除压缩包节省空间

RNA-Seq数据分析分为很多种,比如说找差异表达基因或寻找新的可变剪切。如果找差异表达基因单纯只需要确定不同的read计数就行的话,我们可以用bowtie, bwa这类比对工具,或者是salmon这类align-free工具,并且后者的速度更快。但是如果你需要找到新的isoform,或者RNA的可变剪切,看看外显子使用差异的话,你就需要TopHat, HISAT2或者是STAR这类工具用于找到剪切位点。因为RNA-Seq不同于DNA-Seq,DNA在转录成mRNA的时候会把内含子部分去掉。所以mRNA反转的cDNA如果比对不到参考序列,会被分开,重新比对一次,判断中间是否有内含子。

① 转录组比对需要准确的已知转录本的序列,对于来自未知转录本(比如一些未被数据库收录的lncRNA)或序列不准确的reads无法正确比对;
② 与上一条类似,转录组比对不能对转录本的可变剪接进行分析,数据库中未收录的剪接位点会被直接丢弃;
③ 由于同一个基因存在不同的转录本,因此很多reads可以同时完美比对到多个转录本,reads的比对评分会偏低,可能被后续计算表达量的软件舍弃,影响后续分析(有部分软件解决了这个问题);
④ 由于与DNA测序使用的参考序列不同,因此不利于RNA和DNA数据的整合分析。



HISAT2,取代Bowtie/TopHat程序,能够将RNA-Seq的读取与基因组进行快速比对。HISAT利用大量FM索引,以覆盖整个基因组。Index的目的主要使用与序列比对。由于物种的基因组序列比较长, 如果将测序序列与整个基因组进行比对,则会非常耗时。因此采用将测序序列和参考基因组的Index文件进行比对,会节省很多时间。以人类基因组为例,它需要48,000个索引,每个索引代表~64,000 bp的基因组区域。这些小的索引结合几种比对策略,实现了RNA-Seq读取的高效比对,特别是那些跨越多个外显子的读取。尽管它利用大量索引,但HISAT只需要4.3 GB的内存。这种应用程序支持任何规模的基因组,包括那些超过40亿个碱基的。(http://www.biotrainee.com/thread-2073-1-1.html)


$ hisat2 -h #先看一眼hisat2的使用方法,然后会弹出一堆使用说明,各个参数的意思
HISAT2 version 2.1.0 by Daehwan Kim ([email protected], www.ccb.jhu.edu/people/infphilo)
  hisat2 [options]* -x  {-1  -2  | -U } [-S ]
    Index filename prefix (minus trailing .X.ht2).
         Files with #1 mates, paired with files in .
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
         Files with #2 mates, paired with files in .
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
          Files with unpaired reads.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
        File for SAM output (default: stdout)

  , ,  can be comma-separated lists (no whitespace) and can be
  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.

-x :-x后面跟的是参考基因组索引文件的路径。
-1 :双端测序结果的第一个文件。若有多组数据,使用逗号将文件分隔。Reads的长度可以不一致。
-2 : 双端测序结果的第二个文件。若有多组数据,使用逗号将文件分隔,并且文件顺序要和-1参数对应。Reads的长度可以不一致。
-U :单端数据文件。若有多组数据,使用逗号将文件分隔。可以和-1、-2参数同时使用。Reads的长度可以不一致。
-S :指定输出的SAM文件的路径。


for ((i=77;i<=80;i++))
do hisat2 -t -x /media/yanfang/FYWD/RNA_seq/ref_genome/index/hg19/genome -U /media/yanfang/FYWD/RNA_seq/fastq_files/SRR9576${i}_1.fastq.gz -S /media/yanfang/FYWD/RNA_seq/sam_files/SRR9576${i}.sam

接下来就是等待的时间了,取决于你的笔记本电脑的配置,我的配置是RAM 8G, i7 7500。

$ ./hisat2_map.sh #运行hisat2比对的脚本
Time loading forward index: 00:00:40
Time loading reference: 00:00:07
Multiseed full-index search: 00:12:04
20803937 reads; of these:
  20803937 (100.00%) were unpaired; of these:
    1198535 (5.76%) aligned 0 times
    17146459 (82.42%) aligned exactly 1 time
    2458943 (11.82%) aligned >1 times
94.24% overall alignment rate
Time searching: 00:12:11
Overall time: 00:12:51
Time loading forward index: 00:00:47
Time loading reference: 00:00:07
Multiseed full-index search: 00:05:04
8828013 reads; of these:
  8828013 (100.00%) were unpaired; of these:
    572582 (6.49%) aligned 0 times
    7275873 (82.42%) aligned exactly 1 time
    979558 (11.10%) aligned >1 times
93.51% overall alignment rate
Time searching: 00:05:12
Overall time: 00:05:59
Time loading forward index: 00:00:43
Time loading reference: 00:00:06
Multiseed full-index search: 00:11:37
19909740 reads; of these:
  19909740 (100.00%) were unpaired; of these:
    1256224 (6.31%) aligned 0 times
    16065546 (80.69%) aligned exactly 1 time
    2587970 (13.00%) aligned >1 times
93.69% overall alignment rate
Time searching: 00:11:43
Overall time: 00:12:26
Time loading forward index: 00:00:43
Time loading reference: 00:00:07
Multiseed full-index search: 00:13:38
24231941 reads; of these:
  24231941 (100.00%) were unpaired; of these:
    1348062 (5.56%) aligned 0 times
    20030375 (82.66%) aligned exactly 1 time
    2853504 (11.78%) aligned >1 times
94.44% overall alignment rate
Time searching: 00:13:45
Overall time: 00:14:28

最重要的是比对到基因组或是转录组上的比对率。人类基因组的比对率期望值是70-90%,会出现多个序列比对在有限的序列区称之为“多重比对序列”(multi-mapping reads);转录组上的比对率较低,由于未注释的转录本会被过滤且“多重比对序列”增加,由于同一个基因不同亚型共有外显子区。

SAM(sequence Alignment/mapping)数据格式是目前高通量测序中存放比对数据的标准格式。bam是sam的二进制格式,为了减少sam文件的储存量。为什么要转换格式?为了让计算机好处理。工具:SAMtools。
view: BAM-SAM/SAM-BAM 转换和提取部分比对
sort: 比对排序,-o是根据染色体排序,-n参数则是根据read名进行排序,-t 根据TAG进行排序。
merge: 聚合多个排序比对
index: 索引排序比对
faidx: 建立FASTA索引,提取部分序列
tview: 文本格式查看序列
pileup: 产生基于位置的结果和 consensus/indel calling

for i in `seq 77 80`
  samtools view -S /media/yanfang/FYWD/RNA_seq/sam_files/SRR9576${i}.sam -b > /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}.bam
  #第一步将比对后的sam文件转换成bam文件。-S 后面跟的是sam文件的路径;-b 指定输出的文件为bam,后面跟输出的路径;最后重定向写入bam文件
  samtools sort /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}.bam -o /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_sorted.bam
  samtools index /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_sorted.bam


$ samtools flagstat SRR957677_sorted.bam
25783414 + 0 in total (QC-passed reads + QC-failed reads)
4979477 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
24584879 + 0 mapped (95.35% : N/A)
0 + 0 paired in sequencing #因为是单端测序,所以这项是0
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


$ sudo -H pip install RSeQC
$ bam_stat.py -i SRR957677_sorted.bam
Load BAM file ...  Done
#All numbers are READ count
Total records:                          25783414
QC failed:                              0
Optical/PCR duplicate:                  0
Non primary hits                        4979477
Unmapped reads:                         1198535
mapq < mapq_cut (non-unique):           2458943
mapq >= mapq_cut (unique):              17146459
Read-1:                                 0
Read-2:                                 0
Reads map to '+':                       8670460
Reads map to '-':                       8475999
Non-splice reads:                       14136525
Splice reads:                           3009934
Reads mapped in proper pairs:           0
Proper-paired reads map to different chrom:0

在基因水平上,常用的软件为HTSeq-count,featureCounts,BEDTools, Qualimap, Rsubread, GenomicRanges等。以常用的HTSeq-count为例,这些工具要解决的问题就是根据read和基因位置的overlap判断这个read到底是谁家的孩子。值得注意的是不同工具对multimapping reads处理方式也是不同的,例如HTSeq-count就直接当它们不存在。而Qualimpa则是一人一份,平均分配。

在转录本水平上,一般常用工具为Cufflinks和它的继任者StringTie, eXpress。这些软件要处理的难题就时转录本亚型(isoforms)之间通常是有重叠的,当二代测序读长低于转录本长度时,如何进行区分?这些工具大多采用的都是expectation maximization(EM)。好在我们有三代测序。上述软件都是alignment-based,目前许多alignment-free软件,如kallisto, silfish, salmon,能够省去比对这一步,直接得到read count,在运行效率上更高。不过最近一篇文献[1]指出这类方法在估计丰度时存在样本特异性和读长偏差。


htseq-count 自定义模型:

RNA-seq练习 第二部分(基因组序列下载,注释文件下载,索引下载,比对,比对质控,HTseq-count计数,输出count矩阵文件)_第1张图片


for ((i=77;i<=80;i++))
  samtools sort -n /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}.bam -o /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_nsorted.bam


$ htseq-count --h
usage: htseq-count [options] alignment_file gff_file

This script takes one or more alignment files in SAM/BAM format and a feature
file in GFF format and calculates for each feature the number of reads mapping
to it. See http://htseq.readthedocs.io/en/master/count.html for details.

positional arguments:
  samfilenames          Path to the SAM/BAM files containing the mapped reads.
                        If '-' is selected, read from standard input
  featuresfilename      Path to the file containing the features

optional arguments:
  -h, --help            show this help message and exit
  -f {sam,bam}, --format {sam,bam}
                        type of  data, either 'sam' or 'bam'
                        (default: sam)
  -r {pos,name}, --order {pos,name}
#你需要利用samtool sort对数据根据read name或者位置进行排序,默认是name
                        'pos' or 'name'. Sorting order of 
                        (default: name). Paired-end sequencing data must be
                        sorted either by position or by read name, and the
                        sorting order must be specified. Ignored for single-
                        end data.
  --max-reads-in-buffer MAX_BUFFER_SIZE
                        When  is paired end sorted by
                        position, allow only so many reads to stay in memory
                        until the mates are found (raising this number will
                        use more memory). Has no effect for single end or
                        paired end sorted by name
  -s {yes,no,reverse}, --stranded {yes,no,reverse}
#数据是否来自于strand-specific assay。
#如果选择了no, 那么每一条read都会跟正义链和反义链进行比较。
                        whether the data is from a strand-specific assay.
                        Specify 'yes', 'no', or 'reverse' (default: yes).
                        'reverse' means 'yes' with reversed strand
  -a MINAQUAL, --minaqual MINAQUAL
#最低质量, 剔除低于阈值的read
                        skip all reads with alignment quality lower than the
                        given minimum value (default: 10)
                        feature type (3rd column in GFF file) to be used, all
                        features of other type are ignored (default, suitable
                        for Ensembl GTF files: exon)
  -i IDATTR, --idattr IDATTR
                        GFF attribute to be used as feature ID (default,
                        suitable for Ensembl GTF files: gene_id)
  --additional-attr ADDITIONAL_ATTR
                        Additional feature attributes (default: none, suitable
                        for Ensembl GTF files: gene_name). Use multiple times
                        for each different attribute
  -m {union,intersection-strict,intersection-nonempty}, --mode {union,intersection-strict,intersection-nonempty}
                        mode to handle reads overlapping more than one feature
                        (choices: union, intersection-strict, intersection-
                        nonempty; default: union)
  --nonunique {none,all}
                        Whether to score reads that are not uniquely aligned
                        or ambiguously assigned to features
  --secondary-alignments {score,ignore}
                        Whether to score secondary alignments (0x100 flag)
  --supplementary-alignments {score,ignore}
                        Whether to score supplementary alignments (0x800 flag)
  -o SAMOUTS, --samout SAMOUTS
                        write out all SAM alignment records into SAM files
                        (one per input file needed), annotating each line with
                        its feature assignment (as an optional field with tag
  -q, --quiet           suppress progress report


for i in `seq 77 80`
   htseq-count -r name -f bam /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_nsorted.bam /media/yanfang/FYWD/RNA_seq/ref_genome/gencode.v31lift37.annotation.gtf > /media/yanfang/FYWD/RNA_seq/matrix/SRR9576${i}.count


yanfang@YF-Lenovo:/media/yanfang/FYWD/RNA_seq/matrix$ wc -l *.count
  62297 SRR957677.count
  62297 SRR957678.count
  62297 SRR957679.count
  62297 SRR957680.count
 249188 total


$ head -n 4 SRR9576*.count
==> SRR957677.count <==
ENSG00000000003.14_2    807
ENSG00000000005.6_3     0
ENSG00000000419.12_4    389
ENSG00000000457.14_4    288

==> SRR957678.count <==
ENSG00000000003.14_2    357
ENSG00000000005.6_3     0
ENSG00000000419.12_4    174
ENSG00000000457.14_4    108

==> SRR957679.count <==
ENSG00000000003.14_2    800
ENSG00000000005.6_3     0
ENSG00000000419.12_4    405
ENSG00000000457.14_4    218

==> SRR957680.count <==
ENSG00000000003.14_2    963
ENSG00000000005.6_3     1
ENSG00000000419.12_4    509
ENSG00000000457.14_4    283


$ tail -n 4 SRR9576*.count
==> SRR957677.count <==
__ambiguous 341518
__too_low_aQual 0
__not_aligned   1198535
__alignment_not_unique  2458943

==> SRR957678.count <==
__ambiguous 138861
__too_low_aQual 0
__not_aligned   572582
__alignment_not_unique  979558

==> SRR957679.count <==
__ambiguous 360081
__too_low_aQual 0
__not_aligned   1256224
__alignment_not_unique  2587970

==> SRR957680.count <==
__ambiguous 411012
__too_low_aQual 0
__not_aligned   1348062
__alignment_not_unique  2853504

首先要启动R-studio, 运行R。载入数据,把矩阵加上列名:

> options(stringsAsFactors = FALSE) 
> control1<-read.table("SRR957677.count",sep= "\t",col.names = c("gene_id","control1"))
> head(control1)#查看前几行
               gene_id control1
1 ENSG00000000003.14_2      807
2  ENSG00000000005.6_3        0
3 ENSG00000000419.12_4      389
4 ENSG00000000457.14_4      288
5 ENSG00000000460.17_6      505
6 ENSG00000000938.13_4        0
> control2<-read.table("SRR957678.count",sep= "\t",col.names = c("gene_id","control2"))
> treat1<-read.table("SRR957679.count",sep= "\t",col.names = c("gene_id","treat1"))
> treat2<-read.table("SRR957680.count",sep= "\t",col.names = c("gene_id","treat2"))
> tail(control2)#查看后几行
                     gene_id control2
62292    ENSG00000288111.1_1        0
62293           __no_feature  3856774
62294            __ambiguous   138861
62295        __too_low_aQual        0
62296          __not_aligned   572582
62297 __alignment_not_unique   979558
> tail(treat2)
                     gene_id   treat2
62292    ENSG00000288111.1_1        0
62293           __no_feature 10430059
62294            __ambiguous   411012
62295        __too_low_aQual        0
62296          __not_aligned  1348062
62297 __alignment_not_unique  2853504
> raw_count <- merge(merge(control1, control2, by="gene_id"), merge(treat1, treat2, by="gene_id"))
> head(raw_count) #这里显示的合并之后,行的顺序改变了
                gene_id control1 control2  treat1   treat2
1 __alignment_not_unique  2458943   979558 2587970  2853504
2            __ambiguous   341518   138861  360081   411012
3           __no_feature  9096888  3856774 8247195 10430059
4          __not_aligned  1198535   572582 1256224  1348062
5        __too_low_aQual        0        0       0        0
6   ENSG00000000003.14_2      807      357     800      963
> tail(raw_count)
                  gene_id control1 control2 treat1 treat2
62292 ENSG00000288106.1_1        0        3      1      2
62293 ENSG00000288107.1_1        2        0      0      0
62294 ENSG00000288108.1_1        0        0      0      0
62295 ENSG00000288109.1_1        0        0      1      0
62296 ENSG00000288110.1_1        0        0      0      0
62297 ENSG00000288111.1_1        0        0      0      0

read.table(file, header = FALSE, sep = "", quote = ""'",dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss")。

raw_count_filt <- raw_count[-1:-5,] #删掉前5行
> head(raw_count_filt) #查看删完的数据矩阵的前几行
               gene_id control1 control2 treat1 treat2
6  ENSG00000000003.14_2      807      357    800    963
7   ENSG00000000005.6_3        0        0      0      1
8  ENSG00000000419.12_4      389      174    405    509
9  ENSG00000000457.14_4      288      108    218    283
10 ENSG00000000460.17_6      505      208    451    543
11 ENSG00000000938.13_4        0        0      0      0


> ENSEMBL <- gsub("\\.\\d*\\_\\d*", "", raw_count_filt$gene_id)#把gene_id列里的小数点后面的都去掉
#还有一篇文章是这样的代码:ENSEMBL <- gsub("(.*?)\\.\\d*?_\\d", "\\1", raw_count_filt$gene_id)
> row.names(raw_count_filt) <- ENSEMBL #将ENSEMBL重新添加到raw_count_filt矩阵
> raw_count_filt1 <- cbind(ENSEMBL,raw_count_filt)#合并矩阵ENSEMBL和filt2
> colnames(raw_count_filt1) <- c("ensembl_gene_id","gene_id","control1","control2","treat1","treat2")
               ensembl_gene_id              gene_id control1 control2 treat1 treat2
ENSG00000000003 ENSG00000000003 ENSG00000000003.14_2      807      357    800    963
ENSG00000000005 ENSG00000000005  ENSG00000000005.6_3        0        0      0      1
ENSG00000000419 ENSG00000000419 ENSG00000000419.12_4      389      174    405    509
ENSG00000000457 ENSG00000000457 ENSG00000000457.14_4      288      108    218    283
ENSG00000000460 ENSG00000000460 ENSG00000000460.17_6      505      208    451    543
ENSG00000000938 ENSG00000000938 ENSG00000000938.13_4        0        0      0      0


> mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
> my_ensembl_gene_id <- row.names(raw_count_filt1)
> options(timeout = 4000000) #提高连接时间
> hg_symbols<- getBM(attributes=c('ensembl_gene_id','hgnc_symbol',"chromosome_name", "start_position","end_position", "band"), filters= 'ensembl_gene_id', values = my_ensembl_gene_id, mart = mart)                                                                                                             
> head(hg_symbols)
  ensembl_gene_id hgnc_symbol chromosome_name start_position end_position   band
1 ENSG00000000003      TSPAN6               X      100627109    100639991  q22.1
2 ENSG00000000005        TNMD               X      100584936    100599885  q22.1
3 ENSG00000000419        DPM1              20       50934867     50958555 q13.13
4 ENSG00000000457       SCYL3               1      169849631    169894267  q24.2
5 ENSG00000000460    C1orf112               1      169662007    169854080  q24.2
6 ENSG00000000938         FGR               1       27612064     27635185  p35.3


> readcount <- merge(raw_count_filt1, hg_symbols, by="ensembl_gene_id")
> head(readcount)
 ensembl_gene_id              gene_id control1 control2 treat1 treat2 hgnc_symbol chromosome_name start_position end_position   band
1 ENSG00000000003 ENSG00000000003.14_2      807      357    800    963      TSPAN6               X      100627109    100639991  q22.1
2 ENSG00000000005  ENSG00000000005.6_3        0        0      0      1        TNMD               X      100584936    100599885  q22.1
3 ENSG00000000419 ENSG00000000419.12_4      389      174    405    509        DPM1              20       50934867     50958555 q13.13
4 ENSG00000000457 ENSG00000000457.14_4      288      108    218    283       SCYL3               1      169849631    169894267  q24.2
5 ENSG00000000460 ENSG00000000460.17_6      505      208    451    543    C1orf112               1      169662007    169854080  q24.2
6 ENSG00000000938 ENSG00000000938.13_4        0        0      0      0         FGR               1       27612064     27635185  p35.3


> write.csv(readcount, file='readcount_all,csv')
> readcount<-raw_count_filt1[ ,-1:-2]
> write.csv(readcount, file='readcount.csv')
> head(readcount)
                 control1 control2 treat1 treat2
ENSG00000000003      807      357    800    963
ENSG00000000005        0        0      0      1
ENSG00000000419      389      174    405    509
ENSG00000000457      288      108    218    283
ENSG00000000460      505      208    451    543
ENSG00000000938        0        0      0      0

