Reference Human Genome DNA

GRCh38 vS hg38
UCSC的hg38相比于NCBI的GRCh38缺少EBV序列、decoy序列和HLA序列。

Outline

1. Introduction
2. File names and contents
3. Sequence names
4. Metadata Tag-value pairs
5. Definitions

1. Introduction

The files in this directory provide the FASTA format sequences for a
genome assembly in a package convenient for use by various Next
Generation Sequence read alignment pipelines. The sequence names,
sequence order, and format of the sequence definition lines, were
developed in consultation with several developers and major users of
alignment pipelines.

2. File names and contents

GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz is a gzipped
file that contains FASTA format sequences for the following:
1. chromosomes from the GRCh38 Primary Assembly unit.
Note: the two PAR regions on chrY have been hard-masked with Ns. The
chromosome Y sequence provided therefore has the same coordinates as
the GenBank sequence but it is not identical to the GenBank sequence.
Similarly, duplicate copies of centromeric arrays and WGS on
chromosomes 5, 14, 19, 21 & 22 have been hard-masked with Ns.
2. mitochondrial genome from the GRCh38 non-nuclear assembly unit.
3. unlocalized scaffolds from the GRCh38 Primary Assembly unit.
4. unplaced scaffolds from the GRCh38 Primary Assembly unit.
5. Epstein-Barr virus (EBV) sequence
Note: The EBV sequence is not part of the genome assembly but is
included in the analysis set as a sink for alignment of reads that
are often present in sequencing samples.

GCA_000001405.15_GRCh38_full_analysis_set.fna.gz is a gzipped file
that contains all the same FASTA formatted sequences as
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, plus:
6. alt-scaffolds from the GRCh38 ALT_REF_LOCI_* assembly units.

3. Sequence names

The sequence names in the analysis sets follow UCSC-style naming
patterns.

Chromosomes:
chr{chromosome number or name}
e.g. chr1 or chrX
chrM for the mitochondrial genome.

Unlocalized scaffolds:
chr{chromosome number or name}_{sequence_accession}v{sequence_version}_random
e.g. chr17_GL000205v2_random

Unplaced scaffolds:
chrUn_{sequence_accession}v{sequence_version}
e.g. chrUn_GL000220v1

Alternate loci scaffolds:
chr{chromosome number or name}_{sequence_accession}v{sequence_version}_alt
e.g. chr6_GL000250v2_alt

4. Metadata tag-value pairs

The FASTA definition lines contain sequences metadata in a series of
space-separated tag-value pairs.

Tag Value


AC: sequence accession.version
gi: sequence gi
LN: sequence length
rg: region
- chromosome to which unlocalized scaffolds are assigned,
e.g. chr1
- region on chromosome within which alt-scaffolds or patch
scaffolds are placed, e.g. chr6:28696604-33335493
- not present for chromosomes, other replicons, or unplaced
scaffolds
rl: role of the sequence in the assembly
- possible values are: Chromosome, Mitochondrion, unlocalized,
unplaced, alt-scaffold fix-patch, novel-patch, decoy
M5: md5 checksum of the sequence as a single string of uppercase
letters without line breaks (as produced by Samtools or Picard)
AS: assembly-name
hm: hard-masked regions, either a single span, two spans separated by
a comma, or "multiple" if more than two spans were hard-masked
tp: topology
- circular for chrM and chrEBV
- not present for linear chromosomes and scaffolds

5. Definitions

Unlocalized sequence:
A sequence found in an assembly that is associated with a specific
chromosome but cannot be ordered or oriented on that chromosome.

Unplaced sequence:
A sequence found in an assembly that is not associated with any
chromosome.

Alt-scaffold:
A scaffold that provides an alternate representation of a locus found
in the primary assembly. These sequences do not represent a complete
chromosome sequence although there is no hard limit on the size of the
alternate locus; currently these are less than 1 Mb.

Major release:
The formal release of a genome assembly, e.g. GRCh38.

Minor release:
A release of a genome assembly including patches that occurs between
major releases.

Genome Patch:
A sequence contig/scaffold that corrects sequence in a major release
of the genome, or adds sequence to it.

Fix-patch:
A patch that corrects sequence or reduces an assembly gap in a given
major release. FIX patch sequences are meant to be incorporated into
the primary or existing alt-loci assembly units at the next major
release.

Novel-patch:
A patch that adds sequence to a major release. Typically, NOVEL patch
sequences are meant to be incorporated into the assembly as new
alternate loci at the next major release.

Decoy:
A sequence that is not part of the genome assembly but is included in
an analysis set as a sink for alignment of reads that are often
present in sequencing samples.

######################################################################
Name Last modified Size Description


  [Parent Directory](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/)                                        -   
  [hg38.analysisSet.2bit](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.2bit)              27-Jan-2014 10:40  770M  
  [hg38.analysisSet.chroms.tar.gz](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.chroms.tar.gz)     27-Jan-2014 11:02  905M  
  [hg38.fullAnalysisSet.2bit](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.fullAnalysisSet.2bit)          18-Mar-2014 13:23  797M  
  [hg38.fullAnalysisSet.chroms.tar.gz](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.fullAnalysisSet.chroms.tar.gz) 18-Mar-2014 13:41  936M  
  [md5sum.txt](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/md5sum.txt)                         18-Mar-2014 13:41  250   

Reference:
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/

你可能感兴趣的:(Reference Human Genome DNA)