随着企业的发展,不同的业务场景产生了不同形式的海量技术、业务数据。如何提取出有用的数据来帮助解决特定场景、发现潜在价值成为数据科学家的核心难题之一。业界通过元数据来提高数据科学家的生产力,一种描述数据的数据。(元数据概念请自行了解)不同的用例通常都会有自己特殊的元数据定义及其关系,最常见的如用户元数据、报表元数据、关系元数据。我们需要一套完备的系统帮助专业人员收集、组织、访问和丰富元数据,以支持数据发现和管理(俗称:数据治理,在数据资产化下数据治理尤为重要)。而如何设计一套行之有效的系统,更方便、快速的丰富、查询、使用元数据成为各大企业探索的目标。(国外更多的称呼该系统为 元数据中心\元数据目录)
本文档编写时,元数据中心系统架构经历了三代演变。在该领域内Lyft’s Amundsen、DataHub处于领先者,Amundsen是社区最活跃的,DataHub也越来越被关注。
更详细的架构演进请跳转此文:LinkedIn-DataHub专题: 元数据中心系统架构演进
第三代架构确保我们能够以最具伸缩性和灵活性的方式集成、存储和处理元数据。本文的主角DataHub就是基于三代架构进行构建,市面上具有三代架构特性的还有Apache Atlas,Egeria,Uber Databook(非开源)。Atlas与Hadoop生态系统紧密耦合,最活跃的Amundsen现已可以与Atlas做整合;Egeria支持事件,但功能还不完整;Databook与DataHub较接近,但不开源。DataHub经历了WhereHows(第二代)的过渡,也存在内部版本(开源版版本与内部版本区别看这),在LinkedIn内部被广泛使用,每天处理超过千万实体和关系的变更事件,总计索引超过500万个实体和关系,毫秒级查询,用户体验也获得了极大的改进。不难看出LinkedIn的野心:推进DataHub成为数据资产的基础设施进程。
最新功能清单以官方在线版为准: https://github.com/linkedin/datahub/blob/master/docs/features.md
开源版的数据结构仅支持Datasets、People;数据集支持:Hive、Kafka、RDBMS(如果需要额外的数据集,需编程式定义);存储源支持Oracle、Postgres、MySQL、H2等主流RDBMS 、Elasticsearch和Neo4j。除了以下列的这些,还有部分功能也在规划中,比如仪表盘、指标信息、元数据结构变更记录、数据抓取任务执行记录等等。
DataHub 组成
每个实体,关系和“元数据方面”都是单独的Pegasus文件(PDSC/PDL),User(PDL文件)实体和OwnedBy(PDL文件)关系分别如下(DataHub内部维护了两种文件类型 pdl和avsc (json格式),看官方说明,内部建模都会改成pdl,而网络传输(MCE)则用avsc格式):
关于PDSC/PDL, AVSC相关的请看该文档:https://linkedin.github.io/rest.li/pdl_schema
"type" : "record",
"name" : "OwnedBy",
"namespace" : "com.linkedin.metadata.relationship",
"doc" : "A generic model for the Owned-By relationship",
"fields" : [ {
"name" : "source",
"type" : "string",
"doc" : "Urn for the source of the relationship",
"java" : {
"class" : "com.linkedin.common.urn.Urn"
}, {
"name" : "destination",
"type" : "string",
"doc" : "Urn for the destination of the relationship",
"java" : {
"class" : "com.linkedin.common.urn.Urn"
}, {
"name" : "type",
"type" : {
"type" : "enum",
"name" : "OwnershipType",
"namespace" : "com.linkedin.common",
"doc" : "Owner category or owner role",
"symbolDocs" : {
"CONSUMER" : "A person, group, or service that consumes the data",
"DATAOWNER" : "A person or group that is owning the data",
"DELEGATE" : "A person or a group that overseas the operation, e.g. a DBA or SRE.",
"DEVELOPER" : "A person or group that is in charge of developing the code",
"PRODUCER" : "A person, group, or service that produces/generates the data",
"STAKEHOLDER" : "A person or a group that has direct business interest"
"doc" : "The type of the ownership"
} ],
"pairings" : [ {
"destination" : "com.linkedin.common.urn.CorpuserUrn",
"source" : "com.linkedin.common.urn.DatasetUrn"
}, {
"destination" : "com.linkedin.common.urn.CorpuserUrn",
"source" : "com.linkedin.common.urn.DataProcessUrn"
} ]
package com.linkedin.metadata.relationship;
* A generic model for the Owned-By relationship
@Generated(value = "com.linkedin.pegasus.generator.JavaCodeUtil", comments = "Rest.li Data Template. Generated from metadata-models/src/main/pegasus/com/linkedin/metadata/relationship/OwnedBy.pdl.")
public class OwnedBy
extends RecordTemplate
private final static OwnedBy.Fields _fields = new OwnedBy.Fields();
private final static RecordDataSchema SCHEMA = ((RecordDataSchema) DataTemplateUtil.parseSchema("namespace com.linkedin.metadata.relationship/**A generic model for the Owned-By relationship*/@pairings=[{\"destination\":\"com.linkedin.common.urn.CorpuserUrn\",\"source\":\"com.linkedin.common.urn.DatasetUrn\"},{\"destination\":\"com.linkedin.common.urn.CorpuserUrn\",\"source\":\"com.linkedin.common.urn.DataProcessUrn\"}]record OwnedBy includes/**Common fields that apply to all relationships*/record BaseRelationship{/**Urn for the source of the relationship*/source:{namespace [email protected]=\"com.linkedin.common.urn.Urn\"typeref Urn=string}/**Urn for the destination of the relationship*/destination:com.linkedin.common.Urn}{/**The type of the ownership*/type:{namespace com.linkedin.common/**Owner category or owner role*/enum OwnershipType{/**A person or group that is in charge of developing the code*/DEVELOPER/**A person or group that is owning the data*/DATAOWNER/**A person or a group that overseas the operation, e.g. a DBA or SRE.*/DELEGATE/**A person, group, or service that produces/generates the data*/PRODUCER/**A person, group, or service that consumes the data*/CONSUMER/**A person or a group that has direct business interest*/STAKEHOLDER}}}", SchemaFormatType.PDL));
private final static RecordDataSchema.Field FIELD_Source = SCHEMA.getField("source");
private final static RecordDataSchema.Field FIELD_Destination = SCHEMA.getField("destination");
private final static RecordDataSchema.Field FIELD_Type = SCHEMA.getField("type");
static {
public OwnedBy() {
super(new DataMap(4, 0.75F), SCHEMA);
public OwnedBy(DataMap data) {
super(data, SCHEMA);
public static OwnedBy.Fields fields() {
return _fields;
* Existence checker for source
* @see OwnedBy.Fields#source
public boolean hasSource() {
return contains(FIELD_Source);
* Remover for source
* @see OwnedBy.Fields#source
public void removeSource() {
* Getter for source
* @see OwnedBy.Fields#source
public com.linkedin.common.urn.Urn getSource(GetMode mode) {
return obtainCustomType(FIELD_Source, com.linkedin.common.urn.Urn.class, mode);
* Getter for source
* @return
* Required field. Could be null for partial record.
* @see OwnedBy.Fields#source
public com.linkedin.common.urn.Urn getSource() {
return obtainCustomType(FIELD_Source, com.linkedin.common.urn.Urn.class, GetMode.STRICT);
* Setter for source
* @see OwnedBy.Fields#source
public OwnedBy setSource(com.linkedin.common.urn.Urn value, SetMode mode) {
putCustomType(FIELD_Source, com.linkedin.common.urn.Urn.class, String.class, value, mode);
return this;
* Setter for source
* @param value
* Must not be null. For more control, use setters with mode instead.
* @see OwnedBy.Fields#source
public OwnedBy setSource(
com.linkedin.common.urn.Urn value) {
putCustomType(FIELD_Source, com.linkedin.common.urn.Urn.class, String.class, value, SetMode.DISALLOW_NULL);
return this;
* Existence checker for destination
* @see OwnedBy.Fields#destination
public boolean hasDestination() {
return contains(FIELD_Destination);
* Remover for destination
* @see OwnedBy.Fields#destination
public void removeDestination() {
* Getter for destination
* @see OwnedBy.Fields#destination
public com.linkedin.common.urn.Urn getDestination(GetMode mode) {
return obtainCustomType(FIELD_Destination, com.linkedin.common.urn.Urn.class, mode);
* Getter for destination
* @return
* Required field. Could be null for partial record.
* @see OwnedBy.Fields#destination
public com.linkedin.common.urn.Urn getDestination() {
return obtainCustomType(FIELD_Destination, com.linkedin.common.urn.Urn.class, GetMode.STRICT);
* Setter for destination
* @see OwnedBy.Fields#destination
public OwnedBy setDestination(com.linkedin.common.urn.Urn value, SetMode mode) {
putCustomType(FIELD_Destination, com.linkedin.common.urn.Urn.class, String.class, value, mode);
return this;
* Setter for destination
* @param value
* Must not be null. For more control, use setters with mode instead.
* @see OwnedBy.Fields#destination
public OwnedBy setDestination(
com.linkedin.common.urn.Urn value) {
putCustomType(FIELD_Destination, com.linkedin.common.urn.Urn.class, String.class, value, SetMode.DISALLOW_NULL);
return this;
* Existence checker for type
* @see OwnedBy.Fields#type
public boolean hasType() {
return contains(FIELD_Type);
* Remover for type
* @see OwnedBy.Fields#type
public void removeType() {
* Getter for type
* @see OwnedBy.Fields#type
public OwnershipType getType(GetMode mode) {
return obtainDirect(FIELD_Type, OwnershipType.class, mode);
* Getter for type
* @return
* Required field. Could be null for partial record.
* @see OwnedBy.Fields#type
public OwnershipType getType() {
return obtainDirect(FIELD_Type, OwnershipType.class, GetMode.STRICT);
* Setter for type
* @see OwnedBy.Fields#type
public OwnedBy setType(OwnershipType value, SetMode mode) {
putDirect(FIELD_Type, OwnershipType.class, String.class, value, mode);
return this;
* Setter for type
* @param value
* Must not be null. For more control, use setters with mode instead.
* @see OwnedBy.Fields#type
public OwnedBy setType(
OwnershipType value) {
putDirect(FIELD_Type, OwnershipType.class, String.class, value, SetMode.DISALLOW_NULL);
return this;
public OwnedBy clone()
throws CloneNotSupportedException
return ((OwnedBy) super.clone());
public OwnedBy copy()
throws CloneNotSupportedException
return ((OwnedBy) super.copy());
public static class Fields
extends PathSpec
public Fields(List<String> path, String name) {
super(path, name);
public Fields() {
* Urn for the source of the relationship
public PathSpec source() {
return new PathSpec(getPathComponents(), "source");
* Urn for the destination of the relationship
public PathSpec destination() {
return new PathSpec(getPathComponents(), "destination");
* The type of the ownership
public PathSpec type() {
return new PathSpec(getPathComponents(), "type");
DataHub的API基于Rest.li,Rest.li使用的是Pegasus作为接口定义,因此可以复用元数据模型。Kafka方式接收MCE,传输的格式为Avro(json格式),由Pegasus自动生成。由Apache Samza作为流处理框架,将Avro数据格式转换回Pegasus,并调用相应API。
[1] Open sourcing DataHub: LinkedIn’s metadata search and discovery platform
[2] DataHub: Popular metadata architectures explained
[3] A Dive Into Metadata Hubs
[4] 数据治理篇-元数据: datahub概述
[5] DataPipeline丨LinkedIn元数据之旅的最新进展—Data Hub 【译】