数据管理之特征仓库

鉴于特征仓库热度下降,特将历史库存放出

背景

数据工程师与算法工程师在数据存取上存在沟通成本,基于现状存在以下问题:

  • 提供训练使用的离线特征与在线服务使用的在线特征,构造方式存在差异,离线、在线特征数据一致性存疑
  • 算法工程师使用特征存在冗余,重复造轮子不仅会出现特征质量参差而且效率低下
  • 同步离线特征与在线特征能缩短训练到服务pipeline的开发时间,能提高模型迭代速度

简介

从业务背景、公司技术栈现状、框架重量、二次开发门槛、热度等方面考量,特征仓库的方案选型feast-spark(feast 0.9)

特征仓库的定位是管理,所以特征仓库既不研发离线数据仓库也不提供实时计算。

特征仓库提供特征元数据与离线、在线数据的关联,同时提供离线数据与在线数据的同步;最终实现特征复用、离线和在线特征数据一致性

原理

feast core存储、管理特征表的元数据

特征表的元数据,记录了特征的字段信息、关联的特征数据存储地址(如离线仓库在s3或hdfs,在线仓库在redis或es)

离线和在线特征的同步,保证模型训练与模型服务数据的一致性

feast-spark与数据处理和机器学习的关系可以参考如下的架构图:
数据管理之特征仓库_第1张图片
具体功能的实现细节:

1.注册特征表元数据

sdk通过grpc请求,将特征表结构元数据注册到feast core注册中心:

def _apply_feature_table(self, project: str, feature_table: FeatureTable):
    """
    Registers a single feature table with Feast

    Args:
        feature_table: Feature table that will be registered
    """

    feature_table.is_valid()
    feature_table_proto = feature_table.to_spec_proto()

    # Convert the feature table to a request and send to Feast Core
    try:
        apply_feature_table_response = self._core_service.ApplyFeatureTable(
            ApplyFeatureTableRequest(project=project, table_spec=feature_table_proto),  # type: ignore
            timeout=self._config.getint(opt.GRPC_CONNECTION_TIMEOUT),
            metadata=self._get_grpc_metadata(),
        )  # type: ApplyFeatureTableResponse
    except grpc.RpcError as e:
        raise grpc.RpcError(e.details())

feast core注册中心接口:

@Override
public void applyFeatureTable(
    ApplyFeatureTableRequest request,
    StreamObserver<ApplyFeatureTableResponse> responseObserver) {
String projectName = SpecService.resolveProjectName(request.getProject());
String tableName = request.getTableSpec().getName();

try {
    // Check if user has authorization to apply feature table
    authorizationService.authorizeRequest(SecurityContextHolder.getContext(), projectName);

    ApplyFeatureTableResponse response = specService.applyFeatureTable(request);
    responseObserver.onNext(response);
    responseObserver.onCompleted();
} catch (AccessDeniedException e) {
    log.info(
        String.format(
            "ApplyFeatureTable: Not authorized to access project to apply: %s", projectName));
    responseObserver.onError(
        Status.PERMISSION_DENIED
            .withDescription(e.getMessage())
            .withCause(e)
            .asRuntimeException());
}

/**
* Applies the given FeatureTable to the FeatureTable registry. Creates the FeatureTable if does
* not exist, otherwise updates the existing FeatureTable. Applies FeatureTable in project if
* specified, otherwise in default project.
*
* @param request Contains FeatureTable spec and project parameters used to create or update a
*     FeatureTable.
* @throws NoSuchElementException projects and entities referenced in request do not exist.
* @return response containing the applied FeatureTable spec.
*/
@Transactional
public ApplyFeatureTableResponse applyFeatureTable(ApplyFeatureTableRequest request) {
String projectName = resolveProjectName(request.getProject());

// Check that specification provided is valid
FeatureTableSpec applySpec = request.getTableSpec();
FeatureTableValidator.validateSpec(applySpec);

// Prevent apply if the project is archived.
Project project = projectRepository.findById(projectName).orElse(new Project(projectName));
if (project.isArchived()) {
    throw new IllegalArgumentException(
        String.format(
            "Cannot apply Feature Table to archived Project: (table: %s, project: %s)",
            applySpec.getName(), projectName));
}

// Create or update depending on whether there is an existing Feature Table
Optional<FeatureTable> existingTable =
    tableRepository.findFeatureTableByNameAndProject_Name(applySpec.getName(), projectName);
FeatureTable table = FeatureTable.fromProto(projectName, applySpec, entityRepository);
if (existingTable.isPresent() && table.equals(existingTable.get())) {
    // Skip update if no change is detected
    return ApplyFeatureTableResponse.newBuilder().setTable(existingTable.get().toProto()).build();
}
if (existingTable.isPresent()) {
    existingTable.get().updateFromProto(projectName, applySpec, entityRepository);
    table = existingTable.get();
}

// Commit FeatureTable to database and return applied FeatureTable
tableRepository.saveAndFlush(table);
return ApplyFeatureTableResponse.newBuilder().setTable(table.toProto()).build();
}

2.拽取批数据到离线特征仓库

将批特征数据,拽取到特征表关联的离线特征存地址,实现大致分为如下3个步骤:

a. 对拽取的数据格式做判断(目前只支持parquet格式的批数据)

# Check 1) Only parquet file format for FeatureTable batch source is supported
        if (
            feature_table.batch_source
            and issubclass(type(feature_table.batch_source), FileSource)
            and isinstance(
                type(feature_table.batch_source.file_options.file_format), ParquetFormat
            )
        ):
            raise Exception(
                f"No suitable batch source found for FeatureTable, {name}."
                f"Only BATCH_FILE source with parquet format is supported for batch ingestion."
            )

        # read raw data from local, we will support hdfs data later
        pyarrow_table, column_names = _read_table_from_source(source)

b. 以特征表为参照系,对拽取的数据作字段校验

# Check 2) Check if FeatureTable batch source field mappings can be found in provided source table
        _check_field_mappings(
            column_names,
            name,
            feature_table.batch_source.event_timestamp_column,
            feature_table.batch_source.field_mapping,
        )

c. 将批数据,拽取到特征表关联的离线特征存储地址

def _upload_to_file_source(
    file_url: str, with_partitions: bool, dest_path: str, config: Config
) -> None:
    """
    Uploads data into a FileSource. Currently supports GCS, S3, HDFS and Local FS.

    Args:
        file_url: file url of FileSource defined for FeatureTable
        with_partitions: whether to treat dest_path as dir with partitioned table
        dest_path: path to file or dir to be uploaded
        config: Config instance to configure FileSource
    """
    from urllib.parse import urlparse

    uri = urlparse(file_url)
    staging_client = get_staging_client(uri.scheme, config)

# supported offline storage middleware
storage_clients = {
    GS: _gcs_client,
    S3: _s3_client,
    S3A: _s3a_client,
    AZURE_SCHEME: _azure_blob_client,
    LOCAL_FILE: _local_fs_client,
    HDFS_FILE: _hdfs_fs_client
}

3.流数据拽取到在线仓库

由于目前只实现了redis作在线特征仓库,下面的实现也只以redis为例,实现大致分为如下3个步骤:

a. 取流式队列的消息数据(以kafka为例)

val input = config.source match {
    case source: KafkaSource =>
    sparkSession.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", source.bootstrapServers)
        .option("subscribe", source.topic)
        .load()
    case source: MemoryStreamingSource =>
    source.read
}

val parsed = config.source.asInstanceOf[StreamingSource].format match {
    case ProtoFormat(classPath) =>
    val parser = protoParser(sparkSession, classPath)
    input.withColumn("features", parser($"value"))
    case AvroFormat(schemaJson) =>
    input.select(from_avro($"value", schemaJson).alias("features"))
    case _ =>
    val columns = input.columns.map(input(_))
    input.select(struct(columns: _*).alias("features"))
}

val projected = parsed
    .select("features.*")
    .select(projection: _*)

b.对消息队列字段做校验

val query = projected.writeStream
    .foreachBatch { (batchDF: DataFrame, batchID: Long) =>
    val rowsAfterValidation = if (validationUDF.nonEmpty) {
        val columns = batchDF.columns.map(batchDF(_))
        batchDF.withColumn(
        "_isValid",
        rowValidator.allChecks && validationUDF.get(struct(columns: _*))
        )
    } else {
        batchDF.withColumn("_isValid", rowValidator.allChecks)
    }
    rowsAfterValidation.persist()
    implicit def rowEncoder: Encoder[Row] = RowEncoder(rowsAfterValidation.schema)

c.消息数据写到在线仓库

目前在线仓库是redis,特征数据的key都是经过加密处理;通过timestamp更新key对应的数据,保证特征数据是最新版本

rowsAfterValidation
    .map(metrics.incrementRead)
    .filter(if (config.doNotIngestInvalidRows) expr("_isValid") else rowValidator.allChecks)
    .write
    .format("feast.ingestion.stores.redis")
    .option("entity_columns", featureTable.entities.map(_.name).mkString(","))
    .option("namespace", featureTable.name)
    .option("project_name", featureTable.project)
    .option("timestamp_column", config.source.eventTimestampColumn)
    .option("max_age", config.featureTable.maxAge.getOrElse(0L))
    .save()
override def insert(data: DataFrame, overwrite: Boolean): Unit = {
// repartition for deduplication
val dataToStore =
    if (config.repartitionByEntity && data.rdd.getNumPartitions > 1)
    data
        .repartition(data.rdd.getNumPartitions, config.entityColumns.map(col): _*)
        .localCheckpoint()
    else data

dataToStore.foreachPartition { partition: Iterator[Row] =>
    // grouped iterator to only allocate memory for a portion of rows
    partition.grouped(config.iteratorGroupingSize).foreach { batch =>

4.获取离线特征数据

目前离线特征仓库没有版本功能,离线仓库采用的是一种近似关联的处理策略:能获取entity对应的最新离线特征数据

获取离线特征数据,分为如下几步:

a. 从特征表批数据存放地址读取数据

feature_tables = [_feature_table_from_dict(dct) for dct in feature_tables_conf]
feature_tables_sources = [
    _source_from_dict(dct) for dct in feature_tables_sources_conf
]
entity_source = _source_from_dict(entity_source_conf)

entity_df = _read_and_verify_entity_df_from_source(spark, entity_source)

feature_table_dfs = [
    _read_and_verify_feature_table_df_from_source(spark, feature_table, source,)
    for feature_table, source in zip(feature_tables, feature_tables_sources)
]

b.筛选出时间范围内的数据

def _filter_feature_table_by_time_range(
    feature_table_df: DataFrame,
    feature_table: FeatureTable,
    feature_event_timestamp_column: str,
    entity_df: DataFrame,
    entity_event_timestamp_column: str,
):
    entity_max_timestamp = entity_df.agg(
        {entity_event_timestamp_column: "max"}
    ).collect()[0][0]
    entity_min_timestamp = entity_df.agg(
        {entity_event_timestamp_column: "min"}
    ).collect()[0][0]

    feature_table_timestamp_filter = (
        col(feature_event_timestamp_column).between(
            entity_min_timestamp - timedelta(seconds=feature_table.max_age),
            entity_max_timestamp,
        )
        if feature_table.max_age
        else col(feature_event_timestamp_column) <= entity_max_timestamp
    )

    time_range_filtered_df = feature_table_df.filter(feature_table_timestamp_filter)

    return time_range_filtered_df

c.近似特征表关联

近似特征表关联操作,是获取离线特征数据的核心部分,能保证关联的特征字段数据为当前最新数据

用户在定义特征表字段max_age为特征作用时间范围容忍度(近似偏差的范围)

entity_with_id = entity_df.withColumn("_row_nr", monotonically_increasing_id())

feature_event_timestamp_column_with_prefix = (
    f"{feature_table.name}__{EVENT_TIMESTAMP_ALIAS}"
)
feature_created_timestamp_column_with_prefix = (
    f"{feature_table.name}__{CREATED_TIMESTAMP_ALIAS}"
)

projection = [
    col(col_name).alias(f"{feature_table.name}__{col_name}")
    for col_name in feature_table_df.columns
]

aliased_feature_table_df = feature_table_df.select(projection)

join_cond = (
    entity_with_id[entity_event_timestamp_column]
    >= aliased_feature_table_df[feature_event_timestamp_column_with_prefix]
)
if feature_table.max_age:
    join_cond = join_cond & (
        aliased_feature_table_df[feature_event_timestamp_column_with_prefix]
        >= entity_with_id[entity_event_timestamp_column]
        - expr(f"INTERVAL {feature_table.max_age} seconds")
    )

for key in feature_table.entity_names:
    join_cond = join_cond & (
        entity_with_id[key]
        == aliased_feature_table_df[f"{feature_table.name}__{key}"]
    )

conditional_join = entity_with_id.join(
    aliased_feature_table_df, join_cond, "leftOuter"
)
for key in feature_table.entity_names:
    conditional_join = conditional_join.drop(
        aliased_feature_table_df[f"{feature_table.name}__{key}"]
    )

window = Window.partitionBy("_row_nr", *feature_table.entity_names).orderBy(
    col(feature_event_timestamp_column_with_prefix).desc(),
    col(feature_created_timestamp_column_with_prefix).desc(),
)
filter_most_recent_feature_timestamp = conditional_join.withColumn(
    "_rank", row_number().over(window)
).filter(col("_rank") == 1)

return filter_most_recent_feature_timestamp.select(
    entity_df.columns
    + [
        f"{feature_table.name}__{feature}"
        for feature in feature_table.feature_names
    ]
)

5.离线特征同步到在线特征仓库

将特征从离线仓库同步到在线仓库,实现大致分为如下3个步骤:

a. 读取特征表的离线特征

val input = config.source match {
    case source: BQSource =>
    BigQueryReader.createBatchSource(
        sparkSession.sqlContext,
        source,
        config.startTime,
        config.endTime
    )
    case source: FileSource =>
    FileReader.createBatchSource(
        sparkSession.sqlContext,
        source,
        config.startTime,
        config.endTime
    )
}

val projected = input.select(projection: _*).cache()

b.离线特征数据作数据校验

implicit def rowEncoder: Encoder[Row] = RowEncoder(projected.schema)

TypeCheck.allTypesMatch(projected.schema, featureTable) match {
    case Some(error) =>
    throw new RuntimeException(s"Dataframe columns don't match expected feature types: $error")
    case _ => ()
}

c.离线特征数据存储到在线仓库

val validRows = projected
    .map(metrics.incrementRead)
    .filter(rowValidator.allChecks)

validRows.write
    .format("feast.ingestion.stores.redis")
    .option("entity_columns", featureTable.entities.map(_.name).mkString(","))
    .option("namespace", featureTable.name)
    .option("project_name", featureTable.project)
    .option("timestamp_column", config.source.eventTimestampColumn)
    .option("max_age", config.featureTable.maxAge.getOrElse(0L))
    .save()

6.获取在线特征数据

TODO

demo实操

原生feast-spark仅支持local、k8s、gcp模式下的spark,在我的分支中已经实现了on yarn的模式,下例中的配置皆适用于on yarn模式

通过一个简单的python案例,介绍仓库使用的全流程:

初始化连接到feast

from feast import Client, Feature, Entity, ValueType, FeatureTable
import feast_spark
from feast.data_source import FileSource, KafkaSource
from feast.data_format import ParquetFormat, AvroFormat


# 连接到feast的组件:core、serving、redis
feast_client = Client(
    project="wbliu_august_eleven", # 特征分组
    core_url="feast-release-feast-core.ume-feast:6565", # 实体、特征表元数据注册中心
    serving_url="feast-release-feast-serving.ume-feast:6566", # 特征服务中心
    # 使用feast额外的配置
    options={"spark_staging_location": "hdfs://xxx/home/jovyan/spark_staging_location",
            "spark_launcher": "yarn",
            "spark_home": "/usr/hdp/current/spark3-client/",
            "redis_host": "feast-release-feast-redis.ume-feast",
            "redis_port": 6380}
)


# 使用feast-spark做批量特征的操作
client = feast_spark.Client(feast_client)

使用feast查看历史元数据信息
1.查看历史特征分组信息

feast_client.list_projects()

数据管理之特征仓库_第2张图片
2.查看当前分组entity信息

feast_client.list_entities()

数据管理之特征仓库_第3张图片
3.查看当前分组特征表信息

feast_client.list_feature_tables()

在这里插入图片描述

定义、注册entity和特征表到feast

1.定义entity

entity为特征表作表关联使用的主键

driver_id = Entity(name="driver_id", description="Driver identifier", value_type=ValueType.INT64)

2.定义特征

# Daily updated features
acc_rate = Feature("acc_rate", ValueType.FLOAT)
conv_rate = Feature("conv_rate", ValueType.FLOAT)
avg_daily_trips = Feature("avg_daily_trips", ValueType.INT32)
# Real-time updated features
trips_today = Feature("trips_today", ValueType.INT32)

3.定义特征表

driver_statistics = FeatureTable(
    name="driver_statistics", # 特征表名
    entities=["driver_id"], # 实体名
    features=[ # 特征对象
        acc_rate,
        conv_rate,
        avg_daily_trips
    ],
    batch_source=FileSource( # 特征表关联的批数据源信息
        event_timestamp_column="datetime", # 1.特征发生时间(必须有)
        created_timestamp_column="created", # 2.特征产生时间 (必须有)
        file_format=ParquetFormat(), # 批数据存储格式
        file_url="hdfs://xxxx/home/jovyan/test_data/driver_statistics", # 批数据存储地址
        date_partition_column="date" # 批数据存储划分partion的列
    )
)
 
driver_trips = FeatureTable(
    name="driver_trips",
    entities=["driver_id"],
    features=[
        trips_today
    ],
    batch_source=FileSource(
        event_timestamp_column="datetime",
        created_timestamp_column="created",
        file_format=ParquetFormat(),
        file_url="hdfs://xxx/home/jovyan/test_data/driver_trips",
        date_partition_column="date"
    )
)

4.注册特征表、entity到feast

# Registering entities and feature tables in Feast Core
feast_client.apply(driver_id)
feast_client.apply(driver_statistics)
feast_client.apply(driver_trips)

数据管理之特征仓库_第4张图片
5.准备特征数据

一般特征数据在特征处理之后生成,这个例子没有提前准备特征数据,所以临时生成一份特征数据:

def generate_entities():
    return np.random.choice(999999, size=100, replace=False)
 
def generate_trips(entities):
    df = pd.DataFrame(columns=["driver_id", "trips_today", "datetime", "created"])
    df['driver_id'] = entities
    df['trips_today'] = np.random.randint(0, 1000, size=100).astype(np.int32)
    df['datetime'] = pd.to_datetime(
        np.random.randint(
            datetime(2020, 10, 10).timestamp(),
            datetime(2020, 10, 20).timestamp(),
            size=100),
        unit="s"
    )
    df['created'] = pd.to_datetime(datetime.now())
    return df
 
def generate_stats(entities):
    df = pd.DataFrame(columns=["driver_id", "conv_rate", "acc_rate", "avg_daily_trips", "datetime", "created"])
    df['driver_id'] = entities
    df['conv_rate'] = np.random.random(size=100).astype(np.float32)
    df['acc_rate'] = np.random.random(size=100).astype(np.float32)
    df['avg_daily_trips'] = np.random.randint(0, 1000, size=100).astype(np.int32)
    df['datetime'] = pd.to_datetime(
        np.random.randint(
            datetime(2020, 10, 10).timestamp(),
            datetime(2020, 10, 20).timestamp(),
            size=100),
        unit="s"
    )
    df['created'] = pd.to_datetime(datetime.now())
    return df
 
entities = generate_entities()
stats_df = generate_stats(entities)
trips_df = generate_trips(entities)

拽取特征数据,到对应的特征表批数据源地址

feast_client.ingest(driver_statistics, stats_df)
feast_client.ingest(driver_trips, trips_df)

数据管理之特征仓库_第5张图片

获取离线训练特征数据

# Historical Retrieval For Training
# 准备需要抽取数据的范围
entities_with_timestamp = pd.DataFrame(columns=['driver_id', 'event_timestamp'])
entities_with_timestamp['driver_id'] = np.random.choice(entities, 10, replace=False)
entities_with_timestamp['event_timestamp'] = pd.to_datetime(np.random.randint(
    datetime(2020, 10, 18).timestamp(),
    datetime(2020, 10, 20).timestamp(),
    size=10), unit='s')
 
# get_historical_features will return immediately once the Spark job has been submitted succesfully.
job = client.get_historical_features(
    feature_refs=[# 需要使用的特征表和特征字段
        "driver_statistics:avg_daily_trips",
        "driver_statistics:conv_rate",
        "driver_statistics:acc_rate",
        "driver_trips:trips_today"
    ],
    entity_source=entities_with_timestamp, # 需要抽取的数据范围
    output_location="hdfs://xxx/home/jovyan/historical_feature_output" # 抽取的特征数据存放地址
)

离线特征数据同步到在线仓库

# offline to online ingestion
job = client.start_offline_to_online_ingestion(
    driver_statistics, # 需要同步的特征表
    datetime(2020, 10, 10), # 同步特征的事件时间范围
    datetime(2020, 10, 20)

拽取流数据到在线仓库

这个例子没有提前准备流特征数据,临时构建一份kafka数据:

def send_avro_record_to_kafka(topic, record):
    value_schema = avro.schema.parse(avro_schema_json)
    writer = DatumWriter(value_schema)
    bytes_writer = io.BytesIO()
    encoder = BinaryEncoder(bytes_writer)
    writer.write(record, encoder)
 
    producer = Producer({
        "bootstrap.servers": KAFKA_BROKER,
    })
    producer.produce(topic=topic, value=bytes_writer.getvalue())
    producer.flush()
 
 
# Note: depending on the Kafka configuration you may need to create the Kafka topic first, like below:
# from confluent_kafka.admin import AdminClient, NewTopic
# admin = AdminClient({'bootstrap.servers': KAFKA_BROKER})
# new_topic = NewTopic('driver_trips', num_partitions=1, replication_factor=3)
# admin.create_topics(new_topic)
 
for record in trips_df.drop(columns=['created']).to_dict('record'):
    record["datetime"] = (
        record["datetime"].to_pydatetime().replace(tzinfo=pytz.utc)
    )
 
    send_avro_record_to_kafka(topic="driver_trips", record=record)

添加流数据源到特征表,更新特征表元数据:

# ingest from streaming source
# Change this to any Kafka broker addresses which is accessible by the spark cluster
KAFKA_BROKER = os.getenv("DEMO_KAFKA_BROKERS", "kafka:9092")
avro_schema_json = json.dumps({
    "type": "record",
    "name": "DriverTrips",
    "fields": [
        {"name": "driver_id", "type": "long"},
        {"name": "trips_today", "type": "int"},
        {
            "name": "datetime",
            "type": {"type": "long", "logicalType": "timestamp-micros"},
        },
    ],
})
driver_trips.stream_source = KafkaSource(
    event_timestamp_column="datetime",
    created_timestamp_column="datetime",
    bootstrap_servers=KAFKA_BROKER,
    topic="driver_trips",
    message_format=AvroFormat(avro_schema_json)
)
client.apply(driver_trips)

拽取流数据到在线仓库:

# Start the streaming job and send avro record to Kafka
job = client.start_stream_to_online_ingestion(
    driver_trips
)

获取在线特征

response = feast_client.get_online_features(feature_refs=[ # 需要获取的特征列
        "driver_statistics:avg_daily_trips",
        "driver_statistics:conv_rate",
        "driver_statistics:acc_rate",
        "driver_trips:trips_today"
    ],
    entity_rows=[{"driver_id": 604155}], # 特征数据的主键
    project='wbliu')
 
response.to_dict()

数据管理之特征仓库_第6张图片

目前还存在的问题

feast-spark虽然解决了一部分特征数据管理的问题,但是仍存在一些问题:

  • 目前并未对离线、在线仓库做版本控制(和仓库的构建思路有一定关系)
  • 特征的发现只能靠用户主动触发、需要用户具有一定的特征先验知识
  • feast的使用、部署、环境等都有比较严苛的要求
    展望
    在feast 0.10+版本将会更轻量(对python开发者更友好)

使用方能继承实现更多个性化的特征仓库数据类型,帮助数据提供方(数仓)和数据使用方(算法)协作碰撞出更多火花

抛弃feast core、jobservice、serving等重组件,元数据可以采取本地(需要自己实现同步)或云的形式存储

参考:https://docs.feast.dev/project/feast-0.9-vs-feast-0.10+

你可能感兴趣的:(spark,python,机器学习,大数据,机器学习,python)