数据集上传和管理 - FATE v1.8.0

小白一枚,简单记录一下FATE 使用遇到的坑~~~ 
本文根据官方文档整理了FATE flow 数据集表格上传的流程。
不要小看数据上传这一步,做好了这一步能够避免很多后续训练任务过程中的报错。

前言

实际上,除了直接上传数据文件以外,FATE还支持绑定数据库或是分布式文件系统的数据 (table binding)具体可以参考fate-flow的官方文档: Data Access - FATE Flow (federatedai.github.io)

以下是我的部署环境, 理论上使用其他模式(如docker) 部署的FATE 上传数据都适用本文
操作系统OS:CentOS 7 (多台Host)
FATE 部署模式:FATE on Spark (Spark 单机模式+无hdfs+RabbitMQ集群)
版本: v1.8.0 
链接:FATE/fate_on_spark_deployment_guide.md at master · FederatedAI/FATE · GitHub

前提 Prerequisite 

如果你想远程提交数据,需要先安装 fate flow client  pip install fate-client 
详见文档: FATE Flow Client - FATE Flow (federatedai.github.io) 
先检查客户端和部署机/docker容器的网络连接是否畅通。(docker 容器的端口需要暴露,fate flow http默认为 9380)
替换IP为部署端 并执行命令: flow init --ip 192.168.0.1 --port 9380 

如果你直接在部署机上传文件,不需要另外pip 安装。 不要忘记进入python虚拟环境。不同部署方式或版本的具体路径可能不同。source /data/projects/fate/bin/init_env.sh 

准备数据集文件 Prepare your dataset file  

FATE 会把上传的文件以数据表格的形式保存,存储引擎可以是本地文件系统,也可以是HDFS等分布式系统。 数据上传后才能被加载转换为数据实例,作为模型训练的输入。建议大家把能做的数据预处理做好了再上传,以免FATE自身提供的模块不能满足定制化的需求。下面几点供大家参考(有助于避雷):
1. 表头维度和实际数据一致,如有n列,表头也有n个标签。
    尤其如果拆分了训练验证测试数据集,这些表格格式还是要统一以免报错
2. 文件不要有多余的空格和逗号分隔符,尤其是行首和行尾
3. 数据导出时注意格式,以utf-8 with no-bom保存,不然会引发value error
3. (可选)把序号标注为id 放在首列, 如果数据表没有id, 也可以用extend_sid 配置在上传过程让FATE帮你生成唯一的uuid
4. (可选)预测目标值 标注为 y (字符串类型的标签换成整数0,1,2 ...)。
      或者在具体任务的data transform组件中申明label 和格式

 "data_transform_0":{
          "with_label": true,
          "label_name": "y",
          "label_type": "int",
          "output_format": "dense"
        }

配置并提交任务 Config and submit your job for data uploading  

从版本1.5开始,主要有两种方法提交fate 任务: pipeline 以及 dsl conf 
To enhance usability of FATE, starting at FATE-v1.5, FATE provides Pipeline Module. User may develop federated learning models conveniently using python API.  
其实还有 Http API, flow SDK等其他方法。

DSL(配置文件)

DSL is language of building federated modeling tasks based on configuration file for FATE.  创建配置文件:  data_upload.json  例子如下: 

{   "file": "/data/file/path/upload_data_train_host.csv", 
    "id_delimiter": ",", 
    "head": 1, 
    "partition": 4, 
    "namespace": "experiment", 
    "table_name": "data_train_host", 
    "storage_engine": "LOCALFS" 
} 

具体参数说明链接:here. 
执行命令提交: flow data upload -c data_upload.json 

Pipeline (python scripts)  

Pipeline 对于日常使用python/jupyter notebook 的朋友更友好。事实上pipeline是被自动转换为dsl&conf json文件 发到fate flow server执行的。 例子如下: 

import os 
from pipeline.backend.pipeline import PipeLine 

# initialize a pipeline instance 
pipeline_upload = PipeLine().set_initiator(role='guest',party_id=10000).set_roles(guest=10000) 

data_base = 'C:\\data\\file\\path' 

# add upload task into pipeline (you can upload multiple data files inside one pipeline) 
pipeline_upload.add_upload_data(
                file=os.path.join(data_base,"upload_data_train_host.csv"), 
                table_name="data_train_host",   # table name 
                namespace="experiment",         # namespace 
                head=1, # 1 as first row represents header 
                id_delimiter=",")               # data info 

# submit upload job  
pipeline_upload.upload(drop=1) 

Tips:  官方命名规则为 "{content}_{mode}_{size}_{role}_{role_index}", 即数据集名字,横向纵向模式,大小,角色名字,分割的序号依次用下划线符号连接。 详细说明如下:

  • content: brief description of data content
  • mode: how original data is divided, either "homo""or hetero"; some data sets do not have this information
  • size: includes keyword "mini" if the data set is truncated from another larger set
  • role: role name, either "host" or "guest"
  • role_index: if a data set is further divided and shared among multiple hosts in some example, indices are used to distinguish different parties, starts at 1

有别的小伙伴反馈说,命名空间namespace不能设为 train 和 test。请避免使用。

提交上传任务后, 能够从返回的信息判断是否成功。除此以外fateboard UI 也会有记录显示, 上传任务的role都是local 很容易区分。

{
    "data": {
        "board_url": "http://192.168.129.XXX:8080/index.html#/dashboard?job_id=202207211                 111189522680&role=local&party_id=0",
        "code": 0,
        "dsl_path": "/data/projects/fate/fateflow/jobs/202207211111189522680/job_dsl.jso                 n",
        "job_id": "202207211111189522680",
        "logs_directory": "/data/projects/fate/fateflow/logs/202207211111189522680",
        "message": "success",
        "model_info": {
            "model_id": "local-0#model",
            "model_version": "202207211111189522680"
        },
        "namespace": "experiment",
        "pipeline_dsl_path": "/data/projects/fate/fateflow/jobs/202207211111189522680/pi                 peline_dsl.json",
        "runtime_conf_on_party_path": "/data/projects/fate/fateflow/jobs/202207211111189                 522680/local/0/job_runtime_on_party_conf.json",
        "runtime_conf_path": "/data/projects/fate/fateflow/jobs/202207211111189522680/jo                 b_runtime_conf.json",
        "table_name": "vehicle_scale_homo_guest",
        "train_runtime_conf_path": "/data/projects/fate/fateflow/jobs/202207211111189522                 680/train_runtime_conf.json"
    },
    "jobId": "202207211111189522680",
    "retcode": 0,
    "retmsg": "success"
}

 FATEboard进入对应id的任务点击upload_0 可以预览上传的数据:summary可以看总共的数据条数,data_output可以预览前100条数据

数据集上传和管理 - FATE v1.8.0_第1张图片

后续操作 More Actions after uploading 

查询表格信息 Query table information 

执行指令查询表格信息 Use fate flow client to query:
flow table info -n experiment -t vehicle_scale_homo_guest

{
    "data": {
        "address": {
            "connector_name": null,
            "path": "/data/projects/fate/localfs/input/experiment/vehicle_scale_homo_gue                 st"
        },
        "count": 413,
        "enable": true,
        "exist": 1,
        "namespace": "experiment",
        "origin": "upload",
        "partition": 4,
        "schema": {
            "header": "y,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17",
            "sid": "id"
        },
        "table_name": "vehicle_scale_homo_guest"
    },
    "retcode": 0,
    "retmsg": "success"
}

常见的错误如下: 
1. Exist 为 0 (表格不存在) && enable is false (被禁用)  
2. Count 数量与源文件不符  
3. schema中 sid 没有对应 id, e.g. "sid": "x0" 
4. schema中 Label y 缺失
5. schema header中 多了空格或逗号 e.g. "header": ",y,x0,x1,x2,x3,...,x10,

禁用、删除表格 disable | delete data table in fate flow 

执行指令
flow table disable -n experiment -t breast_homo_guest 
flow table delete -n experiment -t breast_homo_guest 

你可能感兴趣的:(FATE联邦学习,大数据,spark,分布式)