流程为:
(app-root) bash-4.2# pwd
/data/projects/fate/logs
(app-root) bash-4.2# ls
fate_flow
(app-root) bash-4.2# ls fate_flow/
DEBUG.log fate_flow_detect.log fate_flow_schedule.log fate_flow_stat.log INFO.log peewee.log
可以看到,由于没有提交任何任务,当前只有一个fate_flow的目录,这里记录的是fate_flow_server启动的日志。具体而言
* peewee.log:fate中操作db,使用了peewee,这里记录所有通过peewee操作数据库的日志
* fate_flow_detect.log:探测器日志
* fate_flow_schedule.log:调度器日志
* fate_flow_stat.log:除以上3部分外的其余状态日志
* DEBUG.log、INFO.log、WARNING.log、ERROR.log:各级别日志,会将以上各部分(除了fate_flow_detect.log,这个后续单独说明逻辑)中对应级别的日志收集。
因fate_flow_server 启动的日志,均输出在fate_flow 目录中,故本文所述的日志,均为fate_flow目录中的对应日志。
由于是KubeFATE 部署,直接查看容器日志即可。
+ mkdir -p /data/projects/fate/conf/
+ cp /data/projects/fate/conf1/transfer_conf.yaml /data/projects/fate/conf/transfer_conf.yaml
+ cp /data/projects/fate/conf1/service_conf.yaml /data/projects/fate/conf/service_conf.yaml
+ cp /data/projects/fate/conf1/pipeline_conf.yaml /data/projects/fate/conf/pipeline_conf.yaml
+ sed -i 's/host: fateflow/host: 10.200.96.237/g' /data/projects/fate/conf/service_conf.yaml
+ sed -i 's/ip: fateflow/ip: 10.200.96.237/g' /data/projects/fate/conf/pipeline_conf.yaml
+ cp -r /data/projects/fate/examples /data/projects/fate/examples-shared-for-client
+ sleep 5
+ python ./fate_flow/fate_flow_server.py
* Running on http://10.200.96.237:9380/ (Press CTRL+C to quit)
从上述命令可以看出,涉及操作是创建目录->复制配置文件->启动 fate_flow_server.py。
app = DispatcherMiddleware(
manager,
{
'/{}/data'.format(API_VERSION): data_access_app_manager,
'/{}/model'.format(API_VERSION): model_app_manager,
'/{}/job'.format(API_VERSION): job_app_manager,
'/{}/table'.format(API_VERSION): table_app_manager,
'/{}/tracking'.format(API_VERSION): tracking_app_manager,
'/{}/pipeline'.format(API_VERSION): pipeline_app_manager,
'/{}/permission'.format(API_VERSION): permission_app_manager,
'/{}/version'.format(API_VERSION): version_app_manager,
'/{}/party'.format(API_VERSION): party_app_manager,
'/{}/initiator'.format(API_VERSION): initiator_app_manager,
'/{}/tracker'.format(API_VERSION): tracker_app_manager,
'/{}/forward'.format(API_VERSION): proxy_app_manager
}
)
不同的manager对应不同模块的功能。详细说明参照REF1,
[INFO] [2021-07-26 07:14:02,439] [1:140691673888576] - db_models.py[line:60]: init mysql database on cluster mode successfully
这里涉及t_job、t_task、t_tracking_metric、t_tracking_output_data_info、t_machine_learning_model_info、t_model_tag、t_tags、t_component_summary、t_model_operation_log、t_engine_registry 等10张表。会执行建表和建索引的相关操作。各表具体字段可以查看源码。这部分日志会在peewee.log 中打出。
('CREATE TABLE IF NOT EXISTS `componentsummary` (`f_id` BIGINT AUTO_INCREMENT NOT NULL PRIMARY KEY, `f_create_time` BIGINT, `f_create_date` DATETIME, `f_update_time` BIGINT, `f_update_date` DATETIME, `f_job_id` VARCHAR(25) NOT NULL, `f_role` VARCHAR(25) NOT NULL, `f_party_id` VARCHAR(10) NOT NULL, `f_component_name` TEXT NOT NULL, `f_task_id` VARCHAR(50), `f_task_version` VARCHAR(50), `f_summary` LONGTEXT NOT NULL)', [])
注:本地debug时,默认为standalone模式,会在目录生成一个fate_flow_sqlite.db数据库文件
# Storage engine is used for component output data
SUPPORT_BACKENDS_ENTRANCE = {
"fate_on_eggroll": {
EngineType.COMPUTING: (ComputingEngine.EGGROLL, "clustermanager"),
EngineType.STORAGE: (StorageEngine.EGGROLL, "clustermanager"),
EngineType.FEDERATION: (FederationEngine.EGGROLL, "rollsite"),
},
"fate_on_spark": {
EngineType.COMPUTING: (ComputingEngine.SPARK, "spark"),
EngineType.STORAGE: (StorageEngine.HDFS, "hdfs"),
EngineType.FEDERATION: (FederationEngine.RABBITMQ, "rabbitmq"),
},
}
遍历不同的EngineType的engine_name,即COMPUTING、STORAGE、FEDERATION三个部分,创建相关记录。再将如上EngineType中各个engine_name替换为standalone,再遍历一次。
创建相关记录流程为先查询f_engine_entrance表中,是有有该f_engine_type和f_engine_name的值,如没有,执行insert 如有,则update 相关信息。故而针对如上配置的流程为:
sql执行情况样例见 peewee.log
[DEBUG] [2021-07-26 07:14:06,066] [1:140691673888576] - peewee.py[line:2863]: ('SELECT `t1`.`f_create_time`, `t1`.`f_create_date`, `t1`.`f_update_time`, `t1`.`f_update_date`, `t1`.`f_engine_type`, `t1`.`f_engine_name`, `t1`.`f_engine_entrance`, `t1`.`f_engine_config`, `t1`.`f_cores`, `t1`.`f_memory`, `t1`.`f_remaining_cores`, `t1`.`f_remaining_memory`, `t1`.`f_nodes` FROM `t_engine_registry` AS `t1` WHERE ((`t1`.`f_engine_type` = %s) AND (`t1`.`f_engine_name` = %s))', ['computing', 'EGGROLL'])
[DEBUG] [2021-07-26 07:14:06,072] [1:140691673888576] - peewee.py[line:2863]: ('INSERT INTO `t_engine_registry` (`f_create_time`, `f_create_date`, `f_update_time`, `f_update_date`, `f_engine_type`, `f_engine_name`, `f_engine_entrance`, `f_engine_config`, `f_cores`, `f_memory`, `f_remaining_cores`, `f_remaining_memory`, `f_nodes`) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)', [1627283646068, datetime.datetime(2021, 7, 26, 7, 14, 6), 1627283646068, datetime.datetime(2021, 7, 26, 7, 14, 6), 'computing', 'EGGROLL', 'clustermanager', '{"cores_per_node": 20, "nodes": 1}', 20, 0, 20, 0, 1])
[DEBUG] [2021-07-26 07:14:06,242] [1:140691673888576] - peewee.py[line:2863]: ('UPDATE `t_engine_registry` SET `f_engine_config` = %s, `f_cores` = %s, `f_memory` = %s, `f_remaining_cores` = (`t_engine_registry`.`f_remaining_cores` + %s), `f_remaining_memory` = (`t_engine_registry`.`f_remaining_memory` + %s), `f_nodes` = %s WHERE ((`t_engine_registry`.`f_engine_type` = %s) AND (`t_engine_registry`.`f_engine_name` = %s))', ['{"nodes": 1, "cores_per_node": 20}', 20, 0, 0, 0, 1, 'storage', 'STANDALONE'])
fate_flow具体的日志,打在fate_flow_stat.log 中,如下:
[INFO] [2021-07-26 07:14:06,077] [1:140691673888576] - resource_manager.py[line:94]: create computing engine EGGROLL clustermanager registration information
[INFO] [2021-07-26 07:14:06,097] [1:140691673888576] - resource_manager.py[line:94]: create storage engine EGGROLL clustermanager registration information
[INFO] [2021-07-26 07:14:06,117] [1:140691673888576] - resource_manager.py[line:94]: create federation engine EGGROLL rollsite registration information
[INFO] [2021-07-26 07:14:06,139] [1:140691673888576] - resource_manager.py[line:94]: create computing engine SPARK spark registration information
[INFO] [2021-07-26 07:14:06,175] [1:140691673888576] - resource_manager.py[line:94]: create storage engine HDFS hdfs registration information
[INFO] [2021-07-26 07:14:06,199] [1:140691673888576] - resource_manager.py[line:94]: create federation engine RABBITMQ rabbitmq registration information
[INFO] [2021-07-26 07:14:06,207] [1:140691673888576] - resource_manager.py[line:94]: create computing engine STANDALONE fateflow registration information
[INFO] [2021-07-26 07:14:06,216] [1:140691673888576] - resource_manager.py[line:94]: create storage engine STANDALONE fateflow registration information
[INFO] [2021-07-26 07:14:06,227] [1:140691673888576] - resource_manager.py[line:94]: create federation engine STANDALONE fateflow registration information
[INFO] [2021-07-26 07:14:06,236] [1:140691673888576] - resource_manager.py[line:76]: update computing engine STANDALONE fateflow registration information takes no effect
[INFO] [2021-07-26 07:14:06,243] [1:140691673888576] - resource_manager.py[line:76]: update storage engine STANDALONE fateflow registration information takes no effect
[INFO] [2021-07-26 07:14:06,253] [1:140691673888576] - resource_manager.py[line:76]: update federation engine STANDALONE fateflow registration information takes no effect
注意,这里打日志,是调用log.py 中 detect_log() 方法,而不是 settings.py 中的LoggerFactory.getLogger(“fate_flow_detect”)
[INFO] [2021-07-26 07:14:11,255] [1:140691103205120] - detector.py[line:38]: start to detect running task..
[INFO] [2021-07-26 07:14:11,264] [1:140691103205120] - detector.py[line:70]: finish detect 0 running task
[INFO] [2021-07-26 07:14:11,264] [1:140691103205120] - detector.py[line:74]: start detect running job
[INFO] [2021-07-26 07:14:11,272] [1:140691103205120] - detector.py[line:88]: finish detect running job
[INFO] [2021-07-26 07:14:11,273] [1:140691103205120] - detector.py[line:93]: start detect resource recycle
[INFO] [2021-07-26 07:14:11,280] [1:140691103205120] - detector.py[line:116]: finish detect resource recycle
[INFO] [2021-07-26 07:14:11,280] [1:140691103205120] - detector.py[line:120]: start detect expired session
[INFO] [2021-07-26 07:14:08,324] [1:140259369826048] - dag_scheduler.py[line:134]: start schedule waiting jobs
[INFO] [2021-07-26 07:14:08,339] [1:140259369826048] - dag_scheduler.py[line:136]: have 0 waiting jobs
[INFO] [2021-07-26 07:14:08,339] [1:140259369826048] - dag_scheduler.py[line:146]: schedule waiting jobs finished
[INFO] [2021-07-26 07:14:08,339] [1:140259369826048] - dag_scheduler.py[line:148]: start schedule running jobs
[INFO] [2021-07-26 07:14:08,348] [1:140259369826048] - dag_scheduler.py[line:150]: have 0 running jobs
[INFO] [2021-07-26 07:14:08,349] [1:140259369826048] - dag_scheduler.py[line:158]: schedule running jobs finished
[INFO] [2021-07-26 07:14:08,349] [1:140259369826048] - dag_scheduler.py[line:161]: start schedule ready jobs
[INFO] [2021-07-26 07:14:08,359] [1:140259369826048] - dag_scheduler.py[line:163]: have 0 ready jobs
[INFO] [2021-07-26 07:14:08,359] [1:140259369826048] - dag_scheduler.py[line:171]: schedule ready jobs finished
[INFO] [2021-07-26 07:14:08,359] [1:140259369826048] - dag_scheduler.py[line:173]: start schedule rerun jobs
[INFO] [2021-07-26 07:14:08,367] [1:140259369826048] - dag_scheduler.py[line:175]: have 0 rerun jobs
[INFO] [2021-07-26 07:14:08,367] [1:140259369826048] - dag_scheduler.py[line:183]: schedule rerun jobs finished
[INFO] [2021-07-26 07:14:08,368] [1:140259369826048] - dag_scheduler.py[line:185]: start schedule end status jobs to update status
[INFO] [2021-07-26 07:14:08,375] [1:140259369826048] - dag_scheduler.py[line:187]: have 0 end status jobs
[INFO] [2021-07-26 07:14:08,376] [1:140259369826048] - dag_scheduler.py[line:199]: schedule end status jobs finished
每次查询各状态的任务时,都会操作db,对应的peewee.log 日志类似
[DEBUG] [2021-07-26 07:14:08,260] [1:140691094812416] - peewee.py[line:2863]: ('SELECT `t1`.`f_create_time`, `t1`.`f_create_date`, `t1`.`f_update_time`, `t1`.`f_update_date`, `t1`.`f_user_id`, `t1`.`f_job_id`, `t1`.`f_name`, `t1`.`f_description`, `t1`.`f_tag`, `t1`.`f_dsl`, `t1`.`f_runtime_conf`, `t1`.`f_runtime_conf_on_party`, `t1`.`f_train_runtime_conf`, `t1`.`f_roles`, `t1`.`f_work_mode`, `t1`.`f_initiator_role`, `t1`.`f_initiator_party_id`, `t1`.`f_status`, `t1`.`f_status_code`, `t1`.`f_role`, `t1`.`f_party_id`, `t1`.`f_is_initiator`, `t1`.`f_progress`, `t1`.`f_ready_signal`, `t1`.`f_ready_time`, `t1`.`f_cancel_signal`, `t1`.`f_cancel_time`, `t1`.`f_rerun_signal`, `t1`.`f_end_scheduling_updates`, `t1`.`f_engine_name`, `t1`.`f_engine_type`, `t1`.`f_cores`, `t1`.`f_memory`, `t1`.`f_remaining_cores`, `t1`.`f_remaining_memory`, `t1`.`f_resource_in_use`, `t1`.`f_apply_resource_time`, `t1`.`f_return_resource_time`, `t1`.`f_start_time`, `t1`.`f_start_date`, `t1`.`f_end_time`, `t1`.`f_end_date`, `t1`.`f_elapsed` FROM `t_job` AS `t1` WHERE ((`t1`.`f_is_initiator` = %s) AND (`t1`.`f_status` = %s)) ORDER BY `t1`.`f_create_time` ASC', [True, 'waiting'])
[INFO] [2021-07-26 07:14:06,255] [1:140691673888576] - fate_flow_server.py[line:107]: start grpc server thread pool by 40 max workers
[INFO] [2021-07-26 07:14:06,268] [1:140691673888576] - fate_flow_server.py[line:115]: FATE Flow grpc server start successfully
[INFO] [2021-07-26 07:14:06,269] [1:140691673888576] - fate_flow_server.py[line:118]: FATE Flow http server start...
Flask
os.kill(pid,0)