open-falcon安装GPU插件

1.安装dcgm:

# rpm --install datacenter-gpu-manager-1.5.6-1.x86_64.rpm

# dcgmi --version

# nvvs --version

启动监听

# nv-hostengine

查看GPU设备

# dcgmi discovery -l

2.安装gpu-mon:

# go get -u github.com/open-falcon/gpu-mon

# pwd

/root/go/src/github.com/open-falcon/gpu-mon

# make

gofmt -s -w ./args.go ./fetch/metrics.go ./fetch/dcgm.go ./fetch/fetch.go ./common/config.go ./common/log.go ./common/log_test.go ./common/utils.go ./common/config_test.go ./common/common.go ./send/send_test.go ./send/send.go ./send/utils.go ./send/utils_test.go ./main.go

building gpu-mon ...

3.使用插件

open-falcon 插件功能需要开启

编辑agent/config/cfg.json

设置”enabled”为true

cp gpu-mon cfg.example.json 60_gpuMonitor.sh /root/open-falcon/agent/plugin/

# pwd

/root/open-falcon/agent/plugin

# mv cfg.example.json cfg.json

/root/open-falcon/plugin 为插件路径

# pwd

/root/open-falcon/plugin

# ls

60_gpuMonitor.sh  cfg.json  gpu-mon  logs

4.配置文件

配置文件参考cfg.json文件,相关配置项说明如下:

{

    "falcon": {

        // Agent: 上报falcon客户端的地址

        "Agent": "http://127.0.0.1:1988/v1/agent"

    },

    "metric":{

        // ignoreMetrics: 不进行上报的GPU监控配置项

        "ignoreMetrics": [

        ],

        // endpoint值,默认为机器主机名

        "endpoint": ""

    },

    "log":{

        // logLevel: 日志级别,支持:Info、Warn、Error和Debug,默认为Warn

        "level": "Warn",

        // logDir: 日志存储目录

        "dir": "./logs"

    }

}


参考:

https://github.com/open-falcon/gpu-mon

https://blog.csdn.net/u010953692/article/details/103876660

https://developer.nvidia.com/dcgm

你可能感兴趣的:(open-falcon安装GPU插件)