(10054, ‘远程主机强迫关闭了一个现有的连接。‘, None, 10054, None)(联邦学习+ray中常见问题)

第一部分:问题描述

(pid=24828) Files already downloaded and verified
2025-02-24 12:48:44,183    ERROR import_thread.py:89 -- ImportThread: Error while reading from socket: (10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None)
2025-02-24 12:48:44,184    ERROR worker.py:1074 -- listen_error_messages_raylet: Error while reading from socket: (10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None)
2025-02-24 12:48:44,196    ERROR worker.py:981 -- print_logs: Error while reading from socket: (10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None)
E0224 12:48:44.197309 23684 15716 task_manager.cc:323] Task failed: IOError: 2: Stream removed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=myserver, class_name=ParameterServer, function_name=_evaluate, function_hash=}, task_id=3106d80c4e3c2369df5a1a8201000000, task_name=ParameterServer._evaluate(), job_id=01000000, num_args=2, num_returns=2, actor_task_spec={actor_id=df5a1a8201000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=5}
F0224 12:48:45.209749 23684 24808 service_based_gcs_client.cc:207] Couldn't reconnect to GCS server. The last attempted GCS server address was :0
*** Check failure stack trace: ***
    @   00007FF92140174B  public: void __cdecl google::LogMessage::Flush(void) __ptr64
    @   00007FF9214004E2  public: __cdecl google::LogMessage::~LogMessage(void) __ptr64
    @   00007FF9213C94F8  public: virtual __cdecl google::NullStreamFatal::~NullStreamFatal(void) __ptr64
    @   00007FF92123FD48  PyInit__raylet
    @   00007FF92123F24F  PyInit__raylet
    @   00007FF9212509AB  PyInit__raylet
    @   00007FF9211D6D85  PyInit__raylet
    @   00007FF9211A5C73  PyInit__raylet
    @   00007FF9211E9FAC  PyInit__raylet
    @   00007FF921442E74  bool __cdecl google::Demangle(char const * __ptr64,char * __ptr64,int)
    @   00007FF92144638F  bool __cdecl google::Demangle(char const * __ptr64,char * __ptr64,int)
    @   00007FF92144574B  bool __cdecl google::Demangle(char const * __ptr64,char * __ptr64,int)
    @   00007FF921178B44  PyInit__raylet
    @   00007FFAD1CD1BB2  _configthreadlocale
    @   00007FFAD2CC7374  BaseThreadInitThunk
    @   00007FFAD42DCC91  RtlUserThreadStart

进程已结束,退出代码为 -1073740791 (0xC0000409)

代表的含义就是一个客户端使用2块gpu,而我剩余的只有一个gpu

第二部分:解决方法

找到代码中管理gpu数量的这一部分:

修改前面的num_gpus(显式指定每个Worker的GPU资源需求):

(10054, ‘远程主机强迫关闭了一个现有的连接。‘, None, 10054, None)(联邦学习+ray中常见问题)_第1张图片

再次运行就可以了!

(10054, ‘远程主机强迫关闭了一个现有的连接。‘, None, 10054, None)(联邦学习+ray中常见问题)_第2张图片

你可能感兴趣的:(快捷操作,编程技巧,服务器,前端,运维)