Python默認使用CPython解釋器,當中會引入GIL,參考GlobalInterpreterLock:
In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once.
在CPython解釋器中,GIL(全局解釋器鎖)是一個互斥鎖,用於保護Python物件,避免它們在多線程執行被同時存取。
In hindsight, the GIL is not ideal, since it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations. Luckily, many potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL. Therefore it is only in multithreaded programs that spend a lot of time inside the GIL, interpreting CPython bytecode, that the GIL becomes a bottleneck.
多線程程式花了很多時間在GIL裡,解釋CPython bytecode,這時候GIL的存在便使得多線程程式無法充份利用多核系統。
還好pybind11有提供釋放GIL的機制,參考pybind11 documentation - Global Interpreter Lock (GIL):
The classes gil_scoped_release and gil_scoped_acquire can be used to acquire and release the global interpreter lock in the body of a C++ function call. In this way, long-running C++ code can be parallelized using multiple Python threads,
pybind11提供了釋放和重新獲取GIL的API,即gil_scoped_release
和gil_scoped_acquire
這兩個類別,可以利用這兩個API將運行很久的C++代碼平行化。
因為gil_scoped_release
和gil_scoped_acquire
會分別調用PyEval_SaveThread
, PyEval_RestoreThread
, PyEval_AcquireThread
和PyThreadState_Clear
,接下來先來看看這些底層API。
PyEval_SaveThread
和PyEval_RestoreThread
分別由gil_scoped_release
的constructor和destructor調用;PyEval_AcquireThread
和PyThreadState_Clear
則分別由gil_scoped_acquire
的constructor和destructor調用。
參考官方文檔PyEval_SaveThread:
PyThreadState *PyEval_SaveThread()
Part of the Stable ABI.
Release the global interpreter lock (if it has been created) and reset the thread state to NULL, returning the previous thread state (which is not NULL). If the lock has been created, the current thread must have acquired it.
PyEval_SaveThread
的作用是釋放GIL並且reset線程的狀態,然後回傳之前的線程狀態。
參考官方文檔PyEval_RestoreThread:
void PyEval_RestoreThread(PyThreadState *tstate)
Part of the Stable ABI.
Acquire the global interpreter lock (if it has been created) and set the thread state to tstate, which must not be NULL. If the lock has been created, the current thread must not have acquired it, otherwise deadlock ensues.
PyEval_RestoreThread
的作用是獲取GIL並且將線程狀態設為tstate
。對照gil_scoped_release
的代碼,可以知道tstate
是PyEval_SaveThread
的回傳值,也就是釋放GIL前的線程狀態。
PyEval_AcquireThread
void PyEval_AcquireThread(PyThreadState *tstate)
Part of the Stable ABI.
Acquire the global interpreter lock and set the current thread state to tstate, which must not be NULL. The lock must have been created earlier. If this thread already has the lock, deadlock ensues.
Note Calling this function from a thread when the runtime is finalizing will terminate the thread, even if the thread was not created by Python. You can use _Py_IsFinalizing() or sys.is_finalizing() to check if the interpreter is in process of being finalized before calling this function to avoid unwanted termination.
Changed in version 3.8: Updated to be consistent with PyEval_RestoreThread(), Py_END_ALLOW_THREADS(), and PyGILState_Ensure(), and terminate the current thread if called while the interpreter is finalizing.
PyEval_RestoreThread() is a higher-level function which is always available (even when threads have not been initialized).
獲取GIL並將當前線程狀態設為tstate
(不能為NULL
)。GIL必須事先被創造。如果本線程已經有GIL了,就會出現deadlock的情況。
作用類似PyEval_RestoreThread
。
PyThreadState_Clear
void PyThreadState_Clear(PyThreadState *tstate)
Part of the Stable ABI.
Reset all information in a thread state object. The global interpreter lock must be held.
Changed in version 3.9: This function now calls the PyThreadState.on_delete callback. Previously, that happened in PyThreadState_Delete().
重設線程狀態物件tstate
中的所有訊息。必須擁有GIL。
參考pybind11 documentation - Global Interpreter Lock (GIL):
but great care must be taken when any gil_scoped_release appear: if there is any way that the C++ code can access Python objects, gil_scoped_acquire should be used to reacquire the GIL.
使用了gil_scoped_release
之後,如果有C++代碼想存取Python物件,必須先使用gil_scoped_acquire
重新獲取GIL。
如同How to use pybind11 in multithreaded application中所給的示例:
pybind11::gil_scoped_release release;
while (true)
{
// do something and break
}
pybind11::gil_scoped_acquire acquire;
為了做驗證,來查看pybind11/gil.h中gil_scoped_release
的constructor:
explicit gil_scoped_release(bool disassoc = false) : disassoc(disassoc) {
// `get_internals()` must be called here unconditionally in order to initialize
// `internals.tstate` for subsequent `gil_scoped_acquire` calls. Otherwise, an
// initialization race could occur as multiple threads try `gil_scoped_acquire`.
auto &internals = detail::get_internals();
// NOLINTNEXTLINE(cppcoreguidelines-prefer-member-initializer)
tstate = PyEval_SaveThread();
if (disassoc) {
// Python >= 3.7 can remove this, it's an int before 3.7
// NOLINTNEXTLINE(readability-qualified-auto)
auto key = internals.tstate;
PYBIND11_TLS_DELETE_VALUE(key);
}
}
調用了PyEval_SaveThread
,也就是會釋放GIL。
~gil_scoped_release() {
if (!tstate) {
return;
}
// `PyEval_RestoreThread()` should not be called if runtime is finalizing
if (active) {
PyEval_RestoreThread(tstate);
}
if (disassoc) {
// Python >= 3.7 can remove this, it's an int before 3.7
// NOLINTNEXTLINE(readability-qualified-auto)
auto key = detail::get_internals().tstate;
PYBIND11_TLS_REPLACE_VALUE(key, tstate);
}
}
destructor中則調用了PyEval_RestoreThread
,也就是會獲取GIL,這也就是RAII的概念。
正如pybind11的官方討論區所說:
Maybe this will help you: https://docs.python.org/2/c-api/init.html#thread-state-and-the-global-interpreter-lock
py::gil_scoped_release is an RAII version of Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS, while py::gil_scoped_acquire is an RAII version of PyGILState_Ensure/PyGILState_Release.
gil_scoped_release
是PyEval_SaveThread
/PyEval_RestoreThread
的RAII版本,解構子被調用時會間接地調用PyEval_RestoreThread
,重新獲取GIL。
注:此處的py
命名空間即pybind11
,參考Header and namespace conventions。
文檔中給了如下的範例,只出現了gil_scoped_release
而沒有gil_scoped_acquire
:
m.def("call_go", [](Animal *animal) -> std::string {
// GIL is held when called from Python code. Release GIL before
// calling into (potentially long-running) C++ code
py::gil_scoped_release release;
return call_go(animal);
});
這是因為在退出lambda函數後便會調用gil_scoped_release
的解構子,自動釋放GIL,所以不必再調用gil_scoped_acquire
。
PYBIND11_NOINLINE gil_scoped_acquire() {
auto &internals = detail::get_internals();
tstate = (PyThreadState *) PYBIND11_TLS_GET_VALUE(internals.tstate);
if (!tstate) {
/* Check if the GIL was acquired using the PyGILState_* API instead (e.g. if
calling from a Python thread). Since we use a different key, this ensures
we don't create a new thread state and deadlock in PyEval_AcquireThread
below. Note we don't save this state with internals.tstate, since we don't
create it we would fail to clear it (its reference count should be > 0). */
tstate = PyGILState_GetThisThreadState();
}
if (!tstate) {
tstate = PyThreadState_New(internals.istate);
# if defined(PYBIND11_DETAILED_ERROR_MESSAGES)
if (!tstate) {
pybind11_fail("scoped_acquire: could not create thread state!");
}
# endif
tstate->gilstate_counter = 0;
PYBIND11_TLS_REPLACE_VALUE(internals.tstate, tstate);
} else {
release = detail::get_thread_state_unchecked() != tstate;
}
if (release) {
PyEval_AcquireThread(tstate);
}
inc_ref();
}
當中的PyEval_AcquireThread會獲取GIL。
PYBIND11_NOINLINE ~gil_scoped_acquire() {
dec_ref();
if (release) {
PyEval_SaveThread();
}
}
其中dec_ref
為:
PYBIND11_NOINLINE void dec_ref() {
--tstate->gilstate_counter;
# if defined(PYBIND11_DETAILED_ERROR_MESSAGES)
if (detail::get_thread_state_unchecked() != tstate) {
pybind11_fail("scoped_acquire::dec_ref(): thread state must be current!");
}
if (tstate->gilstate_counter < 0) {
pybind11_fail("scoped_acquire::dec_ref(): reference count underflow!");
}
# endif
if (tstate->gilstate_counter == 0) {
# if defined(PYBIND11_DETAILED_ERROR_MESSAGES)
if (!release) {
pybind11_fail("scoped_acquire::dec_ref(): internal error!");
}
# endif
PyThreadState_Clear(tstate);
if (active) {
PyThreadState_DeleteCurrent();
}
PYBIND11_TLS_DELETE_VALUE(detail::get_internals().tstate);
release = false;
}
}
當中的PyThreadState_Clear的作用是重設線程狀態物件tstate
中的所有訊息。
PyEval_SaveThread則會釋放GIL。
討論區的解釋還提到了Py_BEGIN_ALLOW_THREADS
和Py_END_ALLOW_THREADS
。
Py_BEGIN_ALLOW_THREADS:
Py_BEGIN_ALLOW_THREADS
Part of the Stable ABI.
This macro expands to { PyThreadState *_save; _save = PyEval_SaveThread();. Note that it contains an opening brace; it must be matched with a following Py_END_ALLOW_THREADS macro. See above for further discussion of this macro.
宣告一個PyThreadState
物件,調用PyEval_SaveThread
然後把當前狀態存入該物件中。必須與Py_END_ALLOW_THREADS
搭配使用。
Py_END_ALLOW_THREADS:
Py_END_ALLOW_THREADS
Part of the Stable ABI.
This macro expands to PyEval_RestoreThread(_save); }. Note that it contains a closing brace; it must be matched with an earlier Py_BEGIN_ALLOW_THREADS macro. See above for further discussion of this macro.
調用PyEval_RestoreThread
。
Releasing the GIL from extension code中給中如下的範例:
Most extension code manipulating the GIL has the following simple structure:
Save the thread state in a local variable.
Release the global interpreter lock.
... Do some blocking I/O operation ...
Reacquire the global interpreter lock.
Restore the thread state from the local variable.
This is so common that a pair of macros exists to simplify it:
Py_BEGIN_ALLOW_THREADS
... Do some blocking I/O operation ...
Py_END_ALLOW_THREADS
也就是在需要多線程運行的C++代碼前後用Py_BEGIN_ALLOW_THREADS
和Py_END_ALLOW_THREADS
包起來。
gil_scoped_release
是在constructor中調用PyEval_SaveThread
,在destructor中調用PyEval_RestoreThread
,所以討論中才說gil_scoped_release
是RAII版本的Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS
。
討論中提到gil_scoped_acquire
是PyGILState_Ensure
和PyGILState_Release
的RAII版本。
PyGILState_Ensure
PyGILState_STATE PyGILState_Ensure()
Part of the Stable ABI.
Ensure that the current thread is ready to call the Python C API regardless of the current state of Python, or of the global interpreter lock. This may be called as many times as desired by a thread as long as each call is matched with a call to PyGILState_Release(). In general, other thread-related APIs may be used between PyGILState_Ensure() and PyGILState_Release() calls as long as the thread state is restored to its previous state before the Release(). For example, normal usage of the Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS macros is acceptable.
The return value is an opaque “handle” to the thread state when PyGILState_Ensure() was called, and must be passed to PyGILState_Release() to ensure Python is left in the same state. Even though recursive calls are allowed, these handles cannot be shared - each unique call to PyGILState_Ensure() must save the handle for its call to PyGILState_Release().
When the function returns, the current thread will hold the GIL and be able to call arbitrary Python code. Failure is a fatal error.
PyGILState_Ensure
會使當前線程獲取GIL,並返回當前線程狀態。之後必須調用PyGILState_Release
並把當前線程狀態當作參數傳入。
PyGILState_Release
void PyGILState_Release(PyGILState_STATE)
Part of the Stable ABI.
Release any resources previously acquired. After this call, Python’s state will be the same as it was prior to the corresponding PyGILState_Ensure() call (but generally this state will be unknown to the caller, hence the use of the GILState API).
Every call to PyGILState_Ensure() must be matched by a call to PyGILState_Release() on the same thread.
釋放之前獲取的所有資源。
gil_scoped_acquire
的建構子中調用PyEval_AcquireThread
,作用類似PyGILState_Ensure
;destructor中調用PyEval_SaveThread
,作用類似PyGILState_Release
。
在PyTorch的torch/csrc/autograd/generated/python_torch_functions_0.cpp
中有:
auto dispatch_rand = [](c10::SymIntArrayRef size, at::TensorOptions options) -> at::Tensor {
pybind11::gil_scoped_release no_gil;
return torch::rand_symint(size, options);
};
return wrap(dispatch_rand(_r.symintlist(0), options));
注意這裡只用了gil_scoped_release
而沒用gil_scoped_acquire
,這是因為在torch::rand_symint
之後就沒有其它code需要存取Python物件了,並且因為gil_scoped_release
使用了RAII,所以GIL會自動在gil_scoped_release
的解構子中被釋放。
參考Releasing the GIL from extension code和pybind11 multithreading parallellism in python,編輯spam.c
如下:
#define PY_SSIZE_T_CLEAN
#include
long fib_raw(long n) {
if (n < 2) {
return 1;
}
return fib_raw(n-2) + fib_raw(n-1);
}
long fib_cpp(long n) {
long res = 0;
#ifdef SAVE_THREAD
PyThreadState *_save;
_save = PyEval_SaveThread();
#endif
res = fib_raw(n);
#ifdef SAVE_THREAD
PyEval_RestoreThread(_save);
#endif
return res;
}
static PyObject*
fib(PyObject* self, PyObject* args) {
long n;
if(!PyArg_ParseTuple(args, "l", &n))
return NULL;
return PyLong_FromLong(fib_cpp(n));
}
static PyMethodDef SpamMethods[] = {
{"fib",
fib,
METH_VARARGS,
"Execute a shell command."},
{NULL, NULL, 0, NULL}
};
PyDoc_STRVAR(spam_doc, "Spam module that calculate fib.");
static struct PyModuleDef spammodule = {
PyModuleDef_HEAD_INIT,
"spam",
spam_doc,
-1,
SpamMethods
};
PyMODINIT_FUNC
PyInit_spam(void){
PyObject* m;
m = PyModule_Create(&spammodule);
if(m == NULL)
return NULL;
return m;
}
int main(int argc, char* argv[]){
wchar_t* program = Py_DecodeLocale(argv[0], NULL);
if(program == NULL){
fprintf(stderr, "Fatal error: cannot decode argv[0]\n");
exit(1);
}
if(PyImport_AppendInittab("spam", PyInit_spam) == -1){
fprintf(stderr, "Error: could not extend in-built modules table\n");
exit(1);
}
Py_SetProgramName(program);
Py_Initialize();
PyObject* pmodule = PyImport_ImportModule("spam");
if(!pmodule){
PyErr_Print();
fprintf(stderr, "Error: could not import module 'spam'\n");
}
PyMem_RawFree(program);
return 0;
}
g++ -DSAVE_THREAD -shared $(python3.8-config --includes) $(python3.8-config --ldflags) -o spam_save_shared.so -fPIC spam.c
g++ -shared $(python3.8-config --includes) $(python3.8-config --ldflags) -o spam_save_shared_raw.so -fPIC spam.c
編輯spam.c
如下:
#define PY_SSIZE_T_CLEAN
#include
long fib_raw(long n) {
if (n < 2) {
return 1;
}
return fib_raw(n-2) + fib_raw(n-1);
}
long fib_cpp(long n) {
long res = 0;
#ifdef ALLOW_THREADS
Py_BEGIN_ALLOW_THREADS
#endif
res = fib_raw(n);
#ifdef ALLOW_THREADS
Py_END_ALLOW_THREADS
#endif
return res;
}
static PyObject*
fib(PyObject* self, PyObject* args) {
long n;
if(!PyArg_ParseTuple(args, "l", &n))
return NULL;
return PyLong_FromLong(fib_cpp(n));
}
static PyMethodDef SpamMethods[] = {
{"fib",
fib,
METH_VARARGS,
"Execute a shell command."},
{NULL, NULL, 0, NULL}
};
PyDoc_STRVAR(spam_doc, "Spam module that calculate fib.");
static struct PyModuleDef spammodule = {
PyModuleDef_HEAD_INIT,
"spam",
spam_doc,
-1,
SpamMethods
};
PyMODINIT_FUNC
PyInit_spam(void){
PyObject* m;
m = PyModule_Create(&spammodule);
if(m == NULL)
return NULL;
return m;
}
int main(int argc, char* argv[]){
wchar_t* program = Py_DecodeLocale(argv[0], NULL);
if(program == NULL){
fprintf(stderr, "Fatal error: cannot decode argv[0]\n");
exit(1);
}
if(PyImport_AppendInittab("spam", PyInit_spam) == -1){
fprintf(stderr, "Error: could not extend in-built modules table\n");
exit(1);
}
Py_SetProgramName(program);
Py_Initialize();
PyObject* pmodule = PyImport_ImportModule("spam");
if(!pmodule){
PyErr_Print();
fprintf(stderr, "Error: could not import module 'spam'\n");
}
PyMem_RawFree(program);
return 0;
}
g++ -DALLOW_THREADS -shared $(python3.8-config --includes) $(python3.8-config --ldflags) -o spam_allow_threads.so -fPIC spam.c
g++ -shared $(python3.8-config --includes) $(python3.8-config --ldflags) -o spam_raw.so -fPIC spam.c
參考pybind11 documentation - First steps和pybind11 documentation - Global Interpreter Lock (GIL)將上面的例子修改成pybind11版本,然後使用pybind11::gil_scoped_release
:
#include
long fib_raw(long n) {
if (n < 2) {
return 1;
}
return fib_raw(n-2) + fib_raw(n-1);
}
long fib(long n) {
long res = 0;
#ifdef NOGIL1
pybind11::gil_scoped_release no_gil;
#endif
res = fib_raw(n);
return res;
}
PYBIND11_MODULE(spam, m) {
m.doc() = "pybind11 fibonacci plugin"; // optional module docstring
#ifdef NOGIL
m.def("fib", [](long n) -> long {
// GIL is held when called from Python code. Release GIL before
// calling into (potentially long-running) C++ code
pybind11::gil_scoped_release release;
return fib(n);
}, "A function that calculate fibonacci");
#elif DEFINED(NOGIL2)
m.def("fib", &fib, pybind11::call_guard<py::gil_scoped_release>(), "A function that calculate fibonacci");
#else
m.def("fib", &fib, "A function that calculate fibonacci");
#endif
}
pybind11 is a header-only library, hence it is not necessary to link against any special libraries and there are no intermediate (magic) translation steps.
g++ -DNOGIL -shared $(python3.8-config --includes) $(python3 -m pybind11 --includes) $(python3.8-config --ldflags) -o spam_pybind11_nogil.so -fPIC spam.c
g++ -shared $(python3.8-config --includes) $(python3 -m pybind11 --includes) $(python3.8-config --ldflags) -o spam_pybind11_raw.so -fPIC spam.c
挑選一個spam_xxx.so
將它複製成spam.so
,並編寫run.py
如下:
import time
import spam
import sys
from concurrent.futures import ThreadPoolExecutor, as_completed
if __name__ == "__main__":
max_workers = 5
num = 40
results = []
start = time.time()
if len(sys.argv) > 1 and sys.argv[1]:
print("multiple thread")
with ThreadPoolExecutor(max_workers=max_workers) as executor:
results = executor.map(spam.fib, [num] * max_workers)
else:
print("single thread")
for _ in range(max_workers):
results.append(spam.fib(num))
end = time.time()
print(end - start)
print(list(results))
這個腳本會根據是否有傳入參數,分別使用單線程和多線程去呼叫C++函數。
運行結果會像下面這樣:
$ python run.py 1
multiple thread
1.391195297241211
[165580141, 165580141, 165580141, 165580141, 165580141]
$ python run.py
single thread
3.9591684341430664
[165580141, 165580141, 165580141, 165580141, 165580141]
我們有三種不同釋放GIL的代碼,它們的運行時間整理如下:
單線程 | 多線程 | |
---|---|---|
關閉PyEval_SaveThread | 3.976 | 4.018 |
開啟PyEval_SaveThread | 3.959 | 1.391 |
關閉ALLOW_THREADS | 4.036 | 4.079 |
開啟ALLOW_THREADS | 4.025 | 1.453 |
關閉gil_scoped_release | 4.088 | 4.041 |
開啟gil_scoped_release | 4.031 | 1.464 |
可以看到,在關閉PyEval_SaveThread
, ALLOW_THREADS
或gil_scoped_release
時,單線程與多線程運行時的執行時間差異不大。如果釋放了GIL,並使用5個線程運行的話,則可以加速近3倍。