Ingo Molnar, author of the O(1) scheduler [earlier story] and the orginal preemptive kernel patch, has provided a patch to make the O(1) scheduler fully aware of HyperThreading. Ingo explains:
"Symmetric multithreading (hyperthreading) is an interesting new concept that IMO deserves full scheduler support. Physical CPUs can have multiple (typically 2) logical CPUs embedded, and can run multiple tasks 'in parallel' by utilizing fast hardware-based context-switching between the two register sets upon things like cache-misses or special instructions. To the OSs the logical CPUs are almost undistinguishable from physical CPUs. In fact the current scheduler treats each logical CPU as a separate physical CPU - which works but does not maximize multiprocessing performance on SMT/HT boxes."
Read on for Ingo's full explanation.
From: Ingo Molnar To: linux-kernel Subject: [patch] "fully HT-aware scheduler" support, 2.5.31-BK-curr Date: Tue, 27 Aug 2002 03:44:23 +0200 (CEST) symmetric multithreading (hyperthreading) is an interesting new concept that IMO deserves full scheduler support. Physical CPUs can have multiple (typically 2) logical CPUs embedded, and can run multiple tasks 'in parallel' by utilizing fast hardware-based context-switching between the two register sets upon things like cache-misses or special instructions. To the OSs the logical CPUs are almost undistinguishable from physical CPUs. In fact the current scheduler treats each logical CPU as a separate physical CPU - which works but does not maximize multiprocessing performance on SMT/HT boxes. The following properties have to be provided by a scheduler that wants to be 'fully HT-aware':
- HT-aware passive load-balancing: the irq-driven balancing has to be per-physical-CPU, not per-logical-CPU. Otherwise it might happen that one physical CPU runs 2 tasks, while another physical CPU runs no threads. The stock scheduler does not recognize this condition as 'imbalance' - to the scheduler it appears as if the first two CPUs had 1-1 task running, the second two CPUs had 0-0 tasks running. The stock scheduler does not realize that the two logical CPUs belong to the same physical CPU.
- 'active' load-balancing when a logical CPU goes idle and thus causes a physical CPU imbalance. This is a mechanism that simply does not exist in the stock 1:1 scheduler - the imbalance caused by an idle CPU can be solved via the normal load-balancer. In the HT case the situation is special because the source physical CPU might have just two tasks running, both runnable - this is a situation that the stock load-balancer is unable to handle - running tasks are hard to be migrated away. But it's essential to do this - otherwise a physical CPU can get stuck running 2 tasks, while another physical CPU stays idle.
- HT-aware task pickup. When the scheduler picks a new task, it should prefer all tasks that share the same physical CPU - before trying to pull in tasks from other CPUs. The stock scheduler only picked tasks that were scheduled to that particular logical CPU.
- HT-aware affinity. Tasks should attempt to 'stick' to physical CPUs, not logical CPUs.
- HT-aware wakeup. again this is something completely new - the stock scheduler only knows about the 'current' CPU, it does not know about any sibling [== logical CPUs on the same physical CPU] logical CPUs. On HT, if a thread is woken up on a logical CPU that is already executing a task, and if a sibling CPU is idle, then the sibling CPU has to be woken up and has to execute the newly woken up task immediately.
2.5.31-BK-curr: - fluctuates between 2.60 secs and 4.6 seconds. BK-curr + sched-F3: - stable 2.60 sec results.
the results under the stock scheduler depends on pure luck: which CPUs get
the tasks scheduled. In the HT-aware case each task gets scheduled on a
separate physical CPU, all the time.
compiling the kernel source via "make -j2" [under-utilizes CPUs]:
2.5.31-BK-curr: 45.3 sec BK-curr + sched-F3: 41.3 sec
ie. a ~10% improvement. The tests were the best results picked from lots
of (>10) runs. The no-HT numbers fluctuate much more (again the randomness
effect), so the average compilation time in the no-HT case is higher.
saturated compilation "make -j5" results are roughly equivalent, as
expected - the one-runqueue-per-CPU concept works adequately when the
number of tasks is larger than the number of logical CPUs. The stock
scheduler works well on HT boxes in the boundary conditions: when there's
1 task running, and when there's more nr_cpus tasks running.
the patch also unifies some of the other code and removes a few more
#ifdef CONFIG_SMP branches from the scheduler proper.
(the patch compiles/boots/works just fine on UP and SMP as well, on the P4
box and on another PIII SMP box as well.)
Testreports, comments, suggestions welcome,
Ingo