Linux2.6支持超线程感知

Windows2008 R2开始支持NUMA和超线程感知,所以查了一下Linux,发现Linux 2.6版本也支持了。

(转自:http://kerneltrap.org/node/391)

Linux:  HyperThreading-Aware Scheduler

Submitted by Jeremy
on August 28, 2002 - 12:59pm

Ingo Molnar, author of the O(1) scheduler [earlier story] and the orginal preemptive kernel patch, has provided a patch to make the O(1) scheduler fully aware of HyperThreading.  Ingo explains:

"Symmetric multithreading (hyperthreading) is an interesting new concept that IMO deserves full scheduler support. Physical CPUs can have multiple (typically 2) logical CPUs embedded, and can run multiple tasks 'in parallel' by utilizing fast hardware-based context-switching between the two register sets upon things like cache-misses or special instructions. To the OSs the logical CPUs are almost undistinguishable from physical CPUs. In fact the current scheduler treats each logical CPU as a separate physical CPU - which works but does not maximize multiprocessing performance on SMT/HT boxes."

Read on for Ingo's full explanation.

From: Ingo Molnar
To: linux-kernel
Subject: [patch] "fully HT-aware scheduler" support, 2.5.31-BK-curr
Date: 	Tue, 27 Aug 2002 03:44:23 +0200 (CEST)

symmetric multithreading (hyperthreading) is an interesting new concept
that IMO deserves full scheduler support. Physical CPUs can have multiple
(typically 2) logical CPUs embedded, and can run multiple tasks 'in
parallel' by utilizing fast hardware-based context-switching between the
two register sets upon things like cache-misses or special instructions.
To the OSs the logical CPUs are almost undistinguishable from physical
CPUs. In fact the current scheduler treats each logical CPU as a separate
physical CPU - which works but does not maximize multiprocessing
performance on SMT/HT boxes.

The following properties have to be provided by a scheduler that wants to
be 'fully HT-aware':


    
    
    
    
  • HT-aware passive load-balancing: the irq-driven balancing has to be per-physical-CPU, not per-logical-CPU. Otherwise it might happen that one physical CPU runs 2 tasks, while another physical CPU runs no threads. The stock scheduler does not recognize this condition as 'imbalance' - to the scheduler it appears as if the first two CPUs had 1-1 task running, the second two CPUs had 0-0 tasks running. The stock scheduler does not realize that the two logical CPUs belong to the same physical CPU.
  • 'active' load-balancing when a logical CPU goes idle and thus causes a physical CPU imbalance. This is a mechanism that simply does not exist in the stock 1:1 scheduler - the imbalance caused by an idle CPU can be solved via the normal load-balancer. In the HT case the situation is special because the source physical CPU might have just two tasks running, both runnable - this is a situation that the stock load-balancer is unable to handle - running tasks are hard to be migrated away. But it's essential to do this - otherwise a physical CPU can get stuck running 2 tasks, while another physical CPU stays idle.
  • HT-aware task pickup. When the scheduler picks a new task, it should prefer all tasks that share the same physical CPU - before trying to pull in tasks from other CPUs. The stock scheduler only picked tasks that were scheduled to that particular logical CPU.
  • HT-aware affinity. Tasks should attempt to 'stick' to physical CPUs, not logical CPUs.
  • HT-aware wakeup. again this is something completely new - the stock scheduler only knows about the 'current' CPU, it does not know about any sibling [== logical CPUs on the same physical CPU] logical CPUs. On HT, if a thread is woken up on a logical CPU that is already executing a task, and if a sibling CPU is idle, then the sibling CPU has to be woken up and has to execute the newly woken up task immediately.
the attached patch (against 2.5.31-BK-curr) implements all the above HT-scheduling needs by introducing the concept of a shared runqueue: multiple CPUs can share the same runqueue. A shared, per-physical-CPU runqueue magically fulfills all the above HT-scheduling needs. Obviously this complicates scheduling and load-balancing somewhat (see the patch for details), so great care has been taken to not impact the non-HT schedulers (SMP, UP). In fact the SMP scheduler is a compile-time special case of the HT scheduler. (and the UP scheduler is a compile-time special case of the SMP scheduler) the patch is based on Jun Nakajima's prototyping work - the lowlevel x86/Intel bits are still those from Jun, the sched.c bits are newly implemented and generalized. There's a single flexible interface for lowlevel boot code to set up physical CPUs: sched_map_runqueue(cpu1, cpu2) maps cpu2 into cpu1's runqueue. The patch also implements the lowlevel bits for P4 HT boxes for the 2/package case. (NUMA systems which have tightly coupled CPUs with a smaller cache and protected by a large L3 cache might benefit from sharing the runqueue as well - but the target for this concept is SMT.) some numbers: compiling a standalone floppy.c in an infinite loop takes 2.55 seconds per iteration. Starting up two such loops in parallel, on a 2-physical, 2-logical (total of 4 logical CPUs) P4 HT box gives the following numbers:
  2.5.31-BK-curr:     - fluctuates between 2.60 secs and 4.6 seconds.

  BK-curr + sched-F3: - stable 2.60 sec results.

the results under the stock scheduler depends on pure luck: which CPUs get
the tasks scheduled. In the HT-aware case each task gets scheduled on a
separate physical CPU, all the time.

compiling the kernel source via "make -j2" [under-utilizes CPUs]:

  2.5.31-BK-curr:              45.3 sec

  BK-curr + sched-F3:          41.3 sec

ie. a ~10% improvement. The tests were the best results picked from lots
of (>10) runs. The no-HT numbers fluctuate much more (again the randomness
effect), so the average compilation time in the no-HT case is higher.

saturated compilation "make -j5" results are roughly equivalent, as
expected - the one-runqueue-per-CPU concept works adequately when the
number of tasks is larger than the number of logical CPUs. The stock
scheduler works well on HT boxes in the boundary conditions: when there's
1 task running, and when there's more nr_cpus tasks running.

the patch also unifies some of the other code and removes a few more
#ifdef CONFIG_SMP branches from the scheduler proper.

(the patch compiles/boots/works just fine on UP and SMP as well, on the P4
box and on another PIII SMP box as well.)

Testreports, comments, suggestions welcome,

Ingo

你可能感兴趣的:(Linux2.6支持超线程感知)