Extraits de l'interview d'ingo Molnar (issue de kerneltrap)

Extraits de l'interview d'ingo Molnar (issue de kerneltrap) Ingo Molnar: i think i first heard about Linux around 1993, but i truly got hooked on kernel development in 1995 when i bought the german edition of the 'Linux Kernel Internals' book. It might sound a bit strange but i installed my first Linux box for the sole purpose of looking at the kernel source - which i found (and still find) fascinating. So i guess i'm one of the few people who started out as a kernel developer, later on learned their way to be a Linux admin and then finally learned their way around as a Linux user ;-) JA: What was your first contribution to the kernel? Ingo Molnar: my very first contribution was a trivial #ifdef bugfix to the networking code, which was reviewed and merged by Alan Cox. At that point i've been lurking on the kernel mailing list for a couple of months already. My first bigger patch was to arch/i386/kernel/time.c, i implemented timestamp-counter based gettimeofday() on Pentiums (which sped up the gettimeofday() syscall by a factor of ~4) - that code is still alive in current kernels. This patch too was first reviewed by Alan Cox. I strongly believe that a positive 'first contact' between kernel newbies and kernel oldbies is perhaps the single most important factor in attracting new developers to Linux. Besides having the ability to code, kernel developers also need the ability to talk and listen to other developers. JA: Your most recent contribution to the Linux kernel was the O(1) scheduler, merged into the 2.5 development tree in early January. When did you first start working on this project and what was the inspiration? Ingo Molnar: one of the core ideas used within the O(1) scheduler - the using of two sets of priority arrays to achieve fairness - in fact originates from around 1998, i even wrote a preliminary patch back in those times. I couldnt solve some O(N) problems back then so i stopped working on it. I started working on the current O(1) scheduler late last year (2001), sometime in December. The inspiration was what the name suggests - to create a good scheduler for Linux that is O(1) in its entirety: wakeup, context-switching and timer interrupt overhead. JA: Did you base the design on any existing scheduler implementations or research papers? Ingo Molnar: this might sound a bit arrogant, but i have only read (most of the) research papers after writing the scheduler. This i found to be a good approach in the area of Linux - knowing about too many well-researched details can often confuse the real direction we have to take. I like writing new code, and i prefer to approach things from the physics side: take a few elementary rules and build up the 'one correct' solution, no compromises. This might not be as effective as first reading all the available material and then cherry-picking a few ideas and thinking up the remaining things, but it sure gives me lots of fun :-) [ One thing i always try to ensure: i take a look at all existing kernel patches that were announced on the linux-kernel mailing list in the same area, to make sure there's no duplication of effort or NIH syndrome. Since such kernel mailing-list postings are progress reports of active research, it can be said that i read alot of bleeding-edge research. ] JA: Can you explain how your O(1) scheduler improves upon the previous Linux scheduler? Ingo Molnar: there are three main areas of improvements. firstly, as the name suggests, it behaves pretty well independently of how many tasks there are in the system. A number of server workloads (eg. JVMs) actually triggered this inefficiency in the old scheduler. secondly, scheduling on SMP got improved significantly: both performance and scalability is much better. Also, the scheduler decisions are much more robust these days, because the core design is SMP-aware. the third improvement is in the way interactive tasks are handled. This is actually the change that should be the most noticeable for ordinary users. Interactive tasks are now detected via a separate, usage-statistics-driven mechanism, which is decoupled from other scheduler mechanisms such as timeslice management. The end result is: interactive tasks are still snappy under heavy load, and CPU-intensive tasks are isolated from interactive tasks much better so that they cannot monopolize CPU resources. This part is still being tweaked upon - the important thing is that it's decoupled - the old scheduler had lots of behavioral details integrated into it implicitly, which made tweaking harder. There's even a patch that makes all scheduler-internal constants (timeslice length, various deadlines and interactivity rules) runtime configurable. the scheduler also enabled the implementation of a new scheduling policy: batch-scheduling of CPU-intensive tasks. This is a correct implementation of the SCHED_IDLE patches that are floating around - the end result is that batch-scheduled tasks do not disturb other tasks in the system in any way, if other tasks are running then batch-scheduled tasks take up zero CPU time. This can be used for things like SETI calculations, or long numeric calculations in university setups. JA: Is this batch-scheduling in the queue of patches waiting inclusion into the 2.5 kernel? Ingo Molnar: well, batch scheduling is a feature that is welcome by a number of users, but is largely irrelevant to others. Right now it's a separate patch to the stock scheduler. There are also some conflicting requests about SCHED_BATCH semantics: some people would like priority levels to cause RT-like separation of execution times, while the current SCHED_BATCH patch uses priority levels [ie. nice values] to determine the percentage of CPU time shared between SCHED_BATCH tasks. Until such issues are not decided (by actual use), it's not good to codify them by moving it into the stock kernel. ie. in the first 'RT-alike priorities' model, if a 'nice level 15' SCHED_BATCH task is running at the same time a 'nice level 10' task is running, the nice-10 task will get all the CPU time - always, until it exits. In the second priority model the nice-10 task will get more CPU time than the nice-15 task, but both of them will get CPU time. another property of SCHED_BATCH scheduling is the use of much longer timeslices. Eg. right now it's 3 seconds for a default priority SCHED_BATCH task - while normal tasks have 150 msec timeslices. For things like numeric calculations it's good to have as long timeslices as possible, to minimize the effect of cache trashing. Eg. on a sufficiently powerful CPU with a 2 MB L2 cache, the 'population time' of the cache can be as high as 10 milliseconds. So if there are two numeric calculation tasks that both fully utilize the L2 cache (in nonobvious patterns), and which context-switch every 150 milliseconds, then they will waste 10 milliseconds on cache-rebuilding in the first 6% of their timeslice. This shows up as a direct 6% slowdown of the numeric calculation jobs. Now, if SCHED_BATCH is used, and each task has a 3000 milliseconds timeslice, then the cache-rebuild overhead can be at most 0.3% - a far more acceptable number. this is also one of the reasons why the default timeslice length got almost doubled over the 2.4 scheduler's timeslice length (there it was 80 msecs). We cannot do the same in the 2.4 scheduler because it has a weaker interactivity detection code, which will make things like an X desktop appear 'sluggish' while eg. a compilation job is running. I think in the longer term we want to have a more abstract timeslice management solution (something like the fairsched patch), which is more than possible with the O(1) scheduler. JA: How do JVMs trigger an inefficiency in the old scheduler? Ingo Molnar: the Java programming model prefers the use of many 'threads' - which is a valid and popular application programming model. So JVMs under Linux tend to be amongst the applications that use the most processes/threads, which are interacting in complex ways. Schedulers usually have the most work to do when there are more tasks in the systems, so JVMs tend to trigger scheduler inefficiencies sooner than perhaps any other Linux application. JA: Are you aware of any areas where your O(1) scheduler doesn't perform as well as the 2.4 scheduler? Ingo Molnar: not really, i tried to make sure we preserve all the good things from the 2.4 scheduler. If anyone manages to identify such an area then please mail me about it! :-) well, there's one area where difference can be felt - nice levels (priorities) are taken far more seriously in the new scheduler. So if one wants maximum X performance even during heavy X load which makes X a "CPU hog", the X server should be reniced to nice levels -10 or -15: renice -15 -u rootthe above command renices all currently existing root-owned processes to -15, this includes admin shells and X as well, on most distributions. Some distributions that added the O(1) scheduler to their kernel also set X's priority to -10 or -15 in X's startup scripts - an obviously more robust solution. similarly, audio playback code that uses up lots of CPU time (Ogg decoders for example) should use nice level -20 to get the best audio latencies and no 'skipping' of soundtracks. Most of the audio playback applications already support the use of RT priorities for playback - nice level -20 under the O(1) scheduler is a far more secure solution. (if a task with RT priority locks up then that can cause a lockup of the system.) Obviously all of these operations are privileged, and can only be done as root. JA: What are the scalability limitations of the O(1) scheduler? Ingo Molnar: i'm afraid there are none currently - the runqueues are perfectly isolated. The load-balancer is the only piece of code that has to look at the 'global' scheduling picture, but even that code first tries to figure out whether it has work in a 'lock-less' way, then does it go take the runqueue locks. And the load-balancer runs at a much lower frequency than other pieces of the scheduler - the 'big' load balancer runs every 250 msecs - which, in the kernel's timescale, is an eternity. The 'idle rebalancer' runs every 1 millisecond on every idle CPU - but since it uses up idle time its cost is essentially zero. this property of the load-balancer also enables us to add more complex things like support for NUMA cache hierarchies easily - it's not a performance or scalability-critical piece of code. It can support everything from '16 isolated groups of 4 CPUs' or '32 CPUs on a single chip' or any other future cache-hierarchy. This is the beauty of the new multiprocessor scheduler, the handling of the cache hierarchy [ie. SMP or NUMA or CCNUMA, etc.] is decoupled from the actual per-CPU scheduling, so (hopefully) there's no radical redesign needed in the future. JA: Are the improvements of the O(1) scheduler mainly felt on large servers with multiple CPU's? For example, my home server is an aging PIII 650, and there's definitely a finite limit to the number of processes it can handle at one time. Ingo Molnar: the improvements are noticeable for basically every workload where the CPU is 'overloaded': interactive tasks are actively detected and preferred, and the scheduler itself does not add to the load no matter how high the load is. While it's typically servers that are overloaded (mainly because server admins can size their own needs better than desktop users, and mainly because desktop use is largely dependent on the reaction speed of humans, which is quite lacking), it's still quite common for desktop systems to get into various sorts of CPU overload situations - so overload handling is important to both categories of uses. JA: What sort of tuning is left to be done? Ingo Molnar: there are things left like support for non-SMP and non-UP cache hierarchies, like NUMA or SMT (HyperThreading), but the basic design makes the scheduler well-suited for such purposes as well. In most cases an alternative load-balancing algorithm solves the problems. Plus the tweaking of parameters will perhaps never end. JA Do you aim to have the O(1) scheduler eventually merged into the mainline 2.4 kernel? Ingo Molnar: this largely depends on Marcelo. I'm (trying to) do periodic backports of the scheduler to 2.4, and feedback has been positive so far. Most distributions include the O(1) scheduler in their kernel tree, so the code gets a fair amount of testing. (in fact only Debian does not include it, this is due to a generic policy of shipping the default kernel as shipped by Linus.) If 2.6 is released soon enough then it might not be worth putting the O(1) scheduler into 2.4 - with so much stuff being backported to 2.4 i think 2.6 should have some new features by the time it's released! :-) JA: Whether or not it actually happens, how much more testing needs to be done before you personally would be comfortable with the O(1) scheduler being merged into the stable kernel? Ingo Molnar: it depends on what rule Marcelo uses to include stuff in 2.4. If the rule is to 'include stuff that lots of people use and which works just fine' then the O(1) scheduler is ready. If the rule is to 'include nonintrusive or must-have fixes only' then the O(1) scheduler should not be included, since the 2.4 scheduler works just fine for most workloads. The 2.4 scheduler is still actively maintained and has no major problems, so it's not like we are in a hurry. JA: How up-to-date is the O(1) scheduler that's part of Alan Cox's 2.4-ac tree? Ingo Molnar: it's in essence the same scheduler that we have in 2.5. It has one minor tweak missing. In the 2.5 scheduler we have bits of code from other kernel features as well which are not present in 2.4 (and will probably never be), Rusty Russell's hotplug CPU framework, Robert Love's preemptable kernel code, Dipankar Sarma's RCU code and Andrew Morton's autoremove wakeups. So the 2.5 sched.c cannot be directly compared to 2.4's sched.c. JA: You've also recently been experimenting with making the O(1) scheduler aware of Hyper-Threading (aka symmetric-multithreading) capable CPU's. You explained in an email to the Linux kernel mailing list how you implemented this by introducing the concept of a shared runqueue. With future tuning, how much of a performance gain do you think you can get by adding this support? Ingo Molnar this patch makes a measurable impact when the HT-capable system is not fully utilized. Eg. if a 2-CPU HT system (4 logical CPUs) has 2 tasks running. In this case the correct scheduling decision is to move the two tasks to two different physical CPUs. This alone can result in an up to 30% performance increase of the two task's performance - but for HT systems that are out on the market now it could have a bigger impact as well. It all depends on the tasks, how much cache they use and how well the SMT hardware switches between logical CPUs. when the HT system is fully utilized then the 'stock' scheduler gets pretty close to the 'HT-aware' scheduler's performance, due to an existing feature of the scheduler, the so-called "cache-decay based affinity". JA: Have you had much feedback regarding this patch? Ingo Molnar: Intel is obviously interested, and so were a number of kernel developers, and users as well. But i do not expect the kind of feedback the O(1) scheduler itself produced - HT systems are fresh on the market, and the stock O(1) scheduler handles it reasonably well already. JA: What processors currently support Hyper-Threading? Ingo Molnar: only Intel AFAIK - Hyper-Threading is an Intel trademark iirc. I think there are some non-x86 CPUs that have SMT concepts included, perhaps PowerPCs or Alpha? JA: You're also the author of the original kernel preemption patch. How did your patch differ from the more recent work Robert Love has done in this area? Ingo Molnar: it was a small concept-patch from early 2000 that just showed that a preemptible kernel can indeed be done by using SMP spinlocks. The patch, while it booted and appeared to work to a certain degree, had bugs and did not handle the many cases that need special care, which Robert's patches and the current 2.5 kernel handles correctly. otherwise the base approach is IMO very similar, it has things like: + preempt_on(); clear_highpage(page); + preempt_off(); and: + atomic_inc_local(¤t->may_preempt); which is quite similar to what we have 2.5 today, with the difference that Robert and the kernel developer community actually did the other 95% of the work :-)