From: Con Kolivas <xxxxx@kolivas.org>
Newsgroups: gmane.linux.kernel
Subject: [RFC] Orthogonal Interactivity Patches
Date: Mon, 25 Aug 2003 13:22:12 +1000
Lines: 285
Original-To: linux kernel mailing list <linux-kernel@vger.kernel.org>

This is a description of the interactivity patches for 2.5/6 that started with 
O1int. 

Background:
A long discussion for sometime has centred on just what interactivity is. I 
did not set out to define interactivity and then modify the scheduler policy 
to match that definition. Instead what I have concentrated on is improving 
the overall feel of using 2.6 on a desktop (gui or non). The area I 
concentrated on was that of tasks that it would make a noticable difference 
when there was sufficient delay or jitter between the time they wake up and 
the time they are scheduled. This makes a palpable difference in the form of 
audio skips in playback or jerky mouse movement. The effect of increasing 
loads on the system and maintaining fairness must also be addressed when 
tackling these.

Introduction:
How the O(1) scheduler deals with "interactive" tasks. 

The O(1) cpu scheduler designed by Ingo Molnar and merged into early 2.5 
development was adopted for it's substantial improvement in scalability. The 
O(1) nature of the new scheduler meant the overhead of increasing number of 
tasks on increasing numbers of cpus was negligible. 

Briefly, the scheduler maintains a runqueue for each cpu, with this split into 
the active and expired array. How it works is thus: when a task is first 
woken up, it is added to the tail end of the active array to be scheduled 
after all tasks of higher priority on that same array, and any tasks of the 
same priority already on the runqueue. If a task goes to sleep it then is 
removed from the runqueue. If, however, it uses up a full timeslice the 
scheduler has to decide whether to put it on the end of the active array to 
be rescheduled again, or to be put onto the expired array. Tasks on the 
active array round robin from highest priority first come first served to 
lowest priority. Tasks on the expired array are not run unless the active 
array is emptied (all tasks go to sleep or are expired) or some predetermined 
starvation limit is used up.

The interactivity estimator in the 2.5 scheduler is designed to find which 
tasks are interactive versus those that are batch (pure cpu hogs). How it 
works is on the premise that batch tasks never sleep but use up all the cpu 
time offered to them, whereas interactive tasks occasionally sleep. A 
sleep_avg was stored for each task, where every tick of the jiffy (1 
millisecond in 2.5) the task earns a sleep_avg point if it's sleeping, or 
loses a sleep_avg point while it's running. Tasks with high sleep_avg are 
considered interactive and low sleep_avg are cpu hogs. 

The scheduler takes the information from the interactivity estimator and then 
assigns a dynamic priority to the task. The variation from the static 
priority is dependent on the value in PRIORITY_BONUS, set to 25% meaning a 
task's priority can change +/- 5 from it's static. Interactive tasks get a 5 
bonus and cpu hogs get a 5 penalty. 

If a task uses up it's full timeslice, the scheduler can choose to not expire 
if it is still TASK_INTERACTIVE which works out to about a priority bonus of 
3+ for nice 0 tasks.

While this simple and intuitive method works quite well, it does have some 
intrinsic limitations. 

Problems:

1. How much sleep avg needs to be accumulated before a task is interactive 
(MAX_SLEEP_AVG [MSA from here on]). 
Decrease the value and it takes only a short time for a task to accumulate 
enough sleep avg to be considered interactive, and also burn off enough sleep 
avg to be seen as a cpu hog. The problem is that most real world interactive 
applications, while they do sleep frequently, they also use bursts of cpu at 
times. These bursts will rapidly drop the interactive status of a task if the 
MSA is low, and these tasks will suddenly get low priority and be scheduled 
with batch tasks or worse, expire. The best example of this is X which is 
interactive but not infrequently uses large bursts of cpu. A low MSA will 
make dragging a window smooth under no load, but as load increases, dragging 
the window across the screen it will stop, jerk and the mouse will become 
unresponsive due to X expiring. The other problem with low MSAs is that most 
tasks, however interactive they are, start out using a lot of cpu. This means 
that they will be seen as cpu hogs during startup and translates into very 
slow application startup under load. 
Increasing the value of MSA is a way to tackle to rapidly decaying interactive 
status and slow startups but brings a different set of problems with it. In 
2.6.0-test4, the default for MSA is 10 seconds. This means a task needs to 
have slept for ten seconds before it is considered max interactive. Because 
tasks are constantly waking up, burning cpu and then going back to sleep, a 
MSA of 10 seconds means it can take up to a minute before the sleep_avg 
accurately represents the task's interactive status (an exponential 
function). This is seen by audio applications skipping like mad under any 
load until the music has been playing for over a minute, and each time a new 
song loads it starts all over again.

2.The idle task that turns into a cpu hog. 
Tasks that idle a lot (shells etc) can suddently turn into cpu hogs very 
easily (fire off "while true ; do a=1 ; done"). Idle tasks basically are 
sleeping all the time so they get maximum interactivity, and if you have a 
combination of a high MSA and an idle task becomes a cpu hog it can preempt 
almost everything till it gets flagged as non interactive (takes 3 seconds at 
an MSA of 10 seconds) and gets expired.

3. The cpu hog that sleeps while waiting on disk i/o.
Batch tasks (like compilers) never sleep on their own. However they often are 
forced to wait on disk i/o and this registers as sleep. This artifically 
raises their interactive status, and the more of these tasks there are 
waiting on i/o that are hogs, the worse the scenario gets (witness the make 
-j $largenumber). These tasks then swamp the cpu time.

4. Interactive tasks forking off cpu hogs.
A forked process inherits the sleep_avg of it's parent penalised by the 
tunable CHILD_PENALTY. The default is 95 meaning it inherits 95% of the sleep 
avg of it's parent. This is meant to stifle the ability of maximum 
interactive tasks from forking other maximum interactive tasks. The problem 
arises when an interactive task forks off a cpu hog, the cpu hog should not 
really start at a high interactive level even if it drops later. Furthermore, 
the parent usually sleeps after it forks off a cpu hog, elevating it's own 
priority again. This makes for a parent that can repeatedly fire off almost 
fully interactive children (make -j $largenumber). Decreasing the child 
penalty helps, but incurs a fork startup slowdown. Applying a PARENT_PENALTY 
helps, but this hurts the case where the parent is firing off lots of 
interactive tasks (eg web server).

5. Intra jiffy wakeups.
The sleep avg of a task is incremented by the amount of time spent sleeping in 
"jiffies" since it last went to sleep, and is decremented each tick of the 
interrupt (1ms on i386). As hardware gets faster many tasks may wake up, use 
cpu time and go to sleep before a single jiffy has passed. If they wake often 
enough or sleep for short enough periods the resolution of the sleep avg will 
completely misjudge them.

6. Scheduling latency.
Two problems exist in this area. 
First is that the time spent on the runqueue is not credited as sleep time, 
and if the runqueue is busy this can be a not-insignificant duration. This 
makes for worsening elevation of priority under load.
Second, nice 0 tasks get 102ms of task_timeslice. If a task wakes up while 
this first task is running and is not higher priority than it, it will wait 
for the previous task to finish it's timeslice. Most audio apps wakeup around 
every 50ms to do their work, which is often over in less than 2ms but if they 
wait for long enough for other same priority tasks to finish eventually it 
will lead to audio dropouts. Also some tasks need to run very frequently to 
minimise the perceptible jitter (moving the mouse in X).

7. Priority inversion
An interactive task that wakes up a cpu hog to get it to do work normally 
sleeps while waiting for the cpu hog to respond. A poorly written application 
can continually keep looking for a response from the cpu hog, and if it is 
higher priority than the cpu hog, it will preempt it. A spiral of the 
interactive task waking up looking for info, preempting the cpu hog and not 
allowing the cpu hog to get any cpu time can ensue. Eventually the 
interactive task is seen to be wasting cpu, loses it's priority bonus to a 
point where it is equal priority with the cpu hog and finally the cpu hog 
gets scheduled. Wine/wineserver cpu intensive games seem to exhibit this, as 
does v2.28 of blender, and kmail reading a pgp signed email.

While this is by no means an exhaustive list, it does describe some of the 
more common issues.


Current work:

Mike Galbraith did a lot of the excellent earlier work on tackling the issue 
of sleep avg resolution by using high resolution timers and nanosecond 
resolution instead of milliseconds. Initially my efforts at interactivity 
improvement did not use this work, but Ingo Molnar wrote some nice 
infrastructure he considered essential that I then worked with. 

Ingo's patch tackled the sleep_avg resolution problem, and added the on 
runqueue bonus calculation. He went further to discriminate between on 
runqueue tasks being woken up by interrupts or other tasks to give more bonus 
to the interrupt tasks and weight down the other on runqueue tasks. Tasks 
woken in interrupts tend to be interactive ones. The other major addon was 
the introduction of TIMESLICE_GRANULARITY[TG from here on]. This would 
prevent a task from using up it's full timeslice if there were other tasks of 
the same priority on the active array. Basically after TG passed (initally 
set to 50ms and later changed to 25ms) the task would be rescheduled with 
it's remaining timeslice at the end of the queue of tasks at the same 
priority. Thus a lot of nos. 5 and 6 above were already accounted for.


Approach:

What I set out to do was use the current infrastructure and not change any of 
the basic workings so as to not change the overall performance of the 
scheduler. Instead I focused on using the information presented to the 
interactivity estimator to choose dynamic priority more carefully. Originally 
the patches were meant to be the O(1)interactivity patches, abbreviated to 
O1int, which were jokingly referred to as the orthogonal patches and the name 
stuck. I'll concentrate on the latest evolution (O18.1) and bypass discussion 
of bugs and failed approaches.


The MSA in my patches is much smaller (10 average timeslices or ~1second), 
allowing tasks to rapidly change interactive status and allow cpu hogs to be 
identified sooner. Other strategies were put into place to blunt the 
disadvantage of this smaller MSA.

Instead of the sleep avg being a simple counter up and down I use the current 
status of a task to determine how much sleep avg to add or subtract in a sort 
of soft feedback loop. This introduces some autoregulation into the cpu 
interactivity estimator. What it does is determines the current bonus based 
on the sleep_avg to date (CURRENT_BONUS) to decide how much to weight the 
sleep_avg credited to a task. As tasks drop priority they get weighted 
proportionately more so as to recover rapidly from a low interactive state, 
and allow new tasks to gain sleep_avg quickly after their initial burn. 

A flag was introduced (interactive_credit) to detect truly interactive tasks 
so that they wouldn't be penalised too much during their bursts of cpu 
activity. If they reached the MSA and were still accumulating sleep_avg, they 
would start getting interactive credits. Once they were beyond a cutoff (also 
the MSA) they were flagged as HIGH_CREDIT, which then affected their 
sleep_avg burn. This was weighted in the reverse direction as the gaining of 
sleep time, so they slid down an ever increasing slope as they behaved more 
like cpu hogs.

more subtle but helpful was minimising the amount of sleep_avg a LOW_CREDIT 
task could obtain with each wakeup to just one priority bonus to prevent 
rapid swings towards interactive once a task was a proven cpu hog.


Idle detection was implemented, where a task slept for a predetermined time it 
would have a ceiling imposed on the amount of sleep avg it got, set to a 
value that corresponded with ensuring it would just be considered as 
interactive so it wouldn't expire easily. Kernel threads were excluded as 
they are often idle, but when they wake up they need cpu time.

I/O wait of cpu hogs was difficult to detect directly (there is not enough 
information in the kernel) but indirectly, most local disk i/o is associated 
with a special kind of sleep (UNINTERRUPTIBLE_SLEEP seen as process in D). 
Tasks flagged as waiting on this kind of sleep that weren't HIGH_CREDIT were 
limited in their sleep_avg rise to just interactive. This prevents cpu hogs 
getting maximum interactive state during disk i/o and prevents them getting 
needlessly expired after disk i/o.


Forking was handled in my patch by keeping the sleep_avg at a level that would 
not make forked children or their parents lose any priority bonus over what 
CHILD_PENALTY and PARENT_PENALTY. However it would round down their sleep_avg 
so it was just the amount necessary to be the same bonus. This meant they 
remained the same priority but could easily drop priority if they then proved 
to be cpu hogs.


Priority inversion is something that proved difficult to tackle. While a bug 
in earlier incarnations of the orthogonal patches intensified the problem, 
the generic solution in my patches now is that tasks that turn into cpu hogs 
become less interactive at an accelerated rate the lower their interactivity 
gets. This significantly reduces the time the inversion exists but doesn't 
cure the problem. Correcting the bugs in software design are needed for that. 
However it is only the dependent task that gets starved during priority 
inversion, as the longest a single nice 0 fully interactive task can hold the 
cpu over another similarly interactive task with my patches is 27ms. 
Preventing inversion from occurring at all would require more infrastructure.


Other generic changes in my patches were added for throughput purposes.
The requeuing of tasks at TIMESLICE_GRANULARITY was limited to user tasks that 
were interactive with at least MIN_TIMESLICE (10ms) remaining. This meant 
that batch tasks (non interactive) are allowed to run out their full 
timeslice before another task of the same priority gets scheduled. Also it 
meant that a task was not rescheduled with such a small timeslice remaining 
that it was inducing cache trashing. The default timeslice of nice 0 tasks is 
102 ms which meant that if they were requeued every 25ms, they would be 
requeued once with only 2ms remaining. Preemption originally was allowed if a 
task was the same priority but had more than twice the remaining timeslice of 
the process currently being scheduled. This is unecessary with TG in action 
and simply added to the number of context switches so was removed.

TG set to 25ms puts us close to the mouse movement of desktop operating 
systems, if X gets rescheduled with that as the largest wait period (it 
usually is less). Most versions of windows update their mouse cursor at 50Hz, 
and earlier MacOS' at 200Hz. I'm not sure about recent versions. 


>From the testing I've done to date it appears this approach works well in most 
settings and is an improvement in almost every palpable way to the default 
scheduler. The remaining issue is that application startup will be slower 
than the vanilla scheduler (your mileage may vary) under load, but I feel not 
unacceptably so as a tradeoff. No further development is planned on these 
patches (apart from perhaps cosmesis) unless other issues arise.


Thanks to the many contributors of ideas and previous patches and the numerous 
testers out there. I hope that even if an alternative approach is used in the 
final 2.6 kernel that my contributions and explanations have been helpful.


Con Kolivas
25 August 2003