From: Chris Mason I compared the 2.6 pipetest results with the 2.4 suse kernel, and 2.6 was roughly 40% slower. During the pipetest run, 2.6 generates ~600,000 context switches per second while 2.4 generates 30 or so. aio-context-switch (attached) has a few changes that reduces our context switch rate, and bring performance back up to 2.4 levels. These have only really been tested against pipetest, they might make other workloads worse. The basic theory behind the patch is that it is better for the userland process to call run_iocbs than it is to schedule away and let the worker thread do it. 1) on io_submit, use run_iocbs instead of run_iocb 2) on io_getevents, call run_iocbs if no events were available. 3) don't let two procs call run_iocbs for the same context at the same time. They just end up bouncing on spinlocks. The first three optimizations got me down to 360,000 context switches per second, and they help build a little structure to allow optimization #4, which uses queue_delayed_work(HZ/10) instead of queue_work. That brings down the number of context switches to 2.4 levels. On Tue, 2004-02-24 at 13:32, Suparna Bhattacharya wrote: > On more thought ... > The aio-splice-runlist patch runs counter-purpose to some of > your optimizations. I put that one in to avoid starvation when > multiple ioctx's are in use. But it means that ctx->running > doesn't ensure that it will process the new request we just put on > the run-list. The ctx->running optimization probably isn't critical. It should be enough to call run_iocbs from io_submit_one and getevents, which will help make sure the process does its own retries whenever possible. Doing the run_iocbs from getevents is what makes the queue_delayed_work possible, since someone waiting on an event won't have to wait the extra HZ/10 for the worker thread to schedule in. Wow, 15% slower with ctx->running removed, but the number of context switches is stays nice and low. We can play with ctx->running variations later, here's a patch without them. It should be easier to apply with the rest of your code. Index: linux.lkcd/fs/aio.c =================================================================== --- linux.lkcd.orig/fs/aio.c 2004-02-23 13:26:52.000000000 -0500 +++ linux.lkcd/fs/aio.c 2004-02-24 08:48:13.619874976 -0500 @@ -837,7 +837,7 @@ run = __queue_kicked_iocb(iocb); spin_unlock_irqrestore(&ctx->ctx_lock, flags); if (run) { - queue_work(aio_wq, &ctx->wq); + queue_delayed_work(aio_wq, &ctx->wq, HZ/10); aio_wakeups++; } } @@ -1073,13 +1073,14 @@ struct io_event ent; struct timeout to; int event_loop = 0; /* testing only */ + int retry = 0; /* needed to zero any padding within an entry (there shouldn't be * any, but C is fun! */ memset(&ent, 0, sizeof(ent)); +retry: ret = 0; - while (likely(i < nr)) { ret = aio_read_evt(ctx, &ent); if (unlikely(ret <= 0)) @@ -1108,6 +1109,13 @@ /* End fast path */ + /* racey check, but it gets redone */ + if (!retry && unlikely(!list_empty(&ctx->run_list))) { + retry = 1; + aio_run_iocbs(ctx); + goto retry; + } + init_timeout(&to); if (timeout) { struct timespec ts; @@ -1498,11 +1506,9 @@ goto out_put_req; spin_lock_irq(&ctx->ctx_lock); - ret = aio_run_iocb(req); + list_add_tail(&req->ki_run_list, &ctx->run_list); + __aio_run_iocbs(ctx); spin_unlock_irq(&ctx->ctx_lock); - - if (-EIOCBRETRY == ret) - queue_work(aio_wq, &ctx->wq); aio_put_req(req); /* drop extra ref to req */ return 0;