why io_uring made our tail latency worse
we replaced a tidy little blocking-thread-per-connection model with io_uring expecting throughput to go up and latency to follow. throughput did go up. p99.9 doubled. this post is the bisect.
the short version: SQPOLL plus a saturated CPU plus an unfortunate cgroup pinning interacted to produce a multi-millisecond stall on submission that nothing in our metrics caught for two weeks. the rest of this post is the long version.