Performance¶
anyserial's perf story is "low overhead on top of the kernel — anything
else is the wire's fault." The benchmarks below measure the parts we
control: the readiness loop, the configuration apply pipeline, and the
allocation profile of the receive path.
See DESIGN §26 and §28 for the full strategy and methodology.
Targets vs. observed¶
Targets from DESIGN §26.1:
| Metric | Target | Observed (asyncio + uvloop) | Status |
|---|---|---|---|
| pty single-byte receive p50 | < 200 µs | 99 µs | ✅ |
| pty single-byte send p50 | < 200 µs | 101 µs | ✅ |
| 64 KiB write throughput (pty) | ≥ 90% line rate | ≈110 MB/s effective | ✅¹ |
Syscall rate per receive_available() |
1 os.read / call |
1 os.read / call (enforced) |
✅² |
Allocation per receive_into() loop |
~zero payload alloc | < 16 KiB net for 200 calls | ✅ |
Allocation per receive() loop |
(reasonable headroom) | < 256 KiB net for 200 calls | ✅ |
Allocation per receive_available() loop |
(reasonable headroom) | < 64 KiB net for 200 calls | ✅ |
| Cancellation latency | < 1 ms | (covered by integration tests) | ✅ |
| Regression threshold | 10% from baseline | (advisory in CI today) | 🟡³ |
¹ pty has no real link rate to throttle against; "throughput" here is the per-call cost of the write-then-drain loop. Hardware adapter numbers will land once a self-hosted runner is wired up.
² Enforced by
tests/integration/test_receive_syscall_budget.py,
which counts read_nonblocking invocations during a drain and fails
if receive_available triggers more than one. Sibling sanity test
confirms receive(1) still costs N syscalls for an N-byte burst —
the whole reason receive_available exists.
³ The nightly bench job records a JSON baseline per run and surfaces the delta in the GitHub Actions job summary. The hard 10% gate flips on once we've characterized the GHA noise floor with ~10 baselines.
First reference numbers¶
Recorded on a developer laptop (Intel Core Ultra 7 155H, 22 logical
cores, Linux 6.19, Python 3.13.13). Median of 200 rounds × 5 iterations
each via pytest-benchmark.pedantic. Numbers in microseconds (lower is
better):
Single-byte latency (115 200 baud, pty)¶
| Backend | Receive p50 | Receive max | Send p50 | Send max |
|---|---|---|---|---|
| asyncio + uvloop | 99 | 509 | 101 | 399 |
| asyncio (default) | 126 | 961 | 133 | 509 |
| trio | 124 | 781 | 135 | 415 |
uvloop wins on median by 20–30%; asyncio's tail latency is more variable because the default selector loop polls less aggressively.
Bulk send throughput (pty, microseconds per call)¶
| Payload | asyncio + uvloop | asyncio | trio |
|---|---|---|---|
| 256 B | 157 | 179 | 187 |
| 4 KiB | 156 | 195 | 203 |
| 64 KiB | 595 | 631 | 664 |
Per-call overhead at 256 B and 4 KiB is essentially identical — the cost
is paying one wait_writable + one os.write regardless of payload.
At 64 KiB the kernel pty's 4 KiB buffer drives 16 partial writes, and
wall-time scales accordingly.
Many-port fan-out (one round-trip per port, pty, microseconds)¶
| N ports | asyncio + uvloop | asyncio | trio |
|---|---|---|---|
| 8 | 307 | 412 | 358 |
| 32 | 828 | 1026 | 939 |
Sub-linear scale: 4× the ports take ≈2.7× the time, suggesting most of the per-port cost overlaps inside the event loop's readiness wait.
receive_available drain (single call + one-syscall drain)¶
| Queue depth | asyncio + uvloop | asyncio | trio |
|---|---|---|---|
| 64 B | 106 | 134 | 142 |
| 1 KiB | 111 | 132 | 149 |
| 4 KiB | 144 | 329 | 170 |
Per-call cost is flat from 64 B to 1 KiB because the single os.read
that follows the readiness wake-up handles all queued bytes at once —
that's the DESIGN §26.1 syscall-budget target in action. At 4 KiB the
asyncio selector loop starts paying an extra round-trip through the
kernel queue (look at its max vs uvloop's max), which is why the
uvloop / trio numbers stay tight while asyncio's median jumps.
Compare against receive(1) called 64 times for a 64 B burst: that
path costs ≥64 read_nonblocking syscalls by design (enforced by a
sibling integration test), so the effective latency per-byte is
≈2 µs × 64 = ≈128 µs total versus receive_available's 106 µs
for the whole burst — a ~1.2× factor at this depth that grows with
queue size.
Hardware case study: Alicat MFC¶
The pty numbers above measure userland overhead. The numbers below
measure the same library against a real USB-serial device end-to-end,
and compare it head-to-head with pyserial and pyserial-asyncio on
that hardware. Script at
benchmarks/hardware/alicat_benchmark.py;
test rig and methodology in
benchmarks/hardware/README.md.
Rig: Alicat MCR-200SLPM-D (firmware 8v17.0-R23) on a Prolific PL2303 USB-serial adapter, 115200-8N1, Linux 6.19, Python 3.13.
Single-device poll → frame round-trip (500 iterations)¶
| Library / path | p50 | p90 | p99 |
|---|---|---|---|
| pyserial (sync) | 5.61 ms | 5.74 ms | 12.74 ms |
| anyserial async, no portal | 5.52 ms | 5.74 ms | 12.93 ms |
anyserial + BufferedByteReceiveStream |
5.52 ms | 5.70 ms | 12.74 ms |
| anyserial async (portal-wrapped) | 5.99 ms | 6.40 ms | 11.05 ms |
| anyserial sync wrapper | 5.90 ms | 6.37 ms | 12.96 ms |
| pyserial-asyncio | 5.96 ms | 6.30 ms | 12.20 ms |
Two things to read here:
- Pure anyserial async ties pyserial at p50 within ~100 µs on real
USB. Earlier drafts reported anyserial being ~500 µs slower on this
workload; that gap was the
portal.callthread hop in the benchmark harness, not the library. Time the work inside the coroutine and the gap disappears. BufferedByteReceiveStreamis free. Hand-rolledreceive(128)with CR detection and the buffered wrapper are indistinguishable on this workload. Use the buffered wrapper — it's idiomatic and reads like protocol code instead of I/O plumbing.
The ~5.5 ms p50 floor and ~12–17 ms p99 ceiling on all rows is the Prolific adapter's USB IRP turnaround plus device firmware processing; no amount of library work moves it.
Cancellation overshoot (10 ms deadline, 200 iterations)¶
Meets the DESIGN §26.1 < 1 ms p99 cancellation target on real hardware:
| Library / path | p50 | p99 |
|---|---|---|
| anyserial asyncio | 247 µs | 742 µs |
| anyserial asyncio + uvloop | 405 µs | 777 µs |
| anyserial trio | 410 µs | 767 µs |
| pyserial-asyncio | 266 µs | 590 µs |
pyserial (sync timeout=, not cancel) |
162 µs | 449 µs |
| anyserial sync wrapper | 994 µs | 2.60 ms |
The sync wrapper is ~3–4× slower because every cancellation pays one
portal hop. The pyserial timeout= column is included for reference
but measures a blocking read with a deadline, not true task
cancellation — the two aren't directly comparable.
Fan-out scaling (50 polls × N pty peers, seconds)¶
Where the library's architecture pays off:
| N devices | anyserial async | pyserial threaded | speedup |
|---|---|---|---|
| 1 | 0.010 s | 0.025 s | 2.5× |
| 4 | 0.020 s | 0.127 s | 6.4× |
| 16 | 0.084 s | 0.520 s | 6.2× |
One event loop handles 16 concurrent ports in 84 ms; the thread-per- port approach in pyserial takes 520 ms for the same work. Per-port cost: anyserial stays flat at ~5 ms/port from N=4 onward; pyserial climbs to ~32 ms/port — GIL contention and thread dispatch scaling directly with device count.
pty peers have no baud-rate throttling, so absolute numbers are library-overhead-only. The scaling law is what matters here: on real USB with 16 devices the per-port floor would be higher (hardware), but anyserial would remain roughly flat while pyserial would grow.
Takeaways¶
- Single-device request/response: parity with pyserial. Pick whichever fits your codebase.
- Many-device fan-out or structured cancellation: pick anyserial. That's what the library exists for.
- Know the portal cost. ~470 µs per
portal.callhop on this machine — visible on tight request/response loops, irrelevant for I/O-bound multi-device workloads. See Sync wrapper. - Use
BufferedByteReceiveStreamfor line-framed protocols. No cost, better code.
Windows (Serial Pair)¶
Windows numbers are published nightly from the
bench-windows
job when repository variable ANYSERIAL_RUN_SELF_HOSTED_WINDOWS=true
enables an anyserial-windows-serial self-hosted Windows runner with a
pre-provisioned virtual or hardware COM-port pair. The pair defaults to
COM50,COM51 and can be overridden with repository variable
ANYSERIAL_WINDOWS_PAIR. Target matrix from
design-windows-backend.md §11:
| Scenario | Target | Backend matrix |
|---|---|---|
| Single-port round-trip, 1 B request/reply | p99 ≤ 3× Linux p99 on same hardware | asyncio (Proactor) / trio |
| Throughput at 921600 baud, 4 KiB chunks | ≥ 90% of pyserial-asyncio POSIX | asyncio (Proactor) / trio |
| N-port fanout (8 / 32, optionally 128) | No thread growth; linear CPU scale | asyncio (Proactor) / trio |
| Open / close cycle | < 50 ms per cycle | asyncio (Proactor) / trio |
Measured numbers land in the CI job summary per run and in the
bench-results-windows-py3.13-N artifact (retention 90 days). The
uvloop column is absent — uvloop does not build on Windows.
A reference table will populate here once the nightly job has accumulated enough baselines to publish stable medians; the fundamental constraint is Windows host and driver noise, which is higher than the Linux pty runner's noise floor.
Windows-specific caveats¶
- Virtual COM != real USB-serial. Virtual-driver IRP turnaround adds latency even on an otherwise-idle system. Real FTDI / CP210x adapters add more, and an FTDI adapter running its default 16 ms latency timer adds a lot more — see Windows.
- No uvloop. The Windows matrix is asyncio (Proactor) and trio only. Per-backend comparisons against the Linux uvloop numbers are apples-to-oranges.
- Proactor only. The numbers don't include
SelectorEventLoop— it's an explicit unsupported configuration (see Windows / Supported runtimes).
Methodology¶
Async tests are tricky to micro-benchmark — anyio.run() startup is
~50 ms, which would swamp any sub-millisecond workload if invoked per
iteration. Instead each benchmark holds one persistent event loop via
anyio.from_thread.start_blocking_portal,
and each timed iteration is a single portal.call(coro_fn, *args) —
round-trip overhead in the tens of µs.
The portal is parametrized across the same backend matrix the rest of
the test suite uses (asyncio default, asyncio + uvloop, trio). Each pty
pair is opened in raw mode (cfmakeraw on the follower fd, O_NONBLOCK
on the controller) so the kernel doesn't translate \n → \r\n or
buffer waiting for a newline.
See benchmarks/conftest.py
and benchmarks/README.md
for the full setup.
Reproducing locally¶
Equivalent to:
uv sync --all-extras --group bench --group test
mkdir -p benchmarks/results
uv run pytest benchmarks/ --benchmark-only \
--benchmark-json=benchmarks/results/$(git rev-parse --short HEAD).json
To compare two runs:
Caveats¶
- Pty != real serial. The kernel pty has no baud-rate throttling, so these numbers measure userland overhead, not link saturation.
- GHA shared runners are noisy. Numbers from the nightly job will vary ±20–30% between runs. Trust the median across multiple runs.
- uvloop sometimes regresses. It's optimized for sockets; serial
fds use the same
wait_readable/wait_writableplumbing, but the win isn't as large as it is for HTTP servers. Track both. - Hardware numbers will be different. USB-serial adapters add a
per-packet round-trip latency (FTDI's default
latency_timeris 16 ms — see DESIGN §18 for howlow_latency=Truedrops it to 1 ms). low_latency=Trueis Linux-only. macOS, BSD, and Windows have no equivalent kernel knob; the capability readsUNSUPPORTEDand the request is routed throughUnsupportedPolicy(see macOS / BSD / Windows). The headline Linux numbers above were recorded with the low-latency knob engaged; the Windows serial-pair section uses driver defaults.