Why Rust nextest is process-per-test

114 points by jicea 6 months ago

I prefer per process over the alternatives.

When you write code you have the choice to do per process, per thread, or sequential.

The problem is that doing multiple tests in a shared space doesn't necessarily match the world in which this code is run.

Per process testing allows you to design a test that matches the usage of your codebase. Per thread already constrains you.

For example: we might elect to write a job as a process that runs on demand, and the library we use has a memory leak, but it can't be fixed in reasonable time. Since we write it as a process that gets restarted we manage to constrain the behavior.

Doing multiple tests in multiple threads might not work here as there is a shared space that is retained and isn't representative of real world usage.

Concurrency is a feature of your software that you need to code for. So if you have multiple things happening, then that should be part of your test harness.

The test harness forcing you to think of it isn't always a desirable trait.

That said, I have worked on a codebase where we discovered bugs because the tests were run in parallel, in a shared space.

sunshowers - 6 months ago

There is definitely some value in shaking out bugs by running code in parallel in the same process space — someone on Lobsters brought this up too. I've been wondering if there's some kind of optional feature that can be enabled here.
- NewJazz - 6 months ago
  
  Can you do it deliberately in the test?
  - sunshowers - 6 months ago
    
    Yes, you can write your own meta-test that runs all of the actual tests in separate threads. But it's a bit inconvenient and you won't necessarily get separate reporting.
rad_gruchalski - 6 months ago

> The problem is that doing multiple tests in a shared space doesn't necessarily match the world in which this code is run.
For me that's a positive bonus. If it runs multiple times in parallel and works, it will work as a single instance deployed in a pod somewhere.
- OptionOfT - 6 months ago
  
  I get that, but if it doesn't work, do you spend the time on a use-case that doesn't exist?
  - sunshowers - 6 months ago
    
    Totally depends on if it's a use case you care about. Things like libraries vs leaf-node applications also play a factor.

o11c - 6 months ago

A much better model still is a mixture.

* Use multiple processes, but multiple tests per process as well.

* Randomly split and order the tests on every run, to encourage catching flakiness. Print the seed for this as part of the test results for reproducibility.

* Tag your tests a lot (this is one place where, as many languages provide, "test classes" or other grouping is very useful). Smoke tests should run before all other tests, and all run in one process (though still in random order). Known long-running tests should be tagged to use a dedicated process and mostly start early (longest first), except that a few cores should be reserved to work through the fast tests so they can fail early.

* If you need to kill a timed-out test even though other tests are still running in the same process - just kill the process anyway, and automatically run the other tests again.

* Have the harness provide fixtures like "this is a temporary directory, you don't have to worry about clearing it on failure", so tests don't have to worry about cleaning up if killed. Actually, why not just randomly kill a few tests regardless?

I wrote some more about tests here [1], but I'm not sure I'll update it any more because of Github's shitty 2FA-but-only-the-inconvenience-not-the-security requirement.

[1]: https://gist.github.com/o11c/ef8f0886d5967dfebc3d

cortesi - 6 months ago

Nextest is one of the very small handful of tools I use dozens or hundreds of times a day. Parallelism can reduce test suite execution time significantly, depending on your project, and has saved me cumulative days of my life. The output is nicer, test filtering is nicer, leak detection is great, and the developer is friendly and responsive. Thanks sunshowers!

The one thing we've had to be aware of is that the execution model means there can sometimes be differences in behaviour between nextest and cargo test. Very occasionally there are tests that fail in cargo test but succeed in nextest due to better isolation. In practice this just means that we run cargo test in CI.

sunshowers - 6 months ago

Thank you for the kind words!
The behavior differences mean some projects (like wgpu, and nextest itself) only support nextest these days. There's also support for setup scripts which can be used to pre-seed databases and stuff.

marky1991 - 6 months ago

I don't understand why he jumps straight from 'one test per process' to 'one test per thread' as the alternative.

I'm not actually clear what he means by 'test' to be honest, but I assume he means 'a single test function that can either pass or fail'

Eg in python (nose)

class TestSomething: def test_A(): ... def test_B(): ...

I'm assuming he means test_A. But why not run all of TestSomething in a process?

Honestly, I think the idea of having tests have shared state is bad to begin with (for things that truly matter, eg if the outcome of your test depends on the state of sys.modules, something else is horribly wrong), so I would never make this tradeoff to benefit a scenario that I never think should be done.

Even if we were being absolute purists, this still hasn't solved the problem, the second your process communicates with any other process (or server). And that problem seems largely unsolveable, short of mocking.

Basically, I'm not convinced this is a good tradeoff, because the idea of creating thousands and thousands of processes to run a test suite, even on linux, sounds like a bad idea. (And at work, would definitely be a bad idea, for performance reasons)

saghm - 6 months ago

I think most of the context that might explain your confusion is the way that tests work out of the box in Rust. The default test harness when invoking `cargo test` runs one test per thread (and by default parallelizes based on the number of cores available, although this is configurable with a flag). In Rust, there isn't any equivalent to the `TestSomething` class you give; each test is always a top-level function. Since `cargo nextest` is a mostly drop-in replacement for `cargo test`, I imagine the author is using one test per thread as an alternative because that's the paradigm that users will be switching from if they start using cargo nextest.
While enforcing no shared state in tests might be useful, that wouldn't be feasible in Rust without adding quite a lot of constraints that would be tough if not impossible to enforce in a drop-in replacement for cargo test. There's certainly room for alternatives in the testing ecosystem in Rust that don't try to maintain compatibility with the built-in test harness, but I don't think the intention of cargo nextest is to try to do that.
One other point that might not be obvious is that right now, there's no stable way to hook into Rust's libtest. The only options to provide an alternative testing harness in Rust are to either only support nightly rather than stable, break compatibility with tests written for the built-in test harness, or provide a separate harness that still supports existing tests. I'm sure there are arguments to be made for each of the other alternatives, but personally, I don't think there's any super realistic chance for adoption of anything that picks the first two options, so the approach cargo nextest is taking the most viable approach available (at least until libtest stabilizes, but it's not obvious when that will happen).
cbarrick - 6 months ago

> I'm not actually clear what he means by 'test' to be honest, but I assume he means 'a single test function that can either pass or fail'
I assume so as well.
Unit testing in Rust is based around functions annotated with #[test], so it's safe to assume that when the author says "test" they are referring to one such function.
It's up to the user to decide what they do in each function. For example, you could do a Go-style table-driven test, but the entire function would be a single "test", _not_ one "test" per table entry.
sunshowers - 6 months ago

FYI I use they/she pronouns (thanks jclulow!)
> Honestly, I think the idea of having tests have shared state is bad to begin with (for things that truly matter, eg if the outcome of your test depends on the state of sys.modules, something else is horribly wrong), so I would never make this tradeoff to benefit a scenario that I never think should be done.
I don't disagree as a matter of principle, but the reality really is different. Some of the first nextest users outside of myself and my workplace were graphical libraries and engines.
> Basically, I'm not convinced this is a good tradeoff, because the idea of creating thousands and thousands of processes to run a test suite, even on linux, sounds like a bad idea. (And at work, would definitely be a bad idea, for performance reasons)
With Rust or other native languages it really is quite fast. With Python I agree not as much, so this tradeoff wouldn't make as much sense there yes.
But note that things like test cancellation are a little easier to do in an interpreted model.
- vlovich123 - 6 months ago
  
  Have you considered having pre warmed zygote processes hot that you then fork for each test rather than launching the process from scratch? That might mitigate the performance issue even more since less has to be initialized.
  - sunshowers - 6 months ago
    
    Zygote processes for the test binary itself, you mean? That will probably require some coordination from the test because fork is not compatible with most Rust code. Nextest is designed to work with arbitrary Rust test binaries, though it imposes some conditions on them.
jclulow - 6 months ago

NB: the author's pronouns are they/she, not he.
hinkley - 6 months ago

> Honestly, I think the idea of having tests have shared state is bad to begin with
I blame this partially on our notions of code reuse. We conflate it with several other things, and in the case of tests we conflate it with state reuse.
And the availability of state reuse leads to people writing fakes when they should be using mocks, and people not being able to tell the difference between mocks and fakes and thus being incapable to have a rational discussion about them.
To my thinking, and the thinking of pretty much all of the test experts I’ve called Mentor (or even Brother), beforeEaches should be repeatable. Add a test, it repeats one more time. Delete a test, one less. And if they’re repeatable, they don’t have to share the same heap. One heap is as good as another.
Lots of languages can only do that segregation at the process level. In NodeJS it would be isolates (workers). If you’re very careful about global state you could do it per thread. But that doesn’t happen very often because “you” includes library writers, language designers, and your coworker Steve who is addicted to in-memory caching. I can say, “don’t be Steve” until I’m blue in the face but nearly every team hires at least one Steve, and some are rotten with them.

Ericson2314 - 6 months ago

This is good for an entirely different reason, which is running cross-compiled tests in an emulator.

That is especially good for bare metal. If you don't have global allocator, have limitted ram, etc., you might not be able to write the test harness as part of the guest program at all! so you want want to move as much logic to the host program as possible, and then run as little as a few instructions (!) in the guess program.

See https://github.com/gz/rust-x86 for an example of doing some of this.

sunshowers - 6 months ago

This is why Miri also works well with nextest, yeah!

pjc50 - 6 months ago

This will be horrendously slow on Windows.

sunshowers - 6 months ago

It's certainly slower than on Linux, but horrendous is a little bit of an overstatement. For many real-world test suites, process creation is a small part of overall time, and nextest still ends up being overall faster due to better handling of long-pole tests. The issues around thread termination and stuff are quite relevant on Windows. (And maybe posts like mine are going to get Microsoft and AV vendors to pay more attention to process creation performance! It really doesn't have to be slow, it just is in practice.)
Measured by weekly downloads (around 120k a week total last I checked), Windows is actually the number two platform nextest is used on, ahead of macOS. It's mostly CI, but clearly a lot of people are getting value out of nextest on Windows.

bfrog - 6 months ago

There’s a similar test library for c that does this and it’s great. I love the concept, it works well most of the time.

sedatk - 6 months ago

> Memory corruption in one test doesn’t cause others to behave erratically. One test segfaulting does not take down a bunch of other tests.

Is "memory corruption" an issue with Rust? Also, if one test segfaults, isn't it a reason to halt the run because something got seriously broken?

codetrotter - 6 months ago

> Is "memory corruption" an issue with Rust?
You can cause memory corruption if you opt out of memory safety guarantees by using Unsafe Rust.
https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html
Sometimes unsafe is necessary and the idea then is that the “dangerous” parts of the code remain isolated in explicitly marked “unsafe” blocks, where it can be closely reviewed.
Also, even if your own Rust code is doing nothing unsafe you might be using external libraries written in other languages and things might go wrong.
> if one test segfaults, isn't it a reason to halt the run because something got seriously broken?
Sometimes it’s still interesting and helpful to continue running other tests even if one fail. If several of them fail it might even help you pinpoint exactly what’s going wrong than just a single failure might. (Although having a bunch of failing tests can also be more noise.)
sunshowers - 6 months ago

In local runs it makes sense to halt the run, but in CI not as much — nextest makes this entirely configurable.

amelius - 6 months ago

According to some of these reasons every library should run in its own process too.

sunshowers - 6 months ago

I think the difference is that misbehavior isn't as expected of libraries as it is of tests. But yes if your libraries are prone to misbehavior, it's common to compartmentalize them in separate processes.
sujayakar - 6 months ago

that's roughly what the wasm component model is aiming for!
https://hacks.mozilla.org/2019/11/announcing-the-bytecode-al...
CJefferson - 6 months ago

Honestly, I often feel that’s the world we should be heading to. Why should my zlib have any access to my programs memory, other than an input and output buffer?

zbentley - 6 months ago

This article is a good primer on why process isolation is more robust/separated than threads/coroutines in general, though ironically I don't think it fully justifies why process isolation is better for tests as a specific usecase benefitting from that isolation.

For tests specifically, some considerations I found to be missing:

- Given speed requirements for tests, and representativeness requirements, it's often beneficial to refrain from too much isolation so that multiple tests can use/excercise paths that use pre-primed in memory state (caches, open sockets, etc.). It's odd that the article calls out that global-ish state mutation as a specific benefit of process isolation, given that it's often substantially faster and more representative of real production environments to run tests in the presence of already-primed global state. Other commenters have pointed this out.

- I wish the article were clearer about threads as an alternative isolation mechanism for sequential tests versus threads as a means of parallelizing tests. If tests really do need to be run in parallel, processes are indeed the way to go in many cases, since thread-parallel tests often test a more stringent requirement than production would. Consider, for example, a global connection pool which is primed sequentially on webserver start, before the webserver begins (maybe parallel) request servicing. That setup code doesn't need to be thread-safe, so using threads to test it in parallel may surface concurrency issues that are not realistic.

- On the other hand, enough benefits are lost when running clean-slate test-per-process that it's sometimes more appropriate to have the test harness orchestrate a series of parallel executors and schedule multiple tests to each one. Many testing frameworks support this on other platforms; I'm not as sure about Rust--my testing needs tend to be very simple (and, shamefully, my coverage of fragile code lower than it should be), so take this with a grain of salt.

- Many testing scenarios want to abort testing on the first failure, in which case processes vs. threads is largely moot. If you run your tests with a thread or otherwise-backgrounded routine that can observe a timeout, it doesn't matter whether your test harness can reliably kill the test and keep going; aborting the entire test harness (including all processes/threads involved) is sufficient in those cases.

- Debugging tools are often friendlier to in-process test code. It's usually possible to get debuggers to understand process-based test harnesses, but this isn't usually set up by default. If you want to breakpoint/debug during testing, running your tests in-process and on the main thread (with a background thread aborting the harness or auto-starting a debugger on timeout) is generally the most debugger-friendly practice. This is true on most platforms, not just Rust.

- fork() is a middle ground here as well, which can be slow, though mitigations exist, but can also speed things up considerably by sharing e.g. primed in-memory caches and socket state to tests when they run. Given fork()'s sharp edges re: filehandle sharing, this, too, works best with sequential rather than parallel test execution. Depending on the libraries in use in code-under-test, though, this is often more trouble than it's worth. Dealing with a mixture of fork-aware and fork-unaware code is miserable; better to do as the article suggests if you find yourself in that situation. How to set up library/reusable code to hit the right balance between fork-awareness/fork-safety and environment-agnosticism is a big and complicated question with no easy answers (and also excludes the easy rejoinder of "fork is obsolete/bad/harmful; don't bother supporting it and don't use it, just read Baumann et. al!").

- In many ways, this article makes a good case for something it doesn't explicitly mention: a means of annotating/interrogating in-memory global state, like caches/lazy_static/connections, used by code under test. With such an annotation, it's relatively easy to let invocations of the test harness choose how they want to work: reuse a process for testing and re-set global state before each test, have the harness itself (rather than tests by side-effect) set up the global state, run each test with and/or without pre-primed global state and see if behavior differs, etc. Annotating such global state interactions isn't trivial, though, if third-party code is in the mix. A robust combination of annotations in first-party code and a clear place to manually observe/prime/reset-if-possible state that isn't annotated is a good harness feature to strive for. Even if you don't get 100% of the way there, incremental progress in this direction yields considerable rewards.

sunshowers - 6 months ago

But that's exactly it, right — everything you've listed are valid points in the design space, but they require a lot of coordination between various actors in the system. Process per test solves a whole swath of coordination issues.
The post lists out what it would take to make most of nextest's feature set available in a shared-process model. There has been some interest in this, but it is a lot of work for things that come for free.

grayhatter - 6 months ago

Restating the exact same thing 4 different times in the first few paragraphs is an LLM feature right?

diggan - 6 months ago

Sounds more like someone is writing "Inverted pyramid" style, where you repeat information somewhat but go deeper and deeper every time.
sunshowers - 6 months ago

I had Claude do a mild review but this was entirely written by me. The sibling is right: the goal is to present information in increasing levels of detail. Repetition is a powerful tool to make ideas stick.
I do try and present a decent level of detail in the post.