Iterating on Testing in Rust

With the release of rust 1.70, there was some surprise and frustration that unstable test features now require nightly, like all other unstable features in Rust. One of the features most affected is --format json which has been in limbo for 5 years. This drew attention to a feeling I've had: the testing story in Rust has been stagnating. I've been gathering my thoughts on this for the last 3 months and recently had some downtime between tasks so I've started to look further into this.

The tl;dr is to think of this as finding right abstractions to stabilize parts of cargo_test_support and cargo nextest.

Testing today

Running cargo test will build and run all test binaries

$ cargo test
   Compiling cargo v0.72.0
    Finished test [unoptimized + debuginfo] target(s) in 0.62s
     Running  /home/epage/src/cargo/tests/testsuites/main.rs

  running 1 test
test some_case ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Test binaries are created from a package's [lib] and for each .rs file in tests/. test.

awesomeness-rs/
   Cargo.toml
   src/          # whitebox/unit tests go here
    lib.rs
    submodule.rs
    submodule/
      tests.rs
   tests/        # blackbox/integration tests go here
     is_awesome.rs

A test is as simple as adding the #[test] attribute and doing something that panics:

#[test]
fn some_case() {
    assert_eq!(1, 2);
}

And that is basically it. There are a couple more details (doc tests, other attributes) but testing is relatively simple in Rust.

Strengths

Before anything else, we should recognize the strengths of the existing story around testing; what we should make sure we protect. Scouring forums, some points I saw called out include:

Low friction for getting tests up and running
Being the standard way to test makes it easy to jump between projects
High value-to-ceremony ratio
Exclusively running tests in parallel puts pressure on tests being scalable

Problems

For some background, when you run cargo test, the logic is split between two key pieces:

cargo test command which enumerates, builds, and runs test binaries pretty much only caring about their exit code
libtest which is linked into each test binary and parses flags, enumerates tests, runs them, and prints out a report.

Conditional ignores

libtest is static. If you #[ignore] a test, that is it. You can make a test conditioned on a platform or the presence of feature flags, like #[cfg_attr(windows, ignore)]. However, you can't ignore tests based on runtime conditions.

In cargo, we have tests that require installed software. The naive approach is to return early:

#[test]
fn simple_hg() {
    if !has_command("hg") {
        return;
    }

    // ...
}

But that gives developers a misleading view of their test coverage:

$ cargo test
   Compiling cargo v0.72.0
    Finished test [unoptimized + debuginfo] target(s) in 0.62s
     Running  /home/epage/src/cargo/tests/testsuites/main.rs

  running 1 test
test new::simple_hg ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

In cargo, we've worked around this by providing a custom test macro that at compile time checks if hg is installed and adds an #[ignore] attribute:

#[cargo_test(requires_hr)]
fn simple_hg() {
    // ...
}

$ cargo test
   Compiling cargo v0.72.0
    Finished test [unoptimized + debuginfo] target(s) in 16.49s
     Running  /home/epage/src/cargo/tests/testsuites/main.rs

running 1 tests
test init::simple_hg::case ... ignored, hg not installed

test result: ok. 0 passed; 0 failed; 1 ignored; 0 measured; 0 filtered out; finished in 0.10s

Having to wrap #[test] isn't ideal and requires you to bake in every runtime condition into your macro. This also then doesn't compose with other solutions.

Cargo is also unlikely to be able to recognize that it needs to recompile tests when the conditions change.

Test generation

Data driven tests are an easy way to cover a lot of cases (granted, property testing is even better). The most trivial way of doing this is just looping over your cases, like this code from toml_edit:

#[test]
fn integers() {
    let cases = [
        ("+99", 99),
        ("42", 42),
        ("0", 0),
        ("-17", -17),
        ("1_2_3_4_5", 1_2_3_4_5),
        ("0xF", 15),
        ("0o0_755", 493),
        ("0b1_0_1", 5),
        (&std::i64::MIN.to_string()[..], std::i64::MIN),
        (&std::i64::MAX.to_string()[..], std::i64::MAX),
    ];
    for &(input, expected) in &cases {
        let parsed = integer.parse(new_input(input));
        assert_eq!(parsed, Ok(expected));
    }
}

However,

You don't know which input was being processed on failure (without extra steps)
- Any debug output from prior iterations will flood the display when analyzing a failure
Its "fail-fast": a broken case prevents other cases from running, requiring careful ordering to ensure the more general case is first
You don't get the bigger picture of whats working and or not by seeing all of the failures at once
You can't select a specific case to run / debug

Some projects will create bespoke macros are created so you get a #[test] per data point. When this happens frequently enough across projects, people will write their own libraries to automate this, including:

Or libtest-mimic for the choose-your-own adventure route.

For example, with trybuild, you create a dedicated test binary with a test function and everything is then delegated to trybuild:

#[test]
fn ui() {
    let t = trybuild::TestCases::new();
    t.compile_fail("tests/ui/*.rs");
}

And you get this output: trybuild output

Alternatively, some testing libraries replace libtest as the test harness

# Cargo.toml ...

[[bench]]
name = "my_benchmark"
harness = false

use std::iter;

use criterion::BenchmarkId;
use criterion::Criterion;
use criterion::Throughput;

fn from_elem(c: &mut Criterion) {
    static KB: usize = 1024;

    let mut group = c.benchmark_group("from_elem");
    for size in [KB, 2 * KB, 4 * KB, 8 * KB, 16 * KB].iter() {
        group.throughput(Throughput::Bytes(*size as u64));
        group.bench_with_input(BenchmarkId::from_parameter(size), size, |b, &size| {
            b.iter(|| iter::repeat(0u8).take(size).collect::<Vec<_>>());
        });
    }
    group.finish();
}

criterion_group!(benches, from_elem);
criterion_main!(benches);

Custom harnesses are a second class experience

Require their own test binary, distinct from other tests
Varying levels of support or extensions for how to interact with them
The cost for writing your own is high

Test Initialization and Cleanup

When talking about this, people generally think of the classic JUnit setup with its own downsides:

public class JUnitTestCase extends TestCase {
    @Override
    protected void setUp() throws Exception {
        // ...
    }

    public void testSomeSituation() {
        // ...
    }

    @Override
    protected void tearDown() throws Exception {
        // ...
    }
}

In Rust, we generally solve this with RAII:

fn cargo_add_lockfile_updated() {
    let scratch = tempfile::tempdir::new().unwrap();

    // ...


}

This has its own limitations, like some teardown errors being ignored. I've had bugs masked by this on Windows, requiring manual cleanup to catch them:

fn cargo_add_lockfile_updated() {
    let scratch = tempfile::tempdir::new().unwrap();

    // ...

    scratch.close().unwrap();
}

Sometimes generic libraries like tempfile aren't sufficient. Within cargo, we intentionally leak the temp directories, only cleaning them up on the next run so people can debug failures. This is also provided by #[cargo_test]. However, we have regularly hit CI storage limits and it would be a big help if the fixture tracked the size of these directories, much like tracking test times.

Cargo also has a lot of fixture initialization coupled to the directory managed by #[cargo_test], requiring a package to buy-in to the whole system to just use a little portion of it.

Sometimes a fixture is expensive to create and you want to be able to share it. For example in cargo, we sometimes put multiple "tests" in the same function to share the fixture, running into similar problems as we do with the lack of test generation.

The counter argument could be made that we just aren't composing things right. That is likely the case but I feel this organic growth is the natural outcome of not having better supporting tools and needing to prioritize our own development.

Having composable fixtures would go a long way towards making test code more reusable. Take for instance pytest. In a previous part of my career, I made a Python API for hardware that interacted with the CAN bus. This had to be tested at the system level and required access to hardware. With pytest, I could specify that a test required a can_in_interface resource. can_in_interface is a fixture that could be initialized from the command-line and skip all dependent tests, if not specified.

def pytest_addoption(parser):
    parser.addoption(
        "--can-in-interface", default="None",
        action="store",
        help="The CAN interface to use with the tests")

@pytest.fixture
def can_in_interface(request):
    interface = request.config.getoption("--can-in-interface")
    if interface.lower() == "none":
        pytest.skip("Test requires a CAN board")
    return interface

def test_wait_for_intf_communicating(can_in_interface):
    # ...

We have crates like rstest, but they are like #[cargo_test] and build on top of libtest. We can't extend the command-line, have fixtures skip tests, and so on

Scaling up development

As we've worked around limitations, we've lost the strength of transferability. These solutions are also not composable; if there isn't a custom test harness that works for your case, you have to build everything up from scratch. libtest-mimic reduces the work to build from scratch but it still requires you to do this for each scenario, rather than having a way to compose testing logic. This takes a toll on those projects until they say enough is enough and build something custom.

Friction between `cargo test` and libtest

So far I've mostly talked about test writing. There are also problems with test running and reporting. I think cargo nextest has helped highlight gaps in the current workflow. However, cargo nextest is working within the limitations of the existing system. For example, what would normally be attributes on the test function in other language's test libraries, you have to specify in a separate config file. cargo nextest also does process isolation for tests. While it has benefits, I'm concerned about what would lose by making this the default workflow. For example, you can't run cargo nextest on cargo today because of shared state between tests, in particular the creation of short identifiers for temp directories which allows us to have a stable set of directories to use and clean up from. Process isolation also gets in the way of trying to support shared fixtures in the future.

Going back over our backlog, we've problems related to cargo test and libtest's interactions include:

Wanting to run test binaries in parallel, like cargo nextest
Lack of summary across all binaries
Noisy test output (see also #5089)
Confusing command-line interactions (see also #8903, #10392)
Poor messaging when a filter doesn't match
Smarter test execution order (see also #8685, #10673)
JUnit output is incorrect when running multiple test binaries
Lack of failure when test binaries exit unexpectedly

Solution

To avoid optimizing for a local maxima, I like to focus on the ideal case and then step back to what is practical.

My ideal scenario

Before, I made a reference to pytest. That has been the best model for testing I've used so far. It provides a shared, composable convention for extending testing capabilities that I feel help in the scenarios I mapped out.

Runtime-conditional ignores:

@pytest.mark.skipif(!has_command("hg"), reason="requires `hg` CLI")
def test_simple_hg():
    pass

Case generation

@pytest.mark.parametrize("sample_rs", trybuild.find("tests/ui/*.rs"))
def ui(sample_rs):
    trybuild.verify(sample_rs)

Initialization and cleanup

def cargo_add_lockfile_updated(tmpdir):
    # ...

As for the UX, we can shift some of the responsibilities from libtest to cargo test if we formalize their relationship. Currently, cargo test hands off all responsibilities and makes no assumptions about command-line arguments, output formats, whats safe to run in parallel, etc.

Of course, this isn't all that simple or else it would have already been done. For libtest, its difficult to get feedback on unstable features which is one reason things have remained in limbo for so long. This also extends to stabilizing the json output for allowing tighter integration between cargo test and libtest. A naive approach to tighter integration would also be a breaking change as it changes expectations for custom test harnesses and even individual tests are run.

Prototype

I started prototyping the libtest-side of my ideal scenario while waiting for some FCPs to close out. My thought was to start here, rather than on cargo test as this would let me explore what the json output should look like before working to stabilize it for cargo test to use. I even went a step further and implemented all other output formats (pretty, terse and, junit) on top of the structures used for json output, helping to further refine its design and making sure its sufficient for cargo test to create the desired UX.

This prototype is still fairly early; we don't even have full parity with libtest-mimic. For this reason, none of the crates have been published yet.

The basis for the design is "what if this could replace the original libtest?". Even if we this does not become the basis for libtest, my hope is that core parts of the code can be shared to ensure consistent behavior, like serde types and the CLI.

So this is why I made yet another argument parser. I enjoy clap and generally recommend it to people (which is why I've taken on maintaining it). When someone needs something more lightweight, I generally point them to lexopt due to the simpleness of the design.

But.

When designing this prototype, I wanted to design-in support for users to extend the command-line like you can in pytest. This means it needs to be pluggable. If this were exposed in libtest's API, then it can't break compatibility. The easiest way to do this is to have as simple of an API as possible. Clap's API is too big. I was concerned even about the amount of policy in lexopt. I don't know if lexarg will go anywhere but it allowed me to get an idea of how a perma-1.0 test library could possibly have an extensible CLI.

Libs team meeting

After presenting on this at RustNL 2023 (video, slides), I had the opportunity to attend an in-person libs team meeting to discuss parts of this with them (along with OsStr).

Question: how much of this work will make it into libtest?

I went in with the audacious goal of "everything", initially working on extension points to allow out-of-tree experiments on a "pytest"-like API and then slowly pull pieces of that into libtest.

We also discussed experimenting with unstable features by publishing libtest to crates.io where we can break compatibility until we are satisfied and then add the #[stable] attribute to the version shipped with rust. This idea isn't new.

The biggest concern was with the compatibility surface that would have to be maintained and instead we went the opposite direction. Our aim is to make custom test harnesses first-class and shift the focus towards them rather than extending libtest. Hopefully, we can consolidate down on just a couple of frameworks with libtest providing us with the baseline API, reducing the chance that test writing between projects is too disjoint. I also hope we can share code between these projects to improve consistency and make it easier to conform to expectations.

Question: where do we draw the line for these custom test harnesses?

Today, you either get libtest with test enumeration, a main, etc, or you have to build it all up from scratch. Previously eRFC #2318 laid out a plan for rust to still own the #[test] macro and test enumeration but be accessible from custom test harnesses. For the cases I've enumerated and my gut feeling when prototyping, my suspicion is that I'll be wanting to allow custom #[test] macros so my test harness can control what code gets generated and can have nested attributes (like #[test(exclusive)]) rather than repeating our existing pattern of separate macros (e.g. #[ignore]).

To get parity with libtest, we'll need stable support for

Test enumeration (see below)
Disabling of the existing #[test] macro (no plans yet)
Custom preludes to pull in #[test] macro from a dependency (no plans yet)
Pulling in main from a dependency (no plans yet)
Capturing of println (see below)

Plus a low ceremony way to opt-in to all of this (like rust-lang/cargo#6945 ).

We didn't cover everything but we made enough progress to feel happy with this plan

Question: how will we do test enumeration?

I had hoped that inventory or linkme could be used for this but there was concern about supporting these across platforms without hiccups, including from dtolnay who is the maintainer of them.

Instead, we are looking at introducing a new language feature to replace libtest's use of internal compiler features. This would most likely be a #[distributed_slice] though we didn't hash out further details in the meeting.

Question: how will we capture println?

When a test fails, its great that the output is captured and reported back from println, dbg, and panic messages (like from assert).

How does it work though? I'm sorry you asked. At the start of a test, a buffer is passed to set_output_capture which puts it into thread-local storage. When you call println, print, eprintln, or eprint, they call _print and _eprint which calls print_to, writing to the buffer. At the end of the test, set_output_capture is called again to restore printing to stdout / stderr.

This means

If you write directly to std::io::{stdout,stderr}, use libc, or have C code using libc, it will not be captured
If the test launches a thread, its output will not be captured (as far as I can tell)
This API is only available in nightly which libtest has special access to on stable and custom test harnesses do not

Previously, this whole process was more more reusable but more complex and there is hesitance to go back in this direction. For this to be stabilized, this also needs to be more general This needs to cover std::io::stdout / std::io::stderr and likely libc. This needs to be more resilient, capturing across all threads or inherited when new threads are spawned. Then there is async where its not about what thread you are running on. Implicit contexts for stdout and stderr would cover most needs but that I'm doubt we'll get that any time soon, if ever (no matter how much I love the idea of it).

We could workaround this by running each test in its own process but that comes with its own downsides as already mentioned.

So we don't have a plan that can meet these needs yet.

Next Steps

Now is where you can help.

This, unfortunately, isn't my highest priority because I don't want want to leave a trail of incomplete projects. I've previously committed to MSRV, [lints], cargo script, and keeping up with my crates. Even if this was my highest priority, this is too much for one person and is spread across rust language design, the rust compiler, the standard library, and cargo. This will also take time to go through the RFC process for each part so the sooner we start on these, the better.

But I don't think that means we should give up.

For anyone who would like to help out, the parts I see that are unblocked include:

Preparing a #[distributed_slice] Pre-RFC and moving that forward so we have test enumeration
Finish up what can be done on the prototype for further json output feedback
Design a low ceremony way to opt-in to all of this (like rust-lang/cargo#6945)
Sketching out ideas how we might disable the existing #[test] macro
Researching where custom preludes are at and see what might be able to move forward so we can pull in the #[test] macro
Similarly, researching the possibility of pulling in main from a dependency

These are roughly in priority order based on a mixture of

The time it will take before its usable
How confident I am that it can be solved
The pay off from solving it, whether because its more generally useful or a higher pain point to not have (e.g. distributed_slice in both cases)

Alternatively, if your interests align with one of my higher priorities, I welcome the help with them so more of us can focus on this problem.

Discuss on reddit mastadon