toml v0.9
is a near-complete rewrite with dramatic performance improvements, no_std
support, and many other improvements.
This is where I've disappeared to over the last 3 months, trying to wrap up a years-old experiment that has been "almost done" for these last 3 months, with use case after use case building up encouraging me to finish this up rather than my regularly scheduled Cargo work.
Goal: performance
The primary goal of this effort was to experiment with a different parser architecture to see what the performance of it would be like. People are not normally dependent on how long a TOML file takes to parse. However, Cargo may parse a thousand TOML files on every invocation. Most people don't notice the impact because compile times can dwarf the parsing of these files. It's in the small things that this will be noticed, like
rustup
adding overhead to every the execution of every Rust toolchain command to parserust-toolchain.toml
(rustup#2626)- Higher latency when your shell looks up completions for
cargo
with Cargo's new completion system (Unstable: native-completions) - Cargo's overhead when running a cargo-script (Unstable: script)
- Other tools using cargo programmatically (zulip:
cargo metadata
performance)
Previously, I had proposed that cargo publish
include a Cargo.json
or Cargo.cbor
inside of the *.crate
file,
bypassing the overhead of parsing Cargo.toml
for the majority of packages.
There was hesitance due to risks associated with having two files for the same content,
one viewed by users and the other viewed by machines.
This was a last ditch effort to prove whether Cargo.toml
parsing could be sufficiently fast.
Comparing what Cargo would do to parse a Cargo.toml across past, present (bold), future, and theoretical:
library | type | baseline | preserve_order | preserve_order,fast_hash |
---|---|---|---|---|
toml v0.5 | Table | 939us | 895us | - |
toml_edit v0.22.27 | Document | - | 542us | - |
toml_edit v0.23.0 | Document | - | 478us | - |
toml v0.9.0 | DeTable<'static> | 501us | 320us | 304us |
toml v0.9.0 | DeTable<'s> | 398us | 283us | 267us |
serde_json v1.0.140 | Value | 259us | 158us | - |
For other and older views of the data, see #891, #922, #945 #956 #957
The drop in parse times is sufficient to justify the more complex code.
This may even be enough to not need to add support for Cargo.json
.
Format challenges
While JSON started as a subset of Javascript, it has a fairly simple syntax and structure and it is well understood how to make it fast due to its heavy use in machine-to-machine communication. On the other hand, TOML is more heavily oriented towards humans with less care for the impact on performance.
Some examples of this include:
TOML has order independent sub-tables which requires building up the result piecemeal which isn't compatible with models like serde
and requires additional bookkeeping of some sort.
For example, the following are equivalent:
[one]
[two]
[one.a]
[three]
[one.b]
{
"one": {
"a": {},
"b": {}
},
"two": {},
"three": {}
}
or
apple.type = "fruit"
orange.type = "fruit"
apple.skin = "thin"
orange.skin = "thick"
apple.color = "red"
orange.color = "orange"
{
"apple": {
"type": "fruit",
"skin": "thin",
"color": "red"
},
"orange": {
"type": "fruit",
"skin": "thick",
"color": "orange"
}
}
TOML has very specific rules about how different table syntaxes can be combined ([table]
, dotted.key
, { "inline" = "table" }
).
For example,
[fruit]
apple.color = "red"
apple.taste.sweet = true
# [fruit.apple] # INVALID
# [fruit.apple.taste] # INVALID
[fruit.apple.texture] # you can add sub-tables
smooth = true
toml v0.5 solved this problem by parsing tokens into an AST (a Vec<Table>
where Table
is not recursive) and generating lookup tables to make it easier to walk the logical structure among the syntactic structure.
toml v0.6 delegated parsing to toml_edit which
parses directly into a Document
representing the logical structure (a recursive Table
).
toml_edit did two lookups for every table for simplicity of matching TOML's rules for merging of syntactic tables.
When starting to parse a syntactic table,
we remove the Table
and directly insert into it.
When we find the next table header,
we then re-insert the Table
we were just editing.
toml v0.9 continues with the approach of toml_edit by providing a logical DeTable
that it parses into.
Cargo requirements
Originally, cargo
had been using toml v0.5 to parse Cargo.toml
files.
This changed with the introduction of cargo add
which used toml_edit for its ability to make changes to Cargo.toml
without losing the users formatting.
When proposing to merge cargo add
, the Cargo team was concerned about having two different TOML parsers inside of Cargo which may have slight incompatibilities and different errors.
I had proposed that we solve this problem by having one parser, toml_edit, and extending it to support tomls API.
This brought on the requirement of "no loss in performance".
toml_edit was then relied on to parse --config key=false
TOML expressions as it made it easy to validate that only a single key-value expression was present and not a full document, including comments and newlines.
Lately, the Cargo team is experimenting with using the richer information reported from toml_edit to provide rustc-like diagnostics for Cargo.toml
.
To summarize, the traits we need to preserve while making toml faster include:
- Consistent parsing behavior between
serde
-like reading and writing of TOML and format-preserving edits of user TOML files - Validation of
key=value
expressions - Spans for keys and values
To ensure consistent behavior, we cleaned up the tests in toml and toml_edit, isolating crate-specifics, and synced them into each other. This puts a burden on us to ensure that any changes to toml or toml_edits shared tests get synced back to the other.
As a second line of defence, parsing of the AST is shared in the toml_parser crate.
As a third-line of defence, the conversion from the AST to the logical structure and the serde
implementations are maintained in such a way that we can diff the implementations.
However, this is noisy and requires us to remember to sync even changes, much like the tests.
Validating of key=value
expressions can still be done with toml_edit.
As for providing Spans,
we leveraged the fact that we need to provide a logical DeTable
for serde
and included Spans in it (i.e. Spanned<DeTable>
).
Like toml_edit::Document
, this requires DeTable
to be an owned type so that cargo
can keep it around until a diagnostic needs it.
However, it should be rare that cargo
needs spans for dependencies.
To prepare for this possibility, we made DeTable<'s>
track all strings using
Cow
to avoid allocations when escaping isn't used.
To make this owned, you can call DeTable::make_owned
which switches all Cow::Borrowed
s to Cow::Owned
.
At that point, it is safe to
transmute
a DeTable<'s>
to a DeTable<'static>
as lifetimes do not affect layout (or so I'm told).
Implementation
Internally, toml v0.5 has a fallible tokenizer that iterated through char
s in a &str
, requiring UTF-8 decoding logic in the inner loop.
As mentioned before, it then parsed into a Vec<Table>
(non-recursive) and created lookup tables to speed up access.
toml v0.6 (via toml_edit) directly parsed bytes, avoiding the UTF-8 decoding, directly into a Table
(recursive).
For parsing bytes, it helps that TOML syntax only distinguishes 7-bit ASCII characters from each other and UTF-8 is left opaque.
unsafe
was heavily used to avoid re-validating that the &[u8]
s returned were actually &str
.
The parser is written using parser combinators via winnow.
This made it very easy to compare the implementation to the
grammar
and got fairly good performance results compared to the old parser.
However, this required backtracking and carrying state in Result
s which likely helped spill return values over from registers to the stack
(see winnow#72).
To preserve formatting,
the size of an Item
returned by a parser is also quite large.
By doing direct parsing, we had deeper call stacks for this all to accumulate in and higher costs, both in terms of the call stack and parsed result, when backtracking.
toml v0.9 (via toml_parser) has an infallible tokenizer that we collect into a Vec<Token>
,
assuming the capacity based on the length of the &str
we are tokenizing.
The tokenizer processes bytes within each individual step to calculate indices to take a sub-slice of a &str
.
We made unsafe
optional in toml_parser to avoid the UTF-8 boundary checks but found the overhead of the safe option to be low enough that we didn't use it.
We then report parse events and errors through callbacks,
with toml collecting them into a Vec<Event>
with the capacity inferred from Vec<Token>
s length.
In a couple of places (table headers, scalar values),
the parser does do some lookahead but only needs to compare TokenKind
s and not iterate through a sequence of bytes.
The parser only looks at TokenKind
s and not the &str
, meaning that scalar values are not validated at this stage.
All structural syntactic errors are expected to be reported at this stage.
toml walks these events to build up a DeTable
, validating the logical structure and decoding and validating scalars along the way.
rustc's tokenizer tries to minimize the size of a token by only keeping the token length and storing it in a u32
.
We were unsure how much value this would provide toml and didn't experiment with this.
Secondary goals
While the focus was on performance for Cargo, the architecture we were going with provided an opportunity to help with areas of interest.
Error recovery
Like rustc
, we wanted to allow toml to report as many errors as possible, not just the first.
The core of error recovery is that all errors reported through a &mut dyn ErrorSink
parameter.
The caller can choose whether they collected into a ()
, Option<ParseError>
, Vec<ParseError>
, or something else entirely.
This was another important reason to avoid backtracking because we didn't want to have to revert errors that had been reported.
To keep the logic in toml as simple as possible,
we also made the parser create synthetic Event
s wherever possible
(e.g. reporting a closed table header with an empty span when a ]
was missing).
Currently, we only expose error recovery in toml through DeTable::parse_recoverable
.
Bignum support
While I figure an i64
is good enough for any config file,
that didn't appear to be the case.
While we previously parsed integers into an i64
, toml now tracks a validated, normalized string that it can later parse into the type requested via serde
.
Users can also directly parse into a DeTable
and get the raw parts of a DeInteger
to handle it themselves.
no_std
While TOML is unlikely a good format for systems so constrained that they lack alloc,
it could still work for systems that lack std.
In fact, tomling
was created for this purpose but the author would rather not maintain that.
In designing toml_parser for alloc, I found it wasn't too much harder to support core, so I did so as a challenge to myself.
For resource constrained systems or for general performance,
a caller may not care about parsing the entire document.
We made it so that an EventReceiver
can say "don't care" when recursing into an array or inline table.
We also deferred scalar value validation to access by users, giving them control over which scalars they bother to validate.
In no_std
environments, the size of the stack may further constrained.
Instead of offering built-in recursion protection to toml_parser,
it is built on top as an EventReceiver
decorator that callers can control.
All of this is then collected up into toml which provides one way of handling all of these decisions.
facet
facet is an effort towards reflection in Rust,
including offering parallels to what is supported in serde.
facet-toml is currently built on top of toml_edit, much like toml is.
They were considering writing their own but I encouraged them to wait on this effort (facet#148).
Our EventReceiver
design was taking into account that they hope that they can parse TOML directly to the end-users type without an intermediate like Document
or DeTable
.
We'll see how that goes.
To encourage sharing and best practices,
we also split out toml_writer to abstract away the lowest level details of writing out TOML.
They still have to deal with quirks of writing out tables correctly.
toml was updated to directly translate serde
into toml_writer,
without an intermediate like DeTable
or Document
,
completely dropping the dependency on toml_edit.
Some extra buffers were used to avoid the TableAfterValue
errors that users regularly saw with toml v0.5
(basic-toml#3).
Reducing the proliferation of TOML parsers
Speaking of alternative TOML parsers, we have:
- toml_edit which is focused on format-preserving edits
- toml which is focused on
serde
compatibility - basic-toml which is a deprecated, non-conformant fork of toml v0.5
- toml-span which is a fork of
basic-toml
(inheriting its problems) but tracks spans - boml, which is a dependency-light, non-conformant independent parser
- taplo which is a deprecated, non-conformant, LSP focused parser
- tombi which is focused on being an LSP
(for statements on conformance, see toml-test-matrix)
Based on the requirements, parser implementations, and lack of a maintainer for toml, transitioning toml to be on top of toml_edit was the right call even if it was heavier than some people preferred. With where we are at in testing, code structure, and parser implementations, we can better fulfill some of the needs that led to these other parsers but it took a lot of work to get there. Granted, we won't be "zero dependencies" but that is a proxy for other care abouts like compile times, binary size, web of trust, etc which were were either already doing well on or have improved with this release.
It might not make sense for everyone to switch to toml but maybe we can collaborate on some of the lower level layers in toml_parser, like the parser or even the lexer.
On parser combinators
In reflecting back on this rewriting a parser combinator-based parser into a hand written parser, I still think the default choice should be parser combinators (source). With parser combinators, there was generally a 1:1 mapping between code and the ABNF grammar. Now, that relationship is less clear. Each section of code is fuzzy in how it maps to the grammar and implements a subset of it, either assuming a layer above or below takes care of it or there is not-quite duplicated code within the same layer that handles it. Maintaining this in the future will be quite a burden, requiring knowledge of how the parts interact to make sure all of the pieces get updated correctly.
In this case, it was worth it because of user-facing goals it met. In most cases, including toml in the past, the most important goal is being able to delivery something.