Diagnostics with Tracing
August 14, 2019
Effectively developing systems and operating them in production requires visibility into their behavior at runtime. While conventional logging can provide some of this visibility, asynchronous software — like applications using the Tokio runtime — introduces new challenges.
tracing
is a collection of libraries that provide a
framework for instrumenting Rust programs to collect structured, context-aware,
event driven diagnostics. Note that tracing
was originally released under the
name tokio-trace
; the name was changed to reflect that, although it is part of
the Tokio project, the tokio
runtime is not required to use tracing
.
Why do we need another logging library?
Rust already has a robust logging ecosystem based around the log
crate's logging facade, with libraries such as env_logger
and fern
in
widespread use. This raises reasonable questions– why is tracing
necessary, and what benefit does it provide that existing libraries don't? To
answer these questions, we need to consider the challenges introduced by
diagnostics in asynchronous systems.
In synchronous code, we can simply log individual messages as the program executes, and expect them to be printed in order. A programmer can interpret the logs fairly easily, since the log records are output sequentially. For example, in a synchronous system, if I see a group of log messages like this:
DEBUG server: accepted connection from 106.42.126.8:56975
DEBUG server::http: received request
WARN server::http: invalid request headers
DEBUG server: closing connection
I can infer that the request from the client with the IP address 106.42.126.8 was the one that failed, and that the connection from that client was then closed by the server. The context is implied by previous messages: because the synchronous server must serve each request before accepting the next connection, we can determine that any log records occurring after an "accepted connection..." message and before a "closing connection" message refer to that connection.
However, in asynchronous systems like Tokio, interpreting traditional log messages can often be quite challenging. A single thread in an asynchronous system might be executing any number of tasks, switching between them as IO resources become available, and an application might consist of a number of such worker threads running concurrently. In this world, we can no longer rely on the ordering of log messages to determine context or cause and effect. A task might log some messages and yield, allowing the executor to poll another task that logs its own unrelated messages. Log messages from threads running concurrently might be printed out interleaved. To understand asynchronous systems, the contextual and causal relationships must be recorded explicitly, rather than implied by sequential ordering.
If the log lines in the above example were emitted by a asynchronous application, the server task may continue accepting new connections while previously-accepted ones are being processed by other tasks; multiple requests might be processed concurrently by worker threads. We might see something like this:
DEBUG server: accepted connection from 106.42.126.8:56975
DEBUG server: closing connection
DEBUG server::http: received request
DEBUG server: accepted connection from 11.103.8.9:49123
DEBUG server::http: received request
DEBUG server: accepted connection from 102.12.37.105:51342
WARN server::http: invalid request headers
TRACE server: closing connection
We don't know that the request with invalid headers was received from 106.42.126.8 any longer. Since multiple connections are being processed concurrently, the invalid headers might have been sent on another connection while we waited to receive more data from that client. What can we do to make sense of this mess?
Conventional logging often captures static contexts about the program — such as what file, module, or function an event was recorded in — but that's of limited use to us in understanding the program's runtime behavior. Instead, visibility into asynchronous code requires diagnostics that track the dynamic runtime contexts in which events occur.
Application-Level Tracing
tracing
is more than a logging library: it implements scoped, contextual, and
structured diagnostic instrumentation. This allows users to trace logical
contexts in the application through time, even as the actual flow of execution
moves between those contexts.
Spans
To record the flow of execution, tracing
introduces the concept of spans.
Unlike a log line that represents a moment in time, a span models a period of
time with a beginning and an end. When a program begins executing in a context
or performing a unit of work, it enters a span, and when it stops executing in
that context, it exits the span.
Any events that occur between when a span is entered and when it is exited are considered to have occurred within that span. Similarly, spans may be nested: when a thread enters a span inside of another span, it is in both spans, with the newly-entered span considered the child and the outer span the parent. We can then construct a tree of nested spans and follow them throughout different parts of a program.
tracing
also supports events, which model instantaneous points in time.
Events are similar to traditional log messages, but exist within the tree of
spans as well. When we record an event, we can pinpoint the context in which it
occurred.
Structure
By attaching structured data to spans, we can model contexts. Rather than
simply recording unstructured, human-readable messages, tracing
instrumentation points record typed key-value data called fields. For example,
in an HTTP server, a span representing an accepted connection might record
fields such as the client's IP address, the requested path, request method,
headers, and so on. If we revisit the example above, with the addition of spans,
we might see something like this:
DEBUG server{client.addr=106.42.126.8:56975}: accepted connection
DEBUG server{client.addr=82.5.70.2:53121}: closing connection
DEBUG server{client.addr=89.56.1.12:55601} request{path="/posts/tracing" method=GET}: received request
DEBUG server{client.addr=111.103.8.9:49123}: accepted connection
DEBUG server{client.addr=106.42.126.8:56975} request{path="/" method=PUT}: received request
DEBUG server{client.addr=113.12.37.105:51342}: accepted connection
WARN server{client.addr=106.42.126.8:56975} request{path="/" method=PUT}: invalid request headers
TRACE server{client.addr=106.42.126.8:56975} request{path="/" method=PUT}: closing connection
Notice how the events are annotated with spans that record the client IP address, and the request's path and method. Although multiple events are happening concurrently in different contexts, we can now follow the flow of the request from 106.42.126.8 through the system, and determine that it was the request containing the invalid headers that generated the warning.
This machine-readable structured data also gives us the ability to consume diagnostic data in more sophisticated ways than simply formatting it to be read by a human. For example, we might also consume the above data by counting the number of requests recieved for different paths or HTTP methods. By looking at the structure of the span tree as well as at key-value data, we can even do things like recording the entire lifespan of a request only when it ended with an error.
A Worked Example
To demonstrate the value of this kind of diagnostics, let's take a look at an
example. In this example, we've written a small HTTP service, using tower
,
that implements "character repetition as a service". The service recieves
requests for paths consisting of a single ASCII character, and responds with
that character duplicated a number of times equal to the value of the
Content-Length
header.
We've instrumented the example service using tracing
, and used the
tracing-subscriber
crate to implement an admin endpoint that allows us to
change the filter configuration that determines what tracing
instrumentation is enabled at runtime. A load generator runs in the background
that constantly sends requests for random alphabetic characters to the demo
service.
With a version of the example service instrumented using the log
and env_logger
crates, we get output like this:
...
[DEBUG load_log] accepted connection from [::1]:55257
[DEBUG load_log] received request for path "/z"
[DEBUG load_log] accepted connection from [::1]:55258
[DEBUG load_log] received request for path "/Z"
[ERROR load_log] error received from server! status: 500
[DEBUG load_log] accepted connection from [::1]:55259
[DEBUG load_log] accepted connection from [::1]:55260
[DEBUG load_log] received request for path "/H"
[DEBUG load_log] accepted connection from [::1]:55261
[DEBUG load_log] received request for path "/S"
[DEBUG load_log] received request for path "/C"
[DEBUG load_log] accepted connection from [::1]:55262
[DEBUG load_log] received request for path "/x"
[DEBUG load_log] accepted connection from [::1]:55263
[DEBUG load_log] accepted connection from [::1]:55264
[WARN load_log] path "/" not found; returning 404
[DEBUG load_log] accepted connection from [::1]:55265
[ERROR load_log] error received from server! status: 404
[DEBUG load_log] received request for path "/Q"
[DEBUG load_log] accepted connection from [::1]:55266
[DEBUG load_log] received request for path "/l"
...
Because the service is under load, these messages scroll past very fast. I've included just a small sample of the output here.
Although the errors do show up in the logs, it's difficult to associate them with any context that might help us determine their cause. For the 404 error, we're lucky enough that whomever added the warning line logged by the server thought to include the path in the human-readable log message, but for the 500 error, we're flying blind.
Now, let's switch to a version of the demo application that's instrumented with
tracing
:
...
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60891}: load: accepted connection
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60890}: load: accepted connection
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60891}:request{req.method=GET req.path="/I"}: load: received request. req.headers={"content-length": "18", "host": "[::1]:3000"} req.version=HTTP/1.1
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60890}:request{req.method=GET req.path="/T"}: load: received request. req.headers={"content-length": "4", "host": "[::1]:3000"} req.version=HTTP/1.1
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60892}: load: accepted connection
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60892}:request{req.method=GET req.path="/x"}: load: received request. req.headers={"content-length": "6", "host": "[::1]:3000"} req.version=HTTP/1.1
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60893}: load: accepted connection
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60895}:request{req.method=GET req.path="/"}: load: received request. req.headers={"content-length": "13", "host": "[::1]:3000"} req.version=HTTP/1.1
WARN server{name=serve local=[::1]:3000}:conn{remote=[::1]:60895}:request{req.method=GET req.path="/"}: load: rsp.status=404
ERROR gen: error received from server! status=404
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60893}:request{req.method=GET req.path="/a"}: load: received request. req.headers={"content-length": "11", "host": "[::1]:3000"} req.version=HTTP/1.1
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60894}: load: accepted connection
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60894}:request{req.method=GET req.path="/V"}: load: received request. req.headers={"content-length": "12", "host": "[::1]:3000"} req.version=HTTP/1.1
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60896}: load: accepted connection
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60998}: load: accepted connection
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60996}:request{req.method=GET req.path="/z"}: load: received request. req.headers={"content-length": "17", "host": "[::1]:3000"} req.version=HTTP/1.1
ERROR gen: error received from server! status=500
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60987}: load: accepted connection
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60911}: load: accepted connection
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60911}:request{req.method=GET req.path="/m"}: load: received request. req.headers={"content-length": "7", "host": "[::1]:3000"} req.version=HTTP/1.1
DEBUG server{name=serve local=[::1]:3000}:conn{remote=[::1]:60912}: load: accepted connection
...
Notice how in addition to printing individual events, the span contexts in which they occurred are also printed. In particular, each connection and request creates its own span. However, these events are recorded very frequently, so the logs are still scrolling by quite fast.
If we send a request to the demo app's admin endpoint, we can reload the filter configuration to look only at the events recorded by the load generators:
:; curl -X PUT localhost:3001/filter -d "gen=debug"
These filters are specified using a syntax similar to that of the env_logger
crate.
The rate at which the logs scroll by slows down significantly, and we see output like this:
...
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: gen: error received from server! status=404
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: error received from server! status=500
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: error received from server! status=500
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: error received from server! status=500
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: error received from server! status=500
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: gen: error received from server! status=404
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: error received from server! status=500
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: gen: error received from server! status=404
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: gen: error received from server! status=404
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: error received from server! status=500
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: gen: error received from server! status=404
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: error received from server! status=500
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: gen: error received from server! status=404
...
Looking at the request
spans output by the load generator, we start to notice
a pattern. The value of the req.path
field is always either "/"
or "/z"
.
We can reload the filter configuration again, setting the verbosity to maximum
only when we're inside a span with the field req.path
with the value "/"
:
:; curl -X PUT localhost:3001/filter -d "[{req.path=\"/\"}]=trace"
The []
indicates that we wish to filter on tracing
spans rather than on
static targets, and the {}
indicate that we want to match span fields.
...
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: gen: sending request...
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: tower_buffer::service: sending request to buffer worker
DEBUG request{req.method=GET req.path="/"}: load: received request. req.headers={"content-length": "21", "host": "[::1]:3000"} req.version=HTTP/1.1
TRACE request{req.method=GET req.path="/"}: load: handling request...
TRACE request{req.method=GET req.path="/"}: load: rsp.error=path must be a single ASCII character
WARN request{req.method=GET req.path="/"}: load: rsp.status=404
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: gen: error received from server! status=404
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: gen: response complete. rsp.body=path must be a single ASCII character
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: gen: sending request...
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: tower_buffer::service: sending request to buffer worker
DEBUG request{req.method=GET req.path="/"}: load: received request. req.headers={"content-length": "18", "host": "[::1]:3000"} req.version=HTTP/1.1
TRACE request{req.method=GET req.path="/"}: load: handling request...
TRACE request{req.method=GET req.path="/"}: load: rsp.error=path must be a single ASCII character
WARN request{req.method=GET req.path="/"}: load: rsp.status=404
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: gen: error received from server! status=404
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: gen: response complete. rsp.body=path must be a single ASCII character
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: gen: sending request...
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/"}: tower_buffer::service: sending request to buffer worker
DEBUG request{req.method=GET req.path="/"}: load: received request. req.headers={"content-length": "2", "host": "[::1]:3000"} req.version=HTTP/1.1
...
Now we see what's going on. The service doesn't support requests to /
and is,
quite correctly, responding with a 404. But what about those 500 errors? We
remember that they seemed to only occur when the requested path is "/z"...
:; curl -X PUT localhost:3001/filter -d "[{req.path=\"/z\"}]=trace"
...
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: sending request...
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: tower_buffer::service: sending request to buffer worker
DEBUG request{req.method=GET req.path="/z"}: load: received request. req.headers={"content-length": "0", "host": "[::1]:3000"} req.version=HTTP/1.1
TRACE request{req.method=GET req.path="/z"}: load: handling request...
TRACE request{req.method=GET req.path="/z"}: load: error=i don't like this letter. letter="z"
TRACE request{req.method=GET req.path="/z"}: load: rsp.error=unknown internal error
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: error received from server! status=500
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: response complete. rsp.body=unknown internal error
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: sending request...
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: tower_buffer::service: sending request to buffer worker
DEBUG request{req.method=GET req.path="/z"}: load: received request. req.headers={"content-length": "16", "host": "[::1]:3000"} req.version=HTTP/1.1
TRACE request{req.method=GET req.path="/z"}: load: handling request...
TRACE request{req.method=GET req.path="/z"}: load: error=i don't like this letter. letter="z"
TRACE request{req.method=GET req.path="/z"}: load: rsp.error=unknown internal error
ERROR load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: error received from server! status=500
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: response complete. rsp.body=unknown internal error
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: gen: sending request...
TRACE load_gen{remote.addr=[::1]:3000}:request{req.method=GET req.path="/z"}: tower_buffer::service: sending request to buffer worker
DEBUG request{req.method=GET req.path="/z"}: load: received request. req.headers={"content-length": "24", "host": "[::1]:3000"} req.version=HTTP/1.1
TRACE request{req.method=GET req.path="/z"}: load: handling request...
TRACE request{req.method=GET req.path="/z"}: load: error=i don't like this letter. letter="z"
TRACE request{req.method=GET req.path="/z"}: load: rsp.error=unknown internal error
...
We can now trace the entire life-cycle of a request to /z
, ignoring everything
else that's happening concurrently. And, looking at the logged events, we see
that the service records that "I don't like this letter" and returns an internal
error. We've found the (admittedly, entirely fake) bug!
This kind of dynamic filtering is only possible with context-aware, structured diagnostic instrumentation. Because nesting spans and events lets us construct a tree of related contexts, we can interpret the diagnostics in terms of events that are logically or causally linked (e.g., they occurred while handling the same request) rather than those that are temporally linked (they occurred near each other in time).
If you're interested in seeing the code that produced these logs, or interactively experimenting with this demo, you can check out the example here.
Getting Started with Tracing
tracing
is available on crates.io:
tracing = "0.1.5"
The easiest way to get started with tracing
is to use the
tracing::instrument
attribute on a function. This attribute will
instrument the function to create and enter a new span when the function is
called, with the function's arguments recorded as fields on that span. For
example:
use tracing::instrument;
#[instrument]
pub async fn connect_to(remote: SocketAddr) -> io::Result<TcpStream> {
// ...
}
tracing
also provides a set of function-like macros for constructing
spans and events. Users of the log
crate should note that tracing
's
trace!
, debug!
, info!
, warn!
and error!
macros are a superset of the
similarly-named macros in log
and should be drop-in compatible:
use log::info;
info!("hello world!");
use tracing::info;
info!("hello world!");
The more idiomatic style, however, is to use these macros to record structured data rather than unstructured messages. For example:
use tracing::trace;
let bytes_read = ...;
let num_processed = ...;
// ...
trace!(bytes_read, messages = num_processed);
A Subscriber
implementation collects and records trace data,
similarly to a logger in conventional logging. Applications must set up a
default subscriber.
The Subscriber
interface is tracing
's main extension point; different
methods and policies for recording and processing trace data can be represented
as Subscriber
implementations. Currently, the tracing-fmt
crate provides a Subscriber
implementation that logs trace data to the
console, and more implementations are soon to come.
More API documentation is available on docs.rs, and examples are
provided in the tracing
github repository.
Building an Ecosystem
The tracing
ecosystem is centered around the tracing
crate, which provides the API used to instrument libraries and applications, and
the tracing-core
, which provides the minimal, stable kernel of
functionality necessary to connect that instrumentation with Subscriber
s.
However, this is just the tip of the iceberg. The tokio-rs/tracing
repository contains a number of additional crates, in varying degrees of
stability. These crates include:
- Compatibility layers with other libraries, such as
tracing-tower
andtracing-log
. Subscriber
implementations, such astracing-fmt
.- The
tracing-subscriber
crate, which provides utilities for implementing and composingSubscriber
s.
Stable releases of the central crates have been published to crates.io, and
tracing
is already seeing adoption by projects like Linkerd 2 and
Vector. However, there is a lot of future work, including:
- Integrating with distributed tracing systems such as OpenTelemetry or Jaeger.
- Building out richer instrumentation in the Tokio runtime.
- Integration with more libraries and frameworks.
- Writing
Subscriber
s to implement more ways of collecting trace data, such as metrics, profiling, et cetera. - Helping to stabilize experimental crates.
Contributions in all these areas will be welcomed eagerly. We're all looking
forward to seeing what the community will build on top of the platform that
tracing
provides!
If you're interested, check out tracing
on GitHub or join the
Gitter chat channel!