Performance Monitoring: Sentry SDK API Evolution

The objective of this document is to contextualize the evolution of the Performance Monitoring features in Sentry SDKs. We start with a summary of how Performance Monitoring was added to Sentry and to SDKs, and, later, we discuss lessons learned in the form of identified issues and the initiatives to address those issues.

Introduction

Back in early 2019, Sentry started experimenting with adding tracing to SDKs. The Python and JavaScript SDKs were the test bed where the first concepts were designed and developed. A proof-of-concept was released on April 29th, 2019 and shipped to Sentry on May 7, 2019. Python and JavaScript were obvious choices, because they allowed us to experiment with instrumenting Sentry’s own backend and frontend.

Note that the aforementioned work was contemporary to the merger of OpenCensus and OpenTracing to form OpenTelemetry. Sentry’s API and SDK implementations borrowed inspiration from pre-1.0 versions of OpenTelemetry, combined with our own ideas. For example, our list of span statuses openly match those that could be found in the OpenTelemetry specification around the end of 2019.

After settling with an API, performance monitoring support was then expanded to other SDKs. Sentry's Performance Monitoring solution became Generally Available in July, 2020. OpenTelemetry's Tracing Specification version 1.0 was released in February, 2021.

Our initial implementation reused the mechanisms we had in place for error reporting:

  • The Event type was extended with new fields. That meant that instead of designing and implementing a whole new ingestion pipeline, we could save time and quickly start sending "events" to Sentry, this time, instead of errors, a new "transaction" event type.
  • Since we were just sending a new type of event, the SDK transport layer was also reused.
  • And since we were sharing the ingestion pipeline, that meant we were sharing storage and the many parts of the processing that happens to all events.

Our implementation evolved such that there was a clear emphasis on the distinction between Transactions and Spans. Part of that was a side effect from reusing the Event interface.

Transactions resonated well with customers. They allowed for important chunks of work in their code to be highlighted, like a browser page load or http server request. Customers can see and navigate through a list of transactions, while within a transaction the spans give detailed timing for more granular units of work.

In the next section, we’ll discuss some of the shortcomings with the current model.

Identified Issues

While the reuse of the Unified SDK architecture (hubs, clients, scopes) and the transaction ingestion model have merits, experience revealed some issues that we categorize into two groups.

The first group has to do with scope propagation, in essence the ability to determine what the “current scope” is. This operation is required for both manual instrumentation in user code as well as for automatic instrumentation in SDK integrations.

The second group is for issues related to the wire format used to send transaction data from SDKs to Sentry.

Scope Propagation

This issue is tracked by getsentry/sentry-javascript#3751.

The Unified SDK architecture is fundamentally based on the existence of a hub per unit of concurrency, each hub having a stack of pairs of client and scope. A client holds configuration and is responsible for sending data to Sentry by means of a transport, while a scope holds contextual data that gets appended to outgoing events, such as tags and breadcrumbs.

Every hub knows what the current scope is. It is always the scope on top of the stack. The difficult part is having a hub “per unit of concurrency”.

JavaScript, for example, is single-threaded with an event loop and async code execution. There is no standard way to carry contextual data that works across async calls. So for JavaScript browser applications, there is only one global hub shared for sync and async code.

A similar situation appears on Mobile SDKs. There is a user expectation that contextual data like tags, what the current user is, breadcrumbs, and other information stored on the scope be available and settable from any thread. Therefore, in those SDKs there is only one global hub.

In both cases, everything was relatively fine when the SDK had to deal with reporting errors. With the added responsibility to track transactions and spans, the scope became a poor fit to store the current span, because it limits the existence of concurrent spans.

For Browser JavaScript, a possible solution is the use of Zone.js, part of the Angular framework. The main challenge is that it increases bundle size and may inadvertendly impact end user apps as it monkey-patches key parts of the JavaScript runtime engine.

The scope propagation problem became especially apparent when we tried to create a simpler API for manual instrumentation. The idea was to expose a Sentry.trace function that would implicitly propagate tracing and scope data, and support deep nesting with sync and async code.

As an example, let’s say someone wanted to measure how long searching through a DOM tree takes. Tracing this operation would look something like this:

Copied
await Sentry.trace(
  {
    op: 'dom',
    description: 'Walk DOM Tree',
  },
  async () => await walkDomTree()
);

With the Sentry.trace function, users wouldn’t have to worry about keeping the reference to the correct transaction or span when adding timing data. Users are free to create child spans within the walkDomTree function and spans would be ordered in the correct hierarchy.

The implementation of the actual trace function is relatively simple (see a PR which has an example implementation). Knowing what the current span is, however, both in async code and global integrations, is a challenge yet to be overcome.

The following two examples synthesize the scope propagation issues.

Cannot Determine Current Span

Consider some auto-instrumentation code that needs to get a reference to the current span, a case in which manual scope propagation is not available.

Copied
// SDK code
function fetchWrapper(/* ... */) {
  /*
    ... some code omitted for simplicity ...
  */
  const parent = getCurrentHub().getScope().getSpan(); // <1>
  const span = parent.startChild({
    data: { type: 'fetch' },
    description: `${method} ${url}`,
    op: 'http.client',
  });
  try {
    // ...
    // return fetch(...);
  } finally {
    span.finish();
  }
}
window.fetch = fetchWrapper;

// User code
async function f1() {
  const hub = getCurrentHub();
  let t = hub.startTransaction({ name: 't1' });
  hub.getScope().setSpan(t);
  try {
    await fetch('https://example.com/f1');
  } finally {
    t.finish();
  }
}
async function f2() {
  const hub = getCurrentHub();
  let t = hub.startTransaction({ name: 't2' });
  hub.getScope().setSpan(t);
  try {
    await fetch('https://example.com/f2');
  } finally {
    t.finish();
  }
}
Promise.all([f1(), f2()]); // run f1 and f2 concurrently

In the example above, several concurrent fetch requests trigger the execution of the fetchWrapper helper. Line <1> must be able to observe a different span depending on the current flow of execution, leading to two span trees as below:

Copied
t1
\
  |- http.client GET https://example.com/f1
t2
\
  |- http.client GET https://example.com/f2

That means that, when f1 is running, parent must refer to t1 and, when f2 is running, parent must be t2. Unfortunately, all code above is racing to update and read from a single hub instance, and thus the observed span trees are not deterministic. For example, the result could incorrectly be:

Copied
t1
t2
\
  |- http.client GET https://example.com/f1
  |- http.client GET https://example.com/f2

As a side effect of not being able to correctly determine the current span, the present implementation of the fetch integration (and others) in the JavaScript Browser SDK choose to create flat transactions, where all child spans are direct children of the transaction (instead of having a proper multi-level tree structure).

Note that other tracing libraries have the same kind of challenge. There are several (at the time open) issues in OpenTelemetry for JavaScript related to determining the parent span and proper context propagation (including async code):

Conflicting Data Propagation Expectations

There is a conflict of expectations that appears whenever we add a trace function as discussed earlier, or simply try to address scope propagation with Zones.

The fact that the current span is stored in the scope, along with tags, breadcrumbs and more, makes data propagation messy, as some parts of the scope are intended to propagate only into inner functions calls (for example, tags), while others are expected to propagate back into callers (for example, breadcrumbs), especially when there is an error.

Here is one example:

Copied
function a() {
  trace((span, scope) => {
    scope.setTag('func', 'a');
    scope.setTag('id', '123');
    scope.addBreadcrumb('was in a');
    try {
      b();
    } catch(e) {
      // How to report the SpanID from the span in b?
    } finally {
      captureMessage('hello from a');
      // tags: {func: 'a', id: '123'}
      // breadcrumbs: ['was in a', 'was in b']
    }
  })
}

function b() {
  trace((span, scope) => {
    const fail = Math.random() > 0.5;
    scope.setTag('func', 'b');
    scope.setTag('fail', fail.toString());
    scope.addBreadcrumb('was in b');
    captureMessage('hello from b');
    // tags: {func: 'b', id: '123', fail: ?}
    // breadcrumbs: ['was in a', 'was in b']
    if (fail) {
      throw Error('b failed');
    }
  });
}

In the example above, if an error bubbles up the call stack we want to be able to report in which span (by referring to a SpanID) the error happened. We want to have breadcrumbs that describe everything that happened, no matter which Zones were executing, and we want a tag set in an inner Zone to override a tag with the same name from a parent Zone, while inherinting all other tags from the parent Zone. Every Zone has its own "current span."

All those different expectations make it hard to reuse, in an understandable way, the current notion of scope, how breadcrumbs are recorded, and how those different concepts interact.

Finally, it is worth noting that the changes to restructure scope management most likely cannot be done without breaking existing SDK APIs. Existing SDK concepts - like hubs, scopes, breadcrumbs, user, tags, and contexts - would all have to be remodeled.

Span Ingestion Model

Consider a trace depicted by the following span tree:

Copied
F*
├─ B*
│  ├─ B
│  ├─ B
│  ├─ B
│  │  ├─ S*
│  │  ├─ S*
│  ├─ B
│  ├─ B
│  │  ├─ S*
│  ├─ B
│  ├─ B
│  ├─ B
│  │  ├─ S*

where
F: span created on frontend service
B: span created on backend service
S: span created on storage service

This trace illustrates 3 services instrumented such that when a user clicks a button on a Web page (F), a backend (B) performs some work, which then requires making several queries to a storage service (S). Spans that are at the entry point to a given service are marked with a * to denote that they are transactions.

We can use this example to compare and understand the difference between Sentry's span ingestion model and the model used by OpenTelemetry and other similar tracing systems.

In Sentry's span ingestion model, all spans that belong to a transaction must be sent all together in a single request. That means that all B spans must be kept in memory for the whole duration of the B* transaction, including time spent on downstream services (the storage service in the example).

In OpenTelemetry's model, spans are batched together as they are finished, and batches are sent as soon as either a) there are a certain number of spans in the batch or b) a certain amount of time has passed. In our example, it could mean that the first 3 B spans would be batched together and sent while the first S* transaction is still in progress in the storage service. Subsequently, other B spans would be batched together and sent as they finish, until eventually the B* transaction span is also sent.

While transactions are notably useful as a way to group together spans and to explore operations of interest in Sentry, the form in which they currently exist imposes extra cognitive burden. Both SDK maintainers and end users have to understand and choose between a transaction or a span when writing instrumentation code.

The issues that follow in the next few sections have been identified in the current ingestion model, and are all related to this dichotomy.

Complex JSON Serialization of Transactions

In OpenTelemetry's model, all spans follow the same logical format. Users and instrumentation libraries can provide more meaning to any span by attaching key-value attributes to it. The wire protocol uses lists of spans to send data from one system to another.

Sentry's model, unlike OpenTelemetry's, makes a hard distinction between two types of span: transaction spans (often refered to as "transactions") and regular spans.

In memory, transaction spans and regular spans have one distinction: transaction spans have one extra attribute, the transaction name.

When serialized as JSON, though, the differences are greater. Sentry SDKs serialize regular spans to JSON in a format that directly resembles the in-memory spans. By contrast, the serialization of a transaction span requires mapping its span attributes to a Sentry Event (originally used to report errors, expanded with new fields exclusively used for transactions), with all child spans embedded as a list in the Event.

Transaction Spans Gain Event Attributes

When a transaction is transformed from its in-memory representation to an Event, it gains more attributes not assignable to regular spans, such as breadcrumbs, extra, contexts, event_id, fingerprint, release, environment, user, etc.

Lifecycle Hooks

Sentry SDKs expose a BeforeSend hook for error events, which allows users to modify and/or drop events before they are sent to Sentry.

When the new transaction type event was introduced, it was soon decided that such events would not go through the BeforeSend hook, essentially for two reasons:

  • To prevent user code from relying on the dual form of transactions (sometimes looking like a span, sometimes like an event, as discussed in earlier sections);
  • To prevent existing BeforeSend functions that were written with only errors in mind from interfering with transactions, be it mutating them accidentally, dropping them altogether, or causing some other unexpected side effect.

However, it was also clear that some form of lifecycle hook was necessary, to allow users to do things like updating a transaction's name.

We ended up with the middle ground of allowing the mutation/dropping of transaction events through the use of an EventProcessor (a more general form of BeforeSend). This solves problems by giving users immediate access to their data before it goes out of the SDK, but it also has drawbacks in that it's more complicated to use than BeforeSend and also exposes the transaction duality, which was never intended to leak.

By contrast, in OpenTelemetry spans go through span processors, which are two lifecycle hooks: one when a span is started and one when it is ended.

Nested Transactions

Sentry's ingestion model was not designed for nested transactions within a service. Transactions were meant to mark service transitions.

In practice, SDKs have no way of preventing transactions from becoming nested. The end result is possibly surprising to users, as each transaction starts a new tree. The only way to relate those trees is through the trace_id.

Sentry's billing model is per event, be it an error event or a transaction event. That means that a transaction within a transaction generates two billable events.

In SDKs, having a transaction within a transaction will cause inner spans to be "swallowed" by the innermost transaction surrounding them. In these situations, the code creating the spans will only add them to one of the two transactions, causing instrumentation gaps in the other.

Sentry's UI is not designed to deal with nested transactions in a useful way. When looking at any one transaction it is as if all the other transactions in the trace did not exist (no other transaction is directly represented on the tree view). There is a trace view feature to visualize all transactions that share a trace_id, but the trace view only gives an overview of the trace by showing transactions and no child spans. There is no way to navigate to the trace view without first visiting some transaction.

There is also user confusion about what one would expect in the UI for a situation such as this one (pseudocode):

Copied
# if do_a_database_query returns 10 results, is the user
#   - seeing 11 transactions in the UI?
#   - billed for 11 transactions?
#   - see spans within create_thumbnail in the innermost transaction only?
with transaction("index-page"):
    results = do_a_database_query()
    for result in results:
        if result["needs_thumbnail"]:
            with transaction("create-thumbnail", {"resource": result["id"]}):
                create_thumbnail(result)

Spans Cannot Exist Outside of a Transaction

Sentry's tracing experience is centered entirely around the part of a trace that exists inside transactions. This means that data cannot exist outside of a transaction even if it exists in a trace.

If the SDK does not have a transaction going, regular spans created by instrumentation are entirely lost. That said, this is less of a concern on web servers, where automatically instrumented transactions start and finish with every incoming request.

The requirement of a transaction is especially challenging on frontends (browser, mobile, and desktop applications), because in those cases auto-instrumented transactions less reliably capture all spans, as they only last for a limited time before being automatically finished.

Another problem arises in situations where the trace starts with an operation which is only instrumented as a span, not a transaction. In our example trace, the first span that originates the trace is due to a button click. If the button click F* were instrumented as a regular span rather than a transaction, most likely no data from the frontend would be captured. The B and S spans would still be captured, however, leading to an incomplete trace.

In Sentry's model, if a span is not a transaction and there is no ancestor span that is a transaction, then the span won't be ingested. This, in turn, means there are many situations where a trace is missing crucial information that can help debug issues, particularly on the frontend where transactions need to end at some point but execution might continue.

Automatic and manual instrumentation have the challenge of deciding whether to start a span or a transaction, and the decision is particularly difficult considering that:

  • If there is no transaction, then the span is lost.
  • If there is already a transaction, then there is the nested transactions issue.

Missing Web Vitals Measurements

Sentry's browser instrumentation collects Web Vitals measurements. But, because those measurements are sent along to Sentry using the automatically instrumented transaction as a carrier, measurements that are made available by the browser after the automatic transaction has finished are lost.

This causes transactions to be missing some Web Vitals or to have non-final measurements for metrics like LCP.

Unreliable Frontend Transaction Duration

Because all data must go in a transaction. Sentry's browser SDK creates a transaction for every page load and every navigation. Those transactions must end at some time.

If the browser tab is closed before the transaction is finished and sent to Sentry, all collected data is lost. Therefore, the SDK needs to balance the risk of losing all data with the risk of collecting incomplete and potentially inaccurate data.

Transactions are finished after a set time spent idle after the last activity (such as an outgoing HTTP request) is observed. This means that the duration of a page load or navigation transaction is a rather arbitrary value that can't necessarily be improved or compared to that of other transactions, as it doesn't accurately represent the duration of any concrete and understandable process.

We counter this limitation by focusing on the LCP Web Vital as the default performance metric for browsers. But, as discussed above, the LCP value may be sent before it's final, making this a less than ideal solution.

In-memory Buffering Affects Servers

As discussed earlier, the current ingestion model requires Sentry SDKs to observe complete span trees in memory. Applications that operate with a constant flow of concurrent transactions will require considerable system resources to collect and process tracing data. Web servers are the typical case that exhibit this problem.

This means that recording 100% of spans and 100% of transactions is not feasible for many server-side applications, because the overhead incurred is just too high.

Inability to Batch Transactions

Sentry's ingestion model does not support ingesting multiple events at once. In particular, SDKs cannot batch multiple transactions into a single request.

As a result, when multiple transactions finish at nearly the same time, SDKs are required to make a separate request for each transaction. This behavior is at best highly inefficient and at worst a significant and problematic drain on resources such as network bandwidth and CPU cycles.

Compatibility

The special treatment of transaction spans is incompatible with OpenTelemetry. Users with existing applications instrumented with OpenTelemetry SDKs cannot easily use Sentry to ingest and analyze their data.

Sentry does provide a Sentry Exporter for the OpenTelemetry Collector, but, due to the current ingestion model, the Sentry Exporter has a major correctness limitation.

Summary

We have learned a lot through building the current tracing implementation at Sentry. This document is an attempt to capture many of the known limitations, in order to serve as the basis for future improvement.

Tracing is a complex subject, and taming that complexity is no easy feat.

Issues in the first group - those related to scope propagation - are a concern exclusive to SDKs and how they are designed. Addressing them will require internal architecture changes to all SDKs, including the redesign of old features like breadcrumbs, but making such changes is a prerequisite for implementing simple-to-use tracing helpers like a trace function that works in any context and captures accurate and reliable performance data. Note that such changes would almost certainly mean releasing new major versions of SDKs that break compatibility with existing versions.

Issues in the second group - those related to the span ingestion model - are a lot more complex, as any changes made to solve them would affect more parts of the product and require a coordinated effort from multiple teams.

Nonetheless, making changes to the ingestion model would have an immeasurable, positive impact on the product, as doing so would improve efficiency, allow us to collect more data, and reduce the cognitive burden of instrumentation.

You can edit this page on GitHub.