Performance Monitoring: Sentry SDK API Evolution

The objective of this document is to contextualize the evolution of the Performance Monitoring features in Sentry SDKs. We start with a summary of how Performance Monitoring was added to Sentry and to SDKs, and, later, we discuss lessons learned in the form of identified issues and the initiatives to address those issues.

Introduction

Back in early 2019, Sentry started experimenting with adding tracing to SDKs. The Python and JavaScript SDKs were the test bed where the first concepts were designed and developed. A proof-of-concept was released on April 29th, 2019 and shipped to Sentry on May 7, 2019. Python and JavaScript were obvious choices, because they allowed us to experiment with instrumenting Sentry’s own backend and frontend.

Note that the aforementioned work was contemporary to the merger of OpenCensus and OpenTracing to form OpenTelemetry. Sentry’s API and SDK implementations borrowed inspiration from pre-1.0 versions of OpenTelemetry, combined with our own ideas. For example, our list of span statuses openly match those that could be found in the OpenTelemetry specification around the end of 2019.

After settling with an API, performance monitoring support was then expanded to other SDKs. Sentry's Performance Monitoring solution became Generally Available in July, 2020. OpenTelemetry's Tracing Specification version 1.0 was released in February, 2021.

Our initial implementation reused the mechanisms we had in place for error reporting:

  • The Event type was extended with new fields. That meant that instead of designing and implementing a whole new ingestion pipeline, we could save time and quickly start sending "events" to Sentry, this time, instead of errors, a new "transaction" event type.
  • Since we were just sending a new type of event, the SDK transport layer was also reused.
  • And since we were sharing the ingestion pipeline, that meant we were sharing storage and the many parts of the processing that happens to all events.

Our implementation evolved such that there was a clear emphasis on the distinction between Transactions and Spans. Part of that was a side effect from reusing the Event interface.

Transactions resonated well with customers. They allowed for important chunks of work in their code to be highlighted, like a browser page load or http server request. Customers can see and navigate through a list of transactions, while within a transaction the spans give detailed timing for more granular units of work.

In the next section, we’ll discuss some of the shortcomings with the current model.

Identified Issues

While the reuse of the Unified SDK architecture (hubs, clients, scopes) and the transaction ingestion model have merits, experience revealed some issues that we categorize into two groups.

The first group has to do with scope propagation, in essence the ability to determine what the “current scope” is. This operation is required for both manual instrumentation in user code as well as for automatic instrumentation in SDK integrations.

The second group is for issues related to the wire format used to send transaction data from SDKs to Sentry.

Scope Propagation

This issue is tracked by getsentry/sentry-javascript#3751.

The Unified SDK architecture is fundamentally based on the existence of a hub per unit of concurrency, each hub having a stack of pairs of client and scope. A client holds configuration and is responsible for sending data to Sentry by means of a transport, while a scope holds contextual data that gets appended to outgoing events, such as tags and breadcrumbs.

Every hub knows what the current scope is. It is always the scope on top of the stack. The difficult part is having a hub “per unit of concurrency”.

JavaScript, for example, is single-threaded with an event loop and async code execution. There is no standard way to carry contextual data that works across async calls. So for JavaScript browser applications, there is only one global hub shared for sync and async code.

A similar situation appears on Mobile SDKs. There is an user expectation that contextual data like tags, what the current user is, breadcrumbs, and other information stored on the scope to be available and settable from any thread. Therefore, in those SDKs there is only one global hub.

In both cases, everything was relatively fine when the SDK had to deal with reporting errors. With the added responsibility to track transactions and spans, the scope became a poor fit to store the current span, because it limits the existence of concurrent spans.

For Browser JavaScript, a possible solution is the use of Zone.js, part of the Angular framework. The main challenge is that it increases bundle size and may inadvertendly impact end user apps as it monkey-patches key parts of the JavaScript runtime engine.

The scope propagation problem became specially apparent when we tried to create a simpler API for manual instrumentation. The idea was to expose a Sentry.trace function that would implicitly propagate tracing and scope data, and support deep nesting with sync and async code.

As an example, let’s say someone wanted to measure how long searching through a DOM tree took, tracing this operation would look something like this:

Copied
await Sentry.trace(
  {
    op: 'dom',
    description: 'Walk DOM Tree',
  },
  async () => await walkDomTree()
);

With the Sentry.trace function, users wouldn’t have to worry about keeping the reference to the correct transaction or span when adding timing data. Users are free to create child spans within the walkDomTree function and spans would be ordered in the correct hierarchy.

The implementation of the actual trace function is relatively simple (see a PR which has an example implementation), however, knowing what is the current span in async code and global integrations is a challenge yet to be overcome.

The following two examples synthesize the scope propagation issues.

1. Cannot Determine Current Span

Consider some auto-instrumentation code that needs to get a reference to the current span, a case in which manual scope propagation is not available.

Copied
// SDK code
function fetchWrapper(/* ... */) {
  /*
    ... some code omitted for simplicity ...
  */
  const parent = getCurrentHub().getScope().getSpan(); // <1>
  const span = parent.startChild({
    data: { type: 'fetch' },
    description: `${method} ${url}`,
    op: 'http.client',
  });
  try {
    // ...
    // return fetch(...);
  } finally {
    span.finish();
  }
}
window.fetch = fetchWrapper;

// User code
async function f1() {
  const hub = getCurrentHub();
  let t = hub.startTransaction({ name: 't1' });
  hub.getScope().setSpan(t);
  try {
    await fetch('https://example.com/f1');
  } finally {
    t.finish();
  }
}
async function f2() {
  const hub = getCurrentHub();
  let t = hub.startTransaction({ name: 't2' });
  hub.getScope().setSpan(t);
  try {
    await fetch('https://example.com/f2');
  } finally {
    t.finish();
  }
}
Promise.all([f1(), f2()]); // run f1 and f2 concurrently

In the example above, several concurrent fetch requests trigger the execution of the fetchWrapper helper. Line <1> must be able to observe a different span depending on the current flow of execution, leading to two span trees as below:

Copied
t1
\
  |- http.client GET https://example.com/f1
t2
\
  |- http.client GET https://example.com/f2

That means that, when f1 is running, parent must refer to t1 and, when f2 is running, parent must be t2. Unfortunately, all code above is racing to update and read from a single hub instance, and thus the observed span trees are not deterministic. For example, the result could incorrectly be:

Copied
t1
t2
\
  |- http.client GET https://example.com/f1
  |- http.client GET https://example.com/f2

As a side-effect of not being able to correctly determine the current span, the present implementation of the fetch integration (and others) in the JavaScript Browser SDK chooses to create flat transactions, where all child spans are direct children of the transaction (instead of having a proper multi-level tree structure).

Note that other tracing libraries have the same kind of challenge. There are several (at the time open) issues in OpenTelemetry for JavaScript related to determining the parent span and proper context propagation (including async code):

2. Conflicting Data Propagation Expectations

There is a conflict of expectations that appear whenever we add a trace function as discussed earlier, or simply try to address scope propagation with Zones.

The fact that the current span is stored in the scope, along with tags, breadcrumbs and more, makes data propagation messy as some parts of the scope are intended to propagate only into inner functions calls (for example, tags), while others are expected to propagate back into callers (for example, breadcrumbs), specially when there is an error.

Here is one example:

Copied
function a() {
  trace((span, scope) => {
    scope.setTag('func', 'a');
    scope.setTag('id', '123');
    scope.addBreadcrumb('was in a');
    try {
      b();
    } catch(e) {
      // How to report the SpanID from the span in b?
    } finally {
      captureMessage('hello from a');
      // tags: {func: 'a', id: '123'}
      // breadcrumbs: ['was in a', 'was in b']
    }
  })
}

function b() {
  trace((span, scope) => {
    const fail = Math.random() > 0.5;
    scope.setTag('func', 'b');
    scope.setTag('fail', fail.toString());
    scope.addBreadcrumb('was in b');
    captureMessage('hello from b');
    // tags: {func: 'b', id: '123', fail: ?}
    // breadcrumbs: ['was in a', 'was in b']
    if (fail) {
      throw Error('b failed');
    }
  });
}

In the example above, if an error bubbles up the call stack we want to be able to report in which span (by referring to a SpanID) the error happened. We want to have breadcrumbs that describe everything that happened, no matter which Zones were executing, and we want a tag set in an inner Zone to override a tag with the same name from a parent Zone, while inherinting all other tags from the parent Zone. Every Zone has their own "current span".

All those different expectations makes it hard to reuse, in an understandable way, the current notion of scope, how breadcrumbs are recorded, and how those different concepts interact.

Span Ingestion Model

Coming soon.

You can edit this page on GitHub.