Backpressure Management

Backend SDKs that are typically used in server environments are expected to implement a component for backpressure management.

This component will periodically introspect the SDK for measures of throughput and if too high, will dynamically downsample transactions by halving the sample rate temporarily. Once the system has recovered to a healthy state, the SDK will revert to the sample rate set by the user.

The SDK should expose a boolean config parameter called enable_backpressure_handling that controls whether this logic is active or not.

The backpressure component has two main responsibilities:

Periodically schedule a health check in an asynchronous way and update the unhealthy status.
Use this unhealthy status to dynamically halve the effective sample rate for transactions before making the initial sampling decision.

The health check is typically performed once every 10 seconds by default. You can expose this interval as a config parameter if you wish on your SDK.

The health check on most SDKs currently tests the following conditions:

if the background worker queue is full
any rate limits are currently active

You can add more conditions of high throughput or wasted work if available and easily measurable on your platform.

The monitor should act asynchronously. This can be a new thread if supported by the language or a setTimeout in languages without threads like NodeJs.

See the Python implementation as a reference.

The monitor should update its internal health status and expose a downsample_factor which doubles every 10 seconds till the system is unhealthy. Typically we only double a maximum of 10 times because the number is already too small then.

This creates an exponential backoff behavior and reduces load in the transaction pipeline.

In your SDKs set_initial_sampling_decision which is called as part of the start_transaction API, you should use this downsample_factor right before making the random number based sampling decision.

See the Python implementation as a reference.

If possible, in transaction.finish, also record a client report with reason backpressure instead of sample_rate when the transaction is dropped so that we can track these backpressure outcome statistics.

See the Python implementation as a reference.

Traces

Distributed Tracing

Was this helpful?

Help improve this content
Our documentation is open source and available on GitHub. Your contributions are welcome, whether fixing a typo (drat!) or suggesting an update ("yeah, this would be better").

How to contribute | Edit this page | Create a docs issue | Get support