Skip to content

Conversation

@jack-berg
Copy link
Collaborator

Was working on improving the performance of opentelemetry-java metrics under high contention, and realized that the same strategy I identified to help over there helps for the prometheus implementation as well!

The idea here is recognizing that Buffer.observationCount is the bottleneck under contention. In contrast to the other histogram / summary LongAdder fields, Buffer.observationCount is AtomicLong which performs much worse than LongAdder under high contention. Its necessary that the type is AtomicLong because the CAS APIs accommodate the two way communication that the record / collect paths need to signal that a collection has started and all records have successfully completed (preventing partial writes).

However, we can "have our cake and eat it to" by striping Buffer.observationCount into many instances, such that the contention on any instance is reduced. This is actually what LongAdder does under the covers. This implementation stripes it into Runtime.getRuntime().availableProcessors() instances, and uses Thread.currentThread().getId()) % stripedObservationCounts.length to select which instance any particular record thread should use.

Performance increase is substantial. Here's the before and after of HistogramBenchmark on my machine (Apple M4 Mac Pro w/ 48gb RAM):

Before:

Benchmark                                     Mode  Cnt      Score      Error  Units
HistogramBenchmark.openTelemetryClassic      thrpt   25   1138.465 ±  165.921  ops/s
HistogramBenchmark.openTelemetryExponential  thrpt   25    677.483 ±   28.765  ops/s
HistogramBenchmark.prometheusClassic         thrpt   25   5126.048 ±  153.878  ops/s
HistogramBenchmark.prometheusNative          thrpt   25   3854.323 ±  107.789  ops/s
HistogramBenchmark.simpleclient              thrpt   25  13285.351 ± 1784.506  ops/s

After:

Benchmark                                     Mode  Cnt      Score      Error  Units
HistogramBenchmark.openTelemetryClassic      thrpt   25    925.528 ±   13.744  ops/s
HistogramBenchmark.openTelemetryExponential  thrpt   25    584.404 ±   32.762  ops/s
HistogramBenchmark.prometheusClassic         thrpt   25  14623.971 ± 2117.588  ops/s
HistogramBenchmark.prometheusNative          thrpt   25   7405.672 ±  857.611  ops/s
HistogramBenchmark.simpleclient              thrpt   25  13102.822 ± 3081.096  ops/s

@jack-berg jack-berg force-pushed the stripe-buffer-observation-counts branch from 8cd7e50 to 4c2146c Compare January 21, 2026 14:35
…bservationCount

Signed-off-by: Jack Berg <34418638+jack-berg@users.noreply.github.com>
@jack-berg jack-berg force-pushed the stripe-buffer-observation-counts branch from 4c2146c to ad1e7cb Compare January 21, 2026 14:39
long count = observationCount.incrementAndGet();
AtomicLong observationCountForThread =
stripedObservationCounts[
((int) Thread.currentThread().getId()) % stripedObservationCounts.length];
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to consider is whether there is a better algorithm for accessing one of these striped instances. With this algorithm based on thread id, its conceivable that the thread ids could be arranged in such a way that there end up being "hot spots" (i.e. AtomicLong instances that end up receiving a disproportionate amount of the updates). LongAdder solves this by using a random number generate to select which "cell" to access.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to using the thread id to strip is to use ThreadLocalRandom.current().nextInt() % stripedObservationCounts.length. Is there any issue with using ThreadLocalRandom here?

@jack-berg jack-berg requested review from fstab and zeitlinger and removed request for fstab January 21, 2026 15:50
@zeitlinger zeitlinger requested a review from Copilot January 26, 2026 08:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request improves the performance of Histogram and Summary metrics under high contention by implementing striping for the observationCount field in the Buffer class. The optimization replaces a single AtomicLong with an array of AtomicLong instances, distributing contention across multiple counters based on thread ID.

Changes:

  • Introduced stripedObservationCounts array sized by available processors to distribute contention
  • Modified append() method to select a stripe based on thread ID for lock-free counting
  • Updated run() method to aggregate counts across all stripes during collection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +38 to +40
AtomicLong observationCountForThread =
stripedObservationCounts[
((int) Thread.currentThread().getId()) % stripedObservationCounts.length];
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting thread ID to int before taking modulo can cause issues. Thread.getId() returns a long value, and casting it to int can result in negative values due to overflow, which will cause ArrayIndexOutOfBoundsException when used as an array index after modulo operation with negative dividend. Use Math.abs() on the casted int value or use a different approach like (int) (threadId & Integer.MAX_VALUE) to ensure a non-negative index, or better yet, use Long.hashCode(Thread.currentThread().getId()) which properly handles the conversion.

Suggested change
AtomicLong observationCountForThread =
stripedObservationCounts[
((int) Thread.currentThread().getId()) % stripedObservationCounts.length];
int stripeIndex =
(Long.hashCode(Thread.currentThread().getId()) & Integer.MAX_VALUE)
% stripedObservationCounts.length;
AtomicLong observationCountForThread = stripedObservationCounts[stripeIndex];

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah yeah should and was planning to fix before merging

expectedBufferSize =
(int) ((observationCount.getAndSet(0) & ~bufferActiveBit) - expectedCount);
for (AtomicLong observationCount : stripedObservationCounts) {
expectedBufferSize += (int) (observationCount.getAndSet(0) & ~bufferActiveBit);
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary and potentially lossy cast to int. The observationCount values are being cast to int before being added to expectedBufferSize (which is a long). This cast truncates the value to 32 bits, which could cause incorrect calculations if any individual stripe's count exceeds Integer.MAX_VALUE. Although unlikely in practice due to striping, the cast should be removed to maintain correctness. The masking operation (~bufferActiveBit) already produces the correct count value as a long.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing but will fix

} else {
expectedBufferSize = (int) (observationCount.addAndGet(bufferActiveBit) - expectedCount);
for (AtomicLong observationCount : stripedObservationCounts) {
expectedBufferSize += (int) observationCount.addAndGet(bufferActiveBit);
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary and potentially lossy cast to int. The observationCount values are being cast to int before being added to expectedBufferSize (which is a long). This cast truncates the value to 32 bits, which could cause incorrect calculations if any individual stripe's count exceeds Integer.MAX_VALUE. Although unlikely in practice due to striping, the cast should be removed to maintain correctness.

Copilot uses AI. Check for mistakes.
try {
// Signal that the buffer is active.
Long expectedCount = observationCount.getAndAdd(bufferActiveBit);
Long expectedCount = 0L;
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'expectedCount' is only assigned values of primitive type and is never 'null', but it is declared with the boxed type 'Long'.

Suggested change
Long expectedCount = 0L;
long expectedCount = 0L;

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing but will fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant