-
Notifications
You must be signed in to change notification settings - Fork 830
Improve histogram, summary performance under contention by striping observationCount #1794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Improve histogram, summary performance under contention by striping observationCount #1794
Conversation
8cd7e50 to
4c2146c
Compare
…bservationCount Signed-off-by: Jack Berg <34418638+jack-berg@users.noreply.github.com>
4c2146c to
ad1e7cb
Compare
| long count = observationCount.incrementAndGet(); | ||
| AtomicLong observationCountForThread = | ||
| stripedObservationCounts[ | ||
| ((int) Thread.currentThread().getId()) % stripedObservationCounts.length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to consider is whether there is a better algorithm for accessing one of these striped instances. With this algorithm based on thread id, its conceivable that the thread ids could be arranged in such a way that there end up being "hot spots" (i.e. AtomicLong instances that end up receiving a disproportionate amount of the updates). LongAdder solves this by using a random number generate to select which "cell" to access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative to using the thread id to strip is to use ThreadLocalRandom.current().nextInt() % stripedObservationCounts.length. Is there any issue with using ThreadLocalRandom here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request improves the performance of Histogram and Summary metrics under high contention by implementing striping for the observationCount field in the Buffer class. The optimization replaces a single AtomicLong with an array of AtomicLong instances, distributing contention across multiple counters based on thread ID.
Changes:
- Introduced
stripedObservationCountsarray sized by available processors to distribute contention - Modified
append()method to select a stripe based on thread ID for lock-free counting - Updated
run()method to aggregate counts across all stripes during collection
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| AtomicLong observationCountForThread = | ||
| stripedObservationCounts[ | ||
| ((int) Thread.currentThread().getId()) % stripedObservationCounts.length]; |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting thread ID to int before taking modulo can cause issues. Thread.getId() returns a long value, and casting it to int can result in negative values due to overflow, which will cause ArrayIndexOutOfBoundsException when used as an array index after modulo operation with negative dividend. Use Math.abs() on the casted int value or use a different approach like (int) (threadId & Integer.MAX_VALUE) to ensure a non-negative index, or better yet, use Long.hashCode(Thread.currentThread().getId()) which properly handles the conversion.
| AtomicLong observationCountForThread = | |
| stripedObservationCounts[ | |
| ((int) Thread.currentThread().getId()) % stripedObservationCounts.length]; | |
| int stripeIndex = | |
| (Long.hashCode(Thread.currentThread().getId()) & Integer.MAX_VALUE) | |
| % stripedObservationCounts.length; | |
| AtomicLong observationCountForThread = stripedObservationCounts[stripeIndex]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah yeah should and was planning to fix before merging
| expectedBufferSize = | ||
| (int) ((observationCount.getAndSet(0) & ~bufferActiveBit) - expectedCount); | ||
| for (AtomicLong observationCount : stripedObservationCounts) { | ||
| expectedBufferSize += (int) (observationCount.getAndSet(0) & ~bufferActiveBit); |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary and potentially lossy cast to int. The observationCount values are being cast to int before being added to expectedBufferSize (which is a long). This cast truncates the value to 32 bits, which could cause incorrect calculations if any individual stripe's count exceeds Integer.MAX_VALUE. Although unlikely in practice due to striping, the cast should be removed to maintain correctness. The masking operation (~bufferActiveBit) already produces the correct count value as a long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing but will fix
| } else { | ||
| expectedBufferSize = (int) (observationCount.addAndGet(bufferActiveBit) - expectedCount); | ||
| for (AtomicLong observationCount : stripedObservationCounts) { | ||
| expectedBufferSize += (int) observationCount.addAndGet(bufferActiveBit); |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary and potentially lossy cast to int. The observationCount values are being cast to int before being added to expectedBufferSize (which is a long). This cast truncates the value to 32 bits, which could cause incorrect calculations if any individual stripe's count exceeds Integer.MAX_VALUE. Although unlikely in practice due to striping, the cast should be removed to maintain correctness.
| try { | ||
| // Signal that the buffer is active. | ||
| Long expectedCount = observationCount.getAndAdd(bufferActiveBit); | ||
| Long expectedCount = 0L; |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'expectedCount' is only assigned values of primitive type and is never 'null', but it is declared with the boxed type 'Long'.
| Long expectedCount = 0L; | |
| long expectedCount = 0L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing but will fix
Was working on improving the performance of opentelemetry-java metrics under high contention, and realized that the same strategy I identified to help over there helps for the prometheus implementation as well!
The idea here is recognizing that
Buffer.observationCountis the bottleneck under contention. In contrast to the other histogram / summaryLongAdderfields,Buffer.observationCountisAtomicLongwhich performs much worse thanLongAdderunder high contention. Its necessary that the type isAtomicLongbecause the CAS APIs accommodate the two way communication that the record / collect paths need to signal that a collection has started and all records have successfully completed (preventing partial writes).However, we can "have our cake and eat it to" by striping
Buffer.observationCountinto many instances, such that the contention on any instance is reduced. This is actually whatLongAdderdoes under the covers. This implementation stripes it intoRuntime.getRuntime().availableProcessors()instances, and usesThread.currentThread().getId()) % stripedObservationCounts.lengthto select which instance any particular record thread should use.Performance increase is substantial. Here's the before and after of
HistogramBenchmarkon my machine (Apple M4 Mac Pro w/ 48gb RAM):Before:
After: