Uniformization of bounds checking (original) (raw)
November 26, 2025, 5:59pm 1
Uniformization of bounds checking
This post documents my plans for improving the checkers that report out-of-bounds errors.
@NoQ @Xazax-hun @steakhal @ anybody else:
If you have any high-level feedback (suggestions, concerns, plans etc.), please share it here. If this rough overview looks acceptable, I’ll dive into creating concrete commits and uploading them for review.
I also have one concrete question: What should the analyzer do if it sees pointer arithmetic that produces an out-of-bounds pointer (which isn’t the legitimate past-the-end pointer, but a “really” out-of-bounds one)? This is undefined behavior according to the standard, but intuitively isn’t as serious as actually accessing out-of-bounds memory.
Overview of relevant checkers
The checker security.ArrayBound (previously known as alpha.security.ArrayBoundV2) has sophisticated logic for determining whether the used memory location is out-of-bounds, but only applies this in the “basic” context where “use” means the subscript, dereference or arrow operator.
We also have several other out-of-bounds checkers:
- alpha.security.ReturnPtrRange reports situations when a function returns an out-of-bounds pointer.
- alpha.unix.cstring.OutOfBounds reports out-of-bounds access by C string functions (including raw memory handling like
memcpy). - alpha.unix.cstring.BufferOverlap reports use of overlapping buffers in
memcpyand similar functions. This is not exactly an out-of-bounds checker, but needs to perform similar checks, so perhaps it’s worth to think about sharing logic with it. - alpha.core.PointerArithm is currently NOT a bounds checker (it just reports non-array locations participating in pointer arithmetics) but does similar things and there may be plans to generalize it for some sort of bounds checking.
- alpha.cplusplus.IteratorRange reports iterators used outside their valid ranges. (I’m not familiar with this checker, and AFAIK all the iterator checkers have serious technical debt.)
- osx.coreFoundation.containers.OutOfBounds is presumably another checker that does bounds checking, but I know nothing about it.
- Honourable mention goes to the function
isArrayIndexOutOfBounds()withinUndefResultChecker.cppwhich is only used to add " due to array index out of bounds" to the “The left/right operand of ‘X’ is a garbage value” reports.
Currently (November 2025) none of these checkers use the logic of security.ArrayBound; and many of them use the “traditional” logic, which was used in the old alpha.security.ArrayBound (V1) before its removal. (See appendix 1 for a description of this “traditional” logic and its drawbacks.)
Anatomy of a bounds check
Bounds checking can happen in two kinds of situations:
- Actual access where out-of-bounds offset triggers significant issues (e.g. pointer dereference, array subscripting, access through
memcpyand similar functions). - Hypothetical access where the out-of-bounds offset is suspicious but doesn’t cause immediate problems (e.g. performing pointer arithmetic, returning a pointer,
&array[idx]expressions).- In these situations it’s well-defined to create a past-the-end pointer.
- Forming other out-of-bounds pointers is still undefined behavior (but is intuitively “less severe” than actually dereferencing them).
The bounds check can emit three kinds of output:
- A. Bug report: when it finds an error, it should create a bug report:
- the message should describe the accessed memory region, the index/offset (if concrete) and the extent of the memory region (if concrete and the error is an overflow)
- the relevant values (index/offset, extent) should be marked as interesting if they are symbolic
- if the memory area is not dynamically allocated, we should mark its declaration with a note tag (security.OutOfBounds doesn’t do this yet, but alpha.security.ReturnPtrRange has this feature and I think we should adapt it)
- the error node may be fatal or non-fatal
* actual out-of-bounds access is definitely fatal; and although hypothetical out-of-bounds access is also undefined behavior, perhaps it should be a non-fatal error node
- B. Assumption: In some ambiguous situations (access may be legal, may be out of bounds), it is useful to assume that the access is legal and:
- produce a new state where the index/offset is definitely in bounds;
- attach a note tag that describes this assumption.
- (see Appendix 2 for two code examples where the assumptions are helpful).
- C. Nothing: In other cases (e.g. the access is definitely in bounds) the bounds check returns without doing anything.
To produce this output, the bounds checking algorithm consumes lots of input parameters:
- The
ProgramState. - The access kind: actual or hypothetical.
- Message fragment strings that describe the access operation (e.g. distinguish simple pointer dereference and a
memcpycall). - Strings that describe the accessed memory region (for message generation).
- The extent/size of the accessed memory region: either an SVal or a concrete number.
- The offset/index that is being accessed: either an SVal or a concrete number.
In many cases (e.g dereferencing a pointer) parameters 4-6. (accessed region, its extent, the offset) are all calculated from a single location SVal (i.e. pointer) by decomposing it into a {base region, offset} pair. However, there are also situations (e.g. applying the array subscript operator on an array with a statically known size) where these parameters are known separately, as e.g PR#159357 by Alejandro Alvarez proposes.
The bounds checking algorithm (as implemented in security.ArrayBound (V2) ) is composed of two or three subroutine calls:
- the first one checks the lower bound (index >= 0),
- the second checks the upper bound (index < size),
- finally – if needed – the third one checks whether the pointer is a past-the-end pointer (index == size).
This comparison subroutine is itself nontrivial – in fact, I’d say that it is the core innovation of the V2 array bounds checker. The main difficulty is that it needs to approximate a proper mathematical comparison with evalBinOp – and evalBinOp messes up everything with overflows and type conversions. The current implementation is “good enough” for use in production, but I have several long-term plans that would improve accuracy in a few corner cases.
The algorithm is complicated by several additional aspects:
- The offset/index and the extent/size may be measured either in bytes (char units) or as element counts. The current implementation converts everything to bytes for calculations, then converts back to element counts when it generates the messages (when possible), but it would be better to do the calculations/comparisons on element counts when it’s possible.
- In many cases it would be better to e.g. assume x<yx < yx<y instead of 4x<4y4 x < 4 y4x<4y (where the ‘4’ is
sizeof(int)). The analyzer cannot see the connection between these two relationships (because strictly speaking, they are not equivalent due to overflow) – and x<yx < yx<y is the one that would probably occur e.g. in a conditional statement. - If we convert the byte offsets to element counts (instead of the other way around), it would be more natural to handle cases where the extent is not a multiple of the element size (round down in the division). By the way currently the analyzer doesn’t report overflows if the beginning of the accessed object fits into the memory area.
- Avoiding the back-and-forth conversion might also simplify the code.
- In many cases it would be better to e.g. assume x<yx < yx<y instead of 4x<4y4 x < 4 y4x<4y (where the ‘4’ is
- When the offset or the extent is tainted, the analyzer should switch to a “pessimistic” mode where ambiguous situations are also reported. These reports need different messages and we are planning to reassign them to new separate checker(s) under
optin.taint.*. - As the comparisons are made, the code must report the assumptions. This is needed for two things:
- composing the message when the check finishes with an assumption (output B);
- distinguishing “underflow or overflow” reports and “definitely overflow” reports.
- There are also a few tricks that make the message easier to read.
Suggested architecture
As each bounds checker needs most of the complications (which are described in the previous section), it is important to reuse the same code in all of them. There are several “customization points” where the various checkers need to introduce different behavior, so the natural architecture would be a base class and several subclasses.
As a rough draft I propose the following interface:
struct PtrDiff {
std::variant<NonLoc, int64_t> Amount; // or maybe just a NonLoc if we wrap known integers into nonloc::ConcreteInt
std::optional<Type> ElemType; // nullopt means byte offsets
bool isTainted(/* suitable context */) const;
}
// When a checker performs a bounds check, it constructs an instance of a
// subclass of this class, then calls the "main" method
// `performCheckAndTransition()` on it.
class BoundsValidatorBase {
protected:
BoundsValidatorBase(CheckerContext &C, ProgramStateRef State, std::string
AccessedRegionName, PtrDiff Extent, PtrDiff Offset)
: /* ... save everything into data members ... */ {}
////// Customization points -- defined by subclasses /////
virtual bool isActualAccess() const = 0;
// Probably there will be several methods of this kind
virtual std::string describeAccessOperation() const = 0;
virtual const BugType *getBugType(bool isTaintReport) const = 0;
// The constructor is also a customization point: subclasses should define
// their own public constructors with different parametrizations.
////// Shared logic /////
public:
// The "main" function: performs the bounds check, then emits the bug report
// or assumption (if needed).
void performCheckAndTransition(ExplodedNode *N);
protected:
// The comparison subroutine which already exists in ArrayBoundChecker, but
// should be generalized a bit to handle the PtrDiff type.
static std::pair<ProgramStateRef, ProgramStateRef>
compareValueToThreshold(ProgramStateRef State, PtrDiff Value,
PtrDiff Threshold, SValBuilder &SVB,
bool CheckEquality=false);
StateUpdateReporter SUR;
}
I’m planning to introduce this gradually:
- Add a new source file and header in the
Core(analyzer engine) directory and move the logic fromArrayBoundChecker.cppto these new files. (NFC) - Generalize and improve this new bounds checking logic to make it ready for use in other checkers (e.g. support doing the calculations with indices and not byte offsets).
- These may slightly improve the quality of
security.ArrayBoundresults as well.
- These may slightly improve the quality of
- Reimplement the other bounds checkers as wrappers around the shared bounds checking library.
Appendix 1: The “traditional” logic
Several checkers (ReturnPtrRange, cstring.OutOfBounds, isArrayIndexOutOfBounds()) use the “traditional” logic, which is copypasted from the old alpha.security.ArrayBound (V1) checker. Note that this V1 checker produced too many false positives, so it was was deleted from the codebase when security.ArrayBound (originally alpha.security.ArrayBoundV2) became stable.
These checkers use the method ProgramState::assumeInBoundDual() as the “backend” of bounds checking, which relies on weird mathematical trick: to check 0 <= Idx < UpperBound it assume()s that Idx + MIN < UpperBound + MIN where MIN is the minimal value of the array index type (which is always long long).
- This exploits that the analyzer models signed integer overflow as if it was well-defined.
- I had vague plans for e.g. including an
--assume-no-signed-overflowflag for the analyzer (because that would get rid of lots of false positives), but this abuse of the overflow modeling prevents that.
- I had vague plans for e.g. including an
- The comparison may be a signed comparison (by default) or an unsigned comparison (when either
IdxorUpperBoundis anunsigned long long, which enjoys precedence). - The mathematical trick is logically correct with a signed comparison, but with the unsigned comparison it sometimes produces incorrect results: e.g.
Idx == -3andUpperBound = 5ULLwould produce an “in bounds” result. - If one of
IdxandUpperBoundis concrete, while the other is symbolic, then the analyzer (hopefully) can follow the additions and deduce the correct range for the symbol. - However, if
IdxandUpperBoundare both symbolic, then I’m pretty sure that the analyzer won’t see the connection between e.g.Idx < UpperBound(which could appear e.g. in anif) andIdx + MIN < UpperBound + MIN(which is assumed during the bounds checking).
I think it’s important to switch to using the approach of security.ArrayBound (V2) with these checkers. We should also consider eventually eliminating ProgramState::assumeInBoundDual() from the codebase because its tricky approach causes too many problems.
Appendix 2. Why emit assumptions?
Assuming that the access is in bounds (when it may be in bounds and the index is not tainted) can improve the quality of results. The following code show two concrete examples:
int TenElements[10];
int example1(int arg) {
// We need to save our assumptions about the value of 'arg' to deduce that
// these two can't be in bounds simultaneously:
return TenElements[arg] + TenElements[arg + 10];
}
int example2(int arg, int arg2, bool flag) {
if (flag) {
TenElements[arg] = arg2;
}
// ...
if (arg > 20) {
// If an execution path where flag was true reaches this point, then the
// program is already in an undefined corrupted state (after overflowing
// TenElements) and this overflow is not modeled by the analyzer.
// Results from these paths would be inaccurate, so it is better if
// TenElements[arg] produces an assumption that prunes these paths.
}
}
If I recall correctly @steakhal told that the additional assumptions previously caused slowdowns in some cases (by slowing down the Z3 refutation in outlier cases), but if I understand correctly that issue was mitigated in other ways since then so it isn’t relevant anymore.
Xazax-hun November 27, 2025, 11:14am 2
Deduplicating and centralizing bounds checking sounds like a good idea to me. But is reusing a library in different checks the right way? Or could we keep all logic a single modeling checker that could potentially raise some events that other checks could respond to? Similar to how InnerPointerChecker and MallocChecker interacts. E.g., a checker can emit “hypothetical” or “actual” accesses with all the necessary inputs that would be validated by a single modeling checker.
Deduplicating and centralizing bounds checking sounds like a good idea to me. But is reusing a library in different checks the right way? Or could we keep all logic a single modeling checker that could potentially raise some events that other checks could respond to?
I’m confident that reusing a library is the most natural apporach in this situation, because the common logic is in the “middle” of the bounds checking procedure:
- different bounds checker reacts to different situations with different callbacks,
- then they call the same “is it in bounds?” logic,
- finally, if they find an error, then the bug report creation procedure (and the bug type and the checker reporting the error) is again different.
However, if you highlight relevant advantages of some other approach, then I’m open to switching to it.
Similar to how InnerPointerChecker and MallocChecker interacts. E.g., a checker can emit “hypothetical” or “actual” accesses with all the necessary inputs that would be validated by a single modeling checker.
I’m familiar with the logic of InnerPointerChecker and I’d say that it is more complex than the library that I’m suggesting. The “trick” of InnerPointerChecker is that in fact there are two separate checker frontends that are both called InnerPointerChecker:
- One is the standalone checker in
InnerPointerChecker.cppwhich registers some callbacks and (within those callbacks) invokes some functions defined inMallocChecker.cpp– but never creates any bug reports. - The second is a checker frontend within the checker family
MallocCheckerwhich does the actual error reporting.
If we copied this model with bounds checking, we would introduce:
- one standalone checker class for each bounds checking checker (ArrayBound, ReturnPtrRange etc.) – these registers callbacks to perform step 1; but never reports errors
- a central checker family that contains the shared code (step 2.)
- and within this family one frontend for each bounds checking checker (to actually report the errors, step 3.).
For use-after-free errors (which includes inner pointer invalidation) IIUC there are some checker callbacks that are used by multiple subcheckers naturally belong to the central checker family. (And it happens that except for InnerPointerChecker, the other use-after-free checkers are defined “inline” within the checker family and don’t have separate standalone parts.)
On the other hand, detecting out-of-bounds access doesn’t need custom state tracking (it only relies on numerical constraints and dynamic extent info), so each bounds checking checker relies on its own distinct callbacks and there wouldn’t be any shared callbacks that need to be in the central checker family of the bounds checking checkers.
This means that in the bounds-checking case the central checker family can be simplified into a plain library (like the one that I propose) if the error reporting simply uses the checkers that are already needed for their callbacks (instead of introducing the hacky “duplicated” checker frontends that share the name of the standalone checker).
steakhal November 28, 2025, 11:54am 4
IMO hoisting all of these implementations into something more generic makes sense.
Landing fixes would benefit all the rest.
I also liked that you want to provide some nobs to tune the behaviour.
Speaking of the convoluted behaviour of actual and hypothetical bounds checks I don’t have much to say. I probably lack the context right now to form a sound opinion.
What I definitely agree with is that it’s messy.
I also favour the removal of assumeInBoundDual().
I think the emission of the assumptions also need to be adjustable. We probably want to emit them, but I could see reasons also not to. In other words, in situations when the use of a given construct (such as subscript) shouldn’t be used to infer constraints because it’s a weak hint from the user. Maybe a tainted index would be an example. After the subscript I’d prefer not assuming bounds on the tainted index.
That’s right. They are mitigated.
Xazax-hun November 28, 2025, 12:56pm 5
The main reason I have mixed feelings about the proposed architecture is because we do not really use inheritance for sharing implementation across checkers that much at the moment. We have a slightly different model where the checkers subscribe to interesting events and observe the program state for interesting patterns. I think it would be great if we could come up with an API that does not feel very different from how we deal with the existing APIs.
So my idea would be more along the lines of:
- Some accesses might not be clear from the source code, so checkers need to be able to signal it to the analyzer that an access at a certain offset is happening into some memory region (and whether it is actual or hypothetical in your terminology).
- The analyzer have to be able to signal back if the access is safe, unsafe, or provably wrong and the checker needs to be able to act accordingly (e.g., generating sinks, reporting bugs).
For example, we do not have a base class for dealing with taints. There are some APIs for the checkers to query the taint information instead.
I just want to be sure if we diverge from the existing patterns there is a strong reason to do so and first we considered trying to build something that feels at home as in the style of existing APIs.
necto November 28, 2025, 4:04pm 6
Thank you for raising the initiative, Donat!
I agree with the premise. I have one concern and one opinion to share.
I believe that in general, if we can sensibly continue simulation after encountering a bug, we should. This comes from many of our customers that are surprised to see first bug detected and second “missed”.
In this case, I agree with you: even if it is strictly speaking UB, continuing with the invalid pointer until it is actually de-referenced sounds sensible and should not lead to FPs. In my opinion, we should emit a non-fatal bug report and continue analysis with the invalid pointer.
Now my concern is with the architecture you propose. Admittedly, I did not spend enough time deeply considering your proposal, but I can’t easily see the need for inheritance here. On the other hand, inheritance makes coupling tighter and execution less intuitive. Why a set of library functions for sharing the bounds-checking logic not suffice?
Thinking superficially, I imagine we could have:
- the bounds checking function, similar to
assumeInBoundsbut the better implementation as found in ArrayBoundCehcker now. - something like “StateRef introduce-in-bounds-assumptions(State, …)” function that interested checkers could call when relevant.
- a bug-report visitor that adds the notes for the bounds-related assumptions.
My intuition is that if nothing is tainted, then for a memory access event X the following two logical properties are equivalent:
- if X is definitely out of bounds, then we want to produce a bug report;
- if X is ambiguous (may be in bounds, may be out of bounds), then we want to emit an assumption that X is in bounds.
I do not know about any clear counterexample X where one of these is true but the other is not (if we assume that nothing is tainted), so I don’t see a reason to adjust the emission of the assumptions independently of the emission of bug reports.
However, if we find concrete use-cases where it is useful to adjust the emission of assumptions, I’m open to adding customization logic (and it wouldn’t be too difficult to do so).
Taint is a completely separate aspect, which I didn’t detail in the original write-up. I’m sorry if this caused confusion.
If the index is tainted, the ambiguous bounds checks (may be in bounds, may be in bounds) will produce “tainted index may be out of bounds” reports instead of emitting “assuming index is in bounds” assumptions – so there is “no place” for producing assumptions. (This behavior is already present in security.ArrayBound and I don’t plan to change it.)
<offtopic>
There are some corner cases, e.g. a situation where the extent of the memory area is tainted, so we want to emit an assumption that the (non-tainted) is in non-negative, then emit an “index may be out of bounds and the extent is tainted” warning (which will be a non-fatal error node, because taint warnings are customarily non-fatal). This is not yet implemented, but seems to be the logical continuation of the existing logic.</offtopic>
Thanks for the update!
Perhaps I wasn’t clear enough, but I’m not proposing inheritance between checker classes (which have singleton instances and are derived from Checker and related classes), instead I’m planning introduce small utility classes and inheritance between them.
I think what I’m proposing is roughly analogous to the existing patterns. Perhaps the most clear analogy is e.g. the BugReporterVisitor API: if a checker needs to use it, it defines a custom subclass of a utility class, implements some virtual methods to describe the custom behavior, then instantiates this subclass to use its functionality.
I think going in this direction naturally leads to my proposal (or something very similar to it):
- As you say, we want to let the checkers call some “is this access out of bounds?” API that – very roughly – looks like
validateBounds(MemRegion *Region, NonLoc Offset, bool IsActualAccess, /*some other arguments*/).- In fact, for the sake of uniformity, this API should handle all bounds checking, not just those that are “not clear from the source code”.
- The name of this function is just a placeholder, I’m not proposing it for actual use!
- The various bounds checkers will need to emit the assumptions and bug reports in an almost identical fashion – the only difference is that each uses its own bug type and some parts of the messages (which name the access operation) are specific to each checker.
- Steps that are need to be done in an identical fashion include: composing the messages, marking
SVals as interesting, generating the error node or the node with the new assumption etc.
- Steps that are need to be done in an identical fashion include: composing the messages, marking
- So instead of “signaling back” and “the checker needs to be able to act accordingly” I propose that the checker just passes its
BugTypes and message fragments to the library functionvalidateBoundsForMemoryAccess, which will arrange everything - The logic of
validateBoundsForMemoryAccessis fairly complex, so it would be inelegant to put it into a single monolithic function, so I’m planning to split its logic into cooperating methods of a utility class.- I suggest methods instead of standalone functions because I expect that it would be cumbersome to pass context and state between functions.
- Instances of this utility class are lightweight and ephemeral: a fresh one is created for each memory access.
- There will be some logic (e.g. the is this subscript expression wrapped in an address-of like
&arr[idx]check) which is only relevant for one particular bounds checking checker, and I think it will be natural to place these into methods of subclasses of the bounds validator utility class. (There are other approaches, e.g. passing lambdas to the constructor of the utility class, but I expect that this will be simpler than those. However,
The tight coupling between the steps is intentional. After identifying the memory access, all the bounds checking checkers need to follow the same rigid workflow until the very end (up to and including the creation of the bug report). I usually don’t like inheritance, but I think it is still the most idiomatic tool for representing this rigid skeleton that has a few customization spots (represented by virtual methods that are overridden in subclasses).
However, the workflow is so rigid that the inheritance is not absolutely necessary: the API could consist of one big fat function with roughly a dozen parameters. (Half of those parameters would be “real” ones that change from call to call, while the rest are callbacks or message fragment strings that encode the “static” differences between the checkers.)
Right now I expect that the solution with inheritance would be more elegant, but the “one big function” is also acceptable, and it may turn out to be the more appealing approach when I dig into the details of the implementation.
This is completely infeasible, because these results (“is it in bounds?”, the updated State, the note that describes the bounds-related assumption) are all outputs of the same monolithic algorithm, and even if you ask for just one of them, you need to run the full algorithm.
(Running the algorithm multiple times is not feasible for performance reasons. While profiling for other reasons, I have accidentally seen an example translation unit where security.ArrayBound was responsible for more than 1% of the full analysis runtime, which is similar to other checkers that check commonly used language features, but too much to be tripled.)
Okay, this sounds a bit better. That being said, I am still a bit unsure about this inversion of control. In your proposed model one needs to do subclassing to override getBugType, describeAccessOperation and co to decide what to do with the result of the bounds check. I think this is a very rigid design where the base class hardcodes a lot of decisions (e.g., to emit exactly one bug report per access, not zero, not more).
I think it is way more flexible if these utilities are not using inheritance for customization points but return as much information to the caller as possible. Based on the returned information the caller should be able to:
- Add assumptions or introduce state splits based on the condition that would make the access safe
- Emit or ignore warnings, as many as they want
- Do repeated queries with slightly different inputs for the same expression
Just a made up example for the last one, imagine an API where we do not know which array would be accessed:
int access(int *buffer1, int* buffer2, int idx);
Maybe the checker modelling this API wants to assume this call is safe if the index is in bound for any of the arrays.
My point is, the inheritance based architecture seems to be a bit too rigid and opinionated and I think this is the reason why we try to avoid that most of the time. Returning the information to the checker without doing any side-effects (like producing bugs reports) is way more flexible.
It is really hard to anticipate what checks will need in the future so hardcoding anything that is not strictly related to bounds calculations (like error reporting) is a design mistake in my opinion.
I think that is slightly different. We already are doing a side effect in that case and we are not coupling different behaviors (like reporting and bounds checking) together.
I don’t think it is a good idea coupling the addition of assumptions and reporting together with the calculations. If we really want to reuse code on that part we can add some additional utilities but checkers should have more flexibility than that, see my previous example.
I’m proposing this rigid API because the glue logic between the bounds calculations (which have complex output) and the error reporting etc. is itself nontrival and all the existing bounds checking checkers share the same glue logic. Even if we already had the bounds checking pure function that you propose, the DRY principle would still demand the introduction of the rigid API to avoid duplicating this glue logic.
As the rigid API is sufficient for all the existing checkers, I’m planning to implement it in a simple and straightforward way. (I estimate that separating the pure and impure part would increase the code complexity by +50-100%.)
If later we want to introduce a new checker for which the rigid API is insufficient, we can and should still define a flexible API then – however the rigid API (as a wrapper around the flexible API) would still remain useful, because it would represent common behavior patterns that are shared by several checkers (the ones which exist now). I agree that “It is really hard to anticipate what checks will need in the future” but to me this means that we shouldn’t blindly add flexibility without seeing what sort of flexibility will be actually useful.
The assumption/error reporting logic (which needs to be shared between the bounds checking checkers) is at least as complex as the actual bounds checking calculations, so we would definitely need at least some “additional utilities” for that.
However, there is no clear boundary between the calculations and the assumption/error reporting, because the calculations can produce lots of information in various configurations, which all needs to be handled and fed into the reporting utilities.
Even if we implemented the calculations and the assumption/error reporting in standalone functions, the interface between them wouldn’t be stable: most new bounds checking checkers would introduce new distinctions which need to be calculated in the calculation function, forwarded in the glue code and then reported by the reporting utilities.
The whole situation screams for encapsulation – this is an example where the object-oriented approach is clearly better than pure functions and pattern matching on complicated internals of a data structure. (I would emulate the object-oriented approach even in Haskell.)
I do not entirely understand this part. Do I understand it correctly that you imagined code that looks like the following?
// IN THE LIBRARY
class BoundsChecker {
virtual const BugType &getBugType() const = 0;
virtual std::string describeAccessOperation() const = 0;
};
void performBoundsCheck(MemRegion *Reg, NonLoc Offset, CheckerContext &C, BoundsChecker *Chk) {
// ...
const BugType &BT = Chk->getBugType();
std::string Msg = llvm::formatv("{0} {1} ...",
Chk->describeAccessOperation(), /*...*/);
// ...
}
// IN ONE SPECIFIC BOUNDS CHECKING CHECKER
class ArrayBoundChecker : public BoundsChecker,
public Checker<check::PostStmt, /*...*/> {
//...
void checkPostStmt(const ArraySubscriptExpr *E, CheckerContext &C) const {
auto [Region, Offset] = /* decompose return value of E */;
performBoundsCheck(Region, Offset, C, this);
}
const BugType &getBugType() const override;
std::string describeAccessOperation() const override;
};
This is perhaps an implementation detail, but I would like to clarify that I want to structure the code in a different and perhaps more flexible fashion:
// IN THE LIBRARY
class BoundValidator {
BoundValidator(MemRegion *Reg, NonLoc Offset, /*perhaps others*/)
: /* save arguments in data members */ {}
void runCalculations(CheckerContext &C, ProgramStateRef InitialState) {
// runs the calculations and saves the results in many data members
}
virtual std::string describeAccessOperation() const = 0;
void reportBugIfNeeded(const BugType &BT, ExplodedNode *Prev) {
assert(/*this->*/CalculationsDone);
if (!/*this->*/FoundBugToReport) return;
std::string Msg = llvm::formatv("{0} {1} ...",
describeAccessOperation(), /*...*/);
// create the error node, format the message and report the bug
}
//...
};
// IN ONE SPECIFIC BOUNDS CHECKING CHECKER
class ArrayBoundsValidator: public BoundsValidator {
ArrayBoundsValidator(/*...*/) {
// pre-processes the arguments, then calls the constructor of BoundsValidator
}
std::string describeAccessOperation() const override { return "Array access "; }
};
class ArrayBoundChecker : public Checker<check::PostStmt, /*...*/> {
//...
void checkPostStmt(const ArraySubscriptExpr *E, CheckerContext &C) const {
ArrayBoundsValidator ABV{ /* ... */ };
ABV.runCalculations(C, State);
ABV.reportBugIfNeeded(/*this->*/BT, C.getPred());
}
}
Xazax-hun December 9, 2025, 11:16am 12
Could you elaborating what is non-trivial about the reporting that needs to be deduplicated? I’d expect the reporting to be a couple lines at most based on the information that can be returned from a generic bounds-checking utility. On the other hand, if you need to introduce a new utility class in each checker that is using this utility and override some functions that will cost you more lines. So I am afraid that the cure to the deduplication might be worse than the couple lines of trivial code that one needs to write.
Also, in case the reporting itself indeed is non-trivial we could still have another helper function and we could still get the deduplication without the close coupling.
Maybe at this point it is better to have this discussion over concrete PRs where we see the concrete code, and it would make it clearer where the complexity is and how much duplication we actually have. I’d prefer us starting with a flexible solution and add abstractions to make it easier to use when the need naturally arises than prematurely start with something that introduces a lot of coupling right at the start.
Just look at the source code of ArrayBoundChecker.cpp and scroll through the several hundred lines of code that deals with message formatting. (Lines 60-145, 369-574 and 733-786 deal with the message formatting and bug report creation – and I’m planning some improvements that would slightly increase the total size of the logic.)
As you can see, this code is already organized into the utility class StateUpdateReporter and functions like
static Messages getNonTaintMsgs(const ASTContext &ACtx,
const MemSpaceRegion *Space,
const SubRegion *Region, NonLoc Offset,
std::optional<NonLoc> Extent, SVal Location,
BadOffsetKind Problem);
static Messages getTaintMsgs(const MemSpaceRegion *Space,
const SubRegion *Region, const char *OffsetName,
bool AlsoMentionUnderflow);
With your proposal each bounds checking checker would need to perform logic like
std::variant<std::monotype, NonTaintInfo,
TaintInfo, AssumptionInfo> Res
= genericBoundsCheckingUtility(Ctx, Region, Index, /*...*/);
if (Res.index() == 1) {
auto [Space, Region, Offset, Extent, Location, /*...*/] =
Res.get<NonTaintInfo>();
Messages Msgs = getNonTaintMsgs(Ctx.getASTContext(), Space, Region,
Offset, Extent, Location);
//... use this to report an error
} else if (Res.index() == 2) {
auto [Space, Region, OffsetName, AMUf, /*...*/] = Res.get<TaintInfo>();
Messages Msgs = getTaintMsgs(Space, Region, OffsetName, AMUf);
//... use this to report an error
} else if (Res.index() == 3) {
// Unwrap the assumption info and add a transition based on it
}
The close coupling is inherent in the task – the bounds checking algorithm produces lots of outputs that needs to be fed into the code that emits the report/assumption. If you forcefully separate the bounds checking and the report/assumption generation, you just get an large and ugly interface that will be wildly unstable because many small incremental improvements will need to touch both the bounds checking (to gather new information) and the report/assumption emission (to show the new information).
I would prefer to start with the straightforward and concrete approach, and introduce “flexibility” only if it is actually needed.
I agree that this abstract discussion is a bit pointless. When I’ll have time, I’ll start developing patches based on my approach. I’ll be open to concrete proposals that would make the code shorter and easier to maintain, but I don’t want to bring work “forward in time” unless it significantly increases the overall effectiveness.
Xazax-hun December 9, 2025, 2:39pm 14
There is no reason why we could not reuse these utility classes. A portion of this code could be part of some generic visitors.
I am still not 100% sure this would be as bad as you describe here with the right API but I also mentioned that we totally can introduce abstractions like:
auto Res = genericBoundsCheckingUtility(Ctx, Region, Index, ...);
genericReportingFacility(Res, ...);
I think an API like this would be more concise as one does not need to override base classes and it does not have the strong coupling, checkers that need different behavior can do their own logic.
In general, we very rarely take away the full control of how warnings are presented from the checks, and I don’t think we have a strong enough reason here to diverge from that direction.
gb-sonar December 9, 2025, 3:02pm 15
First of all, thank you for the initiative. It would certainly be wonderful to have a shared core that implements robust bounds checking, and that could be safely used by different checkers. The various outdated comments “copied from ArrayBoundChecker” are a clear indication of this.
Regarding your question:
I agree with @necto that hypothetical cases should not sink the execution, as they are not that critical and may suppress some real bug down the line. Nor should they affect assumptions. These can be reported as non-fatal, even making them optional through configuration if they turn out to be too noisy.
I also agree with your general approach. Since bounds checking is an immediate check based on certain information modeled elsewhere, I see no compelling reason to have a central modeling checker that the actual checkers subscribe to. The library approach seems more natural to me.
However, I also have my doubts about the proposed design of a base class with extension points. I believe in this case the extension points are so numerous that we will very soon need to implement some tricks to use the main utility (boundary checking), but together with some custom logic for creating reports and/or assumptions.
Although the following could address that…
In any case, I also believe it will be easier to discuss this based on specific patches.
I will keep an eye on the progress.
Hmm… this is very close to what I imagine. In fact my proposal could be summarized as
auto Res = genericBoundsCheckingUtility(Ctx, Region, Index, ...);
Res.genericReportingMethod(...);
and the difference between a function and a method is only syntactic here.
I understand this feeling and in general I agree with this. However, I feel that these bounds checking checkers are so similar that following the older coding style they would have been implemented as sub-checkers of a single checker family – and there are several large checker families where 3-5 sub-checkers share the presentation of warnings.
I’ll try to make the interface as flexible as possible.
I will probably introduce a utility class whose instances (roughly speaking) represent the results of a bounds check (because it seems to be useful to handle them as a single object). However, I’m not sure that we’ll need inheritance: it is possible that instead of implementing their own local subclass, the individual checkers can use the same utility class (which is defined in a library) and specify their custom behavior by calling different methods and specifying their arguments.