Support query-based clang-tidy external check (original) (raw)

Background

clang-tidy is becoming more and more popular, but its static configuration makes it difficult to implement customized checks for public third-party libraries.
For example: [clang-tidy] Add modernize-nlohmann-json-explicit-conversions check by mikecrowe · Pull Request #126425 · llvm/llvm-project · GitHub
It is difficult to adapt plugins implemented in the form of dynamic link libraries to different versions of clang-tidy, and because it relies on the llvm ecosystem, compiling so consumes a lot of CPU and time. This is very unfriendly to many open source projects or small and medium-sized teams, who do not have the resources to maintain a corresponding tool chain

If we make clang-tidy more flexible and configurable. It will help more C++ projects improve their quality.

Concept

Query Based Custom Check
clang-query is tool to dynamically execute ast-matcher and bind to results. we can reuse it to implement more flexible plugin system in clang-tidy.

High Level Requirement

I want these checks can be treat as built-in checks. They should be enable / disable by command line and config.
User should provide query string and corresponded diagnostics message.

I discussed this idea with Yitzhak Mandelbaum in personal email based on his work with the Transformer library.

In the following replies, I’ll repeat our conversation from 2022:

Me:

Since the last time I hacked on clang-tidy, the Transformer library has
been added. From what I can tell in the commit history and email list
archives, you are the primary author.

Thanks for adding this library!

Having written a bunch of clang-tidy checks[*], I have come to the
conclusion that many refactorings we’d like to write could be handled
by a scriptable tool. The DSLs for expressing edits and rewrite rules
in the Transformer lib puts us almost all the way towards having such
a tool. Imagine if we could write the edits and the rewrite rules as
easily as we write matchers in clang-query.

For many organizations, I think writing C++ that directly manipulates
the AST and the source code transformations is too high a burden for
writing their own transformations. Particularly if those
transformations are of a migratory nature and are “fire once and
forget” for their code base.

If we could get a script-like language mechanism in place for writing
transformations, I think it would make this accessible to many more
people.

It appears that you wrote some parsing support in Transformer already.
I’d like to collaborate with you on adding more parsing support for
the remainder of the DSLs so that we can move closer to a scripted
clang-tidy check.

What do you think?

Yitzhak Mandelbaum:

Thanks for reaching out. I’d be happy to work on this together. I already
have a language and parser that covers many of the available Stencil
combinators, I just haven’t had the chance to make the patches for
committing it. This could be a good forcing function. :) That said, you
might find you prefer a different language. This one is designed in “format
string” style, vs the clang-query language and my own range language which
mimic the C++ libraries they’re based on. I’ve pasted the 1-page guide
below so you can see what I mean.

That said, I should note that we’ve had this language available for general
use for > 2 years, integrated into a web UI, and we haven’t seen much
adoption. In practice, I think a lot of migrations end up needing at least
some custom matchers and/or Stencils (more the former than the latter).
Also, learning the AST and matchers remains a very high barrier for many
newcomers

Clang’s Stencil abstraction provides a code-generating object parameterized
by named references to (bound) AST nodes. Stencils are specified in Zwingli
with a specialized format string language. For example, if we wish to
transform code of the form

if (condition) { body; }

to

if (!condition) { LOG(ERROR) << "condition failed"; } else { body; }

Given the matcher that binds the condition to cond and the body to body,
you can express the output as:

if (!($cond)) { LOG(ERROR) << "condition failed"; } else $body

Here, the $ tells the parser to treat the following alphanumeric characters
as an identifier bound to an AST node by the matcher. The format strings
provide additional built-in operators to help you construct your output.
The full grammar for the format strings can be found below.

StencilDescription
$id Add the code segment represented by the AST node bound to the reference
id
$id.member Constructs code that accesses the named member of the object
bound to id. The access is constructed idiomatically: if id is bound to e,
then constructs e->member, when e is a pointer, and e.member otherwise.
Wraps e in parentheses, if needed. Similarly, it will do some basic
simplifications to avoid creating expressions like (&x)->member. Members
can be identified by raw text or operator calls (e.g. $name()).
$name(id) Given a reference to a named declaration id (that is, a node of
type clang::NamedDecl or one of its derived classes), generates the
name. id must
have an identifier name (that is, constructors are not valid arguments to
the name operation).
$callArgs(id) Given a reference to call expression node, generates the
source text of the arguments (all source between the call’s parentheses).
$initListElements(id) Given a reference to an initializer-list expression
node, generates the source text of the elements (all source between the
braces).
$*(id) Renders a node’s source as a value, even if the node is a pointer.
For the following examples, assume something is bound to the x in the
expression foo(x). The rewrite will depend on the type of the bound
variable.

T* → foo(*x)

T& → foo(x)

T → foo(x)

It will also do some basic simplifications to avoid creating expressions
like *&x.
$&(id) Renders a node’s source as an address, even if the node is an
lvalue. For the following examples, assume something is bound to the x in
the expression foo(x). The rewrite will depend on the type of the bound
variable.

T* → foo(x)

T& → foo(&x)

T → foo(&x)

It will also do some basic simplifications to avoid creating expressions
like &*x.
$(id) Given a reference to a node e, generates (e) if e may parse
differently depending on context. For example, a binary operation is always
wrapped, while a variable reference is never wrapped.
$includeHeader(foo/bar.h) Add an include statement (of the form #include
“foo/bar.h”) at the beginning of the file. Can only appear at the beginning
of the stencil.
Grammar

The grammar for the format strings is as follows
FormatString = IncludeOp* Part*
Part = Text | IdOp | UnaryOp | MemberOp
Text = Literal+
Literal = UNESCAPED|ESCAPED
IdOp = '$' (ID | '(' ID ')')
IncludeOp = '$includeHeader(' PATH ')'
UnaryOp = '$' ('*' | '&' | 'name' | 'callArgs' | 'initListElements') '(' ID ')'
MemberOp = IdOp '.' (Text | Part)

UNESCAPED = [^$]
ESCAPED = '\'.
ID = [a-zA-Z0-9_]+
PATH = [a-zA-Z0-9/.]+

Me:

Yitzhak wrote:

Given the matcher that binds the condition to cond and the body to body=
,
you can express the output as:

if (!($cond)) { LOG(ERROR) << “condition failed”; } else $body

This is actually very close for what I had in mind! Naturally you
can’t express deletions with a format-string like approach, but it
works great for the most common case of “replace X with Y where Y has
values substituted in from X and/or the outer environment”.

That said, I should note that we’ve had this language available for gene=
ral
use for > 2 years, integrated into a web UI, and we haven’t seen much
adoption.

I assume you’re referring to adoption internally at Google?

In practice, I think a lot of migrations end up needing at least
some custom matchers and/or Stencils (more the former than the latter)=

Right now the barrier to entry is so high that I would like to get
some more practice with a scripted approach before I conclude that
most things are going to need some custom code. My gut feeling is
that with more experience using such a tool we’ll identify more and
more bits of scripting infrastructure (matchers, stencils, what have
you) that fill in the missing pieces.

My expectation is that there will always be some transformations that
are going to require some custom code, because the logic is too
complex to be expressed in script.

Also, learning the AST and matchers remains a very high barrier for many
newcomers.

Yes, I think this is a barrier. Even I have a hard time figuring out
how to write a matcher just from looking at the docs. I think there
are improved UIs for developing checks that could improve here. For
instance, if I had IDE support for “write a matcher expression that
selects the highlighted code” and a clang-query window in my IDE (I
use VS, but you get the idea) that was integrated with my editor.

I don’t think we’re quite at the point for GUI integration though, I
think the parser/script infrastructure needs to be solid first.

Yitzhak:

This is actually very close for what I had in mind! Naturally you
can’t express deletions with a format-string like approach, but it
works great for the most common case of “replace X with Y where Y has
values substituted in from X and/or the outer environment”.

Right – this only covers the replacement itself. We need to extend with
operators for “insert”,“change”,“remove”. Probably worth just sticking to
the syntax of the library, but open to other ideas.

I assume you’re referring to adoption internally at Google?

Ah. Yes, sorry.

Right now the barrier to entry is so high that I would like to get
some more practice with a scripted approach before I conclude that
most things are going to need some custom code. My gut feeling is
that with more experience using such a tool we’ll identify more and
more bits of scripting infrastructure (matchers, stencils, what have
you) that fill in the missing pieces.

Fair enough. In some sense, we’ve been doing that for the libraries.
Regardless, its worth a try. Might still be easier to have scripting
language w/ facilty for user-defined operations, vs having to write
full-blown C++ programs.

Yes, I think this is a barrier. Even I have a hard time figuring out
how to write a matcher just from looking at the docs. I think there
are improved UIs for developing checks that could improve here. For
instance, if I had IDE support for “write a matcher expression that
selects the highlighted code” and a clang-query window in my IDE (I
use VS, but you get the idea) that was integrated with my editor.

Agreed. This would be very cool. We’ve made some progress in this
direction, too, with libraries that take an AST and output potential
matchers. But, here too, I haven’t time to get those committed upstream.

Me:

This is all great to hear, it sounds like we’re very much thinking
along the same lines!

I had reviewed a google docs with some comments along these lines a
couple years ago; that wouldn’t happened to have been your doc was it?

I tried finding the link to it, but I couldn’t find it easily.

I think I’m going to take the existing checks that I’ve created and
refactor them into Transformer checks as a way to get some experience.

Yitzhak:

Not sure if you saw, but I
recently added a tutorial to the clang docs.
Clang Transformer Tutorial — Clang 21.0.0git documentation

olologin March 19, 2025, 9:53pm 6

Thanks for creating this discussion @HerrCai0907

Original proposal is here [clang-tidy] RFC: Add generic check that takes match expression like clang-query · Issue #107680 · llvm/llvm-project · GitHub
Maybe someone finds that proposal easier to understand.

But in short: We just want to have some way to quickly add clang-query -based check and use it with existing clang-tidy infrastructure.
I have already tested this solution from [clang-tidy] Add ClangQueryChecks config option by DeNiCoN · Pull Request #123734 · llvm/llvm-project · GitHub and it is good enough for our usage at work, but since @HerrCai0907 volunteered to improve that PR - I am happy to support his intention.

Basically I just want to express my support of this PR and I hope other people also support this. In fact I am surprised nobody else came up with this patch before, it is really useful

Thanks for bringing up the RFC!

At a high level, I think it is a good idea to try to find a way to make it easier for folks to add small, “one-off” checks to clang-tidy. Things like “don’t call function foo” are a pretty common need for various projects, and writing a whole clang-tidy plugin and integrating that can be a challenge (especially because plugin support for Windows is not really there yet).

One concern I have is that clang-query uses the dynamic AST matchers which are different than the C++ DSL used by clang-tidy. The intent is for them to be the same, but the nature of the beast is that the C++ DSL gets way better diagnostics when you mess up the matchers. For example, sometimes clang-query just quietly does nothing instead of telling you about matcher type mismatches. Some things are possible in one that aren’t possible in the other.

It’s also unclear as to whether a clang-query based interface will allow for adding fixits to make code transformations easier.

Some questions I have:

How do you plan to handle invalid query strings? Duplicate check names?
Do you have plans to support fix-its in the initial implementation? If not, how sure are you that they’re possible to support in the future?
What is the performance difference between a clang-tidy plugin and a query string-based approach for the same check? e.g., is clang-query’s performance noticeably different?

CC @ymand as well

I’ve seen this too and the divergence certainly makes for a more difficult development/debug experience for check authors as I usually use clang-query to prototype my matcher.

Give configuration error, since it is the reuse clang-query’s grammar, it will not be too hard to debug in clang-query

In current implement I add custom- module prefix for each check. But for the same name custom check, I am not sure they should be fine-grained override in which way. But at least, we can override the whole check.

Consider the complexity for fix-its in current check, it needs to use API from parser / lexers. I don’t think it is easy to implement a powerful replacement framework. IMO, if someone wants to provide fix-its, maybe dynamical library is more suitable.

Parsing string to dynamically create matcher cost some time.
Creating diagnose from binding also waste some time because we need to search binding name in configuration for each matched results. If there are lots of diagnostics pattern in configuration (I hope no one write check like this), it will cost some time (maybe times log(n) since we always need to search string in map).

Thank you for raising this.

I was actually surprised how well at least Debian-based Linux distributions supported compiling clang-tidy plugins (see my nlohmann/json CI commit).

The annoying part for me was testing the plugin outside the LLVM repository. I needed to copy check_clang_tidy.py from the LLVM repository in order to make the checks testable. I suppose this could be improved by including the script in distribution packages too.

In discussion of PR [clang-tidy] support query based custom check by HerrCai0907 · Pull Request #131804 · llvm/llvm-project · GitHub, we meet some divergences about how to pass the custom linter rules

There are 3 options. Please correct me if I miss somethings

in .clang-tidy config file, with structured options

easy to write, can use schema to check error during writing

in .clang-tidy config file, with flatten options

easy to overwritten.

in another config file and pass it by command line.

avoid to rely on stable interface, configuration in subproject will not influence parent project.

I think we need to come to a decision on this as it’s pretty integral to the RFC.

I personally don’t have strong opinions, but I weakly favor structured input over unstructured input, but I do wonder why we might want to avoid having a stable interface? I thought the goal for this was for people to easily write one-off checks they didn’t want to write plugins for. If the driving force for the feature is effectively “we don’t want folks to do a lot of work to support this”, being unstable seems like we’re going to eventually force folks to do a lot of work to support this.

I think they might be worried that this feature might hinder subsequent break changes and thus become a historical burden? For clang-tidy, usually a problematic sub-repository will not disrupt the configuration of the parent repository. However, if we introduce this feature, we must find a way to avoid this issue, because ast-matcher might suddenly have some breaking changes.
Actually all of solutions can avoid this cases, but the third solution is more natural.

olologin April 30, 2025, 11:24am 14

I am fine with any solution as well

I have finished the whole dev at least for MVP. Hope someone can review this PR https://github.com/llvm/llvm-project/pull/131804. Thanks!

design of configuration

After some struggling attempts, I still want to keep configuration as options 1.

There are 3 options. Please correct me if I miss somethings

in .clang-tidy config file, with structured options

easy to write, can use schema to check error during writing

in .clang-tidy config file, with flatten options

easy to overwritten.

in another config file and pass it by command line.

The reason why not option 3:

The options 3 use additional file introduces complexity for dependences analysis in tooling, and the benefit is focus on compatibility. But actually there are no compatibility in the other options.

when we talk about compatibility, actually there are 2 kinds of compatibility in custom check

config compatibility:
1. config parse is in the early stage of clang-tidy. so we must make sure the clang-tidy must can parse and understand config which created for the other version clang-tidy.
2. to resolve it, I think we only need to keep several key config fields (name, query, diagnostic) stable and make the other config optional.
3. when we add some new fields in later version, the clang-tidy can ignore these unknown fields (guaranteed by yaml parser). some checks maybe broken in this cases by user can disable these broken check in Checks field like disable other clang-tidy check.
query compatibility:
1. query string is parsed in check itself. so use invalid query string will not harm the basic functional of clang-tidy.
2. query string may be changed in each version, because ast-matcher is changing rapidly. It is impossible to let query really stable.
3. when outdated / invalid query in huge repo, user can avoid issue by disabling the custom check in Checks filed also.

The choice of option 1 and option 2:

I personally prefer option 1 because structured options make write rules easier. Maybe we can support both in the future since there are no conflict.

5chmidti August 22, 2025, 10:13pm 16

Sorry this took me a while to get to, here are my thoughts:

I’d like to see this functionality in clang-tidy, as it can help a lot of projects to do simple checking against the usage of their own APIs without heavy buy-in via distributing their own extended clang-tidy.

The config file stability and maintenance cost is a real concern, and I don’t think we will find a solution that does not have a cost in some way, but I guess that is obvious. The long-term stability is IMO probably maintained with the structure that @HerrCai0907 is proposing, and this even allows us to extend in the future. A very forward-looking example would be if somehow there are CIR based matching/analysis added to clang-tidy and that can be parsed from a string like the query matchers (so something that would happen in a few years I’d guess). If something like that were to be requested, we could add a type/engine field to differentiate. (I’ll look at the contained config members in the review.)

Bring-up idea:

Nevertheless, I think it makes sense if we add this functionality in the following way to begin with, e.g., at least for one release to see what people are reporting over a longer lifetime (e.g., for package managers to catch up at least a bit) than to the next release (which would be used by fewer people than an official release). And this could be deployed by: keeping the config inside .clang-tidy, error-ing out when this custom check config is there in the normal invocation, and only enable with something like –-experimental-custom-checks. Then, after some time and feedback, we can enable this for the main invocations. This avoids broken configs because of the error without the special flag, and also allows people to buy-in for something that could change.

I initially thought about going with @HerrCai0907’s option 3, but I also think that an additional config file would be undesirable, even for something like an experimental enablement. Especially since it would incur more maintenance as well. Option 1 sounds like the natural way to go, I don’t see flat options being better than structured options.

I did an initial pass over the PR and had no immediate concerns, but I’ll also try out the feature in a bit to see what e.g., failures look like in the config.

TLDR:

I agree that this is a very useful feature to have,
we should also think very carefully on how to word the documentation and explain limitations and also suggest upstreaming checks if appropriate
I agree with the config being structured in the .clang-tidy file making the most sense
I’d prefer it if this initially became an experimental-flag-enabled-only feature that errors on configs specifying this config without the experimental flag to avoid people committing those configs without their explicit buy-in