Str.count to allow counting of overlapping substrings (original) (raw)

Substring counting could surely be made easier by allowing the existing str.count method to support counting of overlapping substrings, via an optional arg. e.g. overlap=True. This would mean that, for example, the problem of counting 'AAAA' in a string such as 'AAAAAAC' would be extremely simple:

>>> s = 'AAAAAAC'
>>> s.count('AAAA', overlap=True)
3

rather than having to write your own solution:

sum(1 for i in range(n - k + 1) if s[i: i + k] == b)

where s is a string, b the substring to be counted, and n and k are the lengths of s and b respectively.

To be clear: my proposal is that <str>.count(<substr>, overlapping=True) would result in a count of all occurrences of <substr> (non-overlapping and overlapping) in the given string. So that 'AAABAA'.count('AA', overlapping=True) will produce a value of 3 (2 overlapping occurrences at the start, and the one at the end of the string), whereas currently count will produce 2. But count without overlapping=True (overlapping=False, the default) will continue to work the same way, so there would be no side effects.

This problem comes up naturally in many contexts, and I’m aware of previous posters who’ve bumped into this problem when working on their own problems.

I think it would be a relatively simple but useful extension of str.count.

dg-pb (dg-pb) June 15, 2025, 6:49pm 2

Simple? maybe. Useful? Uncertain. Is this causing bottlenecks or is it simply for convenience?

I noted this and will consider it together with other potential extensions for string search when the time comes.

By the way, you could do it a bit better.
Especially if you are dealing with long strings with sparse hits.

def count_nonoverlaping(s, b):
    c = 0
    i = 0
    while 1:
        i = s.find(b, i)
        if i == -1:
            break
        c += 1
        i += 1
    return c
s = 'AAAAAAC'
count_nonoverlaping(s, 'AAAA')

sr-murthy (SR Murthy) June 15, 2025, 7:02pm 3

I noted this and will consider it together with other potential extensions for string search when the time comes.

Thanks, it’s just an idea for consideration. I don’t claim that counting overlapping substrings is the most pressing problem, but it certainly comes up in different domains (bioinformatics, NLP, CS, information theory, to think of a few), and surely a built-in, standard solution is preferable to each person writing their own code. I see at least some prior discussion on this.

dg-pb (dg-pb) June 15, 2025, 7:21pm 4

In str.find/count domain, it rarely something is if it can be done with re. Didn’t even think of that at first - apparently I never had such need myself.

len(re.findall(r'(?=(AAAA))', 'AAAAAAC'))

sr-murthy (SR Murthy) June 15, 2025, 7:28pm 5

Sure, you can do it with re, but it seems more natural to use something like str.count(overlap=True). As you’ve mentioned str.find I wonder if you could have str.findall that gives you all starting indices of the substring (in an overlapping way): an example:

>>> 'AAACAAG'.findall('AA')
[0, 1, 4]

Could be an alternative.

While I dont have a strong opposition to this, given the problem space, I would still suggest using a library, both for the kinds of optimizations unlikely to be made in CPython, and for having access to it today.

In particular, I’d recommend anyone doing frequent string-based operations, especially those in bioinformatics, look into use of stringzilla. It’s heavily optimized for basically this specific use case.

If it were going to be added, I do think it might be better to have functions in the string module than expand the method API of the builtin str type

dg-pb (dg-pb) June 15, 2025, 7:40pm 7

Not likely. People would get confused why re.findall has such inconsistency with str.findall.

I think the str.count(..., overlapping=True) would be ok if this turned out to be useful enough. But this would be weird because this would be 2 steps ahead - there is not str.findall/finditer, but there is a count which counts overlapping matches. So it would be a bit strange that substring counting is so much more advanced than string search on which substring counting is based on.

So I wouldn’t hold my breath. It is ok candidate for extension, but it will have to wait for its time. Also, maybe others will chip in in the meantime - it would be helpful to gauge if there is any actual need for this.

sr-murthy (SR Murthy) June 15, 2025, 7:42pm 8

thanks, tis stringzilla looks fast, which is important.

sr-murthy (SR Murthy) June 15, 2025, 7:45pm 9

Thanks for your consideration. Would I be able to submit a PR? No problem if this is typically done internally, but it would be nice to.

Stefan2 (Stefan) June 15, 2025, 9:46pm 10

You forgot to mention another reason for it: soooo many people using

for i in range(n - k)

instead of the correct

for i in range(n - k + 1)

sr-murthy (SR Murthy) June 15, 2025, 10:07pm 11

Thanks, typo corrected. The iterative find solution is much better, of course.

dg-pb (dg-pb) June 16, 2025, 7:42am 12

In theory you can do whatever you want.

But as I said, currently it isn’t a good time for such and you would be wasting your time. The first item that needs to go in is https://github.com/python/cpython/pull/120025 .

This will be a certain completion of underlying search engine and prepare a bit better ground for further extensions, such as str.finditer/findall. PRs in this area before this is merged would introduce unnecessary complications/conflicts/etc. So it is simply not practical.

But it might take some time as this is not priority (same as this proposal) and also it is a bit difficult to find willing reviewers.

So my best advice would be patience. If this is something that is desirable and the time is right I am sure you could do a PR - just have notifications turned on for this thread and double check once in 6 months / a year. It might take some time and also there is no certainty that this will go through in the end.

sr-murthy (SR Murthy) June 16, 2025, 8:07am 13

Not a problem. I would only submit a PR if there was a sponsor, otherwise not, and with prior agreement.