T85696 Allow action=purge to recalculate the number of pages/subcats/files in a category (original) (raw)

There are a very large number of changes, so older changes are hidden. Show Older Changes

Comment Actions

the recount method gets a shared lock on all the categorylinks entries for that category

That is worrying- with InnoDB, while inside a transaction, you do not need to lock the rows- it should get consistent reads automatically.

Without locking, pages getting added and removed from the category while the SELECT gets the new counts will not be counted correctly.

There is a way to address this concern, and I already have a patch (not quite ready for review, so not uploaded yet) that does the following:

  1. Acquires a named lock on the category to prevent concurrent counts refresh updates (normal links updates are unaffected)
  2. Ensures that subsequent queries will not read a stale snapshot (one taken before the lock was acquired)
  3. Counts all categorylinks rows of each type, and retrieves the old set of counts, in a single query (thus using a consistent snapshot, regardless of READ COMMITTED vs. REPEATABLE READ, assuming MVCC)
  4. If old and new counts are not the same, applies the necessary change(s) as difference(s) between old and new counts
  5. Releases the named lock

Here's an analogy:

You are a professor who teaches in a gigantic lecture hall. You notice that a clock high on the wall reads 11:00, but it is actually 11:05, according to your wristwatch, which is synchronized to a national time standard. You call maintenance and ask them to fix the time on the clock. At 12:05, a maintenance worker arrives and sets up his slow and rather noisy lift. Unfortunately, the worker's watch is broken, so he needs some help.

New method: You compare the clock's time (12:00) with your wristwatch's time (12:05). Because the clock's time is five minutes earlier than your wristwatch's time, you tell the worker to advance the clock by five minutes, making sure no other worker arrives until he is done. He takes one minute to go up in the lift, then adjusts the clock (to 12:06), then goes back down.

Current method: You stop time, because otherwise the time might change (after you check your wristwatch, but before the clock is adjusted). You check your wristwatch, then tell the worker to set the clock to 12:05. He starts going up in the lift. One minute later, he reaches the clock, though because you had stopped time, it is still 12:05. You only restart time once the worker is done adjusting the clock.

This would need to be limited to categories with, say, fewer than 5000 entries; for pathological cases like Category:All stub articles on en.wp with nearly two million pages, the query (see Category::refreshCounts()) takes several minutes to run (I tried on labs and it took 6m18s to count 1,877,668 pages). I also tried Category:Year of birth unknown and it took only 4.5s for 20,125 pages, so only doing this for categories where the current count is <5000 should be safe?

Currently, the query joins against the page table, checking page_namespace to determine what is already available as cl_type. The following query (a simplified version of that from my patch):

select cl_type, count(*) from categorylinks where cl_to = 'All_stub_articles' group by cl_type;

when run against an October 2015 dump of enwiki, takes only 1.35 sec on my desktop PC (running MariaDB 5.5.47).

However, I first need to work on improving the consistency between the two fields, at the very least by making it possible for users to fix inconsistencies themselves (e.g. by purging the affected pages, or the affected category).