Loren James Rittle - Re: libstdc++/7445: poor performance of std::locale::classic() in multi- (original) (raw)

This is the mail archive of the libstdc++@gcc.gnu.orgmailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

I'm not excited by this patch. There has got to be a better way. Loren, can you commet on this? I should have just waited for you before I checked this in anyway.

Hi, I am sorry Benjamin but, unless I have reported otherwise and am just now forgetting it, I have never reviewed any of the threading issues/code in the locale part of the library. I have now looked at this one tight issue (but not all the locale code).

I agree that src/locale.cc 1.62 is flawed. I agree with Andrew Pollard that your proposed patch (dated 2 Aug 2002 16:20:20 -0000 against 1.62) is flawed. Andrew has implemented the so-called Double-Checked Lock Pattern. This is a performance speedup first proposed/widely written about by Dr. Douglas C. Schmidt and his group. I have read that (and I think Doug's paper itself discussed these issues), for some CPU/memory architectures, it is possible that the Double-Checked Lock Pattern will not work. E.g. on any system that reorders writes especially in light of multiple CPUs.

The POSIX thread system has the property that when all accesses of a particular memory region (which may be disjointed and need not be expressed in code) are guarded by a particular mutex, the memory region is guaranteed to appear consistent in a threaded, multiprocessor system. In particular, to make this happen the mutex implementation may force a flush of all pending writes at the mutex lock/unlock points (or, they may especially mark all reads and writes between those boundary points). The DC locking violates this rule (but can usually by made to work for any one particular set of hardware - the trick is making it widely portable). So does the code segment in 1.62 (there is a reference to _S_classic outside the otherwise mutex-protected region). So does your first proposed patch of 1.62.

To shift focus for a moment, this PR smacks of a problem found by "code review" not actual profiling. The code as written before the PR was right (if non-optimal). The patch provided with the PR is flawed (for same reason as above). Is std::locale::classic() actually called so often in nominal cases as to matter? A mutex lock/unlock operation takes a certain amount of time. At quick glance, the guarded init code would appear to require well over that amount of time for modern platforms. Thus (and I can't judge this myself), if std::locale::classic() is nominally called zero or just a few times, the code as written before the PR patch was probably near optimal in handling various tradeoffs. If std::locale::classic() will be called many times in nominal cases, then I agree that we might be non-optimal here. Back to the optimization discussion...

OK, does this mean we can't optimize this situation at all? Well, it turns out that POSIX makes a very similar guarantee related to memory consistency whenever a new thread is started. Thus, if we initialize something which will later be read-only when there is exactly one thread running, then we are completely within the rules to avoid the cost of a mutex acquisition once we go threaded (I hereby state my assumption that other thread models supported by our abstraction layer were written for simpler hardware and/or support POSIX semantics since it is near impossible to reason about the system without this rule).

Is there a reason why locale::c_locale should not be produced while the system is still running library init code in a single thread? I assume the answer will be either: (a) if a particular program run wouldn't have called locale::classic(), then we are not allowed to do related setup work per standard section X.X.X.X'' or, more likely, (b) as an optimization, we didn't want to do the work or allocate memory in any case where locale::classic() would not be called by the program''.

If (a) then we might be stuck with a slow locale::classic() implementation (i.e. ``well-wrapped'' ;-) unless we want to sort out all the issues related to DC locking for all environments to which libstdc++-v3 ports.

If (b) then perhaps we could, as an optimization for platforms that support WEAK (I already hear the moaning from some quarters ;-), convert the init code for c_locale to run while we are single-threaded guarded by a check of a weak symbol ``&locale::classic()''. Does the standard say anything about making such symbols weak? And/or abusing their reference or lack thereof to support optimizing library init code?

I hereby propose that the PR patch and all follow-on patches be unrolled until someone produces a real test case that performs poorly. Performance patches produced by code inspection should hold no merit with us.

Regards, Loren

Loren J. Rittle, Principal Staff Engineer, Motorola Labs (IL02/2240) rittle@labs.mot.com, KeyID: 2048/ADCE34A5, FDC0292446937F2A240BC07D42763672


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]