[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types (original) (raw)
Bill Wendling via llvm-dev llvm-dev at lists.llvm.org
Tue May 26 17:16:29 PDT 2020
- Previous message: [llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
- Next message: [llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, May 26, 2020 at 4:38 PM Craig Topper <craig.topper at gmail.com> wrote:
When I spent some time looking at this back in March when Bill mentioned it on IRC. I think I saw a write to one bit in one of the 8-bit pieces and then a read of that bit and a bit from the adjacent byte. So we used a narrow store and then a wider load due to the 2 bits. I think you're referring to same_flow and free in the structure below. Those both have stores, as does most of the rest of the bitfield (it's an initialization, which seems like could be done with a few bitwise operations on the whole bitfield, but I digress). But yeah, in the case that we have consecutive accesses of bitfields in adjacent bytes, then a bigger read & store are better.
~Craig
On Tue, May 26, 2020 at 4:32 PM John McCall via cfe-dev <cfe-dev at lists.llvm.org> wrote:
On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote: > We're running into an interesting issue with the Linux kernel, and > wanted advice on how to proceed. > > Example of what we're talking about: https://godbolt.org/z/ABGySq > > The issue is that when working with a bitfield a load may happen > quickly after a store. For instance: > > struct napigrocb { > void *frag; > unsigned int fraglen; > u16 flush; > u16 flushid; > u16 count; > u16 groremcsumstart; > unsigned long age; > u16 proto; > u8 sameflow : 1; > u8 encapmark : 1; > u8 csumvalid : 1; > u8 csumcnt : 3; > u8 free : 2; > u8 isipv6 : 1; > u8 isfou : 1; > u8 isatomic : 1; > u8 recursivecounter : 4; _> wsum csum; > struct skbuff *last; > }; > > void devgroreceive(struct skbuff *skb) > { > ... > sameflow = NAPIGROCB(skb)->sameflow; > ... > } > > Right before the "sameflow = ... ->sameflow;" statement is executed, > a store is made to the bitfield at the end of a called function: > > NAPIGROCB(skb)->sameflow = 1; > > The store is a byte: > > orb $0x1,0x4a(%rbx) > > while the read is a word: > > movzwl 0x4a(%r12),%r15d > > The problem is that between the store and the load the value hasn't > been retired / placed in the cache. One would expect store-to-load > forwarding to kick in, but on x86 that doesn't happen because x86 > requires the store to be of equal or greater size than the load. So > instead the load takes the slow path, causing unacceptable slowdowns. > > GCC gets around this by using the smallest load for a bitfield. It > seems to use a byte for everything, at least in our examples. From the > comments, this is intentional, because according to the comments > (which are never wrong) C++0x doesn't allow one to touch bits outside > of the bitfield. (I'm not a language lawyer, but take this to mean > that gcc is trying to minimize which bits it's accessing by using byte > stores and loads whenever possible.) > > The question I have is what should we do to fix this issue? Once we > get to LLVM IR, the information saying that we're accessing a bitfield > is gone. We have a few options: > > * We can glean this information from how the loaded value is used and > fix this during DAG combine, but it seems a bit brittle. > > * We could correct the size of the load during the front-end's code > generation. This benefits from using all of LLVM's passes on the code. > > * We could perform the transformation in another place, possible in > MIR or MC. > > What do people think? Clang used to generate narrower loads and stores for bit-fields, but a long time ago it was intentionally changed to generate wider loads and stores, IIRC by Chandler. There are some cases where I think the “new” code goes overboard, but in this case I don’t particularly have an issue with the wider loads and stores. I guess we could make a best-effort attempt to stick to the storage-unit size when the bit-fields break evenly on a boundary. But mostly I think the frontend’s responsibility ends with it generating same-size accesses in both places, and if inconsistent access sizes trigger poor performance, the backend should be more careful about intentionally changing access sizes.
Fair enough.
-bw
- Previous message: [llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
- Next message: [llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]