[llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long (original) (raw)
Michael Clark via llvm-dev llvm-dev at lists.llvm.org
Fri Apr 21 00:30:03 PDT 2017
- Previous message: [llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long
- Next message: [llvm-dev] Permissions for llvm-mirror - Setting up Libc++ Appveyor builders
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 21 Apr 2017, at 2:23 PM, Michael Clark <michaeljclark at mac.com <mailto:michaeljclark at mac.com>> wrote:
On 21 Apr 2017, at 12:30 PM, Kaylor, Andrew <andrew.kaylor at intel.com <mailto:andrew.kaylor at intel.com>> wrote: I think it’s generally true that whenever branches can reliably be predicted branching is faster than a cmov that involves speculative execution, and I would guess that your assessment regarding looping on input values is probably correct.
Yes it’s based on an assumption that val <= LLONG_MAX i.e. branch predict success, which may not always be the case, but to break branch predict it would require an unpredictable sequence of values <= LLONG_MAX and > LLONG_MAX. I was curious and microbenchmarked it:
Best of 10 runs on a MacBookPro Ivy Bridge Intel Core i7-3740QM
$ time fcvt-branch
real 0m0.208s user 0m0.201s sys 0m0.002s
$ time fcvt-cmov
real 0m0.241s user 0m0.235s sys 0m0.002s
I believe the code that actually creates most of the transformation you’re interested in here is in SelectionDAGLegalize::ExpandNode() in LegalizeDAG.cpp. The X86 backend sets a table entry indicating that FPTOUINT should be expanded for these value types, but the actual expansion is in target-independent code. This is what it looks like in the version I last fetched:
case ISD::FPTOUINT: { SDValue True, False; EVT VT = Node->getOperand(0).getValueType(); EVT NVT = Node->getValueType(0); APFloat apf(DAG.EVTToAPFloatSemantics(VT), APInt::getNullValue(VT.getSizeInBits())); APInt x = APInt::getSignBit(NVT.getSizeInBits()); (void)apf.convertFromAPInt(x, false, APFloat::rmNearestTiesToEven); Tmp1 = DAG.getConstantFP(apf, dl, VT); Tmp2 = DAG.getSetCC(dl, getSetCCResultType(VT), Node->getOperand(0), Tmp1, ISD::SETLT); True = DAG.getNode(ISD::FPTOSINT, dl, NVT, Node->getOperand(0)); // TODO: Should any fast-math-flags be set for the FSUB? False = DAG.getNode(ISD::FPTOSINT, dl, NVT, DAG.getNode(ISD::FSUB, dl, VT, Node->getOperand(0), Tmp1)); False = DAG.getNode(ISD::XOR, dl, NVT, False, DAG.getConstant(x, dl, NVT)); Tmp1 = DAG.getSelect(dl, NVT, Tmp2, True, False); Results.pushback(Tmp1); break; } The tricky bit here is that this code is asking for a Select and then something else will decide whether that select should be implemented as a branch or a cmov. Good. I had found ISD::FPTOUINT but had not found the target-independent code as I was digging in llvm/lib/Target/X86. I had in fact just started looking at the target-independent code after realising it was likely not target specific. This issue could potentially effect any hard float target with IEEE-754 accrued exceptions and conditional moves as the unconditional FSUB will set INEXACT. I can see comments in lib/Target/X86//X86ISelLowering.cpp LowerSELECT regarding selection of branch or cmov and wonder if the DAG can be matched there or whether the fix is in target-independent code. It seems like a SELECT node with any sufficiently large number of child nodes should use a branch instead of a conditional move. I wonder about the cost model for predicate logic and cmov. Modern branch predictors are actually pretty good so if LLVM X86 is using predication when the cost of a branch is less it could result in a loss of performance. I’m now curious about more general possibility of controlling whether SELECT is lowered to branches or predication using cmov. Can this be controlled? Anecdotally, the RISC-V CPU architects recommend branches over predicate logic as in their case (Rocket) branch mis-predict is only 3 cycles. BTW - semi off-topic. The RISC-V interpreter I am working on seems to be a pathological test case for the LLVM/Clang optimiser (-O3) compared with GCC (-O3) with LLVM/Clang producing code that runs nearly twice as slow as GCC. I don’t know exactly what I’ve done for this to happen; too many switch statements I suspect. Branchy code versus predication perhaps? Branchiness might also explain GCC’s lead on SciMark Monte Carlo assuming Monte Carlo is branchy. Now I am guessing, although after some googling I see that clang generates x8664 asm that prefers predication versus branches in gcc. Note this CPU simulator test requires the RISC-V GCC toolchain to be installed. Here is a step by step for anyone interested in a pathological optimiser test case for Clang: - https://github.com/riscv/riscv-gnu-toolchain/ <https://github.com/riscv/riscv-gnu-toolchain/> - https://github.com/michaeljclark/riscv-meta/ <https://github.com/michaeljclark/riscv-meta/> $ git clone https://github.com/riscv/riscv-gnu-toolchain.git <https://github.com/riscv/riscv-gnu-toolchain.git> $ git clone https://github.com/michaeljclark/riscv-meta.git <https://github.com/michaeljclark/riscv-meta.git> $ cd riscv-gnu-toolchain $ export RISCV=/opt/riscv-gnu-toolchain $ ./configure --prefix=$RISCV $ make $ cd .. $ cd riscv-meta $ git submodule update --init --recursive $ export RISCV=/opt/riscv-gnu-toolchain $ make -j4 CXX=g++ V=1 $ make test-build $ time ./build/linuxx8664/bin/rv-sim build/riscv64-unknown-elf/bin/test-sha512 ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2 real 0m28.280s user 0m28.280s sys 0m0.000s $ make clean $ make -j4 CXX=clang++-3.9 V=1 $ make test-build $ time ./build/linuxx8664/bin/rv-sim build/riscv64-unknown-elf/bin/test-sha512 ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2 real 0m52.533s user 0m52.532s sys 0m0.000s $ g++ --version g++ (Debian 6.3.0-6) 6.3.0 20170205 Copyright (C) 2016 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ clang++-3.9 --version clang version 3.9.0-6 (tags/RELEASE390/final) Target: x8664-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin There is also a RISC-V -> x8664 JIT engine (x8664 JIT currently for the RISC-V integer ISAt, hard float coming soon…): $ time ./build/linuxx8664/bin/rv-jit build/riscv64-unknown-elf/bin/test-sha512 ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2 real 0m0.838s user 0m0.840s sys 0m0.000s Clang and GCC produce typical native code that performs the same. $ clang -O3 src/test/test-sha512.c -o test-sha512 $ time ./test-sha512 ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2 real 0m0.285s user 0m0.280s sys 0m0.004s $ gcc -O3 src/test/test-sha512.c -o test-sha512 $ time ./test-sha512 ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2 real 0m0.285s user 0m0.284s sys 0m0.000s Michael.
-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170421/9e4f7127/attachment-0001.html>
- Previous message: [llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long
- Next message: [llvm-dev] Permissions for llvm-mirror - Setting up Libc++ Appveyor builders
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]