Timeline for Why do engines need to be optimized for new processors of the same architecture?

Current License: CC BY-SA 3.0

18 events

when toggle format	what		by	license	comment
Dec 20, 2017 at 8:56	comment	added	Clearer		No I haven't done any real GPU work, but as you point out yourself -- we're talking about CPUs, so don't really see why you need to bring up GPUs.
Dec 20, 2017 at 3:31	comment	added	Peter Cordes		related: Which opcodes are faster at the CPU level? has some hand-wavy relative-cost numbers for different operations, and links to real CPU performance-analysis info.
Dec 20, 2017 at 2:29	comment	added	Krupip		@PeterCordes You've convinced me, that is a pretty bad benchmark. I just assumed Intel just attached addition to the multiply circuit or something to get results like that.
Dec 20, 2017 at 2:26	history	edited	Krupip	CC BY-SA 3.0	deleted 139 characters in body
Dec 20, 2017 at 0:56	comment	added	Peter Cordes		Anyway, the only thing you learn from that benchmark is that they somehow wrote / compiled the FP benchmarks badly so it didn't vectorize well for 32-bit float, and that div / mod are so slow they can't keep up with memory.
Dec 20, 2017 at 0:52	comment	added	Peter Cordes		Like I said, it's mostly a memory benchmark for Haswell. With SSE2 (the benchmark didn't use `-march=native`. update: or did it? They mention AVX), Haswell has a max throughput of 3 `PAND` instructions per clock, doing 16 bytes of bitwise AND each (regardless of element size of course; AND is bitwise so it doesn't need element boundaries). But 32-bit SIMD multiplication throughput is much lower: SSE4.1 `PMULLD` takes 2 uops on Haswell with one per 2 cycle throughput, and SSE2 doesn't even have that. These numbers make no sense at all unless everything was purely memory bottlenecked.
Dec 20, 2017 at 0:42	comment	added	Peter Cordes		@Christoph is correct. The benchmark you link is for a loop over an array `c[i] = a[i] OP b[i]` (i.e. 2 loads and 1 store per operation) so the times are dominated by memory bandwidth because of the very low computational intensity. The array size isn't shown so IDK if it fit in L1D. (`gcc4.9 -Ofast` very likely auto-vectorized those loops, so you're not even measuring the cost of normal scalar operations as part of complex integer code). The first line of that page is IMPORTANT: Useful feedback revealed that some of these measures are seriously flawed. A major update is on the way.
Dec 19, 2017 at 14:57	history	edited	Krupip	CC BY-SA 3.0	added 12 characters in body
Dec 19, 2017 at 14:41	history	edited	Krupip	CC BY-SA 3.0	added 66 characters in body
Dec 19, 2017 at 14:40	comment	added	Krupip		@Christoph Going off of benchmarks, for the programmer, no they actually are the same speed on intel nicolas.limare.net/pro/notes/2014/12/12_arit_speed
Dec 19, 2017 at 14:38	comment	added	Krupip		@Clearer I know we're talking about CPUs here, but you've never programmed for GPUs have you? The same code produces such wildly different results in performance that often you are actually forced to take in to consideration hardware capabilities in CUDA. That's where I came from with this, cache size (shared memory, managed L1 cache) actually needs to be taken into consideration in how you code for something in CUDA.
Dec 19, 2017 at 9:27	comment	added	Clearer		I want to down this for the simple reason that if you have to do these things more than once, you've done all of it wrong in the first place. Different amounts of caches shouldn't change how you use it, only how much you can keep in them. If you're not using all your cores optimally when you have 5 cores, you're doing it wrong from the beginning and might as well throw out your code. It's virtually impossible to take more than one (maybe two) levels of cache into consideration, especially if you're not in full control of the CPU (i.e. you're the kernel).
Dec 19, 2017 at 7:47	comment	added	Christoph		"like multiplication taking more time than addition, where as today on modern intel and amd CPUS it takes the same amount of time" That's not all true. In pipelined architectures you have to differentiate between latency (when the result is ready) and throughput (how many you can do per cycle). Int addition on modern Intel processors has a throughput of 4 and a latency of 1. Multiply has throughput 1 and latency 3 (or 4). These are the things that change with each architecture and need optimization. E.g. `pdep` takes 1 cycle on intel but 6 on Ryzen so might not want to use it on Ryzen.
Dec 19, 2017 at 5:58	comment	added	slebetman		@MaciejPiechotka: Worrying about cache coherency does not necessarily mean worrying that cache would ever be in an incoherent state. It's more like worrying the penalty of the CPU performing cache coherency - there are scenarios when updating cache stalls computation. Of course, this depends not only on cache size but also the architecture
Dec 18, 2017 at 21:05	history	edited	Krupip	CC BY-SA 3.0	added 44 characters in body
Dec 18, 2017 at 20:59	comment	added	Margaret Bloom		Maciej was just taking your statement about cache coherency :) You probably meant "cache optimization" or something. Cache coherence is the ability of a system to keep a consistent view of memory transparently to the software even if in the presence of N independent caches. This is completely orthogonal to the size. TBH the statement is not really relevant but your answer (especially points 5 & 6) addresses the question better than the accepted one IMO :) Maybe stressing out the difference between architecture and u-architecture will make it stand out more.
Dec 18, 2017 at 18:34	comment	added	Maja Piechotka		"Increased or decreased levels of cache means you need to worry less about cache coherency. " - virtually any CPU is cache coherent. Do you mean false sharing? Even than virtually any CPU $ line is almost always 64 B...
Dec 18, 2017 at 16:10	history	answered	Krupip	CC BY-SA 3.0

toggle format