Coding for Cache Optimization
ECE 565
Performance Optimization & Parallelism
Duke University, Fall 2024
Motivation
• Memory Wall
– CPU speed and memory speed have grown at disparate rates
• CPU frequencies are much faster than memory frequencies
• Memory access takes many CPU cycles
– Hundreds, in fact!
– The latency of a load from memory will be in the 60-80ns range
• Cache hierarchy
– Caches are an integral part of current processor designs
• To reduce the impact of long memory latencies
– Cache hierarchies are often multiple levels today
• L1, L2, L3, sometimes L4
• Levels are larger and slower further down the hierarchy
ECE 565 – Fall 2022 2
Motivation (2)
• Cache hierarchy works on principle of locality
– Temporal locality – recently referenced memory addresses are
likely to be referenced again soon
– Spatial locality – memory addresses near recently referenced
memory addresses are likely to be referenced soon
• Locality allows a processor to retrieve a large portion of
memory references from the cache hierarchy
• Thus the programs we write should exhibit good locality!
– Good news – this is typically true of well-written programs
ECE 565 – Fall 2022 3
Locality Example
• Similar to the loop scenarios we’ve discussed –
int A[N];
int sum = 0;
for (i=0; i<N; i++) {
sum = sum + A[i];
}
• Data locality for sum and elements of array A[]
– Temporal locality for sum, spatial locality for A[]
• Code locality for program instructions
– Loops create significant inherent code temporal locality
– Sequential streams of instructions create spatial locality
ECE 565 – Fall 2022 4
Single Slide Cache Reminder
Core
uint64_t A[N];
Move Cache Blocks, e.g. 64B
L1 $ for (int i=0; i < N; i++) {
A[i] = seed;
}
L2 Cache
What happens when CPU
loads A[0] for first time?
Assume 32B cache blocks
L3 Cache
A[0] A[1] A[2] A[3] A[4] A[5]
Fetch A[0-3] (4 x 8B values)
Memory
What happens if CPU next
loads A[5]? Fetch A[4-7]
ECE 565 – Fall 2022 Cache blocks are aligned 5
Important Cache Performance Metrics
• Miss Ratio
– Ratio of cache misses to total cache references
– Typically less than 10% for L1 cache, < 1% for an L2 cache
• Hit Time
– Time to deliver a line in the cache to the processor
– 2-3 CPU cycles for L1, 15-20 cycles for L2, ~40 cycles for L3
– 60-80ns for main memory (hundreds of cycles)
– Related concept is “load-to-use” time
• # of CPU cycles from the execution of a load instruction until
execution of an instruction that depends on the load value
• Miss Penalty
– Time required access a line from the next hierarchy level
• Average access time = hit time + (miss rate * miss penalty)
ECE 565 – Fall 2022 6
Cache Friendly Code
• Strongly-related to the benefits we discussed for certain
loop transformations
• Examples:
– Cold cache, 4-byte words, 4-word cache blocks
for (i=0; i < N; i++) { for (j=0; j < N; j++) {
for (j=0; j < N; j++) { for (i=0; i < N; i++) {
sum += a[i][j]; sum += a[i][j];
} }
} }
Miss rate = ¼ = 25% Miss rate = 100%
ECE 565 – Fall 2022 7
Reverse-engineering a Cache
• Assume you have a machine
• You do not know its
– Cache sizes (or number of levels of cache)
– Cache block sizes
– Cache associativity
– Latencies
– Data bandwidth
• We can find these out through test programs!
– Write targeted code
– Measure performance characteristics
– Analyze the measurements
ECE 565 – Fall 2022 8
In-Class Exercise
• Code a targeted test program
P to determine
– # caches are in our machine
L1 $ – Size of each cache
– Latency of each cache
L2 $ • Assumptions
– LRU replacement policy in use
for each cache
L3 $ – Know cache block size
– Each level of cache hierarchy
has a different access latency
Main Memory
ECE 565 – Fall 2022 9
Test Code Summary
• (See links to code files on the class schedule page)
• Repeatedly access elements of a data set of some size
– E.g. an array
– Each memory access should depend on value from prior access
• E.g. pointer chasing
• This exposes memory latency of each access
• Record exectution time for this set of memory accesses
– Calculate average latency from
• measured time
• known # of accesses
• When data set size grows larger than the size of a cache level –
– We will see a step in the measured average access latency
– Latency will stay constant while the data set size fits within a cache level
• Loop of repeated memory accesses can be unrolled
– To reduce the interference from loop management instructions
• Access data set w/ fixed stride so each access touches a new cache block
• Can randomly cycle through data set elements to defeat prefetch
– Prefetch could disrupt measurement and blur the transition between cache levels
ECE 565 – Fall 2022 10
Example Results
30
25
L3 region
20
CPU cycles
15
10
L2 region
5
L1 region
0
16384 32768 65536 131072 262144 524288 1048576 2097152
Data Set Size (B)
* Measured on an Intel Core i7 CPU @ 2.4 GHz
* Program compiled with gcc -O3
ECE 565 – Fall 2022 11
Cache Access Patterns
stride = 4 words
data set = 24 words
• Vary the data set accessed by our code
– As data set grows larger than a cache level, performance drops
• Performance can be measured as latency or bandwidth
• Vary the stride of data accessed by our code
– Affects spatial locality provided by cache blocks
– If stride is less than size of a cache line –
• Initial access may cause cache miss, sequential accesses are fast
– If stride is greater than size of a cache line –
• Sequential accesses are slow
ECE 565 – Fall 2022 12
Memory Access Latency vs. Bandwidth
• Thus far, we’ve focused mostly on latency
– Hit time for a cache level or memory
– E.g. we put together example code using pointer chasing
• Stresses the latency of each access
• Only one memory access in flight at a given time
• Cache and memory bandwidth is also important P
– Bandwidth is a rate L1 $
• Bytes per cycle
L2 $
• GB per second
– Bandwidth gets smaller at lower levels of memory hierarchy L3 $
• Just as latency grows larger Main Memory
– Some code is throughput sensitive, not latency sensitive
• Code performance would improve with higher access bandwidth
– Even if access latency increased
ECE 565 – Fall 2022 13
Matrix Multiplication
• Common operation in scientific applications
• Significant interaction with cache & memory subsystem
1 2 3 4 1 5 9 13 30
5 6 7 8 2 6 10 14
9 10 11 12 3 7 11 15
13 14 15 16 4 8 12 16
= 1*1 + 2*2 + 3*3 + 4*4
• Recall our memory layout discussion
– E.g. C/C++ uses row-major order
– 2D array is allocated as a linear array in memory
ECE 565 – Fall 2022 14
Matrix Multiplication Implementation
double A[N], B[N], C[N];
int i, j, k;
double sum; k j j
for (i=0; i<N; i++) {
for (j=0; j<N; j++) { 1 2 3 4 1 5 9 13 30
sum = 0; 5 6 7 8 k 2 6 1014
i 9 101112 3 7 1115
i
for (k=0; k<N; k++) {
sum += A[i][k] * B[k][j]; 13141516 4 8 1216
}
C[i][j] = sum;
A B C
}
}
• 3 Loops – i, j, k
– 6 ways to arrange the loops and multiply the matrices
• O(N3) total operations
– N reads for each element of A and B
– N values to sum for each output element of C
ECE 565 – Fall 2022 15
Cache Analysis for Matrix Multiplication
• Each matrix element is 64 bits (a double)
• Assumptions:
– N is very large (cache cannot fit more than one row/column)
– Cache block size = 64 bytes (8 matrix elements per block)
• Consider access pattern for i,j,k loop structure
k j j
i k i
A B C
– A=good spatial locality; C=good temporal locality; B=poor locality
ECE 565 – Fall 2022 16
Matrix Multiplication
double A[N], B[N], C[N];
int i, j, k;
double sum; k j j
for (i=0; i<N; i++) {
for (j=0; j<N; j++) {
sum = 0; i k i
for (k=0; k<N; k++) {
sum += A[i][k] * B[k][j];
} A B C
C[i][j] = sum;
} 1
0.125 0
}
Misses per
iteration 1.125 total
• i-j-k
– Memory accesses for each inner loop iteration
• 2 loads: element A[i][k] and element B[k][j]
– A[i][k] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every iteration
• j-i-k cache miss behavior same as i-j-k
ECE 565 – Fall 2022 17
Matrix Multiplication
double A[N], B[N], C[N];
int i, j, k;
double tmp; k j j
for (k=0; k<N; k++) {
for (i=0; i<N; i++) {
tmp = A[i][k]; i k i
for (j=0; j<N; j++) {
C[i][j] += tmp * B[k][j];
} A B C
}
} 0.125
0 0.125
Misses per
iteration 0.25 total
• k-i-j
– Memory accesses for each inner loop iteration
• 2 loads: element C[i][j] and element B[k][j]; 1 store: element C[i][j]
– C[i][j] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every 8/64 iterations
• i-k-j cache miss behavior same as k-i-j
ECE 565 – Fall 2022 18
Matrix Multiplication
double A[N], B[N], C[N];
int i, j, k;
double tmp; k j j
for (j=0; j<N; j++) {
for (k=0; k<N; k++) {
tmp = B[k][j]; i k i
for (i=0; i<N; i++) {
C[i][j] += tmp * A[i][k];
} A B C
}
} 0
1 1
Misses per
iteration 2 total
• j-k-i
– Memory accesses for each inner loop iteration
• 2 loads: element C[i][j] and element A[i][k]; 1 store: element C[i][j]
– C[i][j] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every 8/64 iterations
• k-j-i cache miss behavior same as j-k-i
ECE 565 – Fall 2022 19
Matrix Multiplication Summary
• k is innermost loop
– A = good spatial locality
– C = good temporal locality
– Misses per iteration k j j
• 1 + (element sz/block sz)
• i is innermost loop i k i
– B = good temporal locality
– Misses per iteration A B C
• 2
• j is innermost loop
– B, C = good spatial locality
– A = good temporal locality
– Misses per iteration
• 2 * (element sz/block sz)
ECE 565 – Fall 2022 20
Other Types of Caching
• Main memory is a cache for disk
– Operates at a physical page granularity
• TLB is a cache for page table
– Translation Lookaside Buffer
– Operates at a page table entry granularity
ECE 565 – Fall 2022 21
Virtual Address Space
0xffffffff
Kernel Space • Each process thinks it has access
to the full address space provided
User Stack by the machine
– 4GB on 32-bit computers
– ~16-256TB on 64-bit computers
– Illusion provided by OS
Shared Libraries • Every process has its own virtual
address space
– Contents visible only to that
process
User Heap • Address space for even one
process is larger than physical
memory
Text
• How does it fit?
0x00000000
ECE 565 – Fall 2022 22
Physical Memory as a Cache for Disk
Virtual Address Spaces Physical addresses
Process 0
Process N
ECE 565 – Fall 2022 23
Physical Memory as a Cache for Disk
Virtual Address Physical addresses
Spaces
Process 0 Two Key Performance
Aspects:
1) Which addresses from
various processes should
we keep in physical
memory?
Process N
2) How do we do this
mapping between a virtual
address and its address in
physical memory?
ECE 565 – Fall 2022 24
What To Keep in Physical Memory?
• Want to service memory accesses from DRAM, not disk
– That is, if the access misses in all caches already
– Disk is thousands of times slower than DRAM
• ~60-80ns to several milliseconds
• Locality still rules the day
– Just as we discussed for caches
• Want to capture spatial and temporal locality in memory
– Temporal: retain recently accessed addresses in physical mem
– Spatial: fetch nearby addresses into physical mem on disk access
ECE 565 – Fall 2022 25
What To Keep in Physical Memory?
• OS manages the physical memory and disk accesses
– Physical memory management is software-based
– Operates on physical memory in units of pages
• Pages have some size of 2 P (e.g. 4KB)
• Much larger than cache block size due to huge latency of disk
– OS treats physical memory as a fully associative cache
• A virtual page can be placed in any physical page
– Page replacement policies may be complex
• SW can maintain and track much more state than HW
• Since accessing a page from disk is so slow there is lots of time!
• Physical memory holds the large-scale working set of a program
– Working set size of less than physical mem size
• Good performance for a process after memory is warmed up
– Working set sizes of all active processes greater than physical mem size
• Can cause a severe performance problem
• Called thrashing: pages are continuously swapped between memory and disk
ECE 565 – Fall 2022 26
Metrics
• Page hit
– Memory reference to an address that is stored in physical mem
• Page miss (page fault)
– Reference to an address that is not in physical memory
– Misses are expensive
• Access to disk
• Software is involved in managing the process
ECE 565 – Fall 2022 27
Physical Memory as a Cache for Disk
Virtual Address Spaces Physical addresses
Process 0
Process N
This looks
ECE 565 – Fall 2022 complicated!
28
What is Stored Where in Physical Memory?
• Need to remember current location for every virtual page
– Since physical mem is managed as fully associative cache
• Solution is called a page table
– Maps virtual pages to physical pages
– Per-process software data structure
Physical memory
Physical page number VP 0
or disk address
VP 4
PTE 0 1 VP 3
1 VP 1
0 null
Disk
1
1 VP 0
PTE 5 0 VP 1
VP 3
Page Table
VP 4
VP 5
ECE 565 – Fall 2022 29
Page Table Management
• Page tables are very large
– One entry per page
– For 32-bit address space:
• Assume 4 GB total virtual memory (232)
• Assume 4KB pages
• 232 / 212 = 220 entries
• PTE is 4B in x86 architecture = 222 bytes
• And that’s just for one process!
• Keep portions of the page table in memory; rest on disk
– The frequently and recently accessed portions, that is
ECE 565 – Fall 2022 30
But That’s Not Quite Enough
Processor Chip
L1 D$
L2 $ Phys
CPU MMU
Virtual Physical Mem
L1 I$
address address
• Physical addresses are needed for cache lookups
– Beginning at the L1 cache
• PTE is needed to turn a VA into PA
– PTE is located in memory (at best)
– Memory access required for every load or store?
ECE 565 – Fall 2022 31
Translation Lookaside Buffer (TLB)
• TLB is a very fast cache of PTEs
– Located inside the MMU of a processor
– Go directly from virtual page to physical page address
• Hierarchy of TLBs in current processor designs
– Separate instruction & data L1 TLBs, L2 TLB
Processor Chip TLB
L1 D$
L2 $ Phys
CPU
Virtual MMU Physical Mem
L1 I$
address address
ECE 565 – Fall 2022 32
TLB Reach
• TLB Reach
– Amount of memory accessible from the TLB
– Should cover the working set size of a process
– (# TLB entries) * (Page size)
• For example
– 64 TLB entries in L1 DTLB * 64KB pages = 4MB reach
ECE 565 – Fall 2022 33
Increasing TLB Reach
• What if we need to cover larger working sets?
• We can’t increase the number of TLB entries
– Well, we could wait a few years for a newer processor
• We can increase the page size
– Most modern architectures support a set range of page sizes
– From 4KB to 16MB
• Example - hugepages
– Consult your favorite OS manual to turn on hugepages
– RHEL example – $ grep Hugepagesize /proc/meminfo
Hugepagesize: 2048 kB
$ grep HugePages_Total /proc/meminfo
– E.g. libhugetlbfs HugePages_Total: 0
$ sysctl –w vm.nr_hugepages=128
$ grep HugePages_Total /proc/meminfo
HugePages_Total: 128
ECE 565 – Fall 2022 34
Memory Systems
Server Diagram
Memory (DIMM Slots)
ECE 565 – Fall 2022 36
Typical Server Architecture
PCIe I/O links to
external CPU I/O links to other
components (Multi-core chip + caches) CPUs in same
(network, storage, socket (often
etc.) proprietary I/O)
All running under
single OS image
Main Memory (sees all CPU cores
and memories)
ECE 565 – Fall 2022 37
Memory Bandwidth (DRAM)
• Currently, main memory is often DRAM
– DDR standards defined by JEDEC
CPU
• Double Data Rate (Multi-core chip + caches)
– DDR4 common in current server CPUs
Mem Controller
– New servers offer DDR5
– Various speed grades available
• More on this in a minute
Main Memory
• Connected to CPU w/ channels
– e.g. 4, 8, 16 (uses chip I/O pins)
• Memory controller schedules requests from the CPU
– Reads from cache misses
– Writes from cache evictions of dirty cache blocks
ECE 565 – Fall 2022 38
DRAM Performance
• Latency: we’ve discussed before
– 60-70ns load-to-use latency in current server CPUs is common
– This would be an uncontended load (no other traffic on the chip)
• Bandwidth (peak): defined by several factors
– Number of off-chip DDR channels (e.g. 4, 8, …)
– Bit width of each channel (e.g. 8 bytes)
– Channel frequency (e.g. 3200 MHz)
• Bandwidth (effective)
– Defined by ability of the memory controller to schedule requests
to utilize the DRAM devices and DDR channels efficiently
– 80% of peak B/W is a typical reasonable utilization
– Could be a bit better for some controllers or programs
• E.g. mostly read traffic compared to an even mix of read-write
ECE 565 – Fall 2022 39
Server DRAM B/W Example
ECE 565 – Fall 2022 40
Server DRAM B/W Example (2)
• 6 DDR4 memory channels per chip; 8B channel width
• 2666 = 2.666 Giga-transfers per second (GT/s)
• Peak DDR4 B/W per-chip:
– 6 channels * 8 B/channel * 2.666 GT/s = 127.97 GB/s
• DDR4 supports up to 3200 MT/s
• What would be B/W with 8 channels and 3200 DDR4?
– 8 * 8 * 3.2 = 204.8 GB/s
• Remember this is peak; Program would see ~80% of this
ECE 565 – Fall 2022 41
More on Memory B/W
• Speaking of what a program would see…
• Remember B/W is a rate (# of bytes or accesses per time)
• Programs that generate many outstanding memory
requests can stress bandwidth more than latency
• Would we see peak Mem B/W w/ a program on 1 core?
– No
– Memory system is provisioned to provide bandwidth to all cores
• What defines B/W we could get from 1 core?
– Max number of in-flight memory requests a core can sustain
– Typically will be defined by size of Cache Miss handling structures
• e.g. size of Read Queue or MSHRs (Miss status handling registers)
ECE 565 – Fall 2022 42
Bandwidth from 1 Core
• Let’s look at an example:
– Suppose cache block size is 64 bytes, chip frequency is 2.4 GHz
– Suppose per-core L2 can support 32 pending cache misses
– Suppose average latency is 200 cycles
• Remember there is contention now so each access may take longer
– B/W = ((32 misses * 64 B/miss) / 200 cycles) * 2.4 GHz
• 24.576 GB/s would be max B/W for a single core
ECE 565 – Fall 2022 43
Other Memory Technologies
• Other memory technologies are emerging
• For use along with DRAM or as a replacement
• E.g. GPUs today use High Bandwidth Memory (HBM)
– Also defined by standards, similarly to DRAM
– Provides higher bandwidth (via wide interfaces and die stacking)
– But maximum capacity is much lower
• HBM2 / HBM2E
– Rates quoted in terms of “giga transactions per second”
– HBM2: 128B interface (wide!), up to 2 GT/s = 256 GB/s per “stack”
– e.g. NVIDIA Volta GPU:
• 4 HBM stacks per GPU (16 GB per stack) for 900 GB/s per GPU
• HBM3 is current standard
ECE 565 – Fall 2022 44