0% found this document useful (0 votes)

50 views44 pages

Cache Performance

Uploaded by

52068838a

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views44 pages

Cache Performance

Uploaded by

52068838a

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Coding for Cache Optimization

ECE 565
Performance Optimization & Parallelism
Duke University, Fall 2024
Motivation

• Memory Wall
– CPU speed and memory speed have grown at disparate rates
• CPU frequencies are much faster than memory frequencies
• Memory access takes many CPU cycles
– Hundreds, in fact!
– The latency of a load from memory will be in the 60-80ns range

• Cache hierarchy
– Caches are an integral part of current processor designs
• To reduce the impact of long memory latencies
– Cache hierarchies are often multiple levels today
• L1, L2, L3, sometimes L4
• Levels are larger and slower further down the hierarchy

ECE 565 – Fall 2022 2

Motivation (2)

• Cache hierarchy works on principle of locality

– Temporal locality – recently referenced memory addresses are
likely to be referenced again soon
– Spatial locality – memory addresses near recently referenced
memory addresses are likely to be referenced soon
• Locality allows a processor to retrieve a large portion of
memory references from the cache hierarchy
• Thus the programs we write should exhibit good locality!
– Good news – this is typically true of well-written programs

ECE 565 – Fall 2022 3

Locality Example

• Similar to the loop scenarios we’ve discussed –

int A[N];
int sum = 0;
for (i=0; i<N; i++) {
sum = sum + A[i];
}

• Data locality for sum and elements of array A[]

– Temporal locality for sum, spatial locality for A[]
• Code locality for program instructions
– Loops create significant inherent code temporal locality
– Sequential streams of instructions create spatial locality

ECE 565 – Fall 2022 4

Single Slide Cache Reminder
Core
uint64_t A[N];
Move Cache Blocks, e.g. 64B

L1 $ for (int i=0; i < N; i++) {

A[i] = seed;
}
L2 Cache
What happens when CPU
loads A[0] for first time?
Assume 32B cache blocks
L3 Cache
A[0] A[1] A[2] A[3] A[4] A[5]

Fetch A[0-3] (4 x 8B values)

Memory
What happens if CPU next
loads A[5]? Fetch A[4-7]
ECE 565 – Fall 2022 Cache blocks are aligned 5
Important Cache Performance Metrics

• Miss Ratio
– Ratio of cache misses to total cache references
– Typically less than 10% for L1 cache, < 1% for an L2 cache
• Hit Time
– Time to deliver a line in the cache to the processor
– 2-3 CPU cycles for L1, 15-20 cycles for L2, ~40 cycles for L3
– 60-80ns for main memory (hundreds of cycles)
– Related concept is “load-to-use” time
• # of CPU cycles from the execution of a load instruction until
execution of an instruction that depends on the load value

• Miss Penalty
– Time required access a line from the next hierarchy level
• Average access time = hit time + (miss rate * miss penalty)
ECE 565 – Fall 2022 6
Cache Friendly Code

• Strongly-related to the benefits we discussed for certain

loop transformations
• Examples:
– Cold cache, 4-byte words, 4-word cache blocks

for (i=0; i < N; i++) { for (j=0; j < N; j++) {

for (j=0; j < N; j++) { for (i=0; i < N; i++) {
sum += a[i][j]; sum += a[i][j];
} }
} }
Miss rate = ¼ = 25% Miss rate = 100%

ECE 565 – Fall 2022 7

Reverse-engineering a Cache

• Assume you have a machine

• You do not know its
– Cache sizes (or number of levels of cache)
– Cache block sizes
– Cache associativity
– Latencies
– Data bandwidth
• We can find these out through test programs!
– Write targeted code
– Measure performance characteristics
– Analyze the measurements

ECE 565 – Fall 2022 8

In-Class Exercise

• Code a targeted test program

P to determine
– # caches are in our machine
L1 $ – Size of each cache
– Latency of each cache
L2 $ • Assumptions
– LRU replacement policy in use
for each cache
L3 $ – Know cache block size
– Each level of cache hierarchy
has a different access latency
Main Memory

ECE 565 – Fall 2022 9

Test Code Summary
• (See links to code files on the class schedule page)
• Repeatedly access elements of a data set of some size
– E.g. an array
– Each memory access should depend on value from prior access
• E.g. pointer chasing
• This exposes memory latency of each access
• Record exectution time for this set of memory accesses
– Calculate average latency from
• measured time
• known # of accesses
• When data set size grows larger than the size of a cache level –
– We will see a step in the measured average access latency
– Latency will stay constant while the data set size fits within a cache level
• Loop of repeated memory accesses can be unrolled
– To reduce the interference from loop management instructions
• Access data set w/ fixed stride so each access touches a new cache block
• Can randomly cycle through data set elements to defeat prefetch
– Prefetch could disrupt measurement and blur the transition between cache levels

ECE 565 – Fall 2022 10

Example Results

L3 region
20
CPU cycles

L2 region
5

L1 region
0
16384 32768 65536 131072 262144 524288 1048576 2097152

Data Set Size (B)

* Measured on an Intel Core i7 CPU @ 2.4 GHz

* Program compiled with gcc -O3
ECE 565 – Fall 2022 11
Cache Access Patterns
stride = 4 words

data set = 24 words

• Vary the data set accessed by our code

– As data set grows larger than a cache level, performance drops
• Performance can be measured as latency or bandwidth

• Vary the stride of data accessed by our code

– Affects spatial locality provided by cache blocks
– If stride is less than size of a cache line –
• Initial access may cause cache miss, sequential accesses are fast
– If stride is greater than size of a cache line –
• Sequential accesses are slow

ECE 565 – Fall 2022 12

Memory Access Latency vs. Bandwidth

• Thus far, we’ve focused mostly on latency

– Hit time for a cache level or memory
– E.g. we put together example code using pointer chasing
• Stresses the latency of each access
• Only one memory access in flight at a given time

• Cache and memory bandwidth is also important P

– Bandwidth is a rate L1 $
• Bytes per cycle
L2 $
• GB per second
– Bandwidth gets smaller at lower levels of memory hierarchy L3 $

• Just as latency grows larger Main Memory

– Some code is throughput sensitive, not latency sensitive
• Code performance would improve with higher access bandwidth
– Even if access latency increased

ECE 565 – Fall 2022 13

Matrix Multiplication

• Common operation in scientific applications

• Significant interaction with cache & memory subsystem

1 2 3 4 1 5 9 13 30
5 6 7 8 2 6 10 14
9 10 11 12 3 7 11 15
13 14 15 16 4 8 12 16

= 11 + 22 + 33 + 44

• Recall our memory layout discussion

– E.g. C/C++ uses row-major order
– 2D array is allocated as a linear array in memory

ECE 565 – Fall 2022 14

Matrix Multiplication Implementation

double A[N], B[N], C[N];

int i, j, k;
double sum; k j j
for (i=0; i<N; i++) {
for (j=0; j<N; j++) { 1 2 3 4 1 5 9 13 30
sum = 0; 5 6 7 8 k 2 6 1014
i 9 101112 3 7 1115
i
for (k=0; k<N; k++) {
sum += A[i][k] * B[k][j]; 13141516 4 8 1216
}
C[i][j] = sum;
A B C
}
}

• 3 Loops – i, j, k
– 6 ways to arrange the loops and multiply the matrices
• O(N3) total operations
– N reads for each element of A and B
– N values to sum for each output element of C

ECE 565 – Fall 2022 15

Cache Analysis for Matrix Multiplication

• Each matrix element is 64 bits (a double)

• Assumptions:
– N is very large (cache cannot fit more than one row/column)
– Cache block size = 64 bytes (8 matrix elements per block)
• Consider access pattern for i,j,k loop structure
k j j

i k i

A B C

– A=good spatial locality; C=good temporal locality; B=poor locality

ECE 565 – Fall 2022 16

Matrix Multiplication

double A[N], B[N], C[N];

int i, j, k;
double sum; k j j
for (i=0; i<N; i++) {
for (j=0; j<N; j++) {
sum = 0; i k i
for (k=0; k<N; k++) {
sum += A[i][k] * B[k][j];
} A B C
C[i][j] = sum;
} 1
0.125 0
}
Misses per
iteration 1.125 total

• i-j-k
– Memory accesses for each inner loop iteration
• 2 loads: element A[i][k] and element B[k][j]
– A[i][k] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every iteration
• j-i-k cache miss behavior same as i-j-k
ECE 565 – Fall 2022 17
Matrix Multiplication

double A[N], B[N], C[N];

int i, j, k;
double tmp; k j j
for (k=0; k<N; k++) {
for (i=0; i<N; i++) {
tmp = A[i][k]; i k i
for (j=0; j<N; j++) {
C[i][j] += tmp * B[k][j];
} A B C
}
} 0.125
0 0.125
Misses per
iteration 0.25 total

• k-i-j
– Memory accesses for each inner loop iteration
• 2 loads: element C[i][j] and element B[k][j]; 1 store: element C[i][j]
– C[i][j] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every 8/64 iterations
• i-k-j cache miss behavior same as k-i-j
ECE 565 – Fall 2022 18
Matrix Multiplication

double A[N], B[N], C[N];

int i, j, k;
double tmp; k j j
for (j=0; j<N; j++) {
for (k=0; k<N; k++) {
tmp = B[k][j]; i k i
for (i=0; i<N; i++) {
C[i][j] += tmp * A[i][k];
} A B C
}
} 0
1 1
Misses per
iteration 2 total

• j-k-i
– Memory accesses for each inner loop iteration
• 2 loads: element C[i][j] and element A[i][k]; 1 store: element C[i][j]
– C[i][j] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every 8/64 iterations
• k-j-i cache miss behavior same as j-k-i
ECE 565 – Fall 2022 19
Matrix Multiplication Summary

• k is innermost loop
– A = good spatial locality
– C = good temporal locality
– Misses per iteration k j j
• 1 + (element sz/block sz)

• i is innermost loop i k i
– B = good temporal locality
– Misses per iteration A B C
• 2

• j is innermost loop
– B, C = good spatial locality
– A = good temporal locality
– Misses per iteration
• 2 * (element sz/block sz)

ECE 565 – Fall 2022 20

Other Types of Caching

• Main memory is a cache for disk

– Operates at a physical page granularity
• TLB is a cache for page table
– Translation Lookaside Buffer
– Operates at a page table entry granularity

ECE 565 – Fall 2022 21

Virtual Address Space
0xffffffff
Kernel Space • Each process thinks it has access
to the full address space provided
User Stack by the machine
– 4GB on 32-bit computers
– ~16-256TB on 64-bit computers
– Illusion provided by OS
Shared Libraries • Every process has its own virtual
address space
– Contents visible only to that
process
User Heap • Address space for even one
process is larger than physical
memory
Text
• How does it fit?
0x00000000
ECE 565 – Fall 2022 22
Physical Memory as a Cache for Disk
Virtual Address Spaces Physical addresses

Process 0

Process N

ECE 565 – Fall 2022 23

Physical Memory as a Cache for Disk

Virtual Address Physical addresses

Spaces
Process 0 Two Key Performance
Aspects:
1) Which addresses from
various processes should
we keep in physical
memory?
Process N
2) How do we do this
mapping between a virtual
address and its address in
physical memory?

ECE 565 – Fall 2022 24

What To Keep in Physical Memory?

• Want to service memory accesses from DRAM, not disk

– That is, if the access misses in all caches already
– Disk is thousands of times slower than DRAM
• ~60-80ns to several milliseconds

• Locality still rules the day

– Just as we discussed for caches
• Want to capture spatial and temporal locality in memory
– Temporal: retain recently accessed addresses in physical mem
– Spatial: fetch nearby addresses into physical mem on disk access

ECE 565 – Fall 2022 25

What To Keep in Physical Memory?
• OS manages the physical memory and disk accesses
– Physical memory management is software-based
– Operates on physical memory in units of pages
• Pages have some size of 2 P (e.g. 4KB)
• Much larger than cache block size due to huge latency of disk
– OS treats physical memory as a fully associative cache
• A virtual page can be placed in any physical page
– Page replacement policies may be complex
• SW can maintain and track much more state than HW
• Since accessing a page from disk is so slow there is lots of time!
• Physical memory holds the large-scale working set of a program
– Working set size of less than physical mem size
• Good performance for a process after memory is warmed up
– Working set sizes of all active processes greater than physical mem size
• Can cause a severe performance problem
• Called thrashing: pages are continuously swapped between memory and disk

ECE 565 – Fall 2022 26

Metrics

• Page hit
– Memory reference to an address that is stored in physical mem
• Page miss (page fault)
– Reference to an address that is not in physical memory
– Misses are expensive
• Access to disk
• Software is involved in managing the process

ECE 565 – Fall 2022 27

Physical Memory as a Cache for Disk
Virtual Address Spaces Physical addresses

Process 0

Process N

This looks
ECE 565 – Fall 2022 complicated!
28
What is Stored Where in Physical Memory?

• Need to remember current location for every virtual page

– Since physical mem is managed as fully associative cache
• Solution is called a page table
– Maps virtual pages to physical pages
– Per-process software data structure
Physical memory
Physical page number VP 0
or disk address
VP 4
PTE 0 1 VP 3
1 VP 1
0 null
Disk
1
1 VP 0
PTE 5 0 VP 1
VP 3
Page Table
VP 4
VP 5
ECE 565 – Fall 2022 29
Page Table Management

• Page tables are very large

– One entry per page
– For 32-bit address space:
• Assume 4 GB total virtual memory (232)
• Assume 4KB pages
• 232 / 212 = 220 entries
• PTE is 4B in x86 architecture = 222 bytes
• And that’s just for one process!

• Keep portions of the page table in memory; rest on disk

– The frequently and recently accessed portions, that is

ECE 565 – Fall 2022 30

But That’s Not Quite Enough

Processor Chip
L1 D$
L2 $ Phys
CPU MMU
Virtual Physical Mem
L1 I$
address address

• Physical addresses are needed for cache lookups

– Beginning at the L1 cache
• PTE is needed to turn a VA into PA
– PTE is located in memory (at best)
– Memory access required for every load or store?

ECE 565 – Fall 2022 31

Translation Lookaside Buffer (TLB)

• TLB is a very fast cache of PTEs

– Located inside the MMU of a processor
– Go directly from virtual page to physical page address
• Hierarchy of TLBs in current processor designs
– Separate instruction & data L1 TLBs, L2 TLB

Processor Chip TLB

L1 D$
L2 $ Phys
CPU
Virtual MMU Physical Mem
L1 I$
address address

ECE 565 – Fall 2022 32

TLB Reach

• TLB Reach
– Amount of memory accessible from the TLB
– Should cover the working set size of a process
– (# TLB entries) * (Page size)
• For example
– 64 TLB entries in L1 DTLB * 64KB pages = 4MB reach

ECE 565 – Fall 2022 33

Increasing TLB Reach

• What if we need to cover larger working sets?

• We can’t increase the number of TLB entries
– Well, we could wait a few years for a newer processor
• We can increase the page size
– Most modern architectures support a set range of page sizes
– From 4KB to 16MB
• Example - hugepages
– Consult your favorite OS manual to turn on hugepages
– RHEL example – $ grep Hugepagesize /proc/meminfo
Hugepagesize: 2048 kB
$ grep HugePages_Total /proc/meminfo
– E.g. libhugetlbfs HugePages_Total: 0
$ sysctl –w vm.nr_hugepages=128
$ grep HugePages_Total /proc/meminfo
HugePages_Total: 128
ECE 565 – Fall 2022 34
Memory Systems
Server Diagram

Memory (DIMM Slots)

ECE 565 – Fall 2022 36

Typical Server Architecture

PCIe I/O links to

external CPU I/O links to other
components (Multi-core chip + caches) CPUs in same
(network, storage, socket (often
etc.) proprietary I/O)

All running under

single OS image
Main Memory (sees all CPU cores
and memories)
ECE 565 – Fall 2022 37
Memory Bandwidth (DRAM)

• Currently, main memory is often DRAM

– DDR standards defined by JEDEC
CPU
• Double Data Rate (Multi-core chip + caches)
– DDR4 common in current server CPUs
Mem Controller
– New servers offer DDR5
– Various speed grades available
• More on this in a minute
Main Memory
• Connected to CPU w/ channels
– e.g. 4, 8, 16 (uses chip I/O pins)
• Memory controller schedules requests from the CPU
– Reads from cache misses
– Writes from cache evictions of dirty cache blocks

ECE 565 – Fall 2022 38

DRAM Performance
• Latency: we’ve discussed before
– 60-70ns load-to-use latency in current server CPUs is common
– This would be an uncontended load (no other traffic on the chip)
• Bandwidth (peak): defined by several factors
– Number of off-chip DDR channels (e.g. 4, 8, …)
– Bit width of each channel (e.g. 8 bytes)
– Channel frequency (e.g. 3200 MHz)
• Bandwidth (effective)
– Defined by ability of the memory controller to schedule requests
to utilize the DRAM devices and DDR channels efficiently
– 80% of peak B/W is a typical reasonable utilization
– Could be a bit better for some controllers or programs
• E.g. mostly read traffic compared to an even mix of read-write
ECE 565 – Fall 2022 39
Server DRAM B/W Example

ECE 565 – Fall 2022 40

Server DRAM B/W Example (2)

• 6 DDR4 memory channels per chip; 8B channel width

• 2666 = 2.666 Giga-transfers per second (GT/s)
• Peak DDR4 B/W per-chip:
– 6 channels * 8 B/channel * 2.666 GT/s = 127.97 GB/s

• DDR4 supports up to 3200 MT/s

• What would be B/W with 8 channels and 3200 DDR4?
– 8 * 8 * 3.2 = 204.8 GB/s

• Remember this is peak; Program would see ~80% of this

ECE 565 – Fall 2022 41
More on Memory B/W
• Speaking of what a program would see…
• Remember B/W is a rate (# of bytes or accesses per time)
• Programs that generate many outstanding memory
requests can stress bandwidth more than latency
• Would we see peak Mem B/W w/ a program on 1 core?
– No
– Memory system is provisioned to provide bandwidth to all cores
• What defines B/W we could get from 1 core?
– Max number of in-flight memory requests a core can sustain
– Typically will be defined by size of Cache Miss handling structures
• e.g. size of Read Queue or MSHRs (Miss status handling registers)

ECE 565 – Fall 2022 42

Bandwidth from 1 Core
• Let’s look at an example:
– Suppose cache block size is 64 bytes, chip frequency is 2.4 GHz
– Suppose per-core L2 can support 32 pending cache misses
– Suppose average latency is 200 cycles
• Remember there is contention now so each access may take longer
– B/W = ((32 misses * 64 B/miss) / 200 cycles) * 2.4 GHz
• 24.576 GB/s would be max B/W for a single core

ECE 565 – Fall 2022 43

Other Memory Technologies
• Other memory technologies are emerging
• For use along with DRAM or as a replacement
• E.g. GPUs today use High Bandwidth Memory (HBM)
– Also defined by standards, similarly to DRAM
– Provides higher bandwidth (via wide interfaces and die stacking)
– But maximum capacity is much lower
• HBM2 / HBM2E
– Rates quoted in terms of “giga transactions per second”
– HBM2: 128B interface (wide!), up to 2 GT/s = 256 GB/s per “stack”
– e.g. NVIDIA Volta GPU:
• 4 HBM stacks per GPU (16 GB per stack) for 900 GB/s per GPU

• HBM3 is current standard

ECE 565 – Fall 2022 44

Cache Performance Optimization Guide
No ratings yet
Cache Performance Optimization Guide
6 pages
Understanding Cache Memory Operations
No ratings yet
Understanding Cache Memory Operations
49 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
58 pages
Error Correcting Codes and Cache Design
No ratings yet
Error Correcting Codes and Cache Design
28 pages
4 Caches With Notes
No ratings yet
4 Caches With Notes
121 pages
CS33 S25 L14 OpenMP Intro Annotated
No ratings yet
CS33 S25 L14 OpenMP Intro Annotated
73 pages
Cache Memory Optimization Techniques
No ratings yet
Cache Memory Optimization Techniques
100 pages
08 Caches
No ratings yet
08 Caches
78 pages
Lecture Slides 07 076-Caches-Opt
No ratings yet
Lecture Slides 07 076-Caches-Opt
11 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
23 pages
Rec 07
No ratings yet
Rec 07
40 pages
Memory Hierarchies: Forecast - Memory (B5) - Motivation For Memory Hierarchy - Cache - Ecc - Virtual Memory
No ratings yet
Memory Hierarchies: Forecast - Memory (B5) - Motivation For Memory Hierarchy - Cache - Ecc - Virtual Memory
19 pages
Performance Optimization Insights
No ratings yet
Performance Optimization Insights
104 pages
F11 - Cache Aware Programming For Multicores
No ratings yet
F11 - Cache Aware Programming For Multicores
20 pages
Lecture 8
No ratings yet
Lecture 8
22 pages
Aca Seminar Report
No ratings yet
Aca Seminar Report
11 pages
Main Memory and Virtual Memory Concepts
No ratings yet
Main Memory and Virtual Memory Concepts
15 pages
Lecture Slides 07 076-Caches-Opt
No ratings yet
Lecture Slides 07 076-Caches-Opt
11 pages
Recitation05 Cachelab
No ratings yet
Recitation05 Cachelab
97 pages
CacheII Annotated
No ratings yet
CacheII Annotated
21 pages
Cache Complexity
No ratings yet
Cache Complexity
97 pages
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior
No ratings yet
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior
35 pages
Computer Architecture-Cache Microarchitecture
No ratings yet
Computer Architecture-Cache Microarchitecture
36 pages
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
No ratings yet
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
10 pages
Matrix Multiplication Lab Guide
No ratings yet
Matrix Multiplication Lab Guide
6 pages
Data Oriented Design for Efficient CPU Processing
No ratings yet
Data Oriented Design for Efficient CPU Processing
17 pages
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
No ratings yet
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
57 pages
Data and Instruction Locality in Caches
No ratings yet
Data and Instruction Locality in Caches
78 pages
APT06 2024S2 New
No ratings yet
APT06 2024S2 New
21 pages
DigitalLogic ComputerOrganization L22 CachesP3 Handout
No ratings yet
DigitalLogic ComputerOrganization L22 CachesP3 Handout
52 pages
ASA Chapter4
No ratings yet
ASA Chapter4
8 pages
COSS - Lecture - 6 - With Annotation
No ratings yet
COSS - Lecture - 6 - With Annotation
37 pages
Cache Misses
No ratings yet
Cache Misses
8 pages
Web GPU
0% (1)
Web GPU
40 pages
Annotated Presentation2
No ratings yet
Annotated Presentation2
14 pages
Computer Org and Arch: R.Magesh
No ratings yet
Computer Org and Arch: R.Magesh
48 pages
Chapter 5
No ratings yet
Chapter 5
4 pages
Lecture 23
No ratings yet
Lecture 23
6 pages
Lab 8
No ratings yet
Lab 8
10 pages
9 - CH05 - Cache Memory Organization
No ratings yet
9 - CH05 - Cache Memory Organization
27 pages
25 e 50 Beb 5 Aad 8 F 60
No ratings yet
25 e 50 Beb 5 Aad 8 F 60
49 pages
Cache Performance and Write Strategies
No ratings yet
Cache Performance and Write Strategies
30 pages
Solutions To Exercises On Memory Hierarchy
No ratings yet
Solutions To Exercises On Memory Hierarchy
15 pages
Cache
No ratings yet
Cache
31 pages
Week 11
No ratings yet
Week 11
45 pages
L15 Cache Introduction
No ratings yet
L15 Cache Introduction
35 pages
Embedded C Programming Guide
100% (1)
Embedded C Programming Guide
57 pages
Ece4750 Lab3 Mem
No ratings yet
Ece4750 Lab3 Mem
16 pages
Cache 2
No ratings yet
Cache 2
37 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Data Hazards and Cache Optimization
No ratings yet
Data Hazards and Cache Optimization
2 pages
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
57 pages
S1 Cache Friendly Code
No ratings yet
S1 Cache Friendly Code
16 pages
Answer Arch Ass
No ratings yet
Answer Arch Ass
3 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Lect 12 Memory
No ratings yet
Lect 12 Memory
42 pages
Computer Architecture and Organization Case Study GROUP 6
No ratings yet
Computer Architecture and Organization Case Study GROUP 6
5 pages
Datasheet AMD Alchemy Au1100
No ratings yet
Datasheet AMD Alchemy Au1100
414 pages
Virgil Bistriceanu Illinois Institute of Technology
No ratings yet
Virgil Bistriceanu Illinois Institute of Technology
2 pages
Computer Architecture Homework
No ratings yet
Computer Architecture Homework
14 pages
Processor Architectures Overview
No ratings yet
Processor Architectures Overview
8 pages
Latticemico32 Tutorial
No ratings yet
Latticemico32 Tutorial
92 pages
Microprocessor Core 2 Duo
100% (2)
Microprocessor Core 2 Duo
22 pages
Neha Lem Architecture
No ratings yet
Neha Lem Architecture
34 pages
Course Outline CSE214 Summer2019
No ratings yet
Course Outline CSE214 Summer2019
6 pages
Cat Notes
No ratings yet
Cat Notes
14 pages
Computer Organization Syllabus EE 2003
No ratings yet
Computer Organization Syllabus EE 2003
2 pages
The SIEVE Algorithm
No ratings yet
The SIEVE Algorithm
18 pages
Armv8m Architecture Memory Protection Unit 100699 0100 00 en
No ratings yet
Armv8m Architecture Memory Protection Unit 100699 0100 00 en
39 pages
rh442 Notes
No ratings yet
rh442 Notes
26 pages
Assignment#4 Memory New 3
No ratings yet
Assignment#4 Memory New 3
3 pages
Virtual Memory Management
No ratings yet
Virtual Memory Management
16 pages
Computer Organization and Architecture. Designing For Performance. 11 Global Edition Edition William Stallings - Ebook PDF PDF Download
100% (5)
Computer Organization and Architecture. Designing For Performance. 11 Global Edition Edition William Stallings - Ebook PDF PDF Download
47 pages
HP 290 ProDesk Specs and BIOS Settings
No ratings yet
HP 290 ProDesk Specs and BIOS Settings
10 pages
Computer Architecture Solved Mcqs Part II
No ratings yet
Computer Architecture Solved Mcqs Part II
20 pages
Intel Pentium 4 Processor
No ratings yet
Intel Pentium 4 Processor
10 pages
Cache Memory Design & Hierarchy
No ratings yet
Cache Memory Design & Hierarchy
50 pages
Computer Organization ... Bits
100% (1)
Computer Organization ... Bits
68 pages
Hashed Perceptron Branch Predictor
No ratings yet
Hashed Perceptron Branch Predictor
17 pages
Info Dmi
No ratings yet
Info Dmi
15 pages
OS Solutions
No ratings yet
OS Solutions
22 pages
Essential Tools for Computer Repair
100% (1)
Essential Tools for Computer Repair
21 pages
Computer Science Exam Sheet
No ratings yet
Computer Science Exam Sheet
9 pages
Virtual Memory Address Translation
No ratings yet
Virtual Memory Address Translation
4 pages
Intel Celeron RJ80530 - Heidenhain MC422
No ratings yet
Intel Celeron RJ80530 - Heidenhain MC422
82 pages
MCA-First Year - I Sem - Curriculum & Syllabus - 2022 Batch
No ratings yet
MCA-First Year - I Sem - Curriculum & Syllabus - 2022 Batch
26 pages

Cache Performance

Uploaded by

Cache Performance

Uploaded by

Coding for Cache Optimization

ECE 565 – Fall 2022 2

• Cache hierarchy works on principle of locality

ECE 565 – Fall 2022 3

• Similar to the loop scenarios we’ve discussed –

• Data locality for sum and elements of array A[]

ECE 565 – Fall 2022 4

L1 $ for (int i=0; i < N; i++) {

Fetch A[0-3] (4 x 8B values)

• Strongly-related to the benefits we discussed for certain

for (i=0; i < N; i++) { for (j=0; j < N; j++) {

ECE 565 – Fall 2022 7

• Assume you have a machine

ECE 565 – Fall 2022 8

• Code a targeted test program

ECE 565 – Fall 2022 9

ECE 565 – Fall 2022 10

Data Set Size (B)

* Measured on an Intel Core i7 CPU @ 2.4 GHz

data set = 24 words

• Vary the data set accessed by our code

• Vary the stride of data accessed by our code

ECE 565 – Fall 2022 12

• Thus far, we’ve focused mostly on latency

• Cache and memory bandwidth is also important P

• Just as latency grows larger Main Memory

ECE 565 – Fall 2022 13

• Common operation in scientific applications

= 1*1 + 2*2 + 3*3 + 4*4

• Recall our memory layout discussion

ECE 565 – Fall 2022 14

double A[N], B[N], C[N];

ECE 565 – Fall 2022 15

• Each matrix element is 64 bits (a double)

– A=good spatial locality; C=good temporal locality; B=poor locality

ECE 565 – Fall 2022 16

double A[N], B[N], C[N];

double A[N], B[N], C[N];

double A[N], B[N], C[N];

ECE 565 – Fall 2022 20

• Main memory is a cache for disk

ECE 565 – Fall 2022 21

ECE 565 – Fall 2022 23

Virtual Address Physical addresses

ECE 565 – Fall 2022 24

• Want to service memory accesses from DRAM, not disk

• Locality still rules the day

ECE 565 – Fall 2022 25

ECE 565 – Fall 2022 26

ECE 565 – Fall 2022 27

• Need to remember current location for every virtual page

• Page tables are very large

• Keep portions of the page table in memory; rest on disk

ECE 565 – Fall 2022 30

• Physical addresses are needed for cache lookups

ECE 565 – Fall 2022 31

• TLB is a very fast cache of PTEs

Processor Chip TLB

ECE 565 – Fall 2022 32

ECE 565 – Fall 2022 33

• What if we need to cover larger working sets?

Memory (DIMM Slots)

ECE 565 – Fall 2022 36

PCIe I/O links to

All running under

• Currently, main memory is often DRAM

ECE 565 – Fall 2022 38

ECE 565 – Fall 2022 40

• 6 DDR4 memory channels per chip; 8B channel width

• DDR4 supports up to 3200 MT/s

• Remember this is peak; Program would see ~80% of this

ECE 565 – Fall 2022 42

ECE 565 – Fall 2022 43

• HBM3 is current standard

You might also like

= 11 + 22 + 33 + 44