I drew 42 frames to show how a GPU speeds up an array operation of 8 elements in parallel over 4 threads in 2 clock cycles.
CPU
• It has one core.
• Its global memory has 120 locations (0-119).
• To use the GPU, it needs to copy data from the global memory to the GPU.
• After GPU is done, it will copy the results back.
GPU
• It has four cores to run four threads (0-3).
• It has a register file of 28 locations (0-27)
• This register file has four banks (0-3).
• All threads share the same register file.
• But they must read/write using the four banks.
• Each bank allows 2 reads (Read 0, Read 1) and 1 write in a single clock cycle.
🔥 Deep Learning Math Workbook - New Edition
Do you have an example with tiled matrix multiplication?