(original) (raw)

Stanford CS149, Fall 2021

PARALLEL COMPUTING

From smart phones, to multi-core CPUs and GPUs, to the world's largest supercomputers and web sites, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design.

Basic Info

Tues/Thurs 3:15-4:45pm

All lectures are virtual

See the course info page for more info on policies and logistics.

Fall 2021 Schedule

| Sep 21 | | Why Parallelism? Why Efficiency? Challenges of parallelizing code, motivations for parallel chips, processor basics | | ------ | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Sep 23 | | A Modern Multi-Core Processor Forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth | | Sep 28 | | Parallel Programming Abstractions Ways of thinking about parallel programs, and their corresponding hardware implementations, ISPC programming | | Sep 30 | | Parallel Programming Basics Thought process of parallelizing a program in data parallel and shared address space models | | Oct 05 | | Performance Optimization I: Work Distribution and Scheduling Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing | | Oct 07 | | Performance Optimization II: Locality, Communication, and Contention Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention | | Oct 12 | | GPU architecture and CUDA Programming CUDA programming abstractions, and how they are implemented on modern GPUs | | Oct 14 | | Data-Parallel Thinking Data-parallel operations like map, reduce, scan, prefix sum, groupByKey | | Oct 19 | | Distributed Computing Using Spark Producer-consumer locality, RDD abstraction, Spark implementation and scheduling | | Oct 21 | | Cache Coherence Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing | | Oct 26 | | Memory Consistency Consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics | | Oct 28 | | Locks, Fine-Grained Synchronization, and Lock-Free Programming Implementation of locks, fine-grained synchronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers | | Nov 02 | | No class (Stanford Election Day Holiday) Go vote! | | Nov 04 | | Transactional Memory Motivation for transactions, design space of transactional memory implementations. | | Nov 09 | | Transactional Memory 2 Finishing up transactional memory focusing on implementations of STM and HTM. | | Nov 11 | | Heterogeneous Parallel Processing Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs | | Nov 16 | | Domain Specific Programming Languages (Case Study: Halide) Performance/productivity motivations for DSLs, case study on Halide image processing DSL | | Nov 18 | | Parallel Graph Processing Frameworks + How DRAM Works domain-specific frameworks for graph processing, streaming graph processing, graph compression, DRAM basics | | Nov 30 | | Programming for Hardware Specialization Programming reconfigurable hardware like FPGAs and CGRAs | | Dec 02 | | Efficiently Evaluating DNNs (+ Course Wrap Up) Scheduling conv layers, exploiting precision and sparsity, DNN acelerators (e.g., GPU TensorCores, TPU) |

Programming Assignments

Written Assignments