Getting Started With Training Infra

Distributed Training 主流技术

PyTorch 官方路线 -> FSDP https://docs.pytorch.org/docs/stable/fsdp.html
DeepSpeed ZeRO Stage 1/2/3 -> 理论必学，工程次之
Tensor Parallel (Megatron-LM) -> 训练超大模型，70B+
Pipeline Parallel
Sequence Parallel / Activation Parallel

所以，未来一段时间，主要需要学习的是 FSDP、ZeRO Stage3、Megatron-LM 的 TP / PP

高性能 Kernel + Compiler

CUDA Kernel 优化，FlashAttention v1/v2/v3
Triton Kernel
Fused Kernels
torch.compile
XLA / PJIT / SPMD -> 只有 DeepMind 重度使用
Mixed Precision -> 基础，快速掌握

Training Platform + Orchestration

工程化，Sharded Checkpointing，Streaming Dataset + Global Shuffling，Job Orchestration / Scheduler，Fault Tolerance，Scaling & Throughput Optimization，Monitoring / Profiling / Telemetry

Distributed Training 主流技术#

高性能 Kernel + Compiler#

Training Platform + Orchestration#

Distributed Training 主流技术

高性能 Kernel + Compiler

Training Platform + Orchestration