Distributed Training 主流技术
- PyTorch 官方路线 -> FSDP https://docs.pytorch.org/docs/stable/fsdp.html
- DeepSpeed ZeRO Stage 1/2/3 -> 理论必学,工程次之
- Tensor Parallel (Megatron-LM) -> 训练超大模型,70B+
- Pipeline Parallel
- Sequence Parallel / Activation Parallel
所以,未来一段时间,主要需要学习的是 FSDP、ZeRO Stage3、Megatron-LM 的 TP / PP
高性能 Kernel + Compiler
- CUDA Kernel 优化,FlashAttention v1/v2/v3
- Triton Kernel
- Fused Kernels
- torch.compile
- XLA / PJIT / SPMD -> 只有 DeepMind 重度使用
- Mixed Precision -> 基础,快速掌握
Training Platform + Orchestration
工程化,Sharded Checkpointing,Streaming Dataset + Global Shuffling,Job Orchestration / Scheduler,Fault Tolerance,Scaling & Throughput Optimization,Monitoring / Profiling / Telemetry