Getting Started With Training Infra
Distributed Training 主流技术 PyTorch 官方路线 -> FSDP https://docs.pytorch.org/docs/stable/fsdp.html DeepSpeed ZeRO Stage 1/2/3 -> 理论必学,工程次之 Tensor Parallel (Megatron-LM) -> 训练超大模型,70B+ Pipeline Parallel Sequence Parallel / Activation Parallel 所以,未来一段时间,主要需要学习的是 FSDP、ZeRO Stage3、Megatron-LM 的 TP / PP 高性能 Kernel + Compiler CUDA Kernel 优化,FlashAttention v1/v2/v3 Triton Kernel Fused Kernels torch.compile XLA / PJIT / SPMD -> 只有 DeepMind 重度使用 Mixed Precision -> 基础,快速掌握 Training Platform + Orchestration 工程化,Sharded Checkpointing,Streaming Dataset + Global Shuffling,Job Orchestration / Scheduler,Fault Tolerance,Scaling & Throughput Optimization,Monitoring / Profiling / Telemetry