Distributed Training 主流技术

  1. PyTorch 官方路线 -> FSDP https://docs.pytorch.org/docs/stable/fsdp.html
  2. DeepSpeed ZeRO Stage 1/2/3 -> 理论必学,工程次之
  3. Tensor Parallel (Megatron-LM) -> 训练超大模型,70B+
  4. Pipeline Parallel
  5. Sequence Parallel / Activation Parallel

所以,未来一段时间,主要需要学习的是 FSDP、ZeRO Stage3、Megatron-LM 的 TP / PP

高性能 Kernel + Compiler

  1. CUDA Kernel 优化,FlashAttention v1/v2/v3
  2. Triton Kernel
  3. Fused Kernels
  4. torch.compile
  5. XLA / PJIT / SPMD -> 只有 DeepMind 重度使用
  6. Mixed Precision -> 基础,快速掌握

Training Platform + Orchestration

工程化,Sharded Checkpointing,Streaming Dataset + Global Shuffling,Job Orchestration / Scheduler,Fault Tolerance,Scaling & Throughput Optimization,Monitoring / Profiling / Telemetry