Distributed Training

Distributed operations

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html

  • nccl
    • point to point communication
      • sendrecv
      • one-to-all(scatter)
      • all-to-one(gather)
      • all-to-all
      • neighbor exchange
    • collectives communication
      • broadcast
      • all-gather
      • reduce
      • reduce-scatter
      • all-reduce( reduce-scatter + all-gather )
process
broadcast
all-gather
reduce
reduce-scatter
all-reduce

PyTorch Distributed Training

https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html

!todo