Distributed Training
Distributed operations
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html
- nccl
- point to point communication
- sendrecv
- one-to-all(scatter)
- all-to-one(gather)
- all-to-all
- neighbor exchange
- collectives communication
- broadcast
- all-gather
- reduce
- reduce-scatter
- all-reduce( reduce-scatter + all-gather )
- point to point communication
| process | |
|---|---|
| broadcast | ![]() |
| all-gather | ![]() |
| reduce | ![]() |
| reduce-scatter | ![]() |
| all-reduce | ![]() |
PyTorch Distributed Training
https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html

!todo
