Opytimizer: Nature-Inspired Computation in Python
November 21, 2020
Cable failures endanger renowned Puerto Rico radio telescope
November 21, 2020

[D] How do you run distributed training?

I have used some DataParallel training with PyTorch sometime back. I started looking at DistributedDataParallel recently to train some models on my 2-gpu home computer. I also want to try training on multiple nodes in the cloud.

I came across a bunch of third-party libraries like Ray, Horovod, and torchelastic. If I use PyTorch DistributedDataParallel, torchelastic to run it on the cloud, why would I need Ray or Horovod. Some benchmarks showed Ray/Horovod to be faster than DataParallel, but no comparison with DistributedDataParallel I felt like Ray and Horovod were useful sometime back before PyTorch natively supported distributed training, and not so much now. Is this accurate? Also, it’s unclear if Horovod and Ray complement or substitute each other.

Also, out of curiosity, what does distributed-training look like in Tensorflow?

submitted by /u/mlvpj
[link] [comments]


Comments are closed.