Hong Kong Corporate Income Tax (CIT)
January 13, 2021
Chinese search firm Baidu to create an electric vehicle company as tech giants jump into auto space
January 13, 2021

[D] How do you run distributed training?

I have used some DataParallel training with PyTorch sometime back. I started looking at DistributedDataParallel recently to train some models on my 2-gpu home computer. I also want to try training on multiple nodes in the cloud.

I came across a bunch of third-party libraries like Ray, Horovod, and torchelastic. If I use PyTorch DistributedDataParallel, torchelastic to run it on the cloud, why would I need Ray or Horovod. Some benchmarks showed Ray/Horovod to be faster than DataParallel, but no comparison with DistributedDataParallel I felt like Ray and Horovod were useful sometime back before PyTorch natively supported distributed training, and not so much now. Is this accurate? Also, it’s unclear if Horovod and Ray complement or substitute each other.

Also, out of curiosity, what does distributed-training look like in Tensorflow?

submitted by /u/mlvpj
[link] [comments]


Comments are closed.