Configuring AMD EPYC workstation with multi GPU to run Pytorch

How to do a clean pytorch install and run a test program in AMD EPYC system.

Pritam Nayak
4 min readMay 15, 2021

Hello Folks. Hope you are doing well. :)
In this short post we will configure a new AMD machine to run pytorch framework. Before getting into the setup, lets have a look at the configuration of the machine that I used for this post.

Hardware
CPU: 2x AMD EPYC 7302 (16 cores, 3.0 GHz, 128 MB cache, PCIe 4.0)
GPU: 4x NVidia RTX A6000 (NVLinked, 48 GB of VRAM per GPU)
Memory: 256 GB
Storage: 1.92TB + 3.84TB
Software
Linux: Ubuntu 20.04 (focal)

Before getting started, a quick tip. Uninstall/remove conda if installed. Conda uses MKL library which does not fully support the AMD AVX2 architecture. If for some reason, you need to use conda setup, please drop a comment so that I can assist you. We will be using plain simple pip for this.

For my deep learning projects I use multiple workstation setup. Intel is best fit for these tasks but the current gen graphics cards from NVidia uses PCIe 4.0 which are not supported by most of the intel processors. As these cards have much higher CUDA cores and higher bandwidth, it will be unwise to restrict the channels and run on lower PCIe 3.0. So we picked AMD for the setup. Most of our R&D machines have AMD Threadripper processors as we won’t be using these for longer periods(running threadripper for days on full load will crash). Recently we configured a AMD EPYC system for our production work load.
We followed the below steps to setup the machine for deep learning. As we use Pytorch and TF most of the time, we picked the lambda stack as it is easier to install.

LAMBDA_REPO=$(mktemp) && wget -O${LAMBDA_REPO} https://lambdalabs.com/static/misc/lambda-stack-repo.deb && \
LAMBDA_REPO=$(mktemp) && wget -O${LAMBDA_REPO} https://lambdalabs.com/static/misc/lambda-stack-repo.deb && sudo dpkg -i ${LAMBDA_REPO} && rm -f ${LAMBDA_REPO} && sudo apt-get update && sudo apt-get - yes upgrade && sudo apt-get install - yes - no-install-recommends lambda-server && sudo apt-get install - yes - no-install-recommends lambda-stack-cuda

This installs the latest stable versions of TF, Pytorch, Theano and Caffe. At the time of writing the version are: TF-(tensorflow-gpu:2.4.1), Pytorch(1.8.1), Theano(1.0.4) and Caffe(1.0.0). Once this install is done, we only need to make changes for GPU.
Lets create a pytorch data parallel test to check if all the GPUs are used.

Next we need to define few global varibales. I know there are better ways and places to define but lets not waste much time on decorating the page.

Now that we have the variables set, lets move ahead with creating a data loader.

Below is a simple model consisting of a fully connected layer.

And finally the below code is used to run the model in parallel.

We are all set to use the model and data loader to test the multi GPU setup. Before running the code lets take a look at the attached cards.

As you can see from the above image, I am using the latest cuda drivers (11.2). The code will work on the 11.1 version as well. The RTX 30 series and the A6000 has an architecture of sm_86. This is supported from 11.1 onwards so its better to pick the latest available cuda packages. Below is a sample code that will install the latest cuda drivers.

LAMBDA_REPO=$(mktemp) && wget -O${LAMBDA_REPO} https://lambdalabs.com/static/misc/lambda-stack-repo.deb && sudo dpkg -i ${LAMBDA_REPO} && rm -f ${LAMBDA_REPO} && sudo apt-get update && sudo apt-get - yes upgrade && sudo apt-get install - yes - no-install-recommends lambda-server && sudo apt-get install - yes - no-install-recommends nvidia-headless-450 && sudo apt-get install - yes - no-install-recommends lambda-stack-cuda

There is only one step left. We need to configure the SO to support multiple gpu training. Some times it works without this step but for my setup, as I used 4 GPUs and for some reason the OS assumed the cards are connected using PLX switches so I had to configure this additional step.

Open a terminal and type:

sudo vim /etc/default/grub

Comment out the line

GRUB_CMDLINE_LINUX=""

and add a new line

GRUB_CMDLINE_LINUX="iommu=soft"

Save the file and run

sudo update-grub

for the changes to take effect.

All set. Now lets check if the setup is complete and we can consume all the GPUs from torch. Below is the output after running the code.

Summary

The takeaway from this blog is that, the newer GPUs use sm_86 architecture which is not supported by older versions of CUDA. This architecture is supported from 11.1 onwards. Lambda labs configure and sell these hardware so they update the packages to support these architectures. And as these graphics cards are double the size so sometimes we need to use connectors to attach more than 2 cards. if we are doing so, we need to update the grub.

Are you planning to install pytorch from source or use anaconda to setup the environment and need help in doing so? Please add a comment so that I will write a post to assist you.

--

--