A few days ago, a fellow Deep Learning practitioner manifested interest in my latest hardware build, and asked me to write a blog post about it: here we are. So please, indulge me!
You will find a lot of blog posts about the same topic (the vast majority of them right here on medium).
They are all interesting and quite instructive, but mine is precisely aimed at showing you how squeeze the most from your next build in terms of bangs for bucks (read it as TFlops per money units).
1. How I chose the components
Let’s start saying what I didn’t want, piece by piece.
I wanted at least two GPUs, and I didn’t want to be limited in terms of PCI Express lanes. Every card would have had a full 16X gen3 bandwidth.
Tim Dettmers, author of an outstanding (although a bit outdated) DL hardware guide (http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/), experimented a lot with PCIe bandwidth and reported that running a modern card on half-bandwidth (8X gen3) would result on a 10% worst-case penalty. Although something like ~10% seems to be acceptable, I wanted to avoid it since it could do quite a difference when running long experiments.
Furthermore, one has to take into account other devices: for example, a modern NVMe drive takes four gen3 lanes for itself.
So, if you plan to buy some shiny 8700K-ish processor (16 lanes in total) for a dual GPU 8X/8X setup, be prepared to sacrifice something in terms of storage speed.
The matter is somewhat less pesky with the new Ryzens , since they got 20 lanes and as the motherboard allows it, you can go fo a 8X/8X/4X configuration.
Finally, you can always resort to a mighty Threadripper: with its 64 lanes, it would leave you with a lot of room for expansion. More importantly, looking at its size, you can use it as self-defence weapon once it becomes obsolete.
For me, it was not an option due its price and its monstrously high power draw. When I said I wanted a cheap build, I meant power supply and electricity bill as well.
All these considerations led me to a single option: an used Xeon E5 (more below).
I didn’t want desktop (that is, non-ECC) memory.
In my opinion, people often underestimates the importance of having Error Correction RAM. I won’t bother you with details, let’s just say that a (quite common) bit flip in RAM could potentially corrupt any file on your hard drive. It is the main reason for which OSes do tend to go kablooey in a year or two and “have to be reinstalled”.
ECC apart, I didn’t want to buy DDR4 memory. For whatever market-related reason I don’t even want to understand, a DDR4 32Gb kit ranges from 400 to 500 Euros here in EU. Frankly, a fraud.
Solution: good old DDR3. Obviously, ECC.
- Mainboard and Chipset
I needed a socket 2011 board capable of supporting ECC memory, a Xeon E5 v1/v2, and with at least two 16X gen3 slots, adequately spaced. It meant Intel C602 chipset, the workstation equivalent of X79.
I ended up with a (used, yet new! See below) Fujitsu D-3128-A2 mainboard.
You cannot save any money here, at least if you want a Pascal card. The used ones are almost as expensive as the new ones, plus they are invariably clogged up with dust, so I decided to go for a new 1080ti, which would have been installed aside with my previous 1070.
- Hard Drives
I’d leave the hard drives out of the build main description since I already had them. Let’s just say I’m using a couple of seasoned Samsung 850 Pro, although I plan to add a NVMe drive soon.
As a general guideline, it’s better not to forget that minibatches have to be moved in and out, and your hard drive is the one which got to serve them to the GPU. In brief, a spinning drive won’t be sufficient. SSDs are essential, NVMe is optional.
Once I made my choices, I started to search Ebay for the appropriate hardware at reasonable prices.
Any Xeon e5 would have been sufficient for my purposes, as long as it would have supported 40 gen3 lanes (32 for the GPUs, the remaining for storage and/or other stuff). Indeed, benchmarks show that once you get one core and two threads per GPU, and al least Sandy Bridge generation at more than 1/1.5 GHz, you will be OK (please refer to good ol’ Dettmers for detailed benchmarks).
If you want something really beefy (who doesn’t want to leave hundreds of chrome tabs open while running Jupyter notebooks?) the best price/performance option would be the E5–2680v2 (ten cores, twenty threads). You will find them for less than 200 euros, since a lot of them have been demounted from dismissed servers. For me, something similar was wasted, and like I said I sought less power-hungry options. 4–6 cores would have been sufficient.
I found a barebone (case, PSU, mobo, a Nvidia NVS295 and a Xeon E5-1650v2 six-cores CPU) Fujitsu M730 workstation for the incredible price of 119 euros, shipped. It should be noted that all this hardware is specifically designed for 24/7 continuous operation.
I expected something incredibly worn out. Instead, I received brand new hardware. I suspect the computer was bought, and never used.
Looking at the next picture below, you’ll note that the mainboard has the correct slots for a solid DL build:
- It has 2 full-fledged 16X gen3 slots for the GPUs (second and fifth from top).
- It has two 8X mechanical, 4X electrical gen3 slots, ideal for hosting a couple of NVMe drives on appropriate cards.
- It has one 16X mechanical, 4X electrical gen2 slot, that I will use for the Quadro NVS 295 (which will control the monitor).
- It even has two PCI slots if you want to add more usb ports or other crappy legacy stuff.
Was I amazingly lucky? Not quite.
The mainboard had proprietary PSU connectors. Looking at the pic below, you’ll notice it lacks the standard ATX connector, having two awkward 12V sockets.
It could have been used only in conjunction with the original Fujitsu PSU, which in turn was just 500W and utterly wanting in terms of PCIe connectors (just one 6-pin. But a 1080ti and a 1070 sum up to two 8-pin and one 6-pin darn connectors).
I was aware that you can use two different PSUs inside the same PC, given you could accept jumpering the second PSU in order to start it together with the primary one. Incredibly untidy.
But then, the same guy who told me to write this blog post suggested that a 2-PSU solution was doable in a more orderly way by the means of a so-called add2psu, a little contraption originally aimed at cryptominers who need a lot of PSUs to feed their mining hardware.
Having solved the interconnection problem, I threw away the original case and ordered an Anidees AI8 full tower case. Specifically engineered to house two power supplies, it was incredibly cheap for its specifications: 106 euros (shipped) on Amazon.
The second PSU I chose, mainly for its unbeatable price/quality ratio, was the XFX P1–650G-TS3X. Boasting four 6+2 PCIe plugs, a terrific ripple suppression capability, and 80+ Gold compliance, it has excellent reviews all around the internet and is cheap (65 euros, local store) since it’s not modular. Who cares about modularity? The chassis has plenty of space for all those wires.
Now I had a combined power of 1150W, 650 of which for the GPUs.
What about the GPUs? When it comes to them, I strongly advise against any non-standard solution. Unless you have specific space requirements, you should go for blower fan (or ‘founders’, or ‘turbo’, etc.. depending on the manufacturer) versions, since contrarily to other dual-triple fan solutions, they throw the hot air from inside the case to out of it, and even more importantly, could be installed quite close one to another without any fuss. I grabbed a 1080ti Founders. At 736 euros, it was the most expensive piece in the rig, vastly exceeding all the rest in price.
It weighs a ton and almost entered the slot under its own weight. The sturdy backplate ensures torsional resistance, tough I’m worried the card could eradicate the slot itself sooner or later.
Why did I choose a 1080ti over its less expensive siblings? Memory. I wanted the ability to manage large batches with complex NN architectures.
I already had a MSI GTX-1070 aero ITX.
Wait, I just said avoid non-blower cards. Ok, but I added unless you have specific needs.
I need to put it into a mini-ITX case to take it with me when I travel.
Let’s finish with the memory. It is recommended to get an amount of RAM at least twice the video memory, otherwise you could see your computer resorting to swap files with obvious performance losses (although NVMe mitigates that issue). I borrowed 32Gb of ECC UDIMMs from another machine of mine. Once I finish an appropriate cycle of tests and benchmarks, I’ll get at least 48/64Gb of ECC RDIMMs (you’ll find them in abundance, and crap cheap, on Ebay).
RDIMMs add an useful buffer with respect to UDIMMs, which contributes in stabilizing the command signals.
Not surprisingly, RDIMMs are the cheapest kind of memory modules you can find on the market, while at the same time being the most reliable. That’s because they cannot be used on desktop hardware (and this in turn leaves out 90% of potential buyers). If you have server-grade hardware, go for them.
Below is the finished build. Note the original PSU on top.
You can also see the 1080ti sitting aside an old 750ti. In the definitive version of the build, you will see the 1070 in place of the 750ti, with the NVS295 left to manage video outputs. I had to use the 750ti for testing purposes since I don’t have demounted the 1070 from the previous build yet, the NVS just has DidplayPort outputs, and I did’t have any DP cable (one ordered).