Geek Stuff: GPU Computing

Posted: August 3rd, 2009 | Author: | Filed under: GPGPU | No Comments »

I started this blog after learning that you can now buy a 4 TFLOP supercomputer for under $10,000.  This post will examine how this is possible and the implications and challenges for the future.

History: CPUs and GPUs

CPUs are the Central Processing Unit that run a computer. Over time they have grown faster and smarter and capable of doing more complex things. GPUs (Graphics Processing Units) are specialized chips that are focused on a much narrower set of tasks: doing all the things necessary to draw fancy graphics on your screen. Over time, GPUs also got faster, but did so in large part by becoming highly parallel.  CPUs have also gone multi-core, but a multicore CPU may have 4 cores, while  a current GPU may have 240 smaller, more specialized processing units.

Recently, GPUs have made their specialized functions more programmable. And while they still are not capable of doing everything needed to run your computer (ie: they are not CPU replacements), they can now run tasks beyond merely drawing things on your screen.  It turns out that the hardware required for drawing on your screen is also good for generalized mathematical processing. By making the GPUs programmable, manufacturers are now opening up these chips to developers.

Limitations and Challenges of GPU Computing

GPU Computing (or General Purpose computing on GPUs) holds great potential, but it also has significant limitations.  For instance:

  • GPUs are massively parallel. A 4 TFLOP NVIDIA computer has almost 1000 cores. Programming massively parallel computers is very difficult, even for smart people. (This was drilled home to me when I spoke to two super-smart CS professors, both of whom said parallel programming is hard, even for them.)
  • GPU programming is highly constrained. First, you need a task that can be paralellized. Not all can.  But even then you need tasks that are “embarrassingly parallel” – tasks that can be split up and processed without a lot of interaction between the atomic tasks. GPUs can be really fast with these sorts of problems, but if threads need to share data amongst each other, a low-end GPU computer will bog down. (The memory architecture isn’t up to the task.)
  • The tools and skillsets needed to support this sort of programming are nascent at best.  We’re early in the world of broadly adopted parallel programming.

Opportunity

Despite these challenges, it seems to me that inevitable that industry will find a way to overcome the challenges and take advantage of this new processing power.  First, the switch to parallel programming has to happen. Main CPUs have already gone multi-core, and soon will be 8, then 16, 32, 64, then 128 cores. To take advantage of this, parallel programming will be a must.  GPUs (with 240 cores) are ahead of the curve, but the industry will catch up.  And second, industry always finds ways to take advantage of new technologies.  Its not obvious why a construction company needs 4 TFLOPs of power – but eventually they will have it.  (Ok – maybe it’s not that hard to imagine a construction company having very sophisticated computer models of the building they will construct, rather than mere blueprints.)

Right now, much of the potential of GPU computing is being applied to traditional supercomputer or computationally intensive applications: scientific modeling, Computer Aided Design, Computer Aided Diagnosis (eg: having a computer automatically read a CAT scan), oil & gas exploration, etc. These are arenes where people already had difficult, computationally oriented challenges, and applying the new technology is more straightforward. (ie: do the same thing, just better / cheaper / in more situations).  NVIDIA has seen speedups increase 3 to 40x for some of these applications.

More interesting to me are what general business applications could benefit from this processing power. And here, for applications that are not already optimized for parallel processing, NVIDIA as seen speedups of as much as 100x.  That’s amazing – two orders of magnatude is transformational. So, applicatons will arise, and it is up to the entrepreneur to find them.


The computing revolution no one knows about

Posted: July 24th, 2009 | Author: | Filed under: GPGPU | 1 Comment »

I’m a pretty technical and informed person when it comes to information technology. I’m not an engineer (any more), but compared to most business people working in the industry, I know a fair bit. But I had no idea that there is a massive revolution going on in the computing industry. The fundamental paradigm that has powered the industry for the last 20 years has changed. It amazes me I didn’t know this. And I’m guessing that many techno-savvy people don’t either.

The Past

Since the mid 80′s, the computer industry has been built on the fact that the speed of computers doubles every 18 months. This is widely described as “Moore’s Law,” but Moore’s Law is slightly different.  Moore’s Law states that the density of transistors in a chip doubles every 18 months.  For nearly 20 years microprocessor companies could squeeze more transistors into their chips and run them twice as fast – doubling the frequency the chips run at. This doubling of frequency directly led to a doubling of performance. (Frequency ~  the number of instructions that can be run in a second.) Thus, year after year, the speed of a single processor grew at about 52%.  This predictable increase in processing power has driven the growth of the computer industry.  (Think about how awesome your iPhone is.  That wasn’t possible 4 years ago.)

The Change

People have been predicting the demise of Moore’s Law for years, but even with existing technology projections, Moore’s Law seems to have at least a few cycles left. That said, the predictable implication of Moore’s Law – that single processor performance doubles every 18 months – has already broken.

This is a huge huge point and suggests a fundamental shift in the computer industry. But before I discuss the implications, let me explain what is going on.

For 20 years, Intel kept doubling the frequency of their chips, thereby doubling performance. But in 2004, Intel hit “the power wall.” (Source) The power consumption of a chip is directly related to the frequency it runs at. So every time you double the frequency (other things being equal), you double the power required to run the chip. This gets costly in terms of straight electricity cost, but you also have to spend a lot on expensive air conditioning to suck the excess heat away. And the further you push the chips, the more these power costs dominate the benefit of increased performance. And on top of that, it gets difficult to cool these chips even if you wanted to. So single-processor performance stalled.

But Moore’s Law keeps trucking, so what can you do with the extra transistors? Well, instead of building a processor that is twice as fast, just build two of them.  When transistor density doubles again, build 4. Then 8. Etc.  You can see this taking place already in dual core and now quad core processors.

The Hidden Revolution

Anyone who’s been paying attention to computers knows that we’ve moved to a multi-core world. And maybe in the back of our minds we’ve wondered about the implications. Is 2 really twice as good as 1? But I could imagine one processor is doing my antivirus check while one lets me surf the net and do email. And when we get to 4? And 8? 16? … 128 cores? What in the world would my laptop do with 128 cores? And would it really be 128 times as fast as 1 core?

There’s the hidden revolution: the answer is, basically, “no.”  Or, at least, a 128 core computer is not going to be 128 times as fast as a 1 core computer without some serious changes in the industry.  That’s the revolution.

In the past, if you wrote a program and didn’t touch it, in 7 generations (about 18mths * 7 = 10.5 years), your program would run 128 times faster. (Ignoring I/O issues, which is obviously crucial). With many-core, to take advantage of the 128 cores, you would have to entirely rewrite your code to do parallel processing.  And this is assuming it is even possible to parallelize your code. If your program is fundamentally serial, then 7 microprocessor generations down the road it might not run any faster.

What This Means

There is a fundamental shift that needs to take place in the computing industry, from serial programming, to parallel programming. This is a very non-trivial change. Some companies, like Intel, are betting that it is too big a change, and their goal is to hide the complexity and paradigm shift from users beneath smarter processors, compilers, and OS’s. My gut feel, admittedly knowing little, is this is a short term solution at best. If the future of computing is parallel, then the future of programming will be parallel too, and all new students coming out of college or grad school will be well versed in parallel programming models.

Another implication may be that for many applications, speed may simply not improve very much. It is not obvious that all programs (or algorithms) are parallelizable. (In fact, many probably aren’t.) So if you have a set of tasks that require serial processing, the doubling of peformance every 18 months will no longer apply.

Conclusion

This is just a taste of the issues involved. (And I apologize for any mistakes or simplifications I’ve made – my knowledge is days old.) But the implications are huge.

Further Reading


Supercool: The personal supercomputer

Posted: July 17th, 2009 | Author: | Filed under: GPGPU | 2 Comments »

The discovery that got me re-energized about starting something, and that is providing my jump-off point for investigations, is the following:

You can now buy a 4 Teraflop supercomputer for under $10,000.

Why this is amazing

For someone like me (ie: know enough to be amazed, but not enough to be blase), this is extraordinary. In 1999, the world’s fastest supercomputer clocked in at 2 TFLOPS (Source). It was a typical supercomputer – taking up over 2,000 sq. ft. of space and probably costing close to $100 million (I’m just guessing on the cost). And now you can buy something TWICE as fast for 1/10,000th the cost, plug it in, and put it under your desk.

The thing that’s incredible to me is that this isn’t a comparison with some top-of-the-line computer from 1950, but from 1999. Sure – my cell phone has more processing power than the fastest supercomputer in the world from the distant past, but that comparison has lost meaning to me. Back then people were still driving around around in horse-drawn carriages and playing pong. But 1999 is CURRENT. That’s AFTER the internet exploded. That’s the modern era. And now that supercomputer - that top-of-the-line, hardcore, crazy technology - is now available for $10k. Wow.  [Actually, this comparison is a bit of apples vs. oranges, but it's close enough to true to be startling.]

How they do it

I think there is increasing activity in the field of personal supercomputing, but the one that got me started is the NVIDIA Tesla (Nvidia, Wikipedia). NVIDIA makes graphics accelerator chips - the chips in your computer used to render graphics (GPU – Graphic Processing Unit). These chips are designed to do all the things needed to display stuff on your computer screen. If you’re just reading text, that’s not that big a deal. But if you are playing a high-end video game on a wide-screen monitor, that IS a big deal.

It turns out that GPU’s, with some work, can be used to perform general purpose and floating point (ie: math) processing (General Purpose Computing on GPU).  NVIDIA put 240 of these GPU cores on a card, strung 4 of them together, and voila – you’ve got a 4 teraflop supercomputer for under $10k.

What are the implications?

This is where things get a bit murkier. The way these HPC (High Performance Computing) computers work is that they string togther hundreds or thousands of individual processors (or cores) together. You can only get a single processor to be so fast, but if you string 1000 together, then it’s a thousand times faster.

Actually, it’s not that easy.  It’s only a thousand times faster if you can split up your problem into 1000 smaller problems that can be worked on in parallel. This is fine for some hard math problems like weather simulation, nuclear modeling, and other traditional supercomputing tasks. But it doesn’t mean you can just throw a Tesla computer under your desk and have Windows boot in 1/10th of a second. (Even disregarding I/O issues.) So one big issue is solving problems that are parallel-izable.

Also, what really needs that sort of computing power? I’m particularly intersted in general-purpose business applications. Sure – experimental physicists and the defense department are always going to need hardcore computing power. But does your average company? A standard desktop computer that you can buy for under $1,000 probably has more computing power than most people currently ever use (gaming aside). And while businesses often need big computing power, it is currently more about I/O, databases, transactions, and web serving. This is cool stuff, no doubt, but different from raw processing power.

All that said, there will be uses for teraflop computing for the masses. I don’t know what they will be, but it sounds revolutionary to me and my initial research suggests that others also see this as an upcoming technology transition.

Next Steps

This is cool stuff. My goal now is to start pulling on the HPC thread and see what I can discover. The question I’m trying to answer is:

What becomes possible when you can buy a 4 TFLOP computer for under $10k?