CUDA 101

September 10, 2018

Helper

The paper-Ian Buck’s thesis
The book-CUDA by example and official documents
The course- CS344 on Udacity and corresponding Github repo

From Graphics to General Computing

This is the 1st post of a series of CUDA notes for helping engineers to tame thousands of threads on specific devices for their work. Written in English so more experts could help correcting the potential mistakes :)

The Mythbuster GPU vs CPU video might be the best one to show why GPU is faster:

Back to 2006, guys in Stanford’s Graphics lab tried to introduce the compute power of GPU to general computing and ship one product, the work got partially noted in Ian Buck’s Phd thesis. Later the work influenced the hardware design and brought out CUDA(Compute Unified Device Architecture), Dr David Kirk ROCK! From software prospective, it’s extended from standard C programming language, exposed GPU’s power to software engineers without requiring expertise in Graphics. Use nvcc as the compiler to split the work on CPU to gcc/vs etc. and turn GPU functions(kernels) to executables for CUDA Runtime(curt) which communicate with device drivers to schedule and perform compute jobs.

More Threads->Need for Speed

The idea might be simple, use more threads to enpower executing the application in parallel. The more threads you have, the more time you save. And its easy to start writing such program:

// from CS344
#include <stdio.h> 
#define NUM_BLOCKS 1
#define BLOCK_WIDTH 256
__global__ void hello()
{
 printf("Hello world! I'm thread %d\n", threadIdx.x);
}
int main(int argc,char **argv)
{
 // launch the kernel
 hello<<<NUM_BLOCKS, BLOCK_WIDTH>>>();
 // force the printf()s to flush
 cudaDeviceSynchronize();
 printf("That's all!\n");
 return 0;
}

No Free Lunch

On the other hand, as u may know, when trying to solve a problem using multi-threads, you are having two problems now :) Despite figuring out how to turn the algorithm to be friendly for parallelization. Memory hierarchy also became more complex for serving such devices, like cudaMalloc’s speed compared to its counterparts. Single thread is relatively weak on GPU and taming thousands of such threads are challenging but also rewarding, for getting a deeper understanding of computer architecture, deploy lower cost, higher throughput services.

Take a Sip

If you are interested and in need of CUDA programming skill, pls start from reading official docs: mainly programming-guide and best-practice guide and do some quick labs for exploration. Some links got provided as below to help u kickstarting.

import webbrower # pls execute the script in python

links = ['graphics.stanford.edu/~ianbuck/thesis.pdf', \ 
'https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html', \
'https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html', \
'https://developer.nvidia.com/cuda-example', \
'https://classroom.udacity.com/courses/cs344', \
'https://github.com/udacity/cs344/']

for link in links: webbrowser.open(link)