Start up:¶
- read the lecture notes on GPU/CUDA programming with Python+Numba (see Moodle)
- open a VPN connection to
open.unice.fr
(uselogin@MIAGEIA
with login replaced with your own login) - create an account on
http://miage-gpu2.unice.fr:8001
- create a Python notebook like this one, and start coding with Python
Kernel Hello world¶
- Copy and execute the code below
- Modify it so as to change the number of threads per bloc, and/or the number of blocks. Find the values that stop NUMBA discarding the NumbaPerformanceWarning`.
- Print
threadIdx.x, threadIdx.y, threadIdx.z
andblockIdx.x , blockIdx.y, blockIdx.z
- Try with a 2D grid.
- Test the limitations mentioned in Fabrice Huet's lecture on
print
in kernel and on the arguments passed to the kernel.
In [ ]:
import numba
from numba import cuda
@cuda.jit
def monKernel():
print("from", cuda.grid(1),"hello, world!")
gridSize = 3
nbThreadsPerBlock = 4
monKernel[gridSize, nbThreadsPerBlock]()
Hardware limitations¶
- Execute the code below to see what are the hardware limitations
In [ ]:
from numba import cuda
gpu = cuda.get_current_device()
print("name = %s" % gpu.name)
print("maxThreadsPerBlock = %s" % str(gpu.MAX_THREADS_PER_BLOCK))
print("maxBlockDimX = %s" % str(gpu.MAX_BLOCK_DIM_X))
print("maxBlockDimY = %s" % str(gpu.MAX_BLOCK_DIM_Y))
print("maxBlockDimZ = %s" % str(gpu.MAX_BLOCK_DIM_Z))
print("maxGridDimX = %s" % str(gpu.MAX_GRID_DIM_X))
print("maxGridDimY = %s" % str(gpu.MAX_GRID_DIM_Y))
print("maxGridDimZ = %s" % str(gpu.MAX_GRID_DIM_Z))
print("maxSharedMemoryPerBlock = %s" % str(gpu.MAX_SHARED_MEMORY_PER_BLOCK))
print("asyncEngineCount = %s" % str(gpu.ASYNC_ENGINE_COUNT))
print("canMapHostMemory = %s" % str(gpu.CAN_MAP_HOST_MEMORY))
print("multiProcessorCount = %s" % str(gpu.MULTIPROCESSOR_COUNT))
print("warpSize = %s" % str(gpu.WARP_SIZE))
print("unifiedAddressing = %s" % str(gpu.UNIFIED_ADDRESSING))
print("pciBusID = %s" % str(gpu.PCI_BUS_ID))
print("pciDeviceID = %s" % str(gpu.PCI_DEVICE_ID))
- See what happens when you run a kernel without respecting these limitations
Kernels that access an array¶
- Write a kernel that takes as argument a 1D numpy array of 32 bits signed integers initialised with zeros and of size
gridSize * nbThreadsPerBlock
, and that writes a1
un each array cell (each thread takes care of one cell). - Optimize your code with a transfer of the array from host to device and later from device to host.
- Write a kernel that takes as argument a 1D numpy array
t
of unsigned 32 bits integers initialised witht[i]==i
for all indexi
. The kernel copiest[i+1]
intot[i]
(i.e.for i range(len(t)) in parallell: t[i] = t[i+1]
). What do you observe? Try with different grid sizes and numbers of threads per block.
Image manipulation¶
- Based on this introduction to the manipulation of images with PIL write a function that takes a colorfull RGB image and returns its grayscale version.
- Optimise your function with a kernel. Start with a small image, like lena.png, and then with larger images, like mona-lisa.jpg, making sure you respect all hardware limitations.
Game of life¶
- Write a function
life(initial_configuration,nb_iterations)
that takes an array of integer (either 0 or 1) and a number of iterations of the rule of life, and returns the final configuration. - Optimize your function with a kernel. Remember that
cuda.synchronize()
introduces a synchronization barrier.