Lab on parallelism M1 info 2023-2024¶

GPU programming with Python and CUDA¶

Start up:¶

read the lecture notes on GPU/CUDA programming with Python+Numba (see Moodle)
open a VPN connection to open.unice.fr (use login@MIAGEIA with login replaced with your own login)
create an account on http://miage-gpu2.unice.fr:8001
create a Python notebook like this one, and start coding with Python

Alternative to MIAGE IA: a third-party VM¶

If your VPN access does not work (or the above is not working for any reason), you may consider using a third-party VM. Sadly, this usually requires to create an account, often leave a credit card number, etc... So this is not the recommended solution.

Here is how to do it with Google Colab:

login to colab.google.com
create a new Python notebook
use the connect button and modify the type of execution to select a VM based on a GPU (ex: T4 GPU).

Kernel Hello world¶

Copy and execute the code below
Modify it so as to change the number of threads per bloc, and/or the number of blocks. Find the values that stop NUMBA discarding the NumbaPerformanceWarning`.
Print threadIdx.x, threadIdx.y, threadIdx.z and blockIdx.x , blockIdx.y, blockIdx.z
Try with a 2D grid.
Test the limitations mentioned in Fabrice Huet's lecture on print in kernel and on the arguments passed to the kernel.

import numba
from numba import cuda

@cuda.jit
def monKernel():
    print("from", cuda.grid(1),"hello, world!")

gridSize = 3
nbThreadsPerBlock = 4
monKernel[gridSize, nbThreadsPerBlock]()

/opt/conda/lib/python3.10/site-packages/numba/cuda/dispatcher.py:538: NumbaPerformanceWarning: Grid size 3 will likely result in GPU under-utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))

from 8 hello, world!
from 9 hello, world!
from 10 hello, world!
from 11 hello, world!
from 4 hello, world!
from 5 hello, world!
from 6 hello, world!
from 7 hello, world!
from 0 hello, world!
from 1 hello, world!
from 2 hello, world!
from 3 hello, world!

Hardware limitations¶

Execute the code below to see what are the hardware limitations

from numba import cuda
gpu = cuda.get_current_device()
print("name = %s" % gpu.name)
print("maxThreadsPerBlock = %s" % str(gpu.MAX_THREADS_PER_BLOCK))
print("maxBlockDimX = %s" % str(gpu.MAX_BLOCK_DIM_X))
print("maxBlockDimY = %s" % str(gpu.MAX_BLOCK_DIM_Y))
print("maxBlockDimZ = %s" % str(gpu.MAX_BLOCK_DIM_Z))
print("maxGridDimX = %s" % str(gpu.MAX_GRID_DIM_X))
print("maxGridDimY = %s" % str(gpu.MAX_GRID_DIM_Y))
print("maxGridDimZ = %s" % str(gpu.MAX_GRID_DIM_Z))
print("maxSharedMemoryPerBlock = %s" % str(gpu.MAX_SHARED_MEMORY_PER_BLOCK))
print("asyncEngineCount = %s" % str(gpu.ASYNC_ENGINE_COUNT))
print("canMapHostMemory = %s" % str(gpu.CAN_MAP_HOST_MEMORY))
print("multiProcessorCount = %s" % str(gpu.MULTIPROCESSOR_COUNT))
print("warpSize = %s" % str(gpu.WARP_SIZE))
print("unifiedAddressing = %s" % str(gpu.UNIFIED_ADDRESSING))
print("pciBusID = %s" % str(gpu.PCI_BUS_ID))
print("pciDeviceID = %s" % str(gpu.PCI_DEVICE_ID))

See what happens when you run a kernel without respecting these limitations

Kernels that access an array¶

Write a kernel that takes as argument a 1D numpy array of 32 bits signed integers initialised with zeros and of size gridSize * nbThreadsPerBlock, and that writes a 1 un each array cell (each thread takes care of one cell).
Optimize your code with a transfer of the array from host to device and later from device to host.
Write a kernel that takes as argument a 1D numpy array t of unsigned 32 bits integers initialised with t[i]==i for all index i. The kernel copies t[i+1] into t[i] (i.e. for i range(len(t)) in parallell: t[i] = t[i+1]). What do you observe? Try with different grid sizes and numbers of threads per block.

Image manipulation¶

Based on this introduction to the manipulation of images with PIL write a function that takes a colorfull RGB image and returns its grayscale version.
Optimise your function with a kernel. Start with a small image, like lena.png, and then with larger images, like mona-lisa.jpg, making sure you respect all hardware limitations.

Game of life¶

Write a function life(initial_configuration,nb_iterations) that takes an array of integer (either 0 or 1) and a number of iterations of the rule of life, and returns the final configuration.
Optimize your function with a kernel. Remember that cuda.synchronize() introduces a synchronization barrier.