Lab on parallelism M1 info 2023-2024¶

GPU programming with Python and CUDA¶

Start up:¶

read the lecture notes on GPU/CUDA programming with Python+Numba (see Moodle)
open a VPN connection to open.unice.fr (use login@MIAGEIA with login replaced with your own login)
create an account on http://miage-gpu2.unice.fr:8001
create a Python notebook like this one, and start coding with Python

Kernel Hello world¶

Copy and execute the code below
Modify it so as to change the number of threads per bloc, and/or the number of blocks. Find the values that stop NUMBA discarding the NumbaPerformanceWarning`.
Print threadIdx.x, threadIdx.y, threadIdx.z and blockIdx.x , blockIdx.y, blockIdx.z
Try with a 2D grid.
Test the limitations mentioned in Fabrice Huet's lecture on print in kernel and on the arguments passed to the kernel.

import numba
from numba import cuda

@cuda.jit
def monKernel():
    print("from", cuda.grid(1),"hello, world!")

gridSize = 3
nbThreadsPerBlock = 4
monKernel[gridSize, nbThreadsPerBlock]()

/opt/conda/lib/python3.10/site-packages/numba/cuda/dispatcher.py:538: NumbaPerformanceWarning: Grid size 3 will likely result in GPU under-utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))

from 8 hello, world!
from 9 hello, world!
from 10 hello, world!
from 11 hello, world!
from 4 hello, world!
from 5 hello, world!
from 6 hello, world!
from 7 hello, world!
from 0 hello, world!
from 1 hello, world!
from 2 hello, world!
from 3 hello, world!

Hardware limitations¶

Execute the code below to see what are the hardware limitations

from numba import cuda
gpu = cuda.get_current_device()
print("name = %s" % gpu.name)
print("maxThreadsPerBlock = %s" % str(gpu.MAX_THREADS_PER_BLOCK))
print("maxBlockDimX = %s" % str(gpu.MAX_BLOCK_DIM_X))
print("maxBlockDimY = %s" % str(gpu.MAX_BLOCK_DIM_Y))
print("maxBlockDimZ = %s" % str(gpu.MAX_BLOCK_DIM_Z))
print("maxGridDimX = %s" % str(gpu.MAX_GRID_DIM_X))
print("maxGridDimY = %s" % str(gpu.MAX_GRID_DIM_Y))
print("maxGridDimZ = %s" % str(gpu.MAX_GRID_DIM_Z))
print("maxSharedMemoryPerBlock = %s" % str(gpu.MAX_SHARED_MEMORY_PER_BLOCK))
print("asyncEngineCount = %s" % str(gpu.ASYNC_ENGINE_COUNT))
print("canMapHostMemory = %s" % str(gpu.CAN_MAP_HOST_MEMORY))
print("multiProcessorCount = %s" % str(gpu.MULTIPROCESSOR_COUNT))
print("warpSize = %s" % str(gpu.WARP_SIZE))
print("unifiedAddressing = %s" % str(gpu.UNIFIED_ADDRESSING))
print("pciBusID = %s" % str(gpu.PCI_BUS_ID))
print("pciDeviceID = %s" % str(gpu.PCI_DEVICE_ID))

See what happens when you run a kernel without respecting these limitations

Kernels that access an array¶

Write a kernel that takes as argument a 1D numpy array of 32 bits signed integers initialised with zeros and of size gridSize * nbThreadsPerBlock, and that writes a 1 un each array cell (each thread takes care of one cell).
Optimize your code with a transfer of the array from host to device and later from device to host.
Write a kernel that takes as argument a 1D numpy array t of unsigned 32 bits integers initialised with t[i]==i for all index i. The kernel copies t[i+1] into t[i] (i.e. for i range(len(t)) in parallell: t[i] = t[i+1]). What do you observe? Try with different grid sizes and numbers of threads per block.

Image manipulation¶

Based on this introduction to the manipulation of images with PIL write a function that takes a colorfull RGB image and returns its grayscale version.
Optimise your function with a kernel. Start with a small image, like lena.png, and then with larger images, like mona-lisa.jpg, making sure you respect all hardware limitations.

Game of life¶

Write a function life(initial_configuration,nb_iterations) that takes an array of integer (either 0 or 1) and a number of iterations of the rule of life, and returns the final configuration.
Optimize your function with a kernel. Remember that cuda.synchronize() introduces a synchronization barrier.