Home > On-Demand Archives > Talks >
Hands-on With CUDA-C on a Nvidia Jetson Nano GPU
Mohammed Billoo - Watch Now - EOC 2024 - Duration: 42:14
The explosion of silicon density in the past decade has resulted in increased massive compute capabilities of modern-day microprocessors in power-efficient packages. Processors supporting trillions of floating point operations (TFLOPS) are now becoming widely available in smaller form factors. Software paradigms have also been created to leverage the underlying hardware. In this talk, Mohammed Billoo will introduce Nvidia's CUDA framework, targeting a variant of the C programming language. Care needs to be taken when implementing applications using CUDA-C. The underlying hardware, specifically the memory architecture and layout, need to be considered to ensure that the final implementation is performant. Mohammed will cover the following topics in this talk:
- Architectural differences between GPUs and CPUs
- Relevant applications for GPUs
- CUDA-C overview and build process
- CUDA-C code structure
Mohammed will conclude the talk with a hands-on demonstration of an image processing algorithm implemented in CUDA-C on a Nvidia Jetson Nano development kit. He will compare the performance of the CUDA-C implementation on the GPU against a naive implementation on a CPU.
Hi,
Great questions! Ultimately, vectors in the C++ STL are essentially C arrays under the hood with some extra features (e.g. automatically re-allocating memory as needed, optimally accessing elements). I would actually caution in using C++ vectors initially, since it's important to ensure that memory is aligned properly when passing data to the GPU. Otherwise, you risk losing the benefits of a GPU. CUDA supports exist in The Yocto Project. You can check out the meta-tegra layer here: https://github.com/OE4T/meta-tegra. I'll look to author another blog post on adding CUDA support, along with cross-compiling CUDA-C applications when using The Yocto Project.
Thanks for a very informative presentation, as always. I particularly liked the "gotchas" one should keep in mind to leverage efficient parallelisation.
In your opinion, would you consider C++'s intrinsic nature of vector manipulation to generally more fitting than C when interfacing with CUDA or maybe when passing back the results to the host?
Do you know how available CUDA when it comes to integration with the Yocto project?
Looking forward to the follow-up article :)