/hn/comments_44216123.gph on codevoid.de

	_______ __ _______
	\| \| \|.---.-..----.\| \|--..-----..----. \| \| \|.-----..--.--.--..-----.
	\| \|\| _ \|\| __\|\| < \| -__\|\| _\| \| \|\| -__\|\| \| \| \|\|__ --\|
	\|___\|___\|\|___._\|\|____\|\|__\|__\|\|_____\|\|__\| \|__\|____\|\|_____\|\|________\|\|_____\|
	on Gopher (inofficial)
	Visit Hacker News on the Web


	COMMENT PAGE FOR:
	Ask HN: How to learn CUDA to professional level


	FilosofumRex wrote 18 hours 35 min ago:
	In you're in it for the money, then forget about HPC and the mathy
	stuff, unless you've a PhD in the application domain, no one will
	bother with you, even if you write CUDA at 120 wpm.

	The real money is in mastering PTX, nvcc, cuobjdump, Nsight Systems,
	and Nsight Compute. CUTLASS is good open source code base to explore -
	start here [1] most importantly, stay off HN, get on Discord gpu mode,
	where real coders are:

	[1]: https://christianjmills.com/series/notes/cuda-mode-notes.html
	[2]: https://discord.com/invite/gpumode

	MoonGhost wrote 3 hours 43 min ago:
	It may be cool and real but sounds like very niche domain. Which
	means there are very few people and places. Mostly gaming industry
	and drivers. Starting from zero level and getting there in one step
	will be hard. One should be really, really smart for this.

	lacker wrote 20 hours 25 min ago:
	If you're experienced in C++ you can basically just jump in. I found
	this youtube series to be really helpful: [1] After watching this video
	I was able to implement a tiling version of a kernel that was the
	bottleneck of our production data analysis pipeline to improve
	performance by over 2x. There's much more to learn but I found this
	video series to be a great place to start.

	[1]: https://www.youtube.com/playlist?list=PLxNPSjHT5qvtYRVdNN1yDcd...

	SonOfLilit wrote 21 hours 26 min ago:
	Prefix scan is a great intro to GPU programming: [1] After this you
	should be able to tell whether you enjoy this kind of work.

	If you do, try to do a reasonably optimized GEMM, and then try to
	follow the FlashAttention paper and implement a basic version of what
	they're doing.

	[1]: https://developer.download.nvidia.com/compute/cuda/2_2/sdk/web...

	brudgers wrote 1 day ago:
	For better or worse, direct professional experience in a professional
	setting is the only way to learn anything to a professional level.

	That doesn't mean one-eyed-king knowledge is never enough to solve that
	chicken-and-egg. You only have to be good enough to get the job.

	But if you haven't done it on the job, you don't have work experience
	and you are either lying to others or lying to yourself...and any
	sophisticated organization won't fall for it...

	...except of course, knowingly. And the best way to get someone to
	knowingly ignore obvious dunning-kruger and/or horseshit is to know
	that someone personally or professionally.

	Which is to say that the best way to get a good job is to have a good
	relationship with someone who can hire you for a good job (nepotism
	trumps technical ability, always). And the best way to find a good job
	is to know a lot of people who want to work with you.

	To put it another way, looking for a job is the only way to find a job
	and looking for a job is also much much harder than everything that
	avoids looking for a job (like studying CUDA) by pretending to be
	preparation...because again, studying CUDA won't ever give you
	professional experience.

	Don't get me wrong, there's nothing wrong with learning CUDA all on
	your own. But it is not professional experience and it is not looking
	for a job doing CUDA.

	Finally, if you want to learn CUDA just learn it for its own sake
	without worrying about a job. Learning things for their own sake is the
	nature of learning once you get out of school.

	Good luck.

	alecco wrote 1 day ago:
	Ignore everybody else. Start with CUDA Thrust. Study carefully their
	examples. See how other projects use Thrust. After a year or two, go
	deeper to cub.

	Do not implement algorithms by hand. Recent architectures are extremely
	hard to reach decent occupancy and such. Thrust and cub solve 80% of
	the cases with reasonable trade-offs and they do most of the work for
	you.

	[1]: https://developer.nvidia.com/thrust

	bee_rider wrote 1 day ago:
	It looks quite nice just from skimming the link.

	But, I donât understand the comparison to TBB. Do they have a
	version of TBB that runs on the GPU natively? If the TBB
	implementation is on the CPUâ¦ thatâs just comparing two different
	pieces of hardware. Which would be confusing, bordering on dishonest.

	alecco wrote 12 hours 15 min ago:
	The TBB comparison is a marketing leftover from 10 years ago when
	they were trying to convince people that NVIDIA GPUs were much
	faster than Intel CPUs for parallel problems.

	matt3210 wrote 1 day ago:
	Just make cool stuff. Find people to code review. I learn way more
	during code reviews than anything else.

	canyp wrote 1 day ago:
	My 2 cents: "Learning CUDA" is not the interest bit. Rather, you want
	to learn two things: 1) GPU hardware architecture, 2) parallelizing
	algorithms. For CUDA specifically, there is the book CUDA Programming
	Guide from Nvidia, which will teach you the basics of the language. But
	what these jobs typically require is that you know how to parallelize
	an algorithm and squeeze the most of the hardware.

	gdubs wrote 1 day ago:
	I like to learn through projects, and as a graphics guy I love the GPU
	Gems series. Things like: [1] As an Apple platforms developer I
	actually worked through those books to figure out how to convert the
	CUDA stuff to Metal, which helped the material click even more.

	Part of why I did it was â and this was some years back â I wanted
	to sharpen my thinking around parallel approaches to problem solving,
	given how central those algorithms and ways of thinking are to things
	like ML and not just game development, etc.

	[1]: https://developer.nvidia.com/gpugems/gpugems3/part-v-physics-s...

	fifilura wrote 1 day ago:
	I am not a CUDA programmer but when looking at this, I think I can see
	the parallels to Spark and SQL [1] So - start getting used to
	programming without using for loops, would be my tip.

	[1]: https://gfxcourses.stanford.edu/cs149/fall24/lecture/dataparal...

	sremani wrote 1 day ago:
	The book - PMPP - Programming Massively Parallel Processors

	The YouTube Channel - CUDA_MODE - it is based on PMPP
	I could not find the channel, but here is the playlist [1] Once done,
	you would be on solid foundation.

	[1]: https://www.youtube.com/watch?v=LuhJEEJQgUM&list=PLVEjdmwEDkgW...

	math_dandy wrote 1 day ago:
	Are there any GPU emulators you can use to run simple CUDA programs on
	a commodity laptops, just to get comfortable with the mechanics, the
	toolchain, etc.?

	throwaway81523 wrote 22 hours 16 min ago:
	You can get VPS with GPU's these days, not super cheap, but
	affordable for those in the industry.

	corysama wrote 1 day ago:
	[1] emulates running simple CUDA programs in a web page with zero
	setup. Itâs a good way to get your toes wet.

	[1]: https://leetgpu.com/

	gkbrk wrote 1 day ago:
	Commodity laptops can just use regular non-emulated CUDA if they have
	an Nvidia GPU. It's not just for datacenter GPUs, a ton of regular
	consumer GPUs are also supported.

	bee_rider wrote 1 day ago:
	A commodity laptop doesnât have a GPU these days, iGPUs are good
	enough for basic tasks.

	SoftTalker wrote 1 day ago:
	It's 2025. Get with the times, ask Claude to do it, and then ask it to
	explain it to you as if you're an engineer who needs to convince a
	hiring manager that you understand it.

	rakel_rakel wrote 1 day ago:
	Might work in 2025, 2026 will demand more.

	mekpro wrote 1 day ago:
	To professionals in the field, I have a question: what jobs, positions,
	and companies are in need of CUDA engineers? My current understanding
	is that while many companies use CUDA's by-products (like PyTorch),
	direct CUDA development seems less prevalent. I'm therefore seeking to
	identify more companies and roles that heavily rely on CUDA.

	kloop wrote 1 day ago:
	My team uses it for geospatial data. We rasterize slippy map tiles
	and then do a raster summary on the gpu.

	It's a weird case, but the pixels can be processed independently for
	most of it, so it works pretty well. Then the rows can be summarized
	in parallel and rolled up at the end. The copy onto the gpu is our
	current bottleneck however.

	indianmouse wrote 1 day ago:
	As a very early CUDA programmer who participated in the cudacontest
	from NVidia during 2008 and I believe one of the only entries (I'm not
	claiming though) to be submitted from India and got a consolation and
	participation prize of a BlackEdition Card, I can vouch the method
	which I followed.

	- Look up the CUDA Programming Guide from NVidia

	- CUDA Programming books from NVidia from
	developer.nvidia.com/cuda-books-archive link

	- Start creating small programs based on the existing implementations
	(A strong C implementation knowledge is required. So, brush up if
	needed.)

	- Install the required Toolchains, compilers, and I am assuming you
	have the necessary hardware to play around

	- Github links with CUDA projects. Read the code, And now you could use
	LLM to explain the code in the way you would need

	- Start creating smaller, yet parallel programs etc., etc.,

	And in about a month or two, you should have enough to start writing
	CUDA programs.

	I'm not aware of the skill / experience levels you have, but whatever
	it might be, there are plenty of sources and resources available now
	than it was in 2007/08.

	Create a 6-8 weeks of study plan and you should be flying soon!

	Hope it helps.

	Feel free to comment and I can share whatever I could to guide.

	edge17 wrote 1 day ago:
	What environment do you use? Is it still the case that Windows is the
	main development environment for cuda?

	hiq wrote 1 day ago:
	> I am assuming you have the necessary hardware to play around

	Can you expand on that? Is it enough to have an nvidia graphic card
	that's like 5 year old, or do you need something more specific?

	adrian_b wrote 9 hours 0 min ago:
	A 5-year old card, i.e. an NVIDIA Ampere RTX 30xx from 2020, is
	perfectly fine.

	Even 7-year old cards, i.e. NVIDIA Turing RTX 20xx from 2018, are
	still acceptable.

	Older GPUs than Turing should be avoided, because they lack many
	capabilities of the newer cards, e.g. "tensor cores", and their
	support in the newer CUDA toolkits will be deprecated in a not very
	distant future, but very slowly, so for now you can still create
	programs for Maxwell GPUs from 10 years ago.

	Among the newer GPUs, the RTX 40xx SUPER series (i.e. the SUPER
	variants, not the original RTX 40xx series) has the best energy
	efficiency. The newest RTX 50xx GPUs have worse energy efficiency
	than RTX 40xx SUPER, so they achieve a somewhat higher performance
	only by consuming a disproportionately greater power. Instead of
	that, it is better to use multiple RTX 40xx SUPER.

	indianmouse wrote 20 hours 41 min ago:
	That is sufficient.

	slt2021 wrote 1 day ago:
	each nVidia GPU has a certain Compute Capability ( [1] ).

	Depending on the model and age of your GPU, it will have a certain
	capability that will be the hard ceiling for what you can program
	using CUDA

	[1]: https://developer.nvidia.com/cuda-gpus

	sanderjd wrote 22 hours 25 min ago:
	Recognizing that this won't result in any useful benchmarks, is
	there a way to emulate an nvidia gpu? In a docker container, for
	instance?

	dpe82 wrote 1 day ago:
	When you're just getting started and learning that won't matter
	though. Any Nvidia card from the last 10 years should be fine.

	rahimnathwani wrote 1 day ago:
	I'm not a CUDA programmer, but AIUI:

	- you will want to install the latest version of CUDA Toolkit
	(12.9.1)

	- each version of CUDA Toolkit requires the card driver to be above
	a certain version (e.g. toolkit depends on driver version 576 or
	above)

	- older cards often have recent drivers, e.g. the current version
	of CUDA Toolkit will work with a GTX 1080, as it has a recent
	(576.x) driver

	sputknick wrote 1 day ago:
	I used this to teach high school students. Probably not sufficient to
	get what you want, but it should get you off the ground and you can run
	from there.

	[1]: https://youtu.be/86FAWCzIe_4?si=buqdqREWASNPbMQy

	tkuraku wrote 1 day ago:
	I think you just pick a problem you want to solve with gpu programming
	and go for it. Learning what you need along the way. Nvidia blog posts
	are great for learning things along the way such as

	[1]: https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernel...

	majke wrote 1 day ago:
	I had a bit, limited, exposure to cuda. It was before the AI boom,
	during Covid.

	I found it easy to start. Then there was a pretty nice learning curve
	to get to warps, SM's and basic concepts. Then I was able to dig deeper
	into the integer opcodes, which was super cool. I was able to optimize
	the compute part pretty well, without much roadblocks.

	However, getting memory loads perfect and then getting closer to hw
	(warp groups, divergence, the L2 cache split thing, scheduling), was
	pretty hard.

	I'd say CUDA is pretty nice/fun to start with, and it's possible to get
	quite far for a novice programmer. However getting deeper and achieving
	real advantage over CPU is hard.

	Additionally there is a problem with Nvidia segmenting the market -
	some opcodes are present in _old_ gpu's (CUDA arch is _not_ forwards
	compatible). Some opcodes are reserved to "AI" chips (like H100). So,
	to get code that is fast on both H100 and RTX5090 is super hard. Add to
	that a fact that each card has different SM count and memory capacity
	and bandwidth... and you end up with an impossible compatibility
	matrix.

	TLDR: Beginnings are nice and fun. You can get quite far on the
	optimizing compute part. But getting compatibility for differnt chips
	and memory access is hard. When you start, chose specific problem,
	specific chip, specific instruction set.

	epirogov wrote 1 day ago:
	I bought P106-90 for 20$ and start porting my date apps to parallel
	processing with it.

	izharkhan wrote 1 day ago:
	Haking Kase kare

	rramadass wrote 1 day ago:
	CUDA GPGPU programming was invented to solve certain classes of
	parallel problems. So studying these problems will give you greater
	insight into CUDA based parallel programming. I suggest reading the
	following old book along with your CUDA resources.

	Scientific Parallel Computing by L. Ridgway Scott et. al. -

	[1]: https://press.princeton.edu/books/hardcover/9780691119359/scie...

	weinzierl wrote 1 day ago:
	Nvidia itself has a paid course series. It is a bit older but I believe
	still relevant. I have bought it, but not yet started it yet. I intend
	to do so during the summer holidays.

	imjonse wrote 1 day ago:
	These should keep you busy for months: [1] resources and discord
	community
	Book: Programming massively parallel processors
	nvidia cuda docs are very comprehensive too

	[1]: https://www.gpumode.com/
	[2]: https://github.com/srush/GPU-Puzzles

	mdaniel wrote 1 day ago:
	Wowzers, the line noise

	[1]: https://github.com/HazyResearch/ThunderKittens#:~:text=here%...

	amelius wrote 1 day ago:
	This follows a "winner takes all" scenario. I see the differences
	between the submissions are not so large, often smaller than 1%. Kind
	of pointless to work on this, if you ask me.

	imjonse wrote 14 hours 5 min ago:
	the main site is confusing indeed with all those leaderboards, but
	follow the discord and resources links for the actual learning
	material.

	amelius wrote 9 hours 10 min ago:
	Thanks, looks interesting indeed.

	elashri wrote 1 day ago:
	I will give you personal experience learning CUDA that might be
	helpful.

	Disclaime: I don't claim that this is actually a systematic way to
	learn it and it is more for academic work.

	I got assigned to a project that needs learning CUDA as part of my PhD.
	There was no one in my research group who have any experience or know
	CUDA. I started with standard NVIDIA courses (Getting Started with
	Accelerated Computing with CUDA C/C++ and there is python version too).

	This gave me good introduction to the concepts and basic ideas but I
	think after that I did most of learning by trial and error. I tried a
	couple of online tutorials for specific things and some books but it
	was always a deprecated function there or here or a change of API that
	make things obsolete. Or basically things changed for your GPU and now
	you have to be careful because yoy might be using GPU version not
	compatible with what I develop for in production and you need things to
	work for both.

	I think learning CUDA for me is an endeavor of pain and going through
	"compute-sanitizer" and Nsight because you will find that most of your
	time will go into debugging why things is running slower than you
	think.

	Take things slowly. Take a simple project that you know how to do
	without CUDA then port it to CUDA ane benchmark against CPU and try to
	optimize different aspect of it.

	The one advice that can be helpful is not to think about optimization
	at the beginning. Start with correct, then optimize. A working slow
	kernel beats a fast kernel that corrupts memory.

	korbip wrote 1 day ago:
	I can share a similar PhD story (the result being visible here: [1]
	). Back then I didn't find any tutorials that cover anything beyond
	the basics (which are still important).
	Once you have understood the principle working mode and architecture
	of a GPU, I would recommend the following workflow:
	1. First create an environment so that you can actually test your
	kernels against baselines written in a higher-level language.
	2. If you don't have an urgent project already, try to
	improve/re-implement existing problems (MatMul being the first
	example). Don't get caught by wanting to implement all size cases.
	Take an example just to learn a certain functionality, rather than
	solving the whole problem if it's just about learning.
	3. Write the functionality you want to have in increasing complexity.
	Write loops first, then parallelize these loops over the grid. Use
	global memory first, then put things into shared memory and
	registers. Use plain matrix multiplication first, then use mma
	(TensorCore) primitives to speed things up.
	4. Iterate over the CUDA C Programming Guide. It covers all (most) of
	the functionality that you want to learn - but can't be just read an
	memorized. When you apply it you learn it.
	5. Might depend on you use-case but also consider using higher-level
	abstractions like CUTLASS or ThunderKitten. Also, if your environment
	is jax/torch, use triton first before going to CUDA level.

	Overall, it will be some pain for sure. And to master it including
	PTX etc. will take a lot of time.

	[1]: https://github.com/NX-AI/flashrnn

	kevmo314 wrote 1 day ago:
	> I think learning CUDA for me is an endeavor of pain and going
	through "compute-sanitizer" and Nsight because you will find that
	most of your time will go into debugging why things is running slower
	than you think.

	This is so true it hurts.

	ForgotIdAgain wrote 1 day ago:
	I have not tried it yet, but seems nice :

	[1]: https://leetgpu.com/

	Onavo wrote 1 day ago:
	Assuming you are asking this because of the deep learning/ChatGPT hype,
	the first question you should ask yourself is, do you really need to?
	The skills needed for CUDA are completely unrelated to building machine
	learning models. It's like learning to make a TLS library so you can
	get a full stack web development job. The skills are completely
	orthogonal. CUDA belongs to the domain of game developers, graphics
	people, high performance computing and computer engineers (hardware).
	From the point of view of machine learning development and research,
	it's nothing more than an implementation detail.

	Make sure you are very clear on what you want. Most HR departments cast
	a wide net (it's like how every junior role requires "3-5 years of
	experience" when in reality they don't really care). Similarly when
	hiring, most companies pray for the unicorn developer who can
	understand the entire stack from the GPU to the end user product domain
	when the day to day is mostly in Python.

	throwaway81523 wrote 1 day ago:
	I looked at the CUDA code for Leela Chess Zero and found it pretty
	understandable, though that was back when Leela used a DCNN instead of
	transformers. DCNN's are fairly simple and are explained in fast.ai
	videos that I watched a few years ago, so navigating the Leela code
	wasn't too difficult. Transformers are more complicated and I want to
	bone up on them, but I haven't managed to spend any time understanding
	them.

	CUDA itself is just a minor departure from C++, so the language itself
	is no big deal if you've used C++ before. But, if you're trying to get
	hired programming CUDA, what that really means is they want you
	implementing AI stuff (unless it's game dev). AI programming is a much
	wider and deeper subject than CUDA itself, so be ready to spend a bunch
	of time studying and hacking to come up to speed in that. But if you
	do, you will be in high demand. As mentioned, the fast.ai videos are a
	great introduction.

	In the case of games, that means 3D graphics which these days is
	another rabbit hole. I knew a bit about this back in the day, but it
	is fantastically more sophisticated now and I don't have any idea where
	to even start.

	robotnikman wrote 1 day ago:
	>But if you do, you will be in high demand

	So I'm guessing trying to find a job as a CUDA programmer is nowhere
	as big of a headache compared to other software engineering jobs
	right now? I'm thinking maybe learning CUDA and more about AI might
	be a good pivot from the current position as a Java middleware
	developer.

	randomNumber7 wrote 5 hours 18 min ago:
	It is likely much more focused on mathematics compared to what a
	usual java dev does.

	upmind wrote 1 day ago:
	This is a great idea! This is the code right' [1] I have two beginner
	(and probably very dumb) questions, why do they have heavy c++/cuda
	usage rather than using only pytorch/tensorflow. Are they too slow
	for training Leela? Second, why is there tensorflow code?

	[1]: https://github.com/leela-zero/leela-zero

	henrikf wrote 1 day ago:
	That's Leela Zero (plays Go instead of Chess). It was good for its
	time (~2018) but it's quite outdated now. It also uses OpenCL
	instead of Cuda. I wrote a lot of that code including Winograd
	convolution routines.

	Leela Chess Zero ( [1] ) has much more optimized Cuda backend
	targeting modern GPU architectures and it's written by much more
	knowledgeable people than me. That would be a much better source to
	learn.

	[1]: https://github.com/LeelaChessZero/lc0

	throwaway81523 wrote 1 day ago:
	As I remember, the CUDA code was about 3x faster than the
	tensorflow code. The tensorflow stuff is there for non-Nvidia
	GPU's. This was in the era of the GTX 1080 or 2080. No idea about
	now.

	upmind wrote 1 day ago:
	Ah I see, thanks a lot!

	lokimedes wrote 1 day ago:
	Thereâs a couple of âconcernsâ you may separate to make this a
	bit more tractable:

	1. Learning CUDA - the framework, libraries and high-layer wrappers.
	This is something that changes with times and trends.

	2. Learning high-performance computing approaches. While a GPU and the
	Nvlink interfaces are Nvidia specific, working in a massively-parallel
	distributed computing environment is a general branch of knowledge that
	is translatable across HPC architectures.

	3. Application specifics. If your thing is Transformers, you may just
	as well start from Torch, Tensorflow, etc. and rely on the current
	high-level abstractions, to inspire your learning down to the
	fundamentals.

	Iâm no longer active in any of the above, so I canât be more
	specific, but if you want to master CUDA, I would say learning how
	massive-parallel programming works, is the foundation that may
	translate into transferable skills.

	david-gpu wrote 21 hours 52 min ago:
	Former GPU guy here. Yeah, that's exactly what I was going to suggest
	too, with emphasis on #2 and #3. What kind of jobs are they trying to
	apply for? Is it really CUDA that they need to be familiar with, or
	CUDA-based libraries like cuDNN, cuBLAS, cuFFT, etc?

	Understanding the fundamentals of parallel programming comes first,
	IMO.

	chanana wrote 20 hours 1 min ago:
	> Understanding the fundamentals of parallel programming comes
	first, IMO.

	Are there any good resources youâd recommend for that?

	rramadass wrote 15 hours 5 min ago:
	I am not the person you asked the question of, but you might find
	the following useful (in addition to the ones mentioned in my
	other comments);

	Foundations of Multithreaded, Parallel, and Distributed
	Programming by Gregory Andrews - An old classic but still very
	good explanations of concurrent algorithmic concepts.

	Parallel Programming: Concepts and Practice by Bertil Schmidt
	et.al. - A relatively recent book with comprehensive coverage.

	rramadass wrote 1 day ago:
	This is the right approach. Without (2) trying to learn (1) will just
	lead to "confusion worse confounded". I also suggest a book
	recommendation here -

	[1]: https://news.ycombinator.com/item?id=44216478

	jonas21 wrote 1 day ago:
	I think it depends on your learning style. For me, learning
	something with a concrete implementation and code that you can play
	around with is a lot easier than trying to study the abstract
	general concepts first. Once you have some experience with the
	code, you start asking why things are done a certain way, and that
	naturally leads to the more general concepts.

	rramadass wrote 15 hours 19 min ago:
	It has got nothing to do with "learning styles". Parallel
	Computing needs knowledge of three things; a) Certain crucial
	architectural aspects (logical and physical) of the hardware b)
	Decomposing a problem correctly to map to that hardware c)
	Algorithms using a specific language/framework to combine the
	above two. CUDA (and other similar frameworks) only come in the
	last step and so a knowledge of the first two is a prerequisite.

	lokimedes wrote 1 day ago:
	This one was my go-to for HPC, but it may be a bit dated by now:

	[1]: https://www.amazon.com/Introduction-Performance-Computing-...

	rramadass wrote 1 day ago:
	That's a good book too (i have it) but more general than the
	Ridgway Scott book which uses examples from Numerical Computation
	domains. Here is an overview of the chapters; example domains
	start from chapter 10 onwards - [1] These sort of books are only
	"dated" when it comes to specific languages/frameworks/libraries.
	The methods/techniques are evergreen and often conceptually
	better explained in these older books.

	For recent up to date works on HPC, the free multi-volume The Art
	of High Performance Computing by Victor Eijkhout can't be beat -

	[1]: https://www.jstor.org/stable/j.ctv1ddcxfs
	[2]: https://news.ycombinator.com/item?id=38815334

	dist-epoch wrote 1 day ago:
	As they typically say: Just Do It (tm).

	Start writing some CUDA core to sort an array or find the maximum
	element.

	the__alchemist wrote 1 day ago:
	I concur with this. Then supplement with resources A/R. Ideally, find
	some tasks in your programs that are parallelize. (Learning what
	these are is important too!), and switch them to Cuda. If you don't
	have any, make a toy case, e.g. an n-body simulation.

	amelius wrote 1 day ago:
	I'd rather learn to use a library that works on any brand of GPU.

	If that is not an option, I'll wait!

	moralestapia wrote 1 day ago:
	K, bud.

	Perhaps you haven't noticed, but you're in a thread that asked
	about CUDA, explicitly.

	uecker wrote 1 day ago:
	GCC / clang also have support for offloading.

	latchkey wrote 1 day ago:
	Then learn PyTorch.

	The hardware between brands is fundamentally different. There isn't
	a standard like x86 for CPUs.

	So, while you may use something like HIPIFY to translate your code
	between APIs, at least with GPU programming, it makes sense to
	learn how they differ from each other or just pick one of them and
	work with it knowing that the others will just be some variation of
	the same idea.

	horsellama wrote 1 day ago:
	the jobs requiring cuda experience are most of the times because
	torch is not good enough

	Cloudef wrote 1 day ago:
	Both zig and rust are aiming to compile to gpus natively. What cuda
	and hip provide is heterogeneous computing runtime, aka hiding the
	boilerplate of executing code on cpu and gpu seamlessly

	pjmlp wrote 1 day ago:
	If only Khronos and the competition cared about the developer
	experience....

	the__alchemist wrote 1 day ago:
	This is continuously a point of frustration! Vulkan compute is...
	suboptimal. I use Cuda because it feels like the only practical
	option. I want Vulkan or something else to compete seriously, but
	until that happens, I will use Cuda.

	corysama wrote 1 day ago:
	Is [1] + [2] getting there?

	Runs on anything + auto-differentiatation.

	[1]: https://github.com/KomputeProject/kompute
	[2]: https://shader-slang.org/

	pjmlp wrote 1 day ago:
	It took until Vulkanised 2025, to acknowledge Vulkan became the
	same mess as OpenGL, and to put an action plan into action to
	try to correct this.

	Had it not been for Apple with OpenCL initial contribution,
	regardless of how it went from there, AMD with Mantle as
	starting point for Vulkan, NVidia with Vulkan-Hpp and Slang,
	and the ecosystem of Khronos standards would be much worse.

	Also Vulkan isn't as bad as OpenGL tooling, because LunarG
	exists, and someone pays them for the whole Vulkan SDK.

	The attitude "we put paper standards" and the community should
	step in for the implementations and tooling, hardly comes to
	the productivity from private APIs tooling.

	Also all GPU vendors, including Intel and AMD, also rather push
	their own compute APIs, even if based on top of Khronos ones.

	david-gpu wrote 21 hours 40 min ago:
	> The attitude "we put paper standards" and the community
	should step in for the implementations and tooling

	Khronos is a consortium financed by its members, who either
	implement the standards on their own hardware or otherwise
	depend on the ecosystem around them. For example, competing
	GPU vendors typically implement the standards in parallel
	with the committee meetings. The very people who represent
	their company in Khronos are typically leads of the teams who
	implement the standards.

	Source: used to represent my employers at Khronos. It was a
	difficult, thankless job, that required almost as much
	diplomacy as technical expertise.

	pjmlp wrote 12 hours 58 min ago:
	I know, and the way those members implemented Khronos
	standards, versus their own proprietary alternatives, shows
	how it actually works in practice, regarding developer
	tooling and ergonomics.


	<- back to front page