Tag feed

#cuda

99 posts24 followers

Explore Hashnode

Alternatives

Trending tags this week

NNakonakokawaiiai-change-brief.hashnode.dev6d ago · 9 min read

What I Learned While Exploring Local AI Tools

Recently, I have been spending more time exploring local AI tools and learning how different parts of the ecosystem work together. Until recently, I mostly thought about AI through web services and ch

0

RMRaghul Mblog.raghul.inJul 8 · 5 min read

The Hidden Softwares Behind Every AI GPU : Why Hardware Alone Isn't Enough

Most people think NVIDIA, AMD, or Intel won the AI race by building faster GPUs. That's only half the story. The real winner isn't just the company with the most powerful hardware it's the company wit

2

J

Nnidhinkumarblog.nidhin.devJun 7 · 5 min read

Is C++ the Real Engine Behind AI?

If you follow modern technology trends, you have likely been led to believe that artificial intelligence is entirely built on Python. Every tutorial, machine learning framework documentation, and open

0

PTPralisha Tripathypralishablog1.hashnode.devJun 6 · 5 min read

Python Doesn't Power LLMs. So What Actually Does?

Like many aspiring AI engineers, I spent months building machine learning models in Python. Every tutorial I followed used Python. Every notebook I opened contained Python. Every machine learning proj

0

WWingEdge777wingedge777.hashnode.devMay 29 · 27 min read

[CUDA in Practice] SGEMM — Beating cuBLAS: A Deep Dive into Peak-Performance Matrix Multiplication in Pure CUDA C++

0. Preface — The Last Stand of Scalar Compute Warning: Extremely dense content ahead, with many diagrams, heavy bit-manipulation, and memory-mapping derivations. Best read on a PC. Target audience: R

0

WWingEdge777wingedge777.hashnode.devMay 24 · 8 min read

[CUDA in Practice] RoPE — Why Kernel Fusion in Hand-Written Operators Matters: Reducing Memory Traffic and Launch Overhead

AI compilers are evolving fast. In many cases, torch.compile plus JIT optimization in PyTorch can deliver striking speedups, to the point that people often say, "hand-written operators are no longer n

0

MRMarcus Rowetechsifted.hashnode.devApr 28 · 6 min read

Stable Diffusion SDXL Turbo Errors: Black Images, VRAM Crashes and Model Loading Fixes

SDXL Turbo is genuinely impressive for what it does — real-time or near-real-time image generation in 1-4 steps. But it has a specific set of requirements and failure modes that are different from base SDXL, and if you're bringing over your SDXL setu...

0

WWingEdge777wingedge777.hashnode.devApr 9 · 33 min read

[CUDA in Practice] Matrix Transpose — From Padding to XOR Swizzle: The Art of Shared Memory Access Optimization

Note: Text translated by AI. Code crafted by human. Matrix transpose is one of the most fundamental operations in deep learning and high-performance computing. The deceptively simple coordinate swap

0

VMVinayak Malimalivinayak.hashnode.devApr 8 · 2 min read

CUDA Configuration for Windows

Step 1: NVIDIA Video Driver You should install the latest version of your GPUs driver. - Check which GPU is present in your system You can download drivers here: NVIDIA GPU Drive Download How to che

0

KTKartik Thakoreblog.hammer.aiApr 6 · 11 min read

Implementing TurboQuant in llama.cpp: CUDA Scars and What Actually Ships

Part 1 of 2. Why We Did This Hammer.ai runs a industrial research lab hyper focused on regulated domain document understand at extremely efficient margins. Private equity self funded companies like f

0

#cuda

Search Hashnode

#cuda

Explore Hashnode

Trending tags this week

What I Learned While Exploring Local AI Tools

The Hidden Softwares Behind Every AI GPU : Why Hardware Alone Isn't Enough

Is C++ the Real Engine Behind AI?

Python Doesn't Power LLMs. So What Actually Does?

[CUDA in Practice] SGEMM — Beating cuBLAS: A Deep Dive into Peak-Performance Matrix Multiplication in Pure CUDA C++

[CUDA in Practice] RoPE — Why Kernel Fusion in Hand-Written Operators Matters: Reducing Memory Traffic and Launch Overhead

Stable Diffusion SDXL Turbo Errors: Black Images, VRAM Crashes and Model Loading Fixes

[CUDA in Practice] Matrix Transpose — From Padding to XOR Swizzle: The Art of Shared Memory Access Optimization

CUDA Configuration for Windows

Implementing TurboQuant in llama.cpp: CUDA Scars and What Actually Ships