SDShannon Diasinfitservers.hashnode.dev·3d ago · 1 min readThe Enterprise Blueprint: Installing NVIDIA Drivers and CUDA on Ubuntu Dedicated ServersDeploying machine learning models, graphics rendering workflows, or high-performance computing (HPC) jobs requires rock-solid underlying infrastructure. Unfortunately, many online tutorials recommend 00
Nnidhinkumarinblog.nidhin.dev·Jun 7 · 5 min readIs C++ the Real Engine Behind AI?If you follow modern technology trends, you have likely been led to believe that artificial intelligence is entirely built on Python. Every tutorial, machine learning framework documentation, and open00
PTPralisha Tripathyinpralishablog1.hashnode.dev·Jun 6 · 5 min readPython Doesn't Power LLMs. So What Actually Does?Like many aspiring AI engineers, I spent months building machine learning models in Python. Every tutorial I followed used Python. Every notebook I opened contained Python. Every machine learning proj00
WWingEdge777inwingedge777.hashnode.dev·May 29 · 27 min read[CUDA in Practice] SGEMM — Beating cuBLAS: A Deep Dive into Peak-Performance Matrix Multiplication in Pure CUDA C++0. Preface — The Last Stand of Scalar Compute Warning: Extremely dense content ahead, with many diagrams, heavy bit-manipulation, and memory-mapping derivations. Best read on a PC. Target audience: R00
WWingEdge777inwingedge777.hashnode.dev·May 24 · 8 min read[CUDA in Practice] RoPE — Why Kernel Fusion in Hand-Written Operators Matters: Reducing Memory Traffic and Launch OverheadAI compilers are evolving fast. In many cases, torch.compile plus JIT optimization in PyTorch can deliver striking speedups, to the point that people often say, "hand-written operators are no longer n10
MRMarcus Roweintechsifted.hashnode.dev·Apr 28 · 6 min readStable Diffusion SDXL Turbo Errors: Black Images, VRAM Crashes and Model Loading FixesSDXL Turbo is genuinely impressive for what it does — real-time or near-real-time image generation in 1-4 steps. But it has a specific set of requirements and failure modes that are different from base SDXL, and if you're bringing over your SDXL setu...00
WWingEdge777inwingedge777.hashnode.dev·Apr 9 · 33 min read[CUDA in Practice] Matrix Transpose — From Padding to XOR Swizzle: The Art of Shared Memory Access OptimizationNote: Text translated by AI. Code crafted by human. Matrix transpose is one of the most fundamental operations in deep learning and high-performance computing. The deceptively simple coordinate swap 00
VMVinayak Maliinmalivinayak.hashnode.dev·Apr 8 · 2 min readCUDA Configuration for WindowsStep 1: NVIDIA Video Driver You should install the latest version of your GPUs driver. - Check which GPU is present in your system You can download drivers here: NVIDIA GPU Drive Download How to che10
KTKartik Thakoreinblog.hammer.ai·Apr 6 · 11 min readImplementing TurboQuant in llama.cpp: CUDA Scars and What Actually ShipsPart 1 of 2. Why We Did This Hammer.ai runs a industrial research lab hyper focused on regulated domain document understand at extremely efficient margins. Private equity self funded companies like f00
WWingEdge777inwingedge777.hashnode.dev·Apr 5 · 9 min readA Deep Dive into DeviceQuery: Understanding Your GPU Hardware0. Preface Before writing a single line of high-performance CUDA code, you must know your silicon. deviceQuery is often the first command a developer runs, yet its output is usually ignored. This post00