sjun.hashnode.devCoroutine series 3) Coroutines for LLM inferenceThis is the third post in the series Coroutine, IO bound and Asyncio for AI. Click the image for the series index Introduction In this post, I will briefly introduce how to utilize coroutines for LLMs. Using asyncio for LLM inference is straightfor...Feb 8·13 min read
sjun.hashnode.devCoroutine series 2) Useful Asyncio FunctionsThis is the second post in the series Coroutine, IO bound and Asyncio for AI. Click the image for the series index Introduction I explained coroutines and asyncio in the previous post: https://sjun.hashnode.dev/1-what-are-coroutine-asyncio-io-bound...Jan 31·14 min read
sjun.hashnode.devCoroutine series 1) What are Coroutine, Asyncio, I/O bound?Introduction In many cases, we have to run several jobs concurrently. Most developers are likely familiar with multi-threading or multi-processing, both of which Python supports through ThreadPoolExecutor and ProcessPoolExecutor. However, there is an...Jan 21·9 min read
sjun.hashnode.devKV Cache and Prompt Caching: How to Leverage them to Cut Time and CostsIntroduction A Problem of LLM Inference In the transformer structure, the model calculates the \(\mathbf{K}, \mathbf{V}\) matrices using weight matrices \(\mathbf{W}\). When an input \(\mathbf{x}_0\) vector enters the model, it is first multiplied by...Jan 16·10 min read
sjun.hashnode.devWhy LoRA? Understanding the representative PEFT.Why LoRA? Low-Rank Adaptation (LoRA) has revolutionized the way we approach Large Language Models (LLMs). As the most prominent Parameter-Efficient Fine-Tuning (PEFT) method, LoRA allows developers to adapt massive models like Llama 3 or GPT-4 to spe...Jan 11·6 min read