JBJun Baeinsjun.hashnode.dev·Feb 8 · 13 min readCoroutine series 3) Coroutines for LLM inferenceThis is the third post in the series Coroutine, IO bound and Asyncio for AI. Click the image for the series index Introduction In this post, I will briefly introduce how to utilize coroutines for LLMs. Using asyncio for LLM inference is straightfor...00
JBJun Baeinsjun.hashnode.dev·Jan 31 · 14 min readCoroutine series 2) Useful Asyncio FunctionsThis is the second post in the series Coroutine, IO bound and Asyncio for AI. Click the image for the series index Introduction I explained coroutines and asyncio in the previous post: https://sjun.hashnode.dev/1-what-are-coroutine-asyncio-io-bound...00
JBJun Baeinsjun.hashnode.dev·Jan 21 · 9 min readCoroutine series 1) What are Coroutine, Asyncio, I/O bound?Introduction In many cases, we have to run several jobs concurrently. Most developers are likely familiar with multi-threading or multi-processing, both of which Python supports through ThreadPoolExecutor and ProcessPoolExecutor. However, there is an...00
JBJun Baeinsjun.hashnode.dev·Jan 16 · 10 min readKV Cache and Prompt Caching: How to Leverage them to Cut Time and CostsIntroduction A Problem of LLM Inference In the transformer structure, the model calculates the \(\mathbf{K}, \mathbf{V}\) matrices using weight matrices \(\mathbf{W}\). When an input \(\mathbf{x}_0\) vector enters the model, it is first multiplied by...00
JBJun Baeinsjun.hashnode.dev·Jan 11 · 6 min readWhy LoRA? Understanding the representative PEFT.Why LoRA? Low-Rank Adaptation (LoRA) has revolutionized the way we approach Large Language Models (LLMs). As the most prominent Parameter-Efficient Fine-Tuning (PEFT) method, LoRA allows developers to adapt massive models like Llama 3 or GPT-4 to spe...00