@coffeenwhiskers

Exploring and Finetuning Sa2VA (Segment Anything 2 + Vision Assistant) by Bytedance

Feb 26 · 8 min read · So existing MLLMs are pretty good at one thing. Either they do vision-language chat (LLaVA, InternVL, the usual suspects) or they do segmentation (SAM2, SEEM). Combining them usually means running sep

Technical considerations for using GenAI to make pixel art for games

Feb 8 · 7 min read · I’ve always loved RPGs and between my hobby of writing stories, I was considering using RPGs as a medium to convey these stories in more engaging detail. I experimented alot with the art style using g

Research Tidbit #1 - ViT Depth MoE

Aug 12, 2025 · 5 min read · ** i am a very silly goose so please take my very amateurish research experiment here done with a large adult serving of salt Experiment 1.1 Using the NYU Depth V2 dataset with controlled indoor environments and predictable depth rangeallowed the foc...

When Words Win: How Language Blinds Multimodal AI

Aug 7, 2025 · 11 min read · Having been using Qwen2.5-VL's extensively in the past month or so, I’ve identified two distinct failure modes that expose fundamental architectural limitations in current VLM systems. These failures reveal critical weaknesses in visual grounding mec...

Crash Course on VLM Training Phases

Jul 16, 2025 · 10 min read · Recently working on VLMs and surprisingly couldn’t really find accessible resources on VLM training and reliable information on dataset size (relatively new information so alot of models hallucinate on this) so I’m writing this in the hopes that down...