Exploring and Finetuning Sa2VA (Segment Anything 2 + Vision Assistant) by Bytedance
So existing MLLMs are pretty good at one thing. Either they do vision-language chat (LLaVA, InternVL, the usual suspects) or they do segmentation (SAM2, SEEM). Combining them usually means running sep
coffeenwhiskers.hashnode.dev8 min read