February 9, 2024

Multimodal Model LLaVa 1.6 Is Now Good

The new LLaVa 1.6 model was released and it's actually good.

by

John Paul Ada

We already knew open source multimodal models like LLaVa existed, but if you’re like me, they probably weren’t your first choice for any task.

That changes with this update, for me at least.

Introduction

The new LLaVa 1.6 model was released recently, as the successor to the previous to LLaVa 1.5, released in the October of last year.
So — what changed from then?

If you’ve used LLaVA-1.5 before, you’d definitely feel the difference.

But how does it compare to the other models out there?

Expanded stats from the announcement post.

As you can see, LLaVA’s scores have gone up by leaps and bounds with this new model. It’s even better than some commercial models, like Gemini Pro!

Let’s see it in action!

Let’s imagine we have a very simple product identification task.

Let’s see how LLaVa-1.6 performs. First let’s see the 34B model in action. We can check out their demo for that (mostly coz my laptop can’t handle it 🥲).

With a very simple prompt, I’m asking it to tell me the name of the product. It’s able to get “Australian Pork & Beef Bolognese Mince”. That’s already pretty good! I would like it to add the variations as well, but a little prompting should be enough to get it to perform how I want it.

This time, I’ll try it locally on my laptop using Ollama. But since my laptop is weak, I can only use the 13B model. I’ll be using the same image to test as well. Let’s see how this goes!

LLaVa-1.6 13B model result (Ollama — M1 Air)

Using the 13B model, we get “Australia’s Finest Pork Mince”, which is a bit far from what I expected. Still a bit better compared to when I used the LLaVa-1.5 7B 😅

Now, let’s see how GPT-4V (via ChatGPT) performs in this case!

In this particular case, we get:

The name of the product in the image is “Australian Pork & Beef Bolognese Mince”.

That is actually NOT what I expected: I expected it to just respond with the specific product name, especially because its a premium commercial model.

Maybe for a lot of cases GPT-4V is better, but for this specific case and prompt, LLaVa-1.6 34B actually performed better considering my expectations.

How about Gemini Pro? In their benchmarks, it’s already been established that LLaVa-1.6 performs better than Gemini Pro, but how does it perform in this specific task? Let’s find out!

The free Google Bard uses Gemini Pro so it seems like the best way to try it out. Its response is:

The product in the image is Coles Australian Pork & Beef Bolognaise Mince..

I guess there’s a couple of issues here:

It included the explanation even though I asked it not to
It mispelled Bolognese to Bolognaise
What’s with the .. at the end?

For this specific task, Gemini Pro via Google Bard does not perform as well as LLaVa-1.6 34B and GPT-4V.

📺 Try their DEMO!

LLaVA-1.6 Demo

Try out the new LLaVA-1.6 34B model!

llava.hliu.cc

⚒️ How To Use In Your Projects

If you want to use this in your projects, you can use an API like Replicate, but if you’re going to use an API, honestly you might as well use GPT-4V or Gemini Ultra instead.

As with most open source AI models, I tend to use Ollama, as I’ve shown earlier.

You can try it out on the console with this:

# 13B model
ollama run llava:13b-v1.6

# 34B model
ollama run llava:34b-v1.6

or using the Ollama API or the Python/JS libraries.

I also like trying out LLMs in a single file using llamafile, like I did with LLaVa-1.5, but unfortunately, it there’s no LLaVa-1.6 llamafile yet. Hopefully somebody makes one soon!

✅ Summary

Based on the stats and the actual performance:

LLaVa-1.6 34B is a big step for open source multimodal models.

OCR-wise, and reasoning-wise (check out their demo!), LLaVa-1.6 34B performs better than I expected, close to the current commercial models (and even better in some cases)! Currently only the 34B model can claim this, as the 13B model isn’t up to par with expectations just yet.

🧠 My Thoughts

So would I actually use this model? That’s a BIG YES.
But would I actually use this in Production? No — at least not yet.

For the current use cases I have in mind, its currently capabilities, which are LEAGUES beyond the previous version, LLaVA-1.5, are good enough for me.

The problem is the time it takes to respond.
Despite the speed up in the inference time, which I like a lot, this speed is still not suitable for production.
Even though the inference time is on par with the likes of GPT-4V, I wouldn’t use any model that takes this long to process, even commercial ones. If you need to process images in realtime, taking around 5 seconds to respond will be a significant bottleneck.

But for my personal tools and side projects — I would definitely use LLaVA-1.6.