January 30, 2024

OpenAI Embeddings vs Open Source

OpenAI text embeddings are amazing, but not the best in every category

by

John Paul Ada

OpenAI just released their new embedding models, and it was a long time coming. Their previous embedding model was clearly lagging behind the times.

Introduction

If you’ve read the announcement post, you’ll know they released two new embedding models: text-embedding-3-small and text-embedding-3-large.

They come with better pricing, and even better performance.

But we all know they’re not the only text embedding models out there. It’s wild world out there. As someone who prefers open source if I can help it (mostly coz I’m broke as hell), I wanted to compare these new OpenAI models with the crème de la crème of the open source text embeddings.

For this comparison, we’ll look at the following:

and we will be comparing them on the following criteria/stats:

MTEB Score (embedding benchmark for English)
Embedding Dimensions
Max Input Length (in tokens)
Cost / 1k tokens

Comparison

With the criteria and contenders out of the way, here is the comparison:

MTEB Score

One thing we can notice immediately is that OpenAI’s new text-embedding-3-large model is only the second best performing model in this list with a score of 65.59. The best performing model here is actually the E5 model with a score of 66.63.Meanwhile, Jina has the worst performance of 60.38.

Winner: E5

Embedding Dimensions

We can see the same type of result with embedding dimensions as well, with text-embedding-3-large being the second best with3072dimensions, the E5 model being the top again with4096dimensions, and the Jina model being the worst of the bunch with768dimensions — which makes you wonder why I even put it in this list?

That’s because …

Max Input Length

The Jina model is actually the first open source text embedding model to have 8k max input token length. That’s why I felt it deserved a spot on this list. For this criteria, it’s on par with OpenAI’s new text embedding models.

But then again, they’re a tie for second place, as the E5 model once again reigns supreme in this category with a 32k max input token length.

Winner: E5

Cost

Less explanations needed here. The open source models cost nothing to use, but the pricing for the new OpenAI embedding models are actually pretty good! Jina also has an API, but unless you bulk buy tokens, the OpenAI text embeddings are actually cheaper.

Summary

At a glance, based on the comparisons, the E5 is the best model that we have looked into today. But there’s another consideration:

The e5-mistral-7b-instruct is a heckin chonk of a model.

It doesn’t matter if it’s a really good model if you can’t use it. I’m unable to run this model at all with my smol Macbook M1 Air.

Recommendations

If you have the budget: the new OpenAI embeddings
If you have big documents: Jina
If you have small documents: BGE
If you have a massive GPU lying around: E5

My Thoughts

I have never used the OpenAI embeddings because of the great open source options out there, but I’m actually interested to try the new models because of their improved cost and performance, especially for commercial work.

But also, there’s…