AI news
October 3, 2024

OpenAI Drops Realtime API, Prompt Caching, and More on DevDay 2024

Great news for developers! Advanced voice mode is now accessible via API.

Jim Clyde Monge
by 
Jim Clyde Monge

Merely a few days since some big names, like former CTO Mira Murati, left OpenAI. So, seeing Sam Altman up there on stage for DevDay, talking about new products, feels a bit odd.

With all these changes at the company, you can’t help but wonder: should we still trust him?

But that’s not the focus here. Let’s set the drama aside for a second and look at what DevDay was really about—the new tools OpenAI just announced for developers.

OpenAI has certainly packed a lot into this year’s event, and while the leadership changes are concerning, it’s clear the company is still pushing forward. In fact, there’s quite a bit of progress worth unpacking.

In case you missed DevDay last year in 2023, here’s a quick update on the progress made since then:

  • 98% decrease in cost per token from GPT-4 to 4o mini
  • 50x increase in token volume across their systems
  • Significant model intelligence progress

Realtime API

The highlight of DevDay 2024 was undoubtedly the Realtime API.

This API enables developers to build low-latency, multimodal conversational capabilities into their applications, supporting text, audio, and function calling.

Here’s a sample javascript call to the API.

const event = {
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [
      {
        type: 'input_text',
        text: 'Hey, how are you doing?'
      }
    ]
  }
};
ws.send(JSON.stringify(event));
ws.send(JSON.stringify({type: 'response.create'}));

So, why should developers care about this?

  1. Native speech-to-speech: No text intermediary means low latency, nuanced output.
  2. Natural, steerable voices: The models have a natural inflection and can laugh, whisper, and adhere to tone direction.
  3. Simultaneous multimodal output: Text is useful for moderation, faster-than-realtime audio ensures stable playback.

Back in July, I made a post about OpenAI’s advanced voice mode where I showed its ability to detect and react to different human vocal tones and also expressed how impressed I was with the feature. Check out the full article below:

OpenAI’s Advanced Voice Mode Is Now Available to Select Users
OpenAI’s Advanced Voice Mode is now available to select ChatGPT Plus users and will rollout to all paying users in…generativeai.pub

Now, thousands of developers can integrate this feature into their apps, opening the door for a new wave of voice-powered applications.

Here’s the Realtime API pricing information

  • Text input: $5 per 1 million tokens
  • Text output: $20 per 1 million tokens
  • Audio input: $100 per 1 million tokens (around $0.06 per minute)
  • Audio output: $200 per 1 million tokens (around $0.24 per minute)

Prompt Caching

Next on the list is Prompt Caching, a feature that significantly reduces the cost and time needed to process repeated prompts.

OpenAI will now route API requests to servers that have recently processed the same or similar prompts, meaning you can skip redundant computations. This feature is particularly useful for developers working with long or complex prompts that are frequently reused.

This can reduce latency by up to 80% and cost by 50% for long prompts.

Prompt caching isn’t a completely new concept. In fact, Anthropic introduced a similar feature not too long ago, which allowed developers to cache frequently used contexts and reduce costs by up to 90%.

OpenAI’s Prompt Caching is enabled for the following models:

  • gpt-4o
  • gpt-4o-mini
  • o1-preview
  • o1-mini

When you make an API request, here’s what happens:

Prompt Caching, OpenAI DevDay 2024
Image from OpenAI
  1. Cache Lookup: The system checks if the initial portion (prefix) of your prompt is stored in the cache.
  2. Cache Hit: If a matching prefix is found, the system uses the cached result. This significantly decreases latency and reduces costs.
  3. Cache Miss: If no matching prefix is found, the system processes your full prompt. After processing, the prefix of your prompt is cached for future requests.

These cached prefixes can only last up to 10 minutes. But during off-peak periods, caches may persist for up to one hour.

Pricing for prompt caching is as follows:

OpenAI prompt caching pricing
Image from OpenAI

Vision Fine-Tuning

Another great feature introduced at DevDay was Vision Fine-Tuning.

This feature allows users to fine-tune models with images alongside text in JSONL files. This opens up the possibility of training models not only with textual inputs but also with visual data.

Here’s an example of an image message on a line of your JSONL file. Below, the JSON object is expanded for readibility, but typically this JSON would appear on a single line in your data file:

{
  "messages": [
    { "role": "system", "content": "You are an assistant that identifies uncommon cheeses." },
    { "role": "user", "content": "What is this cheese?" },
    { "role": "user", "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/3/36/Danbo_Cheese.jpg"
          }
        }
      ] 
    },
    { "role": "assistant", "content": "Danbo" }
  ]
}

In what way is this useful?

OpenAI partnered with leading tech companies like Grab to showcase the power of vision fine-tuning in real-world applications. Grab, a major food delivery and rideshare service, used this feature to enhance its GrabMaps platform, which relies on street-level imagery collected from drivers to support operations across Southeast Asia.

By fine-tuning GPT-4o with just 100 examples, Grab improved its ability to localize traffic signs and count lane dividers.

OpenAI vision fine-tuning with Grab
Image from OpenAI

This resulted in a 20% increase in lane count accuracy and a 13% improvement in speed limit sign localization, streamlining their mapping processes and reducing the need for manual intervention.

Note: Your training images cannot contain pictures of people, faces, CAPTCHAs, or images that violate our terms of use. Datasets containing these images will be automatically rejected.

With regards to pricing, OpenAI is currently offering 1 million training tokens per day for free through October 31, 2024, to fine-tune GPT-4o with images.

After October 31, 2024, GPT-4o fine-tuning training will cost $25 per 1M token, and inference will cost $3.75 per 1M input token and $15 per 1M output token. Image inputs are first tokenized based on image size and then priced at the same per-token rate as text inputs.

Let’s Talk About Safety

As cool as these new features are, they do come with safety concerns, especially the Realtime API.

We’re getting closer to a world where fake phone calls could be indistinguishable from real ones. Imagine getting a call from someone who sounds exactly like your boss or a family member, only to later find out it was an AI.

It’s not hard to see how bad actors could abuse this technology.

In fact, a few days ago, the Federal Communications Commission fined a political consultant $6 million for using AI to mimic the voice of President Joe Biden in robocalls earlier this year.

To avoid misuse, OpenAI OpenAI’s API can’t call restaurants or shops directly. However, there’s no disclosure from the AI’s side that they are not humans, so it’s difficult to identify whether you’re talking to an AI or not. For now, it seems to be the developers’ responsibility to add a certain kind of disclosure.

OpenAI has tried to mitigate these risks. For voice interactions, OpenAI uses an audio safety infrastructure, which has proven effective in minimizing potential abuse, particularly in preventing misuse for deceptive purposes, such as misleading phone calls or voice manipulation.

When it comes to vision fine-tuning, fine-tuned models remain entirely under the control of the user, ensuring full ownership of business data. OpenAI does not train models on any inputs or outputs used for fine-tuning without explicit permission, ensuring that the data remains private and secure.

Final Thoughts

A lot was announced today, but the standout for me is the Realtime API. Essentially, it’s an API version of ChatGPT’s advanced voice mode, and I expect to see hundreds of apps built on this voice API in the coming weeks.

According to OpenAI, there are now over 3 million developers experimenting with its technology to build new apps and features. These newly announced products, especially the Realtime voice API, could help grow that crucial user base and the revenue for OpenAI.

At this point, it’s hard to gauge just how intuitive these APIs are or how cost-effective they’ll be in real-world applications. I plan to build some proof-of-concept (PoC) apps to test them out and will share my findings in a separate post. Until then, I’d love to hear your thoughts on this year’s DevDay.

What product announcement excited you the most? Let me know in the comments!

Get your brand or product featured on Jim Monge's audience