Publised on May 15, 2024

GPT-4o API Guide: Harnessing Text, Image, and Video Processing for Intelligent Automation

GPT-4o, OpenAI's groundbreaking multimodal AI assistant, seamlessly processes text, images, and videos to deliver powerful insights and responses. Discover its capabilities, API usage, and real-world examples in this in-depth guide.

OpenAI's GPT-4o marks a significant leap forward in AI technology by integrating text, vision, and audio processing within a single model. This unified approach ensures cohesive understanding and generation across multiple modalities, opening up new possibilities for developers and businesses alike.

Introduction to GPT-4o

GPT-4o ("o" for "omni") is designed to handle a combination of text, audio, and video inputs, and can generate outputs in text, audio, and image formats.

Background

Before GPT-4o, users could interact with ChatGPT using Voice Mode, which operated with three separate models. GPT-4o integrates these capabilities into a single model that's trained across text, vision, and audio. This unified approach ensures that all inputs—whether text, visual, or auditory—are processed cohesively by the same neural network.

Current API Capabilities

Currently, the API supports text and image inputs only, with text outputs, the same modalities as gpt-4-turbo. Additional modalities, including audio, will be introduced soon. This guide will help you get started with using GPT-4o for text, image, and video understanding.

Getting Started

Install OpenAI SDK for Python

%pip install --upgrade openai --quiet

Configure the OpenAI client and submit a test request

To setup the client for our use, we need to create an API key. Skip these steps if you already have an API key for usage.

You can get an API key by following these steps:

Create a new project
Generate an API key in your project
(RECOMMENDED, BUT NOT REQUIRED) Setup your API key for all projects as an env var

Once we have this setup, let's start with a simple text input to the model for our first request. We'll use both system and user messages, and we'll receive a response from the assistant role.

from openai import OpenAI
import os

## Set the API key and model name
MODEL="gpt-4o"
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as an env var>"))

completion = client.chat.completions.create(
  model=MODEL,
  messages=[
    {"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"},
    {"role": "user", "content": "Hello! Could you solve 2+2?"}
  ]
)

print("Assistant: " + completion.choices[0].message.content)

Image Processing

GPT-4o can directly process images and take intelligent actions based on the image. We can provide images in two formats:

Base64 Encoded
URL

Let's first view the image we'll use, then try sending this image as both Base64 and as a URL link to the API

from IPython.display import Image, display, Audio, Markdown
import base64

IMAGE_PATH = "data/triangle.png"

# Preview image for context
display(Image(IMAGE_PATH))

Base64 Image Processing

# Open the image file and encode it as a base64 string
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image(IMAGE_PATH)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "What's the area of the triangle?"},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image}"}
            }
        ]}
    ],
    temperature=0.0,
)

print(response.choices[0].message.content)

URL Image Processing

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "What's the area of the triangle?"},
            {"type": "image_url", "image_url": {
                "url": "https://upload.wikimedia.org/wikipedia/commons/e/e2/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png"}
            }
        ]}
    ],
    temperature=0.0,
)

print(response.choices[0].message.content)

Video Processing

While it's not possible to directly send a video to the API, GPT-4o can understand videos if you sample frames and then provide them as images. It performs better at this task than GPT-4 Turbo.

Since GPT-4o in the API does not yet support audio-in (as of May 2024), we'll use a combination of GPT-4o and Whisper to process both the audio and visual for a provided video, and showcase two usecases:

Summarization
Question and Answering

Setup for Video Processing

We'll use two python packages for video processing - opencv-python and moviepy. These require ffmpeg, so make sure to install this beforehand.

%pip install opencv-python --quiet
%pip install moviepy --quiet

Process the video into two components: frames and audio

import cv2
from moviepy.editor import VideoFileClip
import time
import base64

# We'll be using the OpenAI DevDay Keynote Recap video
VIDEO_PATH = "data/keynote_recap.mp4"

def process_video(video_path, seconds_per_frame=2):
    base64Frames = []
    base_video_path, _ = os.path.splitext(video_path)

    video = cv2.VideoCapture(video_path)
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = video.get(cv2.CAP_PROP_FPS)
    frames_to_skip = int(fps * seconds_per_frame)
    curr_frame=0

    # Loop through the video and extract frames at specified sampling rate
    while curr_frame < total_frames - 1:
        video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
        curr_frame += frames_to_skip
    video.release()

    # Extract audio from video
    audio_path = f"{base_video_path}.mp3"
    clip = VideoFileClip(video_path)
    clip.audio.write_audiofile(audio_path, bitrate="32k")
    clip.audio.close()
    clip.close()

    print(f"Extracted {len(base64Frames)} frames")
    print(f"Extracted audio to {audio_path}")
    return base64Frames, audio_path

# Extract 1 frame per second. Adjust `seconds_per_frame` to change sampling rate
base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=1)

## Display frames and audio for context
display_handle = display(None, display_id=True)
for img in base64Frames:
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8")), width=600))
    time.sleep(0.025)

Audio(audio_path)

Example 1: Summarization

Now that we have both the video frames and the audio, let's run a few different tests to generate a video summary to compare the results of using the models with different modalities.

Visual Summary
Audio Summary
Visual + Audio Summary

# Visual Summary
response = client.chat.completions.create(
    model=MODEL,
    messages=[
      {"role":"system","content":"You are generating a video summary. Please provide a summary of the video. Respond in Markdown."},
      {"role":"user", "content":[
          "These are the frames from the video.",
          *map(lambda x:{"type":"image_url",
                         "image_url":{"url":f'data:image/jpg;base64,{x}',"detail":"low"}}, base64Frames)
        ],
      }
    ],
    temperature=0,
)
print(response.choices[0].message.content)

# Audio Summary
transcription = client.audio.transcriptions.create(
    model="whisper-1",
    file=open(audio_path, "rb"),
)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
      {"role":"system","content":"""You are generating a transcript summary. Create a summary of the provided transcription. Respond in Markdown."""},
      {"role":"user","content":[
          {"type":"text","text":f"The audio transcription is: {transcription.text}"}
        ],
      }
    ],
    temperature=0,
)
print(response.choices[0].message.content)

# Audio + Visual Summary
response = client.chat.completions.create(
    model=MODEL,
    messages=[
      {"role":"system","content":"""You are generating a video summary. Create a summary of the provided video and its transcript. Respond in Markdown"""},
      {"role":"user","content":[
          "These are the frames from the video.",
          *map(lambda x:{"type":"image_url",
                         "image_url":{"url":f'data:image/jpg;base64,{x}',"detail":"low"}}, base64Frames),
          {"type":"text","text":f"The audio transcription is: {transcription.text}"}
        ],
      }
    ],
    temperature=0,
)
print(response.choices[0].message.content)

Example 2: Question and Answering

For the Q&A, we'll use the same concept as before to ask questions of our processed video while running the same 3 tests.

QUESTION = "Question: Why did Sam Altman have an example about raising windows and turning the radio on?"

# Visual Q&A
qa_visual_response = client.chat.completions.create(
    model=MODEL,
    messages=[
      {"role":"system","content":"Use the video to answer the provided question. Respond in Markdown."},
      {"role":"user","content":[
          "These are the frames from the video.",
          *map(lambda x:{"type":"image_url","image_url":{"url":f'data:image/jpg;base64,{x}',"detail":"low"}}, base64Frames),
          QUESTION
        ],
      }
    ],
    temperature=0,
)
print("Visual QA:\n" + qa_visual_response.choices[0].message.content)

# Audio Q&A
qa_audio_response = client.chat.completions.create(
    model=MODEL,
    messages=[
      {"role":"system","content":"""Use the transcription to answer the provided question. Respond in Markdown."""},
      {"role":"user","content":f"The audio transcription is: {transcription.text}. \n\n {QUESTION}"},
    ],
    temperature=0,
)
print("Audio QA:\n" + qa_audio_response.choices[0].message.content)

# Visual + Audio Q&A
qa_both_response = client.chat.completions.create(
    model=MODEL,
    messages=[
      {"role":"system","content":"""Use the video and transcription to answer the provided question."""},
      {"role":"user","content":[
          "These are the frames from the video.",
          *map(lambda x:{"type":"image_url",
                         "image_url":{"url":f'data:image/jpg;base64,{x}',"detail":"low"}}, base64Frames),
          {"type":"text","text":f"The audio transcription is: {transcription.text}"},
          QUESTION
        ],
      }
    ],
    temperature=0,
)
print("Both QA:\n" + qa_both_response.choices[0].message.content)

Conclusion

Integrating many input modalities such as audio, visual, and textual, significantly enhances the performance of the model on a diverse range of tasks. This multimodal approach allows for more comprehensive understanding and interaction, mirroring more closely how humans perceive and process information.

Currently, GPT-4o in the API supports text and image inputs, with audio capabilities coming soon.

See All Posts gpt 4o ai gpt-4o ai gpt4o ai openai gpt4o