[ ABORT TO HUD ]
SEQ. 1
SEQ. 2
SEQ. 3

Native Multimodal Ingestion

Mastering the Gemini API 12m 250 BASE XP

Beyond Text Prompts

Gemini was built from the ground up to be multimodal. You don't need to convert videos into images or transcribe audio before sending it to the API.

import vertexai
from vertexai.generative_models import GenerativeModel, Part

vertexai.init(project="your-project-id", location="us-central1")
model = GenerativeModel("gemini-3.1-pro")

# Pass a raw video file directly from Cloud Storage
video_part = Part.from_uri("gs://your-bucket/meeting.mp4", mime_type="video/mp4")

response = model.generate_content([
    video_part, 
    "Summarize the key decisions made in this meeting video."
])

Gemini processes the raw audio and video frames natively.

SYNAPSE VERIFICATION
QUERY 1 // 2
How does Gemini 3.1 handle video files?
It requires a separate transcription API first
It natively ingests the video and audio simultaneously
It only looks at the first frame
It converts the video to a text description before processing
Watch: 139x Rust Speedup
Google Vertex AI Academy | Free Interactive Course | Infinity AI