Gemini was built from the ground up to be multimodal. You don't need to convert videos into images or transcribe audio before sending it to the API.
import vertexai
from vertexai.generative_models import GenerativeModel, Part
vertexai.init(project="your-project-id", location="us-central1")
model = GenerativeModel("gemini-3.1-pro")
# Pass a raw video file directly from Cloud Storage
video_part = Part.from_uri("gs://your-bucket/meeting.mp4", mime_type="video/mp4")
response = model.generate_content([
video_part,
"Summarize the key decisions made in this meeting video."
])
Gemini processes the raw audio and video frames natively.