Karen Zhang from Google explains how the rise of multimodal large language models has opened up an entirely new set of practical, real-world use cases especially around video and visual content, not just text and structured data.
Zhang describes how these models are now being applied in areas that were previously expensive, slow, and operationally complex. One standout example comes from a customer that manages more than 20% of the UK’s fleet management services and works with Amazon Logistics. Traditionally, producing driver safety training videos required full film crews, casting, on-site shoots, road closures, and significant time and cost. For a business operating at that scale, the overhead was enormous.
Using Google’s video generation model (VO3) alongside Gemini models, that process has been transformed. Instead of physical production, the team now generates safety training content through prompting, noting that this has already resulted in more than 50 production-grade videos that are actively being rolled out to drivers. What was once highly labour-intensive has become faster, more flexible, and far more scalable.
Beyond efficiency gains, Google highlights the broader impact as these AI-generated videos are helping improve road safety and, in very real terms, contributing to saving lives. It’s a use case she says would have been hard to imagine even a year ago, when AI applications were still heavily focused on structured data. Today, with models that can work across unstructured inputs like video and images, Google sees enormous potential for creating meaningful, high-impact content across industries, including financial services, logistics, and beyond.


