VLOGGER Is Google's Image-To-Video AI Tool That Can Be Controlled By Voice
Even as Google VLOGGER is not yet accessible for testing, as per mutiple reports, the demonstration hints at its potential to enable users to create and command avatars using voice commands.
With Artificial Intelligence (AI) being the hottest buzzword in tech, search engine giant Google's researchers have been busy, unveiling a string of innovative models and concepts. Their latest creation involves transforming a static image into a manipulable avatar, following their recent advancements in game-playing AI. Even as Google VLOGGER is not yet accessible for testing, as per mutiple reports, the demonstration hints at its potential to enable users to create and command avatars using voice commands.
However, a user named Madni Aghadi (@hey_madni) has posted on X, formerly Twitter: "Google just dropped VLOGGER, and it's crazy. This is going to transform the future of VIDEO forever. Here’s everything you need to stay ahead of the curve: 🧵 👇."
It should be noted that the image posted by Aghadi is a mockup and not real. VLOGGER is the tech giant's research project that may be able to "make photos come alive" via AI in future. While existing tools like Pika Labs' lip sync, Hey Gen's video translation services and Synthesia offer similar functionalities to some degree, Google VLOGGER appears to offer a more straightforward, bandwidth-friendly alternative.
What Is VLOGGER
Currently, VLOGGER remains a research endeavour featuring a few entertaining demo videos. However, should it evolve into a product, it has the potential to revolutionise communication on platforms like Teams or Slack.
This AI model has the capability to generate a dynamic avatar from a static image while preserving the photorealistic appearance of the individual throughout every frame of the resulting video.
Also read: OnePlus 11 Gets Big Price Cut On Amazon. Check Out The Offer And More
Moreover, the model integrates an audio file of the individual speaking, orchestrating body and lip movements to mirror the natural gestures and expressions that the person would exhibit if they were speaking in real life. This also includes generating head movements, facial expressions, eye movements, blinking, as well as hand gestures and upper body motions, all without relying on any additional references beyond the provided image and audio.
A Github post further explains VLOGGER (in abstract) as follows: "We propose VLOGGER, a method for text and audio-driven talking human video generation from a single input image of a person, which builds on the success of recent generative diffusion models."
Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion based architecture that augments text-to-image models with both temporal and spatial controls. This approach enables the generation of high quality videos of variable length, that are easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g., visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate.