Google has recently introduced its latest AI model, the Gemini 1.5 Pro, representing the next generation in artificial intelligence. Constructed on the MoE architecture, this new model boasts significant advancements over its counterparts. Google has positioned the Gemini 1.5 Pro as a superior and notably advanced model compared to its forerunners. Serving as the inaugural release in the Gemini 1.5 line, the 1.5 Pro is currently undergoing early testing. Characterised as a mid-size multimodal model, it has been optimised for scalability across a diverse array of tasks.


What Makes Gemini 1.5 Pro Stand Out?


What distinguishes the Gemini 1.5 Pro is its extensive understanding of context across different modalities. Google asserts that the Gemini 1.5 Pro can achieve comparable results to the recently launched Gemini 1.0 Ultra but with significantly reduced computing power. The standout feature of the Gemini 1.5 Pro is its capability to consistently process information across up to one million tokens, marking the longest context window for any large-scale foundational model to date. To provide context, the Gemini 1.0 models offer a context window of up to 32,000 tokens, GPT-4 Turbo has 128,000 tokens, and Claude 2.1 has 200,000 tokens.


Although the model comes with a standard 128,000-token context window, Google is allowing a select group of developers and enterprise customers to experiment with a context window of up to one million tokens. Currently, in preview mode, developers can test the Gemini 1.5 Pro using Google’s AI Studio and Vertex AI.


Use Cases Of Gemini 1.5 Pro?


The Gemini 1.5 Pro is said to be capable of processing approximately 700,000 words or around 30,000 lines of code, a substantial increase compared to the capacity of Gemini 1.0 Pro, which can handle 35 times less. Additionally, the Gemini 1.5 Pro can efficiently handle 11 hours of audio and 1 hour of video across various languages. Demonstrative videos shared on Google's official YouTube channel illustrated the model's extensive contextual understanding, featuring a 402-page PDF as a prompt. The live interaction showcased the model's responsiveness to a prompt consisting of 326,658 tokens, including 256 tokens of images, totalling 327,309 tokens.



Another demonstration highlighted the Gemini 1.5 Pro's utilisation of a 44-minute video, specifically a silent film recording of Sherlock Jr., accompanied by various multimodal prompts. The total tokens for the video amounted to 696,161, with 256 tokens for images. The demo showcased a user instructing the model to display specific moments and associated information in the video, prompting the model to provide timestamps and details corresponding to the video.



Meanwhile, a separate demonstration exhibited the model's interaction with 100,633 lines of code through a series of multimodal prompts.