Gemini: Google strengthens its search engine and humanizes it with natural dialogue and vision skills | Technology
Just 24 hours after the presentation of ChatGPT-4o, the most advanced version of the Open AI conversational robot, Google has matched and upped the ante this Tuesday, when it has presented similar improvements for its search engine, which are already beginning in the United States to spread to the rest of the world. The new search platform reproduces the skills of what the company calls “agents”, with the ability to plan and execute actions on behalf of the user, but humanizes it to the point of emulating an interaction with a person. Gemini, as the artificial intelligence of the multinational and the search engine is called, can be interrupted to reorient the conversation and the mobile camera becomes your eyes to describe what you see, solve the problems you observe or specify where you are. . an object that she has registered during their conversation. Where did you put the keys? What is the solution to this problem? What is this? Ask Gemini.
Google has pulled out all its forces to counter Open AI and fight for its hegemony in the search field. The company’s chief executive officer, Sundar Pichai, presented the latest advances in artificial intelligence this Tuesday at the annual Google I/O in Mountain View (California). It will apply to all products (Gmail, Photos, Drive, Meet and any Workspace tool), but especially, as Pichai stated, to the platform that is its stronghold: “The most exciting transformation with Gemini, of course, is in Google search. “We radically modified how it works.”
“Gemini can maintain a personalized and interactive conversation, mixing and combining inputs and outputs,” explains Pichai about the humanization of the interaction with the search engine, which is no longer linear (successive queries and responses) to emulate a relationship similar to the staff. These are skills that they already presented with agents in Las Vegas last April, during Google Next, where robots that plan and execute actions on behalf of the user were launched. “They are intelligent systems that show reasoning, planning and memory. They are able to think several steps ahead and work across programs and systems or do something on behalf of the user and, more importantly, with their supervision. “We are thinking a lot about how to do it in a way that is private, safe and works for everyone,” said the manager in response to the ethical risks identified by the company’s own research group (DeepMind).
The conventional search engine, which returns web pages more or less related to the user’s request, goes down in history with Gemini. Liz Reid, director of Google Search, says that, although this tool has been “incredibly powerful,” it requires “a lot of work,” in relation to the work of fine-tuning the descriptors and purging relevant information from the miles of results obtained. “Searching has been through one question after another,” she admits.
The new skills understand, as he explains, “what is really what you have in mind”, contextualizes, knows where you are interacting from and “reasons” to offer a result that combines what was found in various domains and presents a plan and advice. As he explains with a practical example, while the traditional search engine was asked about restaurants in the area, thanks to Gemini’s AI Overview, now “a place to celebrate an anniversary” can be requested and the search engine offers different categories of plans, prices, locations and suggestions. Or she can also add a complex travel program for a family of several members with different interests. “Google can do the brainstorming for you,” Reid points out.
But Gemini goes beyond conversation, reasoning and planning, which already represents a radical advance. The new step is the greatest possible humanization and that, in addition to hearing, it acquires another fundamental sense: sight. Demis Hassabis, director of DeepMid, explains: “We always wanted to build a universal agent that would be useful in everyday life. That’s why we made Gemini multimodal from the beginning. We are now processing a different stream of sensory information. These agents can see and hear what we do better, understand the context in which we find ourselves, and respond quickly in the conversation, making the pace and quality of the interaction much more natural.”
Hassabis demonstrates these skills, which will be available in the Live app for Advanced plan subscribers, in a sequence shot recorded in real time. The search engine uses the mobile phone’s camera to record the real context of a user who asks her what she sees, what the specific part of an object she points to is called, how to solve a mathematical problem written on paper and how to improve a Process of distributing data in a schematic displayed on a whiteboard. Finally, she asks him: “Where did you leave my glasses?” Gemini, who has recorded everything she has seen during the interaction, even if it is not relevant to the conversation held so far, reviews the perceived images and answers exactly where she has seen them. From there, the glasses act with Gemini.
“Gemini is much more than a chat robot. It is designed to be her personal assistant,” explains Sissie Hsiao, vice president of Google and general manager of Gemini, in reference to the Astra project led by her colleague Hassabis. This is what Sam Altman, head of Open AI, a Google competitor and developer of the similar ChapGPT-4o, calls a “super-competent colleague.”
“The responses are personalized (you can choose between 10 voices and the system adjusts to the user’s speech pattern) and intuitive to maintain a real back-and-forth conversation with the model. “Gemini is able to provide information more succinctly and respond in a more conversational manner than, for example, if she is interacting only with text,” Hsiao specifies.
There has also been progress in power, not only with new devices, such as its own processors (the Axion chip and the Trillium TPU), but also in charging capacity. Gemini 1.5 Pro subscribers will be able to manage up to one million tokens, which Hsiao says is “the largest window of context.” A token is the basic unit of information. It can be understood as a word, number, symbol or any other individual element that constitutes a part of the input or output data of the program. With this capability, Gemini can upload and analyze a PDF of up to 1,500 pages or 30,000 lines of code or an hour-long video or review and summarize multiple files. Google hopes to offer the two million tokens.
To facilitate the implementation of these skills on devices with less capacity, such as mobile phones, Google has updated the specific systems for these terminals and developed Flash, a high-performance system that provides speed, efficiency and lower consumption.
And although this edition of Google I/O has not been the main development, Google has also presented improvements in the artificial intelligence programs for photographs, with version 3 of Image, video creation (Veo) and music, with Lyria and Synth ID. The Ask Photos search engine, which will begin operating in the summer, will be able to locate and group images by theme at the user’s request and create an album with all the related images.
You can follow EL PAÍS Technology is Facebook and x or sign up here to receive our weekly newsletter.