ChatGPT Breaks Boundaries, From Text to Voice and Vision

ChatGPT primarily relied on text-based prompts to generate essays, poems, and summaries. However, the new update allows users to engage in conversations with the chatbot through voice and images.

This evolution marks a significant milestone for generative AI, as OpenAI combines the capabilities of voice-based assistants with its powerful large language models (LLMs). For example, users can ask ChatGPT to create an impromptu bedtime story using voice commands, guiding the narrative with spoken prompts. Alternatively, users can ask questions, and ChatGPT will respond in spoken language.

In addition, while traveling, users can snap a picture of a landmark and have a live conversation with ChatGPT about its significance.

Image-based queries

Additionally, ChatGPT users can now utilize image-based queries by uploading pictures and asking ChatGPT to explain the contents of the image or provide instructions to achieve a particular goal.

This feature enables users to seek assistance with various tasks, such as figuring out why their grill isn't working, assessing the items in the refrigerator to plan a meal, or analyzing intricate graphs related to work data. If users want to highlight a particular area of the image, they can utilize the drawing tool in the mobile app.

The understanding of images is made possible by combining multimodal models, namely GPT-3.5 and GPT-4. These models apply their language processing capabilities to various visual content, including photographs, screenshots, and documents containing both text and images.

Collaborating with Spotify and distinct voice actors

OpenAI has introduced a voice feature powered by a novel text-to-speech model capable of generating human-like voices from text and a short sample of spoken language. To create this feature, OpenAI collaborated with professional voice actors to develop five distinct voices. The company has also used its open-source Whisper speech recognition system to convert spoken words into text.

Spotify has also joined as a launch partner for this voice feature, as its innovation allows podcasters to record their voices and translate their shows from English into Spanish, French, or German while maintaining their own unique voice.

OpenAI acknowledges that this new voice technology, capable of creating realistic synthetic voices from brief real speech samples, opens up numerous creative and accessibility-oriented applications. However, the company is transparent about the limitations of its models and discourages the use of ChatGPT for high-risk tasks without proper verification.

Additionally, they acknowledge that the model performs better with English text and may not be as proficient with other languages, especially those with non-Latin scripts.

The rollout of these new features will begin in the next two weeks for paying Plus and Enterprise subscribers. Initially, voice capabilities will be available on an opt-in beta basis for the ChatGPT Android and iOS apps, while image search will be a default feature on all platforms.

Not too long ago, OpenAI unveiled ChatGPT Enterprise, a business-oriented version of its AI-driven chatbot application.