Can ChatGPT “see”?

Not only can ChatGPT create images, it appears to also understand images, correctly identifying objects

Will Mayall
4 min readJul 18, 2023

In March 2023, Microsoft unveiled a research paper titled “Sparks of Artificial General Intelligence: Early experiments with GPT-4”. This paper provides a captivating exploration into GPT-4’s astonishing abilities. Intriguingly, they found that GPT-4 could generate images, despite not being specifically trained for this task. Not only could it construct an image, but it also seemed to possess the ability to “perceive” and interpret an image. This revelation piqued my curiosity. I endeavored to explore whether I could utilize ChatGPT-4 to create a basic illustration, and then assess if it could accurately identify the components of that picture. The results were impressively successful. It’s worth noting that the language employed to describe the images in the following content is SVG, a straightforward language compatible with contemporary web browsers.

Creating the image

The initial phase involved requesting ChatGPT-4 to generate an image. I tasked it with crafting a visual scene featuring mountains.

Original image created by ChatGPT via the request: “Use SVG to create an image of a mountain under a cloud-filled sky. Make the sky look like a sunset. Put snow on the mountain. Put foothills in front of the mountain. Put a rolling green meadow between a road and the foothills and mountain.”

The results generated by ChatGPT-4 are genuinely striking. But how did it create this image? Its capabilities stem from its training data, or “corpus,” which functions as its memory. Much like human memory and unlike conventional computer storage, a Large Language Model (LLM) does not retain a precise copy of its inputs. Instead, it constructs a generalized, compacted “idea” of what it learns, gaining an understanding of context and relationships in the process. For instance, a typical association it might make is that a sidewalk often accompanies a road.

While the exact image crafted by ChatGPT may not exist within its corpus, it’s probable that it has gleaned understanding about the objects in the image and how to transcribe them into SVG code. Though the model would have encountered image code describing some or all of the objects in the picture, it doesn’t store direct copies of this code. Instead, it internalizes the concept of the object. For instance, it recognizes that mountains are often depicted as brown triangles, with white, triangular snowcaps positioned higher up.

Some elements of the composition, such as the green meadow or the white clouds, are likely represented frequently within the corpus. Other aspects, like the precise positioning and layering of elements, suggest a deeper grasp of the depicted objects. The overall composition hints at an integrated understanding of all components involved.

The final image is almost certainly unique. Like a human, ChatGPT infrequently repeats itself; multiple attempts at generating the image resulted in markedly different outputs each time. I selected my favorite among them. Although the corpus must contain a description of each element of the original image, the overall composition remains noteworthy. It’s unlikely that any existing SVG code features this exact image, though it’s not wholly outside the realm of possibility. After extensive online searching, I found that SVG landscape images are quite rare. My conclusion is that SVG is seldom employed for such images and that the image is unique.

“Seeing” the image

In a separate dialogue, I prompted ChatGPT to assign labels to the objects represented in the raw SVG code. There were no explicit “clues” embedded in the code to denote what the image contained. It’s important to note that from ChatGPT’s perspective, it had never “encountered” the image before being tasked to label it.

The image with ChatGPT’s labels as a result of the request: “Using SVG, copy the below image [see above] and add labels for each object and area: [SVG code]”

The results were intriguing. The image’s shapes are simplistic, potentially symbolizing a multitude of things. Yet, how did the AI discern one triangle to represent a mountain and another a snowcap?

When tasked with describing the image, ChatGPT elucidated:

The image, in general, appears to represent an abstract landscape, encompassing elements like mountains, terrain, and a gray surface that could be a road.

These findings imply that the AI possesses a grasp of the image’s entirety, particularly suggesting an understanding of context. This can be compared to a visually impaired person providing an image description based on the elucidation of the objects therein.

In conclusion, the capabilities of GPT-4, as shown in this experiment, are impressive and surprising. It’s not just that GPT-4 can generate SVG code and create an image; what’s striking is the apparent “understanding” and context it applies when interpreting the elements of the image. Although it’s not visual perception as humans experience it, the AI’s ability to infer the meaning of simple shapes and their configuration in a given context opens up exciting possibilities. While we tread carefully, these findings underscore the potential of AI, like GPT-4, to redefine the boundaries of machine capabilities in unexpected ways.

--

--

No responses yet