1/4 from the “… and a nice rug” series | DALL·E output | image author’s own

I Used The New ChatGPT Vision Feature, and Nothing Will Ever Be The Same Again

Artificial intelligence (AI) just got another upgrade, and it’s ahead of schedule

--

In February this year, Sam Altman proposed that we might need to start thinking about an update to Moore’s law: “the amount of intelligence in the universe doubles every 18 months”. That proposal came before the release of GPT-4, and this week — just 7 months later — we find ourselves through the looking glass once more.

For many nascent users of AI (that is, people who might use AI as a cooler version of a traditional search engine) being able to prompt with images certainly adds an element of fun to interactions with large language models (LLMs). But the paradigm shift that this marks is so profound that it’s worth understanding how you can partner with ChatGPT to completely change your ways of learning, working, and creating … right now.

In this article I’m using the latest versions of DALL·E with credits, and a subscription to ChatGPT Plus (GPT-4 Vision, Advanced Data Analysis Beta).

Using GPT-4 to explain a complex image

When it comes to expediting your learning and productivity, there is no greater tool than an AI assistant. Gone are the days of tirelessly navigating diagrams and flowcharts to try and root out the key message — be that academically, professionally, or personally. I’ve created a couple of examples below that show just how game-changing this feature really is.

First, using the tool as an educational assistant. You’ll see below that I’ve written a prompt letting the AI know that I’m (hypothetically) in high-school, and I’ve given it an image that I’m struggling with.

ChatGPT output | image author’s own

It identifies the image, and provides some introduction. It’s a well-labelled and fairly simple image, so a good start. We then get a detailed summary explaining — at an appropriate level — what each element of the diagram means.

I’m getting horrible flashbacks to my master’s | ChatGPT output | image author’s own

I notice that it’s actually missed the “nucleus” but no bother, we’re all friends here and the AI is happy to take feedback.

ChatGPT output | image author’s own

By starting out with a clear prompt (e.g. “I’m in high-school”, “I’m a PhD student”, “please explain in no more than three sentences”) you’ll get more effective responses, and you’ll be able to tailor the conversation to make it work for you.

But what if we look at something more complicated that requires significantly more textual and logical analysis? What about something like the notoriously intimidating Tokyo Metro System?

Terrifying | ChatGPT output | image author’s own

It’s obviously a tough image for a Tokyo newbie to decipher quickly, but it’s a walk through the cherry blossom for our AI friend. Not only can it easily identify the image, but it can help us navigate it too. Let’s ask for some help getting around.

ChatGPT output | image author’s own

Okay, this could be really helpful. I’ve not had any time to review cool spots though, so I’d also appreciate insight when it comes to the itinerary. Let’s see how it makes recommendations when given some specific criteria.

ChatGPT output | image author’s own

The Imperial Palace East Gardens sound nice.

So does Cat Street.

Using GPT-4 to identify an artistic style using AI-generated images as the input

It’s one thing to work with an AI as effectively an improved search engine — indeed, in the examples above I’ve used ChatGPT to help me to do something I could have definitely done myself had I not been so lazy.

However, when we use the tool to produce novel information and creative suggestions, it feels like something altogether more impressive (if a little frightening sometimes).

Let’s look a completely different example. To begin with, I used the ‘creative co-pilot’ DALL·E to generate images without including stylistic prompts. I kept the prompt simple: “a room that’s never been seen before, with a sofa, a window, a houseplant, and a nice rug, digital art”.

2/4 from the “… and a nice rug” series | DALL·E output | image author’s own

Next, I generated variations of the image without changing the prompt. I selected two images that most closely shared the style.

3/4 from the “… and a nice rug” series | DALL·E output | image author’s own
4/4 from the “… and a nice rug” series | DALL·E output | image author’s own

Doesn’t that look lovely.

Next, I took the images and uploaded them to ChatGPT with a prompt:

ChatGPT output | image author’s own

… and here is the response:

ChatGPT output | image author’s own

Okay, this makes sense. But the next part is where it gets particularly interesting (and helpful). As a starting point, the AI will talk to you in a relatively straightforward way, unless you’ve provided information about yourself previously to set expectations, e.g. talk to me like I’m 12 years old.

The initial response was great — it did what I asked it do — but sometimes we don’t want the straightforward answer… we want the fancy, intimidating one.

ChatGPT output | image author’s own

“… a tabula rasa if you will.” Sublime!

Finally, just for fun:

ChatGPT output | image author’s own

Yeah, now you’re speaking my language.

Using GPT-4 to create code based on an input image

Now that we’ve finished messing about with discerningly restrained neoclassical modernist architecture, we can look at the use case that most radically challenges how we approach professions like software engineering.

I’ve not written code in a long time (though I promise I can! See my other work on Medium), I now overwhelmingly don’t need to. Let’s look at this final example of how an image prompt can fast-track the programming process.

To start, I uploaded an image of a colour (specifically #FF91AF Baker-Miller pink, my favourite colour) and provided a prompt with specific instructions*. I’ve intentionally declared that we need to be able to run the code in a notebook to limit the number of libraries available to work with. It’s a challenging task!

ChatGPT output | image author’s own

It tells us the steps it’s going to take to complete the task, and starts by accurately extracting the colour from the image. It explains what the code will try to achieve, and gives us the code.

ChatGPT output | image author’s own

Here’s the code, exactly as provided by ChatGPT:

# Inlining the main color directly into the HTML/CSS for the dog representation

dog_html = """
<div style="
width: 200px;
height: 200px;
background-color: #ff91af;
border-radius: 50%;
position: relative;
margin: 40px auto;
">
<!-- Ears of the dog -->
<div style="
width: 70px;
height: 100px;
background-color: #ff91af;
clip-path: polygon(50% 0%, 0% 100%, 100% 100%);
position: absolute;
top: -30px;
left: 0;
"></div>
<div style="
width: 70px;
height: 100px;
background-color: #ff91af;
clip-path: polygon(50% 0%, 0% 100%, 100% 100%);
position: absolute;
top: -30px;
right: 0;
"></div>

<!-- Eyes of the dog -->
<div style="
width: 30px;
height: 40px;
background-color: #000;
border-radius: 50%;
position: absolute;
top: 70px;
left: 40px;
"></div>
<div style="
width: 30px;
height: 40px;
background-color: #000;
border-radius: 50%;
position: absolute;
top: 70px;
right: 40px;
"></div>

<!-- Nose of the dog -->
<div style="
width: 30px;
height: 20px;
background-color: #000;
border-radius: 50%;
position: absolute;
top: 100px;
left: 85px;
"></div>
</div>
"""

# Display the stylized dog face in the notebook
display(HTML(dog_html))

I ran this in the notebook, and here is the haunting image of the dog that it creates. I won’t show you the cat/peacock hybrid monster it created in my first attempt. Nightmare fuel.

image (unfortunately) author’s own

Now this is clearly a very straightforward and controlled example. To replicate a dashboard, a website, or other front-end we’d need to have fewer environmental controls (blame my current set-up), but it is possible.

When discussing with ChatGPT afterwards what it would ideally want to be able to access to complete the dog task more effectively, it points out that it would want more flexibility with regards to HTML/CSS (e.g. <canvas> with JavaScript), the option to work with SVG, and/or Python graphics libraries like Pygame, PIL, and Turtle.

If you’ve the time, I’d encourage you to experiment with the new feature in ChatGPT and give DALL·E a whirl too. It’s a great time to learn more about AI and make it a partner in your post-digital world.

*Side note: why do we tell large language models (LLMs) to “think step by step” or “take a deep breath”? Well, a study by researchers at Google DeepMind found that using specific encouragements like these can improve the accuracy of the response from the model by setting preference for better “reasoned” responses. For a good summary of why this works, check out this article: Telling AI model to “take a deep breath” causes math scores to soar in study | Ars Technica

--

--

Niamh Kingsley

Passionate about technology, AI, & neuroscience. You can generally find me @nifereum, @niamhkingsley or connect via https://www.linkedin.com/in/niamhkingsley