I Used The New ChatGPT Vision Feature, and Nothing Will Ever Be The Same Again
Artificial intelligence (AI) just got another upgrade, and it’s ahead of schedule
In February this year, Sam Altman proposed that we might need to start thinking about an update to Moore’s law: “the amount of intelligence in the universe doubles every 18 months”. That proposal came before the release of GPT-4, and this week — just 7 months later — we find ourselves through the looking glass once more.
For many nascent users of AI (that is, people who might use AI as a cooler version of a traditional search engine) being able to prompt with images certainly adds an element of fun to interactions with large language models (LLMs). But the paradigm shift that this marks is so profound that it’s worth understanding how you can partner with ChatGPT to completely change your ways of learning, working, and creating … right now.
In this article I’m using the latest versions of DALL·E with credits, and a subscription to ChatGPT Plus (GPT-4 Vision, Advanced Data Analysis Beta).
Using GPT-4 to explain a complex image
When it comes to expediting your learning and productivity, there is no greater tool than an AI assistant. Gone are the days of tirelessly navigating diagrams and flowcharts to try and root out the key message — be that academically, professionally, or personally. I’ve created a couple of examples below that show just how game-changing this feature really is.
First, using the tool as an educational assistant. You’ll see below that I’ve written a prompt letting the AI know that I’m (hypothetically) in high-school, and I’ve given it an image that I’m struggling with.
It identifies the image, and provides some introduction. It’s a well-labelled and fairly simple image, so a good start. We then get a detailed summary explaining — at an appropriate level — what each element of the diagram means.
I notice that it’s actually missed the “nucleus” but no bother, we’re all friends here and the AI is happy to take feedback.
By starting out with a clear prompt (e.g. “I’m in high-school”, “I’m a PhD student”, “please explain in no more than three sentences”) you’ll get more effective responses, and you’ll be able to tailor the conversation to make it work for you.
But what if we look at something more complicated that requires significantly more textual and logical analysis? What about something like the notoriously intimidating Tokyo Metro System?
It’s obviously a tough image for a Tokyo newbie to decipher quickly, but it’s a walk through the cherry blossom for our AI friend. Not only can it easily identify the image, but it can help us navigate it too. Let’s ask for some help getting around.
Okay, this could be really helpful. I’ve not had any time to review cool spots though, so I’d also appreciate insight when it comes to the itinerary. Let’s see how it makes recommendations when given some specific criteria.
The Imperial Palace East Gardens sound nice.
So does Cat Street.
Using GPT-4 to identify an artistic style using AI-generated images as the input
It’s one thing to work with an AI as effectively an improved search engine — indeed, in the examples above I’ve used ChatGPT to help me to do something I could have definitely done myself had I not been so lazy.
However, when we use the tool to produce novel information and creative suggestions, it feels like something altogether more impressive (if a little frightening sometimes).
Let’s look a completely different example. To begin with, I used the ‘creative co-pilot’ DALL·E to generate images without including stylistic prompts. I kept the prompt simple: “a room that’s never been seen before, with a sofa, a window, a houseplant, and a nice rug, digital art”.
Next, I generated variations of the image without changing the prompt. I selected two images that most closely shared the style.
Doesn’t that look lovely.
Next, I took the images and uploaded them to ChatGPT with a prompt:
… and here is the response:
Okay, this makes sense. But the next part is where it gets particularly interesting (and helpful). As a starting point, the AI will talk to you in a relatively straightforward way, unless you’ve provided information about yourself previously to set expectations, e.g. talk to me like I’m 12 years old.
The initial response was great — it did what I asked it do — but sometimes we don’t want the straightforward answer… we want the fancy, intimidating one.
“… a tabula rasa if you will.” Sublime!
Finally, just for fun:
Yeah, now you’re speaking my language.
Using GPT-4 to create code based on an input image
Now that we’ve finished messing about with discerningly restrained neoclassical modernist architecture, we can look at the use case that most radically challenges how we approach professions like software engineering.
I’ve not written code in a long time (though I promise I can! See my other work on Medium), I now overwhelmingly don’t need to. Let’s look at this final example of how an image prompt can fast-track the programming process.
To start, I uploaded an image of a colour (specifically #FF91AF Baker-Miller pink, my favourite colour) and provided a prompt with specific instructions*. I’ve intentionally declared that we need to be able to run the code in a notebook to limit the number of libraries available to work with. It’s a challenging task!
It tells us the steps it’s going to take to complete the task, and starts by accurately extracting the colour from the image. It explains what the code will try to achieve, and gives us the code.
Here’s the code, exactly as provided by ChatGPT:
# Inlining the main color directly into the HTML/CSS for the dog representation
dog_html = """
margin: 40px auto;
<!-- Ears of the dog -->
clip-path: polygon(50% 0%, 0% 100%, 100% 100%);
clip-path: polygon(50% 0%, 0% 100%, 100% 100%);
<!-- Eyes of the dog -->
<!-- Nose of the dog -->
# Display the stylized dog face in the notebook
I ran this in the notebook, and here is the haunting image of the dog that it creates. I won’t show you the cat/peacock hybrid monster it created in my first attempt. Nightmare fuel.
Now this is clearly a very straightforward and controlled example. To replicate a dashboard, a website, or other front-end we’d need to have fewer environmental controls (blame my current set-up), but it is possible.
If you’ve the time, I’d encourage you to experiment with the new feature in ChatGPT and give DALL·E a whirl too. It’s a great time to learn more about AI and make it a partner in your post-digital world.
*Side note: why do we tell large language models (LLMs) to “think step by step” or “take a deep breath”? Well, a study by researchers at Google DeepMind found that using specific encouragements like these can improve the accuracy of the response from the model by setting preference for better “reasoned” responses. For a good summary of why this works, check out this article: Telling AI model to “take a deep breath” causes math scores to soar in study | Ars Technica
Hope you enjoyed this foray in to ChatGPT Vision. Let me know what you think, and feel free to check out some of my previous work for Towards Data Science below.
Monte Carlo Tree Search (MCTS) AI Gameplay in Swift
Plus setting up Swift in Jupyter Notebooks!