Table of contents
Hi everyone👋 , In this article, we will be generating our own art using an AI. You can refer to my YouTube video to see a working tutorial for better understanding and a step-by-step guide of the same.
- Link to Google collaboratory
Introduction
Over the past couple of years, there has been a lot of research and developments into creating AI models that can generate images from a given text prompt. This could be thought of as a personal artist, who tries to create an artwork by following the very words of your instruction. Now, who wouldn't want to have a personal Picasso, but as that's impossible, we can settle with the next very possible thing — an AI Picasso 😃. Tag along, as we understand a unique type of models, in terms of what they are made of and why they work. Next, we will see some examples of the images that I generated and finally conclude with shedding some light on how you can create images of your own! Theory — what is it?
For this article, we are going to use CLIP + {GAN} family of models. At the very heart of these models is a GAN i.e. Generative Adversarial Networks. It is a combination of two sub-modules, (1) Generator: which tries to create a realistic artificial image, and (2) Discriminator: which tries to distinguish between the artificial/fake and real image. While training GANs, we train both of the sub-modules and over time they become exceptionally good at their job. For us, the generator is an interesting piece, as it has now learned to create very realistic images!
The basic architecture of GAN. Take from “GAN tutorials [..]” by Razvan Marinescu. Source: Link Over the years, there has been a lot of development on improving the performance of GANs. The architecture shared above is very generic and can be considered as a very basic GAN. Recently, people have developed much better architecture, some of them are, (detailed comparison of different types of GANs and code can be found here.) ProGAN: where first, the generator and discriminator are trained on low-resolution images, and then new layers are added incrementally to handle higher resolutions. BigGAN: it mostly follows the architecture of SAGAN, which incorporates attention layers. And as the name suggests, it's a bigger version of SAGAN, with twice as much channel size and 8 times the batch size. VQGAN: considered the SotA at the time of writing, it incorporates transformers (which is widely tested in the text domain) with CNN (widely used in the image domain). The final product has more expressive power locally (due to CNN) and globally (due to transformers). The next important component is the CLIP model which, given a set of captions, basically finds the best matching caption for an image. For this, it has a sub-module that computes the similarity between an image and a text description. After exhaustive training, this module has a lot of expressive power.
The image and text encoder along with the similarity comparison in CLIP (for a batch of data). Source: Link The final image generating solution can be created by making a combination of CLIP and a GAN (you can pick one from above), in this manner — (1) use GAN to generate an image, (2) use CLIP to find the similarity between the text prompt and the image, (3) over multiple steps, train the combined models to maximize the similarity score generated by CLIP. In this way, we can think of CLIP as the teacher which gives the student (GAN), homework to draw an image (the only hint given by the teacher is some textual description). The student first returns with a very bad drawing, on which the teacher provides some feedback. The student goes back and later returns with a better version. These interactions (iterations) are nothing but passing and processing of feedbacks between the teacher (CLIP) and student (GAN). Over time, the images become better and we can stop when we think the results are good enough.
Examples
We will follow a bottom-up approach. So we will first look at the images I have generated and then later talk about the practical aspects. While the type of images we can generate is infinite (as it depends on the text prompts), for this article, I tried to broadly generate images over three areas— (1) Painting, (2) Posters, and (3) Entities. For each of these types, I have added some type-specific suffix in the prompts to convey my intentions to the model. We will discuss the text prompts in detail later, so enough talk for now, let us look at some of the images that can be generated using nothing but a google colab notebook and text prompts,
- Prompt for this piece: “two guys with what looks like some kind of supernatural power; game poster; trending in artstation” 🕹️
- Prompt for this piece: “Shiva the destroyer god | 4k | trending in artstation” 🕉️
- Prompt for this piece: “a unicorn wearing a black armor | 4k | deviantart | artstation” That's trippy 🦄
Alright, guys! I hope this article was helpful for you if it is the leave your comments below. I will meet you in another article until then KEEP CODING🤗.