I trained a open source version of Dall·E with thousands of images of cups to understand the relation between training data/input and output. Then I compared these to existing models.
You can follow my experiments in this section. There are many possibilities yet to explore. Let’s collect comments, ideas and thoughts together!
There are many different angles (way beyond Tech) from which you can judge and question text-to-image models. Some key concepts from multiple disciplines are collected here. Plus you can find helpful links to dive in deeper. I am happy about any additional link or expertise.
an algorithm designing our future
given the market’s near total saturation of our image repertoire, artistic practice can no longer revolve around the construction of objects to be consumed by a passive bystander. Instead, there must be an art of action, interfacing with reality, taking steps – however small – to repair the social bond.
not only every human experience but also every content expressed by means of other semiotic devices can be translated into the terms of verbal language
once you pass the picture of a cup through an algorithm it looses its need for functionality and becomes an aesthetecized representation of cultural connotations
participation is important as a project it rehumanises a society rendered numb and fragmented by the repressive instrumentality of capitalist production.
similarity cannot be a necessary condition for depictions. the drawing is not similar to what it is supposed to depict but it still is a picture
when neural networks will be able to generate an image that cannot be distinguished from a human creation, it will be because images created by humans have been transformed, in their biggest banality and instrumentality, as an aesthetic by default.
the aesthetic object, aesthetic judgement and aesthetic existence
the translation of a text into images belongs to a long Western theological tradition of making images express a sacred text is obscured
they don’t see that the defects, the metamorphoses, the amorphous are so many aesthetic potentialities, that the strange familiarity between human and technical productions is also made of distances and differences consisting in an anthropo-technological gray zone: human and technical have always influenced each other, the imagination will have been the name of their meeting through a material support.
Model name, date:
try.pt, April 2022
Model type:
Dalle-pytorch trained on a dataset of 374 text-image-pairs
Model Parameters:
default settings with pretrained OpenAI VAE and 20 - 60 epochs
Licence:
Not provided due to poor quality
Intended Usecase:
Test training
Intended Users:
none
Product images of cups available at amazon.de. The product description was used as image description.
1. Way more data and longer training time is required to generate realistic images. Training time can be controlled with the epochs, which is the amount of iterations over the data.
2. The image descriptions need to make sense.
3. In the repository it is recommended to use the pre-trained VQ-VAE instead of the one by OpenAI
4. Most of the cups displayed in the training data can be put into the categories frustration, sexism and cringe. If the training worked well the model would reproduce that. Therefor it is not too bad that it was not able to generate realistic outputs.
Model name, date:
cups.pt, May 2022
Model type:
Dalle-pytorch trained on a dataset of 3445 text-image-pairs
Model Parameters:
default settings with pretrained VQ-VAE (- -taming) and 800 epochs
Licence:
Free to use trained model for further experiments
Intended Usecase:
Generation of images of cups from a formal description. Research whether it is possible to generate configurations of attributes the dataset did not contain.
Intended Users:
Researchers, Creatives / Artistic work
Images of cups in a studio environment, different perspectives and fillings.
Description structure cups:
a [shape] cup [direction] with [side]
a [color] cup [pattern|print]
a cup [filling]|[usage] [location]
1. The more words in the prompt matched the image description the better was the reproduction of the image.
2. The longer the prompt, the more detailed the output.
3. Probably every single word was translated to a token. For example the prompts ‘used as a plant pot’ and ‘used as a panecil holder’ had three tokens in common, so with the prompt containing ‘used as a plant pot’ it sometimes also generated images of pencils.
1. A cup often seems to be referred to football, so here we find different semantic connections. The fonts in the images might originate from memes, stock footage or other images including text on the internet.
2. Images on the internet rarely seem to be descriped by the direction of the handle or the cup lying on its side.
3. Adding the word ‘illustration’ or ‘paining’ to the prompt gets misinterpreted as a stylistic feature. By adding ‘a photo of…’ to the prompt it can be reversed.
Model name, date:
concept.pt, June 2022
Model type:
Dalle-pytorch trained on a dataset of 2138 text-image-pairs
Model Parameters:
default settings with pretrained VQ-VAE (- -taming) and ~1800 epochs
Licence:
Free to use trained model for further experiments
Intended Usecase:
Generation of images of cups and matching mood images. Research whether it is possible to enhance the stylistic features of the cup by combining the cups and the mood images.
Intended Users:
Researchers, Creatives / Artistic work
Images of cups in a studio environment, sorted by their styles + matching mood images.
Description structure cups:
[mood] [shape] [colour] cup [feature] [filling] handle [direction]
Description structure mood images:
[mood] [type/object] [colour]
1. By using exactly the image description as a prompt, the output was very realistic and well fitted. This was true for both the images of the cups as well as for the mood images, which were way fewer and more diverse.
2. As soon as the order of the words was disrupted or more words were added the output was blurred.
3. The order of the words in the input seem to be important: I tried combining the image descriptions of a cup and a matching mood image into a single prompt. When I entered the description for the cup first, it resembled more a cup and vice versa.
4. In some cases it seemed like the colours could be transferred, but with the output still being quite unclear (in some cases aesthetically interesting), it is hard to say wether the goal to enhance the moods of the cups could be achieved.
1. The usage of long and precise prompts provided realistic images that fitted the description well in a sense of a very literal translation.
2. Because the text description was translated very literally, non of the outputs seems very surprising, ‘originally’ or ‘artistic’.
3. Related to the model I trained myself I can say that the model is not capable of producing anything it has not ‘seen’ before. In that sense it can not generate a cup that would be distorted in an interesting way to enhance its style like ‘no one has seen before’. Therefor I think calling those models ‘creative’ is far from reality.
Dalle-pytorch offers more settings, which I did not try yet. I used the pre-trained VQ-VAE, but you can also train your own. The relation between the VAE and the actual training still needs to be explored.
As dalle-pytorch is still based on a GAN architecture it is already quite old and newer architectures with Diffusion Models are already out. I hope to find repositories accessible also for non-coders.
There are options to use a model trained on a big dataset as a starting point and train of top of that with your own data. That method is called finetuning.
Prompt-Learning works similar only you fill the empty spaces the latent space with your own data instead of manipulating the latent space with further training.
These two methods seem promising for generating images with my own content but also being efficient with lower training cost.
A mystified umbrella term for studies in Machine Learning, Self Learning Algorithms and others, commonly used for statistically driven decision-making, recommendation systems or output generation.
A part of the field of AI research, feeding a machine with big amounts of data with the goal to have it find rulesets and (re)produce output accordingly.
A set of rules for a computer to follow, written in code.
Pytorch is a open source framework used for machine learning, based on the programming language python.
The expression of an algorithm trained on data and derived a ruleset from it, written in so called ‘weights’ and ‘biases’.
A Self-learning Algorithm trained on image-text-pairs to generate new images from a text prompt.
Working across multiple research fields without limiting one’s work to the traditional borders of one discipline.
“participation is important as a project: it rehumanises a society rendered numb and fragmented by the repressive instrumentality of capitalist production. Given the market’s near total saturation of our image repertoire, […] artistic practice can no longer revolve around the construction of objects to be consumed by a passive bystander. Instead, there must be an art of action, interfacing with reality, taking steps – however small – to repair the social bond.”*
Statistically, a system is “is biased if it’s consistently wrong in one direction or another”*
Commonly referred to as “prejudiced against a certain group or characteristic.”*
A branch of philosophy which engages with sensual perception, often simplified as ‘beauty’. It investigates what generates the sensual significance of e.g. a piece of art.
Art history knows an established canon, a selection of works that are universally known as the most important works of art. It has been criticised for not representing artists from diverse backgrounds. Inspiration for dealing with data could be drawn from this fight.
A system to ”clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited”*
The data fed into a machine learning model, can either refer to the data used for training or the prompt
The result of a task input to a machine learning model
The sentence used as an input in a trained model. The instruction for the model to generate the output.
Strategies to control the output of a model. Includes the usage of subject and style keywords and semantics. For pre-trained models those strategies are developed in a trial and error approach.
The action of a machine learning model processing data to acquire the rulesets of the given data
Mostly applied in Natural Language Processing, Finetuning means using a pertrained model and training is further with more data fitted to a use case to enhance the quality of the output for a certain task.
Research about how communication is influenced by the intersection of different modalities. A modality can be images, sound, spoken or written language, etc.
Bachelor Thesis by Katharina Mumme
supervised by Prof. Peter Kabel
and Prof. Aram Bartholl
HAW Hamburg, Department Design, Visual Communication
April – July 2022
The repository can be found here
Due to large file sizes the .pt-files of the models cannot be uploaded here. In case you are interested feel free to message me.
Big shoutout to Phil Wang for recreating the architectures and making code available.
Fonts used on this website, thanks to the creators for making their work openly accessible:
Saira Condensed (Omnibus Type)
Office Code Pro (@nathco)
Special thanks to Fabian and Max for their amazing (coding) support
and to Katharina, Olivia, Lisa, Linda, Rebecca and Janne for letting me take pictures of their beautiful dishes.