Article Five: My text-to-img Journey - Checkpoints, PickleTensors, SafeTensors, LORA, and Textual Embeddings

This is the fifth post in a series of articles I have been writing about text-to-image software. In this series, I will talk about how the technology works, ways you can use it, how to set up your own system to batch out images, technical advice on hardware and software, advice on how you can get the images you want, and touch on some of the legal, ethical, and cultural challenges that we are seeing around this technology. My goal is to keep it practical so that anyone with a little basic computer knowledge can understand and use these articles to add another tool to their art toolbox, as cheaply and practically as possible. In this fifth post, we will discuss how to use Automatic1111 stable-diffusion-webui in more detail focusing on differences between Checkpoints, PickleTensors, Safetensors, LORA, and Textual Embeddings, and cover a few basic prompt concepts that will help you to get the images you want. 

Checkpoint and Extra Network Files

In our third article we asked you to download a checkpoint https://civitai.com/models/41928/8buffgen to test if your installation of the webUI (Automatic1111-Stable-Diffusion-webUI) worked. Now we are going to go into more detail about the checkpoints and extra networks that are available in the webUI and how you can use them to get the pictures you want with more consistency. Some of this information can be found in the Automatic1111-stable-diffusion-webui wiki pages here: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features

At the end of this section we have included a table that shows where all the files need to be be stored for the webUI to use them and common file types.

Model Checkpoints

There are two file types that can be used by the webUI to generate images; PikleTensors (*.ckpt) and SafeTensors (*.safetensors). These are trained models that tend to focus on certain styles or content. There is a difference between a checkpoint and a safetensor file and both are important components used during the training process to generate images from text prompts.


A PikleTensors (*.ckpt) file is a snapshot of the trained model at a specific point in the training process. It contains the weights and biases of the neural network and other relevant parameters, which can be used to resume training from the same point or to generate images using the current state of the model. Checkpoint files are saved periodically during the training process to track the progress of the model and to prevent loss of data in case of errors or crashes.


On the other hand, a SafeTensors (*.safetensors) file is a file format used to store tensors, which are multi-dimensional arrays used in deep learning algorithms. Safetensor files are designed to be efficient and secure, with features like data compression and encryption. They can be used to store the textual descriptions that are input to the text-to-image software during training or to store the generated images that result from the training process.


In summary, a checkpoint file is a snapshot of the trained model, while a safetensor file is a file format used to store tensors efficiently and securely. Both of these files are important components of text-to-image software during the training process, but they are meant to serve different purposes. I expect to see the use of these files differ more in the future as AI tools mature and expect to see SafeTensors (*.safetensors) traded and sold as a final product, since they are more secure, and PikleTensors (*.ckpt) be traded and shared as works in progress for people who are training/mixing models. For our purpose using Automatic1111-Stable-Diffusion-webUI, we can use both file types at this time, but keep in mind that SafeTensors are what you will want to use unless you are going to be doing a lot of training and merging.


To install the PikleTensors (*.ckpt) and SafeTensors (*.safetensors), you need to move the downloaded file to the <home directory>/stable-diffusion-webui/models/stable-diffusion directory on your linux server and then reload the webUI by scrolling to the bottom of the webUI and click on the bottom of the “reloadUI”. This may take up to a minute to reload. If it fails, just refresh the page until it reloads.

Textual Inversion

Textual inversion is a useful technique to generate more diverse and consistent images. Specifically, the technique involves taking a textual input that describes a certain scene or object and generating an image that shows the opposite or inverse of that scene or object when using it for image-to-image generation. For example, you can use a Textual Inversion to consistently generate a person's face. 


You can find details and other uses on the Author's site is here: https://textual-inversion.github.io/ and a long explanation about Textual Inversion on the Automatic1111-stable-diffusion-webUI can be found here: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Textual-Inversion


The current Textual Inversion file is a .pt file, but older ones may be a .bin file type. We have used both in the webUI with no issues, but I suggest focusing on .pt files. To use the textual inversion .pt file place the file into the embeddings directory and use its filename in the prompt. You don't have to restart the webUI for this to work, but may need to click the refresh button in the webUI.

LoRA

LoRA, or Low-Rank Adaptation of Large Language Models, is a technique that could potentially improve the performance of text-to-image software by enabling the language model component to more effectively process textual input and generate corresponding images. LoRA was initially proposed by Microsoft Research in a paper with a focus on addressing the challenge of fine-tuning large language models. For more information see here: https://pub.towardsai.net/hugging-face-lora-is-a-simple-framework-for-fine-tuning-text-to-image-models-746e7615420 

While initially proposed for large language models and demonstrated on transformer blocks, LoRA is also suitable for use in other contexts, such as Stable Diffusion fine-tuning.

What does this mean for us who want to make cool images? LoRA files add certain objects or features that may not be in the model we are using. To use the example of our 8Buff_gen model, we trained it for putting people in jars. This was not part of the original models from Stability Diffusion, and we got the idea from a LoRA call girls in jars. That Lora could be called by using other models, but results were inconsistent depending on what model was used. The advantage is we have more consistency with that training being in the model. But with using separate LoRAs GPU time and drive space can be saved. This will be an interesting area to see how it is adapted for generative platforms in the future.

To use, download LoRA files (*.safetensors or *.pt) and place them in the ./stable-diffusion-webui/models/Lora directory. You will need to reload the webUI interface.


Lora is added to the prompt by putting the following text into the prompt: <lora:filename:multiplier>, where filename is the name of file with Lora on disk, excluding extension, and multiplier is a number, generally from 0 to 1, that lets you choose how strongly Lora will affect the output. Lora cannot be added to the negative prompt and can’t be used for prompt matrix (more on that later).

Hypernetworks

Hypernetworks is a type of neural network architecture that can be used in text-to-image software to generate images from textual input. Unlike traditional neural networks that use fixed weights and biases to transform input data into output, hypernetworks generate the weights and biases of another neural network called the target network, which is responsible for generating the image from the textual input. Hypernetworks fine tune weights for CLIP and Unet, the language model and the actual image de-noiser used by Stable Diffusion, generously donated to the world by our friends at Novel AI in autumn 2022.


Overall, hypernetworks offer a promising approach for text-to-image software, enabling the generation of high-quality images from textual input more efficiently and effectively. We will get into hypernetworks in future articles, but for now we wanted to mention them here since you will undoubtedly see the word in the webUI interface. 


Download hypernetwork files (*.safetensors or *.pt) and place them in the ./stable-diffusion-webui/models/hypernetworks directory. Works in the same way as Lora in the prompt. Multiplier can be used to choose how strongly the hypernetwork will affect the output. Same rules for adding hypernetworks to the prompt apply as for Lora: <hypernet:filename:multiplier>.


Below is a brief chart of all the files we mentioned above alonge with the directory each file should be moved to for use, the common files types, and basic use in the prompt within the webUI.


Checkpoints and Extra networksFile Placement in the Directory /stable-diffusion-webuiFile typesHow to use in AUTOMATIC1111-stable-diffusion-webUI prompts
Checkpoint/stable-diffusion-webui/models/Stable-diffusion/*.ckpt, *.safetensorsDrop down from upper left hand side of page labeled: “Stable Diffusion checkpoint”
Textual Inversion/stable-diffusion-webui/embeddings/*.pt, *.bin (depreciated)Place embedding's filename into prompt
LoRA/stable-diffusion-webui/models/Lora*.safetensors, *.pt Use following format in prompt (may require additional keyword tiggers in the prompt): <lora:filename:multiplier>
Hypernetworks/stable-diffusion-webui/models/hypernetworks*.pt, *.ckpt, *.safetensorsUse following format in prompt (may require additional keyword tiggers in the prompt): <hypernet:filename:multiplier>

Size Matters

Pausing for a minute we want to discuss image size. Most models are trained for a certain pixel size. The default is 512x512 meaning the image is 512 pixels in height and 512 pixels in length. The newer models support 768x768 (image is 768 pixels in height and 768 pixels in length).


This doesn’t mean that you can’t do different size images with most models. Most will try to do as large as you have resources for. Why the size matters is to get the most consistent and best images if you use the size that the model was trained for. We recommend using the recommended model size, or if that is unknown, use 512x512 or 768x768 for all testing, then expanding the size after you have basedlined the model.

Stable Diffusion Models 

We do love our model and recommend it, (8Buff-gen, available in .ckpt and .safetensor at huggingface or Civitai) but we know that as a creative person you will want more models. Before you download bunches, we want to recommend starting with some of the basics. Download the original Stable Diffusion checkpoint files from Huggingface. Once downloaded, move the files into the following directory ./stable-diffusion-webui/models/Stable-diffusion/ on your system.

Once you have played with the Stable-Diffusion models, you will see why so many other models have been mixed, merged, and trained using them. They are a great baseline, but lack fine tuning that most will want when they have an idea in their head. Once you have an idea of the starting point of where many of models start, you can look at where they are going.

You can search for models, LoRA, Textual Inversion, and Hypernetwork files on both Huggingface or Civitai, or do some searches for other sites. Be careful though and remember that you are downloading large files that could contain anything, even malware (that is another reason we think the *.safetensor files will be used more in the long term).

Using All of These Files in Automatic1111-Stable-Diffuison-webUI

Now that you have all these files downloaded and moved to the proper folders, let's discuss how you can use them in the webUI interface. The primary point of having a User Interface (UI) is to simplify and speed up tasks, thanks to all the contributors at Automatic1111, we have that.

Now let's go through the interface:



Stable Diffusion Checkpoint: In the upper left of your window. Select the model you want to use. We, surprise, are using our 8Buff_GEN_FP16_v1.safetensor for this demo.


The top text box is for positive prompts, the things you want in your image:

woman standing on city street, 


The bottom text box is for negative prompts. These are things we want to avoid in our image. Below are some common negative prompts we use:

blurry, low quality, NSFW,


Width and height setting controls the size of the output image. We recommend using 512x512 for initial testing.


Batch size controls the number of images to be generated each time. We usually set this to 4 so we can see if our prompt needs more tweaking, or if we are getting a one off image.


Finally, hit the Generate button. After a short wait, you will get your images!




Now we will use some of the Textual Inversion files. Click on the Extra networks button, the small button with red on it just below the 'Generate' button, and the UI will provide a set of cards, each corresponding to a file you moved into the file directories we discussed earlier. Clicking the card adds the model to prompt, where it will affect generation.



Let's click on the "katana_v9_shurik" Textual Inversion and you will notice it adds the title to the positive prompt box. This Textual Inversion will render faces to look like Katara from the "Avatar: The Last Airbender" television cartoon show. Go ahead and click Generate.



Now we have four images of a woman in a city that looks like Katara. If you like one of the images and want to use it as a preview. Go back to the the Extra networks button, the small button with red on it just below the 'Generate' button, and hover your mouse over the card. A "replace preview" link will appear. Click on the words "replace preview".



By clicking on replace preview it will add the current image you have in the viewer into the card.



You can use the same process for LoRA and Hypernetworks to add them to prompts. Note that the format you need to use with LoRAs and Hyernetworks is similar to each other and have their own challenges when using them in images. We will get into some of the issues and challenges in future articles.

To summarize, we have discussed some of the common files for image generation and what they do, and how to use them in the webUI. Next time we will go into more detail on prompts, why some models require longer prompts, and how to add attention and emphasis to the prompts you want so you can generate consistent images.


Comments

Popular Posts