TurboCharge your AI Training for GenAI



GenAI + Stability Diffusion
Tips and Tricks for the Training we've implemented.
In terms of MLOPS we've used MosaicML which comes with the module Composer
that allows training to be 48x faster with Pytorch vs traditional Pytorch
combining some of the most reonowned methods which is embedded into the composer library:
1. No mixed Opinions on Mixed Precision Training.
Training on FP16 with F32 weights saved yields 1.7x ca. than FP32 directly end to end.
This is one of the reasons I used MosaicML with the Composer functionality.
Courtesy: Nvidia AMP: Automatic Mixed Precision Library.
Model quality unaffected by mixed precision Training. (Training acceleration by 2-4.5x)
Memory footprint reduced by 2 times.
Speed of training is due to the fact: speed of multiplying 2 numbers approx. square number of bits.
2. The old Friend: Channel Last vs Channel First
For CNN Training (CNN networks are 4Dim Tensors) and the algorithm on Nvidia GPU level
is implementd to perform on channel last. so every time: you have channel first there is an extra transpose operation that is to be performed. Nvidia reports 1.15 to 1.6x TFlops
improvements.
3. Go Easy on the Image Augmentations.
This is something in the first wave of AI I would always do it with Albumentations which is a good library but its on CPU. I would either pre-augment and send all of the images or folder of images to GPU. This is a problem because you will have bottleneck depending on how you set the pipeline: i.e. if at every epoch or you perform the augmentation with a defined set of transformations as a list, then perform a random python operation on that you will have cpu to gpu bottleneck as well as slow augmentation matrix
To mitigate this train in larger batches (less GPU-CPU communication and transfer.)
Perform Augmentations directly on GPU using Nvidia DALI or Kornia. (https://kornia.github.io/)
4. Image Processing using Pillow-SIMD
* usually Pytorch or Keras comes with Pillow under the hood. However this can be improved with Pillow-SIMD and its way faster.
5. Stepwise Learning Rate.
Stepwise LR adapation per steps within 1 epoch rather than learning rate adaption after 1 epoch. is generally more benefial. The bigger the dataset and the batch training the more effective this strategy becomes.
All those are combined in MosaicML composer functionality and offers 7x speedup vs using other training methods.