So, I spent about $200 in total on this model. I trained it on both videos and images, using booru tags for the descriptions.
What I noticed:
- The model learns really well from images. I tested this with
batch_size 256andlr 6e-5. This, in turn, boosts the model's understanding of what should move in a video and how. - In just a couple of epochs on a 90k image dataset, the model already learned the styles of the authors in the dataset pretty well (I specifically collected a dataset with different authors for testing).
- For video training, I used a
batch_sizeof 16, and the dataset consisted of 4.7k video clips. - Towards the end, I trained the model a bit specifically for image2video.
- But the results are still far from ideal. No matter how much I trained, there's this constant feeling that it's just about to get there, but it never does. Yes, the model improves, but slower than it seems, as if it's slowing down as it gets closer.
- Maybe I'm just making this up, but I didn't train CFG, and perhaps that's a problem.
- Also, it seems the model is very sensitive to lr (learning rate).
- When training on images with a small batch size + small lr, the model barely learned.
- But when I trained on video the same way (small batch size + small lr), it did learn.
- The dataset had both real life and anime content.
- Resolutions ranged from 512x768 to 768x512.
- Video clips had a maximum of 121 frames.
Overall, the model is unlikely to be usable (not as good as WAN, etc.), but it sometimes produces decent results (rarely). On the bright side, it generates very quickly.
If you want to support me, here's the link. I'll try to train the model on a larger dataset of images and videos, and maybe something will come of it. ➡️ https://boosty.to/muinez
- Downloads last month
- 211
Model tree for Muinez/ltxvideo-2b-nsfw
Base model
Lightricks/LTX-Video-0.9.5