So, I spent about $200 in total on this model. I trained it on both videos and images, using booru tags for the descriptions.

What I noticed:

  • The model learns really well from images. I tested this with batch_size 256 and lr 6e-5. This, in turn, boosts the model's understanding of what should move in a video and how.
  • In just a couple of epochs on a 90k image dataset, the model already learned the styles of the authors in the dataset pretty well (I specifically collected a dataset with different authors for testing).
  • For video training, I used a batch_size of 16, and the dataset consisted of 4.7k video clips.
  • Towards the end, I trained the model a bit specifically for image2video.
  • But the results are still far from ideal. No matter how much I trained, there's this constant feeling that it's just about to get there, but it never does. Yes, the model improves, but slower than it seems, as if it's slowing down as it gets closer.
  • Maybe I'm just making this up, but I didn't train CFG, and perhaps that's a problem.
  • Also, it seems the model is very sensitive to lr (learning rate).
  • When training on images with a small batch size + small lr, the model barely learned.
  • But when I trained on video the same way (small batch size + small lr), it did learn.
  • The dataset had both real life and anime content.
  • Resolutions ranged from 512x768 to 768x512.
  • Video clips had a maximum of 121 frames.

Overall, the model is unlikely to be usable (not as good as WAN, etc.), but it sometimes produces decent results (rarely). On the bright side, it generates very quickly.

If you want to support me, here's the link. I'll try to train the model on a larger dataset of images and videos, and maybe something will come of it. ➡️ https://boosty.to/muinez

Downloads last month
211
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Muinez/ltxvideo-2b-nsfw

Finetuned
(1)
this model