alexnasa/OmniAvatar-Clay-Fast · Can't run locally.

Sep 17

getting the following error:
Cannot access gated repo for url https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev/resolve/main/model_index.json.

alexnasa

Owner Sep 17

make sure you go to l https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev and fill up the form and accept the terms and condition with your account and also make sure you to create a token on HF from your account and provide it when duplicating

fmuntean

Sep 17

Thank you for the response,
I did download it locally using the generated key into a folder FLUX.1-Kontext-dev next to the next models. using an API key.
Now here are the lines in question:

flux_pipe = FluxKontextPipeline.from_pretrained("black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16)
flux_pipe.load_lora_weights("alexnasa/Claymation-Kontext-Dev-Lora")
flux_pipe.to("cuda")

I think for the first one I just need to point it to the downloaded folder, correct ? However my next question here is related to the size of this model. Is there another smaller model that I can use ?
On the second line the' alexnasa/Claymation-Kontext-Dev-Lora' repo does not seems to be available. Can you please help me ?

alexnasa

Owner Sep 17

I made https://huggingface.co/alexnasa/Claymation-Kontext-Dev-Lora public, you can download and use it if needed

fmuntean

Sep 17

Thank you very much,
Now on the FLUX model size. Is there another smaller model that can still be used ?

alexnasa

Owner Sep 17

not that I know of, you can either train or search for a SD2.1 or SDXL for that, lora doesn't need lots of examples anyway, 10-20 pairs of images does the job.
SD2.1 is the smallest size model you can technically find that would give you the best results when it comes to one character or on person image

fmuntean

Sep 17

I am also trying to run only the flux_pipe in memory instead of CUDA but then I get error on GPU memory allocation even if there is no GPU memory usage at all.
What i have done is change the flux_pipe.to("cpu"), is there something else I need to change ?

fmuntean

Sep 17

•

edited Sep 17

I am trying to learn this by firehose and run this locally.
I appreciate any pointers please.
Currently seems that a mix of CPU and CUDA on the models (have the big one on CPU and the rest on CUDA is not working). Any reason why ?
For now I switch to load all models in memory.

The WanInferencePipeline.load_model() seems to just trigger an exit the entire app with no errors.

My understanding so far of the entire process is:

convert a normal image into a clay image using the FLUX.1-Kontext-dev and Claymation-Kontext-Dev-Lora models
convert the audio in vector format or text using the wav2vec2-base-960h models
generate the final video using the Wan2.1-T2V-1.3B and OmniAvatar-1.3B

Please let me know if my understanding is correct. I am trying to split the entire app into steps so I can run them separately one by one.

fmuntean

Sep 19

•

edited Sep 19

I was able to generate the Clay_ images.

In the WanInferencePipeline.add_lora_to_model(..) method I get the following error:
RuntimeError: Error(s) in loading state_dict for WanModel:
size mismatch for patch_embedding.weight: copying a param with shape torch.Size([1536, 33, 1, 2, 2]) from checkpoint, the shape in current model is torch.Size([1536, 16, 1, 2, 2]).

Looks that the sizes are not matching.
I am using the following paramteres:
pretrained_lora_path = 'pretrained_models/OmniAvatar-1.3B/pytorch_model.pt'
lora_rank = 128
lora_alpha= 64
lora_target_modules= 'q,k,v,o,ffn.0,ffn.2'
init_lora_weights = "kaiming"

Please advise.

here is the console output:
Loading models from: ['pretrained_models/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors']
model_name: wan_video_dit model_class: WanModel
This model is initialized with extra kwargs: {'has_image_input': False, 'patch_size': [1, 2, 2], 'in_dim': 16, 'dim': 1536, 'ffn_dim': 8960, 'freq_dim': 256, 'text_dim': 4096, 'out_dim': 16, 'num_heads': 12, 'num_layers': 30, 'eps': 1e-06}
The following models are loaded: ['wan_video_dit'].
Loading models from: pretrained_models/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth
model_name: wan_video_vae model_class: WanVideoVAE
The following models are loaded: ['wan_video_vae'].
Loading models from: pretrained_models/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth
model_name: wan_video_text_encoder model_class: WanTextEncoder
The following models are loaded: ['wan_video_text_encoder'].
Using wan_video_text_encoder from pretrained_models/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth.
Using wan_video_dit from ['pretrained_models/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors'].
Using wan_video_vae from pretrained_models/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth.
No wan_video_image_encoder models available.
Use LoRA: lora rank: 128, lora alpha: 64.0

From what I can see the Model OmniAvatar-1.3B is not matching the Wan2.1-T2V-1.3B dimensions, or there are some parameters that I have wrong ?

fmuntean

Sep 22

any pointers on how to move this forward ?

fmuntean

Oct 6

any update on this issue ?
looks that the code is trying to load LORA 33 dim over a 16 dim model