RuntimeError: The size of tensor a (48) must match the size of tensor b (64) at \nnon-singleton dimension 0"}

Hi there,
i have fine tuned phi-4-mini-instruct model and try deploy in huggingface endpoint inference and got the error.
Error is:
Server message]Endpoint failed to start
Exit code: 1. Reason: �� │\n[rank1]: │ │ 1., 1., 1., 1., 1., 1., │ │\n[rank1]: │ │ │ │ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │\n[rank1]: │ │ 1.], device=‘cuda:1’) │ │\n[rank1]: │ ╰──────────────────────────────────────────────────────────────────────────╯ │\n[rank1]: ╰──────────────────────────────────────────────────────────────────────────────╯\n[rank1]: RuntimeError: The size of tensor a (48) must match the size of tensor b (64) at \n[rank1]: non-singleton dimension 0"},“target”:“text_generation_launcher”,“span”:{“rank”:1,“name”:“shard-manager”},“spans”:[{“rank”:1,“name”:“shard-manager”}]}
{“timestamp”:“2025-04-29T07:29:42.246550Z”,“level”:“ERROR”,“fields”:{“message”:“Shard 1 failed to start”},“target”:“text_generation_launcher”}
{“timestamp”:“2025-04-29T07:29:42.246595Z”,“level”:“INFO”,“fields”:{“message”:“Shutting down shards”},“target”:“text_generation_launcher”}
{“timestamp”:“2025-04-29T07:29:42.269764Z”,“level”:“INFO”,“fields”:{“message”:“Terminating shard”},“target”:“text_generation_launcher”,“span”:{“rank”:0,“name”:“shard-manager”},“spans”:[{“rank”:0,“name”:“shard-manager”}]}
{“timestamp”:“2025-04-29T07:29:42.269807Z”,“level”:“INFO”,“fields”:{“message”:“Waiting for shard to gracefully shutdown”},“target”:“text_generation_launcher”,“span”:{“rank”:0,“name”:“shard-manager”},“spans”:[{“rank”:0,“name”:“shard-manager”}]}
{“timestamp”:“2025-04-29T07:29:42.470177Z”,“level”:“INFO”,“fields”:{“message”:“shard terminated”},“target”:“text_generation_launcher”,“span”:{“rank”:0,“name”:“shard-manager”},“spans”:[{“rank”:0,“name”:“shard-manager”}]}
Error: ShardCannotStart

then i try to deploy the base microsoft-phi-4-mini-instruct model and got the same error.

so can anyone help me in this, to resolve the issue?

Seems unresolved issue of TGI?

rdaya
In case anyone runs into this, the trick (a bad one) is to set USE_FLASH_ATTENTION=false