Egyptian Arabic Qwen3-TTS — Custom Voice

A fine-tuned Qwen3-TTS 1.7B model specialized in generating Egyptian Arabic (Masri) speech with a natural, conversational tone.

The model was trained on Egyptian Arabic speech data to better capture the dialectal prosody, pronunciation, and conversational rhythm that are not well represented in the base multilingual Qwen3-TTS model.


Model Details

Property Value
Base Model Qwen/Qwen3-TTS-12Hz-1.7B-Base
Task Text-to-Speech (Speech Synthesis)
Architecture Transformer TTS
Parameters 1.7B
Speech Codec 12Hz
Voice Type Custom Voice
Speaker Name egyptian_speaker
Primary Language Egyptian Arabic (عربي مصري)
Training Data ~25 hours clean Egyptian speech

Motivation

While the base Qwen3-TTS model supports multiple languages, Egyptian Arabic dialect is significantly underrepresented. The base model produced speech that sounded foreign — with inconsistent pronunciation and an unnatural conversational rhythm for Egyptian dialect.

This fine-tune directly addresses those issues:

  • Natural Egyptian dialect pronunciation
  • Conversational prosody and tone
  • Clear, clean speech output
  • Retains the original model's multilingual capability

Example Usage

Installation

pip install qwen-tts soundfile torch

Basic Inference

from qwen_tts import Qwen3TTSModel
import soundfile as sf
import torch

model_id = "itshamdi404/Egy_Arabic_Qwen3-TTS-12Hz-1.7B-Base"

tts = Qwen3TTSModel.from_pretrained(
    model_id,
    device_map={"": 0},
    torch_dtype=torch.float16,
)

wavs, sr = tts.generate_custom_voice(
    text="إزيك يا صاحبي عامل إيه النهاردة",
    speaker="egyptian_speaker",
    language="auto",
)

sf.write("speech.wav", wavs[0], sr)

Optional Sampling Parameters

wavs, sr = tts.generate_custom_voice(
    text="النهاردة الجو جميل جدا في القاهرة",
    speaker="egyptian_speaker",
    language="auto",
    temperature=0.8,   # Controls variation (lower = more consistent)
    top_p=0.9,         # Nucleus sampling threshold
)

sf.write("speech.wav", wavs[0], sr)

Dataset

The model was fine-tuned on approximately 90 hours of clean Egyptian Arabic speech collected from real spoken Egyptian sources.

The dataset includes a variety of speakers and natural conversational language covering a wide range of topics.


Limitations

  • The model is specialized for Egyptian Arabic and may perform worse on other Arabic dialects.
  • Performance may degrade on:
    • Uncommon or rare vocabulary
    • Regional Egyptian sub-dialect variations
  • Only a single Egyptian speaker voice is currently available.
  • Like most TTS systems, performance may vary on very long or complex sentences.

Future Improvements

Possible future improvements include:

  • Adding more speaker voices and diversity
  • Training on larger Egyptian Arabic datasets
  • Improving robustness across regional Egyptian sub-dialects
  • Evaluating across multiple Arabic dialects

Author

Hamdi Mohamed — AI Engineer specializing in:

  • Large Language Models (LLMs)
  • Speech AI
  • Computer Vision

GitHub Hugging Face LinkedIn


Citation

If you use this model in your research or project, please cite:

@misc{hamdi2026egyptianqwen3tts,
  author    = {Hamdi Mohamed},
  title     = {Egyptian Arabic Qwen3-TTS: Fine-tuning Large TTS Models for Regional Dialects},
  year      = {2026},
  url       = {https://huggingface.co/itshamdi404/Egy_Arabic_Qwen3-TTS-12Hz-1.7B-Base}
}
Downloads last month
52
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for itshamdi404/Egy_Arabic_Qwen3-TTS-12Hz-1.7B-Base

Finetuned
(24)
this model