Egyptian Arabic Qwen3-TTS — Custom Voice

A fine-tuned Qwen3-TTS 1.7B model specialized in generating Egyptian Arabic (Masri) speech with a natural, conversational tone.

The model was trained on Egyptian Arabic speech data to better capture the dialectal prosody, pronunciation, and conversational rhythm that are not well represented in the base multilingual Qwen3-TTS model.

Model Details

Property	Value
Base Model	`Qwen/Qwen3-TTS-12Hz-1.7B-Base`
Task	Text-to-Speech (Speech Synthesis)
Architecture	Transformer TTS
Parameters	1.7B
Speech Codec	12Hz
Voice Type	Custom Voice
Speaker Name	`egyptian_speaker`
Primary Language	Egyptian Arabic (عربي مصري)
Training Data	~25 hours clean Egyptian speech

Motivation

While the base Qwen3-TTS model supports multiple languages, Egyptian Arabic dialect is significantly underrepresented. The base model produced speech that sounded foreign — with inconsistent pronunciation and an unnatural conversational rhythm for Egyptian dialect.

This fine-tune directly addresses those issues:

Natural Egyptian dialect pronunciation
Conversational prosody and tone
Clear, clean speech output
Retains the original model's multilingual capability

Example Usage

Installation

pip install qwen-tts soundfile torch

Basic Inference

from qwen_tts import Qwen3TTSModel
import soundfile as sf
import torch

model_id = "itshamdi404/Egy_Arabic_Qwen3-TTS-12Hz-1.7B-Base"

tts = Qwen3TTSModel.from_pretrained(
    model_id,
    device_map={"": 0},
    torch_dtype=torch.float16,
)

wavs, sr = tts.generate_custom_voice(
    text="إزيك يا صاحبي عامل إيه النهاردة",
    speaker="egyptian_speaker",
    language="auto",
)

sf.write("speech.wav", wavs[0], sr)

Optional Sampling Parameters

wavs, sr = tts.generate_custom_voice(
    text="النهاردة الجو جميل جدا في القاهرة",
    speaker="egyptian_speaker",
    language="auto",
    temperature=0.8,   # Controls variation (lower = more consistent)
    top_p=0.9,         # Nucleus sampling threshold
)

sf.write("speech.wav", wavs[0], sr)

Dataset

The model was fine-tuned on approximately 90 hours of clean Egyptian Arabic speech collected from real spoken Egyptian sources.

The dataset includes a variety of speakers and natural conversational language covering a wide range of topics.

Limitations

The model is specialized for Egyptian Arabic and may perform worse on other Arabic dialects.
Performance may degrade on:
- Uncommon or rare vocabulary
- Regional Egyptian sub-dialect variations
Only a single Egyptian speaker voice is currently available.
Like most TTS systems, performance may vary on very long or complex sentences.

Future Improvements

Possible future improvements include:

Adding more speaker voices and diversity
Training on larger Egyptian Arabic datasets
Improving robustness across regional Egyptian sub-dialects
Evaluating across multiple Arabic dialects

Author

Hamdi Mohamed — AI Engineer specializing in:

Large Language Models (LLMs)
Speech AI
Computer Vision

Citation

If you use this model in your research or project, please cite:

@misc{hamdi2026egyptianqwen3tts,
  author    = {Hamdi Mohamed},
  title     = {Egyptian Arabic Qwen3-TTS: Fine-tuning Large TTS Models for Regional Dialects},
  year      = {2026},
  url       = {https://huggingface.co/itshamdi404/Egy_Arabic_Qwen3-TTS-12Hz-1.7B-Base}
}

Downloads last month: 52

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for itshamdi404/Egy_Arabic_Qwen3-TTS-12Hz-1.7B-Base

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-Base

Finetuned

(24)

this model