baseline: initial working version

This commit is contained in:
2026-03-30 18:18:41 +01:00
commit 27cfe2a3f5
19 changed files with 3878 additions and 0 deletions

8
.gitignore vendored Normal file
View File

@@ -0,0 +1,8 @@
__pycache__/
*.pyc
.cache/
temp/
output/
*.mp4
*.wav
*.mp3

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2026 Nguyen Cong Thuan Huy (mangodxd)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

45
LM_STUDIO_MIGRATION.md Normal file
View File

@@ -0,0 +1,45 @@
# LM Studio Migration Notes
## Summary
This repo originally translated subtitle chunks through a Google Translate scraper wired directly into `src/engines.py`. The translation backend is now replaced with a dedicated LM Studio client that talks to an OpenAI-compatible `/v1/chat/completions` endpoint.
## New Runtime Defaults
- `LM_STUDIO_BASE_URL=http://127.0.0.1:1234/v1`
- `LM_STUDIO_API_KEY=lm-studio`
- `LM_STUDIO_MODEL=gemma-3-4b-it`
- `--translation-backend lmstudio`
## Commands Used In This Checkout
```powershell
uv venv --clear --python "C:\pinokio\bin\miniconda\python.exe" .venv
uv pip install --python .venv\Scripts\python.exe -r requirements.txt pytest
```
Validation commands:
```powershell
.venv\Scripts\python.exe -m pytest
.venv\Scripts\python.exe main.py --help
.venv\Scripts\python.exe -c "from src.translation import TranslationConfig, LMStudioTranslator; print(TranslationConfig.from_env().model)"
```
## Files Touched
- `main.py`
- `requirements.txt`
- `README.md`
- `src/engines.py`
- `src/translation.py`
- `tests/conftest.py`
- `tests/test_main_cli.py`
- `tests/test_translation.py`
## Notes
- Translation remains segment-by-segment for deterministic subtitle ordering.
- The CLI now supports `--lmstudio-base-url` and `--lmstudio-model`.
- Parser/help now loads before heavy runtime imports, which makes `main.py --help` more reliable.
- `src/googlev4.py` was removed from the active codebase because LM Studio is now the only supported translation backend.

167
README.md Normal file
View File

@@ -0,0 +1,167 @@
# YouTube Auto Dub
YouTube Auto Dub is a Python pipeline that downloads a YouTube video, transcribes its speech with Whisper, translates the subtitle text through a local LM Studio server, and renders a subtitled output video.
## What Changed
- Translation now uses an LM Studio OpenAI-compatible `/v1/chat/completions` endpoint.
- Google Translate scraping has been removed from the active runtime path.
- LM Studio is now the default and only supported translation backend.
- Translation settings can be configured with environment variables or CLI flags.
## Requirements
- Python 3.10+
- [uv](https://docs.astral.sh/uv/)
- FFmpeg and FFprobe available on `PATH`
- LM Studio running locally with an OpenAI-compatible server enabled
## Setup
Create a UV-managed virtual environment in a repo subfolder and install dependencies:
```powershell
uv venv --python "C:\pinokio\bin\miniconda\python.exe" .venv
uv pip install --python .venv\Scripts\python.exe -r requirements.txt
```
Verify the local toolchain:
```powershell
.venv\Scripts\python.exe --version
ffmpeg -version
ffprobe -version
.venv\Scripts\python.exe main.py --help
```
## LM Studio Configuration
Start LM Studio's local server and load a translation-capable model. The default model name in this repo is:
```text
gemma-3-4b-it
```
If your local LM Studio model name differs, set it with an environment variable or `--lmstudio-model`.
### Environment Variables
```powershell
$env:LM_STUDIO_BASE_URL="http://127.0.0.1:1234/v1"
$env:LM_STUDIO_API_KEY="lm-studio"
$env:LM_STUDIO_MODEL="gemma-3-4b-it"
```
Defaults if unset:
- `LM_STUDIO_BASE_URL=http://127.0.0.1:1234/v1`
- `LM_STUDIO_API_KEY=lm-studio`
- `LM_STUDIO_MODEL=gemma-3-4b-it`
## Usage
Basic example:
```powershell
.venv\Scripts\python.exe main.py "https://youtube.com/watch?v=VIDEO_ID" --lang es
```
Override the LM Studio endpoint or model from the CLI:
```powershell
.venv\Scripts\python.exe main.py "https://youtube.com/watch?v=VIDEO_ID" `
--lang fr `
--translation-backend lmstudio `
--lmstudio-base-url http://127.0.0.1:1234/v1 `
--lmstudio-model gemma-3-4b-it
```
Authentication options for restricted videos still work as before:
```powershell
.venv\Scripts\python.exe main.py "https://youtube.com/watch?v=VIDEO_ID" --lang ja --browser chrome
.venv\Scripts\python.exe main.py "https://youtube.com/watch?v=VIDEO_ID" --lang de --cookies cookies.txt
```
## CLI Options
| Option | Description |
| --- | --- |
| `url` | YouTube video URL to process |
| `--lang`, `-l` | Target language code |
| `--browser`, `-b` | Browser name for cookie extraction |
| `--cookies`, `-c` | Path to exported cookies file |
| `--gpu` | Prefer GPU acceleration when CUDA is available |
| `--whisper_model`, `-wm` | Override Whisper model |
| `--translation-backend` | Translation backend, currently `lmstudio` |
| `--lmstudio-base-url` | Override LM Studio base URL |
| `--lmstudio-model` | Override LM Studio model name |
## Translation Behavior
The LM Studio translator is tuned for subtitle-like text:
- preserves meaning, tone, and intent
- keeps punctuation natural
- returns translation text only
- preserves line and segment boundaries
- leaves names, brands, URLs, emails, code, and proper nouns unchanged unless transliteration is clearly needed
- avoids commentary, summarization, and censorship
Translation is currently performed segment-by-segment to keep subtitle ordering deterministic and reduce the risk of malformed batched output corrupting timing alignment.
## Testing
Run the focused validation suite:
```powershell
.venv\Scripts\python.exe -m pytest
.venv\Scripts\python.exe main.py --help
```
The tests cover:
- LM Studio request payload construction
- response parsing
- retry handling for transient HTTP failures
- empty or malformed response handling
- CLI and environment config precedence
## Troubleshooting
### LM Studio connection errors
- Make sure LM Studio's local server is running.
- Confirm the base URL ends in `/v1`.
- Check that the loaded model name matches `LM_STUDIO_MODEL` or `--lmstudio-model`.
### Empty or malformed translations
- Try a stronger local instruction-tuned model if your current model ignores formatting.
- Keep LM Studio in non-streaming OpenAI-compatible mode.
- Review the server logs for model-side failures.
### FFmpeg missing
If startup reports missing `ffmpeg` or `ffprobe`, install FFmpeg and add it to your system `PATH`.
## Project Layout
```text
youtube-auto-dub/
|-- main.py
|-- requirements.txt
|-- language_map.json
|-- README.md
|-- LM_STUDIO_MIGRATION.md
|-- src/
| |-- core_utils.py
| |-- engines.py
| |-- media.py
| |-- translation.py
| `-- youtube.py
`-- tests/
|-- conftest.py
|-- test_main_cli.py
`-- test_translation.py
```

999
language_map.json Normal file
View File

@@ -0,0 +1,999 @@
{
"af": {
"name": "af-ZA",
"voices": {
"male": [
"af-ZA-WillemNeural"
],
"female": [
"af-ZA-AdriNeural"
]
}
},
"sq": {
"name": "sq-AL",
"voices": {
"male": [
"sq-AL-IlirNeural"
],
"female": [
"sq-AL-AnilaNeural"
]
}
},
"am": {
"name": "am-ET",
"voices": {
"male": [
"am-ET-AmehaNeural"
],
"female": [
"am-ET-MekdesNeural"
]
}
},
"ar": {
"name": "ar-DZ",
"voices": {
"male": [
"ar-DZ-IsmaelNeural",
"ar-BH-AliNeural",
"ar-EG-ShakirNeural",
"ar-IQ-BasselNeural",
"ar-JO-TaimNeural",
"ar-KW-FahedNeural",
"ar-LB-RamiNeural",
"ar-LY-OmarNeural",
"ar-MA-JamalNeural",
"ar-OM-AbdullahNeural",
"ar-QA-MoazNeural",
"ar-SA-HamedNeural",
"ar-SY-LaithNeural",
"ar-TN-HediNeural",
"ar-AE-HamdanNeural",
"ar-YE-SalehNeural"
],
"female": [
"ar-DZ-AminaNeural",
"ar-BH-LailaNeural",
"ar-EG-SalmaNeural",
"ar-IQ-RanaNeural",
"ar-JO-SanaNeural",
"ar-KW-NouraNeural",
"ar-LB-LaylaNeural",
"ar-LY-ImanNeural",
"ar-MA-MounaNeural",
"ar-OM-AyshaNeural",
"ar-QA-AmalNeural",
"ar-SA-ZariyahNeural",
"ar-SY-AmanyNeural",
"ar-TN-ReemNeural",
"ar-AE-FatimaNeural",
"ar-YE-MaryamNeural"
]
}
},
"az": {
"name": "az-AZ",
"voices": {
"male": [
"az-AZ-BabekNeural"
],
"female": [
"az-AZ-BanuNeural"
]
}
},
"bn": {
"name": "bn-BD",
"voices": {
"male": [
"bn-BD-PradeepNeural",
"bn-IN-BashkarNeural"
],
"female": [
"bn-BD-NabanitaNeural",
"bn-IN-TanishaaNeural"
]
}
},
"bs": {
"name": "bs-BA",
"voices": {
"male": [
"bs-BA-GoranNeural"
],
"female": [
"bs-BA-VesnaNeural"
]
}
},
"bg": {
"name": "bg-BG",
"voices": {
"male": [
"bg-BG-BorislavNeural"
],
"female": [
"bg-BG-KalinaNeural"
]
}
},
"my": {
"name": "my-MM",
"voices": {
"male": [
"my-MM-ThihaNeural"
],
"female": [
"my-MM-NilarNeural"
]
}
},
"ca": {
"name": "ca-ES",
"voices": {
"male": [
"ca-ES-EnricNeural"
],
"female": [
"ca-ES-JoanaNeural"
]
}
},
"zh": {
"name": "zh-HK",
"voices": {
"male": [
"zh-HK-WanLungNeural",
"zh-CN-YunjianNeural",
"zh-CN-YunxiNeural",
"zh-CN-YunxiaNeural",
"zh-CN-YunyangNeural",
"zh-TW-YunJheNeural"
],
"female": [
"zh-HK-HiuGaaiNeural",
"zh-HK-HiuMaanNeural",
"zh-CN-XiaoxiaoNeural",
"zh-CN-XiaoyiNeural",
"zh-CN-liaoning-XiaobeiNeural",
"zh-TW-HsiaoChenNeural",
"zh-TW-HsiaoYuNeural",
"zh-CN-shaanxi-XiaoniNeural"
]
}
},
"hr": {
"name": "hr-HR",
"voices": {
"male": [
"hr-HR-SreckoNeural"
],
"female": [
"hr-HR-GabrijelaNeural"
]
}
},
"cs": {
"name": "cs-CZ",
"voices": {
"male": [
"cs-CZ-AntoninNeural"
],
"female": [
"cs-CZ-VlastaNeural"
]
}
},
"da": {
"name": "da-DK",
"voices": {
"male": [
"da-DK-JeppeNeural"
],
"female": [
"da-DK-ChristelNeural"
]
}
},
"nl": {
"name": "nl-BE",
"voices": {
"male": [
"nl-BE-ArnaudNeural",
"nl-NL-MaartenNeural"
],
"female": [
"nl-BE-DenaNeural",
"nl-NL-ColetteNeural",
"nl-NL-FennaNeural"
]
}
},
"en": {
"name": "en-AU",
"voices": {
"male": [
"en-AU-WilliamMultilingualNeural",
"en-CA-LiamNeural",
"en-HK-SamNeural",
"en-IN-PrabhatNeural",
"en-IE-ConnorNeural",
"en-KE-ChilembaNeural",
"en-NZ-MitchellNeural",
"en-NG-AbeoNeural",
"en-PH-JamesNeural",
"en-US-AndrewNeural",
"en-US-BrianNeural",
"en-SG-WayneNeural",
"en-ZA-LukeNeural",
"en-TZ-ElimuNeural",
"en-GB-RyanNeural",
"en-GB-ThomasNeural",
"en-US-AndrewMultilingualNeural",
"en-US-BrianMultilingualNeural",
"en-US-ChristopherNeural",
"en-US-EricNeural",
"en-US-GuyNeural",
"en-US-RogerNeural",
"en-US-SteffanNeural"
],
"female": [
"en-AU-NatashaNeural",
"en-CA-ClaraNeural",
"en-HK-YanNeural",
"en-IN-NeerjaExpressiveNeural",
"en-IN-NeerjaNeural",
"en-IE-EmilyNeural",
"en-KE-AsiliaNeural",
"en-NZ-MollyNeural",
"en-NG-EzinneNeural",
"en-PH-RosaNeural",
"en-US-AvaNeural",
"en-US-EmmaNeural",
"en-SG-LunaNeural",
"en-ZA-LeahNeural",
"en-TZ-ImaniNeural",
"en-GB-LibbyNeural",
"en-GB-MaisieNeural",
"en-GB-SoniaNeural",
"en-US-AnaNeural",
"en-US-AriaNeural",
"en-US-AvaMultilingualNeural",
"en-US-EmmaMultilingualNeural",
"en-US-JennyNeural",
"en-US-MichelleNeural"
]
}
},
"et": {
"name": "et-EE",
"voices": {
"male": [
"et-EE-KertNeural"
],
"female": [
"et-EE-AnuNeural"
]
}
},
"fil": {
"name": "fil-PH",
"voices": {
"male": [
"fil-PH-AngeloNeural"
],
"female": [
"fil-PH-BlessicaNeural"
]
}
},
"fi": {
"name": "fi-FI",
"voices": {
"male": [
"fi-FI-HarriNeural"
],
"female": [
"fi-FI-NooraNeural"
]
}
},
"fr": {
"name": "fr-BE",
"voices": {
"male": [
"fr-BE-GerardNeural",
"fr-CA-ThierryNeural",
"fr-CA-AntoineNeural",
"fr-CA-JeanNeural",
"fr-FR-RemyMultilingualNeural",
"fr-FR-HenriNeural",
"fr-CH-FabriceNeural"
],
"female": [
"fr-BE-CharlineNeural",
"fr-CA-SylvieNeural",
"fr-FR-VivienneMultilingualNeural",
"fr-FR-DeniseNeural",
"fr-FR-EloiseNeural",
"fr-CH-ArianeNeural"
]
}
},
"gl": {
"name": "gl-ES",
"voices": {
"male": [
"gl-ES-RoiNeural"
],
"female": [
"gl-ES-SabelaNeural"
]
}
},
"ka": {
"name": "ka-GE",
"voices": {
"male": [
"ka-GE-GiorgiNeural"
],
"female": [
"ka-GE-EkaNeural"
]
}
},
"de": {
"name": "de-AT",
"voices": {
"male": [
"de-AT-JonasNeural",
"de-DE-FlorianMultilingualNeural",
"de-DE-ConradNeural",
"de-DE-KillianNeural",
"de-CH-JanNeural"
],
"female": [
"de-AT-IngridNeural",
"de-DE-SeraphinaMultilingualNeural",
"de-DE-AmalaNeural",
"de-DE-KatjaNeural",
"de-CH-LeniNeural"
]
}
},
"el": {
"name": "el-GR",
"voices": {
"male": [
"el-GR-NestorasNeural"
],
"female": [
"el-GR-AthinaNeural"
]
}
},
"gu": {
"name": "gu-IN",
"voices": {
"male": [
"gu-IN-NiranjanNeural"
],
"female": [
"gu-IN-DhwaniNeural"
]
}
},
"he": {
"name": "he-IL",
"voices": {
"male": [
"he-IL-AvriNeural"
],
"female": [
"he-IL-HilaNeural"
]
}
},
"hi": {
"name": "hi-IN",
"voices": {
"male": [
"hi-IN-MadhurNeural"
],
"female": [
"hi-IN-SwaraNeural"
]
}
},
"hu": {
"name": "hu-HU",
"voices": {
"male": [
"hu-HU-TamasNeural"
],
"female": [
"hu-HU-NoemiNeural"
]
}
},
"is": {
"name": "is-IS",
"voices": {
"male": [
"is-IS-GunnarNeural"
],
"female": [
"is-IS-GudrunNeural"
]
}
},
"id": {
"name": "id-ID",
"voices": {
"male": [
"id-ID-ArdiNeural"
],
"female": [
"id-ID-GadisNeural"
]
}
},
"iu": {
"name": "iu-Latn-CA",
"voices": {
"male": [
"iu-Latn-CA-TaqqiqNeural",
"iu-Cans-CA-TaqqiqNeural"
],
"female": [
"iu-Latn-CA-SiqiniqNeural",
"iu-Cans-CA-SiqiniqNeural"
]
}
},
"ga": {
"name": "ga-IE",
"voices": {
"male": [
"ga-IE-ColmNeural"
],
"female": [
"ga-IE-OrlaNeural"
]
}
},
"it": {
"name": "it-IT",
"voices": {
"male": [
"it-IT-GiuseppeMultilingualNeural",
"it-IT-DiegoNeural"
],
"female": [
"it-IT-ElsaNeural",
"it-IT-IsabellaNeural"
]
}
},
"ja": {
"name": "ja-JP",
"voices": {
"male": [
"ja-JP-KeitaNeural"
],
"female": [
"ja-JP-NanamiNeural"
]
}
},
"jv": {
"name": "jv-ID",
"voices": {
"male": [
"jv-ID-DimasNeural"
],
"female": [
"jv-ID-SitiNeural"
]
}
},
"kn": {
"name": "kn-IN",
"voices": {
"male": [
"kn-IN-GaganNeural"
],
"female": [
"kn-IN-SapnaNeural"
]
}
},
"kk": {
"name": "kk-KZ",
"voices": {
"male": [
"kk-KZ-DauletNeural"
],
"female": [
"kk-KZ-AigulNeural"
]
}
},
"km": {
"name": "km-KH",
"voices": {
"male": [
"km-KH-PisethNeural"
],
"female": [
"km-KH-SreymomNeural"
]
}
},
"ko": {
"name": "ko-KR",
"voices": {
"male": [
"ko-KR-HyunsuMultilingualNeural",
"ko-KR-InJoonNeural"
],
"female": [
"ko-KR-SunHiNeural"
]
}
},
"lo": {
"name": "lo-LA",
"voices": {
"male": [
"lo-LA-ChanthavongNeural"
],
"female": [
"lo-LA-KeomanyNeural"
]
}
},
"lv": {
"name": "lv-LV",
"voices": {
"male": [
"lv-LV-NilsNeural"
],
"female": [
"lv-LV-EveritaNeural"
]
}
},
"lt": {
"name": "lt-LT",
"voices": {
"male": [
"lt-LT-LeonasNeural"
],
"female": [
"lt-LT-OnaNeural"
]
}
},
"mk": {
"name": "mk-MK",
"voices": {
"male": [
"mk-MK-AleksandarNeural"
],
"female": [
"mk-MK-MarijaNeural"
]
}
},
"ms": {
"name": "ms-MY",
"voices": {
"male": [
"ms-MY-OsmanNeural"
],
"female": [
"ms-MY-YasminNeural"
]
}
},
"ml": {
"name": "ml-IN",
"voices": {
"male": [
"ml-IN-MidhunNeural"
],
"female": [
"ml-IN-SobhanaNeural"
]
}
},
"mt": {
"name": "mt-MT",
"voices": {
"male": [
"mt-MT-JosephNeural"
],
"female": [
"mt-MT-GraceNeural"
]
}
},
"mr": {
"name": "mr-IN",
"voices": {
"male": [
"mr-IN-ManoharNeural"
],
"female": [
"mr-IN-AarohiNeural"
]
}
},
"mn": {
"name": "mn-MN",
"voices": {
"male": [
"mn-MN-BataaNeural"
],
"female": [
"mn-MN-YesuiNeural"
]
}
},
"ne": {
"name": "ne-NP",
"voices": {
"male": [
"ne-NP-SagarNeural"
],
"female": [
"ne-NP-HemkalaNeural"
]
}
},
"nb": {
"name": "nb-NO",
"voices": {
"male": [
"nb-NO-FinnNeural"
],
"female": [
"nb-NO-PernilleNeural"
]
}
},
"ps": {
"name": "ps-AF",
"voices": {
"male": [
"ps-AF-GulNawazNeural"
],
"female": [
"ps-AF-LatifaNeural"
]
}
},
"fa": {
"name": "fa-IR",
"voices": {
"male": [
"fa-IR-FaridNeural"
],
"female": [
"fa-IR-DilaraNeural"
]
}
},
"pl": {
"name": "pl-PL",
"voices": {
"male": [
"pl-PL-MarekNeural"
],
"female": [
"pl-PL-ZofiaNeural"
]
}
},
"pt": {
"name": "pt-BR",
"voices": {
"male": [
"pt-BR-AntonioNeural",
"pt-PT-DuarteNeural"
],
"female": [
"pt-BR-ThalitaMultilingualNeural",
"pt-BR-FranciscaNeural",
"pt-PT-RaquelNeural"
]
}
},
"ro": {
"name": "ro-RO",
"voices": {
"male": [
"ro-RO-EmilNeural"
],
"female": [
"ro-RO-AlinaNeural"
]
}
},
"ru": {
"name": "ru-RU",
"voices": {
"male": [
"ru-RU-DmitryNeural"
],
"female": [
"ru-RU-SvetlanaNeural"
]
}
},
"sr": {
"name": "sr-RS",
"voices": {
"male": [
"sr-RS-NicholasNeural"
],
"female": [
"sr-RS-SophieNeural"
]
}
},
"si": {
"name": "si-LK",
"voices": {
"male": [
"si-LK-SameeraNeural"
],
"female": [
"si-LK-ThiliniNeural"
]
}
},
"sk": {
"name": "sk-SK",
"voices": {
"male": [
"sk-SK-LukasNeural"
],
"female": [
"sk-SK-ViktoriaNeural"
]
}
},
"sl": {
"name": "sl-SI",
"voices": {
"male": [
"sl-SI-RokNeural"
],
"female": [
"sl-SI-PetraNeural"
]
}
},
"so": {
"name": "so-SO",
"voices": {
"male": [
"so-SO-MuuseNeural"
],
"female": [
"so-SO-UbaxNeural"
]
}
},
"es": {
"name": "es-AR",
"voices": {
"male": [
"es-AR-TomasNeural",
"es-BO-MarceloNeural",
"es-CL-LorenzoNeural",
"es-CO-GonzaloNeural",
"es-CR-JuanNeural",
"es-CU-ManuelNeural",
"es-DO-EmilioNeural",
"es-EC-LuisNeural",
"es-SV-RodrigoNeural",
"es-GQ-JavierNeural",
"es-GT-AndresNeural",
"es-HN-CarlosNeural",
"es-MX-JorgeNeural",
"es-NI-FedericoNeural",
"es-PA-RobertoNeural",
"es-PY-MarioNeural",
"es-PE-AlexNeural",
"es-PR-VictorNeural",
"es-ES-AlvaroNeural",
"es-US-AlonsoNeural",
"es-UY-MateoNeural",
"es-VE-SebastianNeural"
],
"female": [
"es-AR-ElenaNeural",
"es-BO-SofiaNeural",
"es-CL-CatalinaNeural",
"es-CO-SalomeNeural",
"es-ES-XimenaNeural",
"es-CR-MariaNeural",
"es-CU-BelkysNeural",
"es-DO-RamonaNeural",
"es-EC-AndreaNeural",
"es-SV-LorenaNeural",
"es-GQ-TeresaNeural",
"es-GT-MartaNeural",
"es-HN-KarlaNeural",
"es-MX-DaliaNeural",
"es-NI-YolandaNeural",
"es-PA-MargaritaNeural",
"es-PY-TaniaNeural",
"es-PE-CamilaNeural",
"es-PR-KarinaNeural",
"es-ES-ElviraNeural",
"es-US-PalomaNeural",
"es-UY-ValentinaNeural",
"es-VE-PaolaNeural"
]
}
},
"su": {
"name": "su-ID",
"voices": {
"male": [
"su-ID-JajangNeural"
],
"female": [
"su-ID-TutiNeural"
]
}
},
"sw": {
"name": "sw-KE",
"voices": {
"male": [
"sw-KE-RafikiNeural",
"sw-TZ-DaudiNeural"
],
"female": [
"sw-KE-ZuriNeural",
"sw-TZ-RehemaNeural"
]
}
},
"sv": {
"name": "sv-SE",
"voices": {
"male": [
"sv-SE-MattiasNeural"
],
"female": [
"sv-SE-SofieNeural"
]
}
},
"ta": {
"name": "ta-IN",
"voices": {
"male": [
"ta-IN-ValluvarNeural",
"ta-MY-SuryaNeural",
"ta-SG-AnbuNeural",
"ta-LK-KumarNeural"
],
"female": [
"ta-IN-PallaviNeural",
"ta-MY-KaniNeural",
"ta-SG-VenbaNeural",
"ta-LK-SaranyaNeural"
]
}
},
"te": {
"name": "te-IN",
"voices": {
"male": [
"te-IN-MohanNeural"
],
"female": [
"te-IN-ShrutiNeural"
]
}
},
"th": {
"name": "th-TH",
"voices": {
"male": [
"th-TH-NiwatNeural"
],
"female": [
"th-TH-PremwadeeNeural"
]
}
},
"tr": {
"name": "tr-TR",
"voices": {
"male": [
"tr-TR-AhmetNeural"
],
"female": [
"tr-TR-EmelNeural"
]
}
},
"uk": {
"name": "uk-UA",
"voices": {
"male": [
"uk-UA-OstapNeural"
],
"female": [
"uk-UA-PolinaNeural"
]
}
},
"ur": {
"name": "ur-IN",
"voices": {
"male": [
"ur-IN-SalmanNeural",
"ur-PK-AsadNeural"
],
"female": [
"ur-IN-GulNeural",
"ur-PK-UzmaNeural"
]
}
},
"uz": {
"name": "uz-UZ",
"voices": {
"male": [
"uz-UZ-SardorNeural"
],
"female": [
"uz-UZ-MadinaNeural"
]
}
},
"vi": {
"name": "vi-VN",
"voices": {
"male": [
"vi-VN-NamMinhNeural"
],
"female": [
"vi-VN-HoaiMyNeural"
]
}
},
"cy": {
"name": "cy-GB",
"voices": {
"male": [
"cy-GB-AledNeural"
],
"female": [
"cy-GB-NiaNeural"
]
}
},
"zu": {
"name": "zu-ZA",
"voices": {
"male": [
"zu-ZA-ThembaNeural"
],
"female": [
"zu-ZA-ThandoNeural"
]
}
}
}

View File

@@ -0,0 +1,98 @@
"""
Language Map Generator for YouTube Auto Dub.
This script fetches the latest available voices from Microsoft Edge TTS
and generates a `language_map.json` file compatible with the
Multi-Speaker Diarization system.
It groups voices into 'male' and 'female' lists (pools) for every language,
enabling the engine to rotate voices for different speakers automatically.
Usage: python latest_langmap_generate.py
"""
import asyncio
import json
import edge_tts
from pathlib import Path
from typing import Dict, List, Any
# Define path relative to project root (assuming this script is in root or src)
# Adjust BASE_DIR if you move this script.
BASE_DIR = Path(__file__).resolve().parent
LANG_MAP_FILE = BASE_DIR / "language_map.json"
async def generate_lang_map() -> None:
print("[*] Connecting to Microsoft Edge TTS API...")
try:
# Fetch all available voices
voices = await edge_tts.list_voices()
except Exception as e:
print(f"[!] CRITICAL: Failed to fetch voices: {e}")
return
print(f"[*] Processing {len(voices)} raw voice entries...")
# Structure: { "vi": { "name": "vi-VN", "voices": { "male": [], "female": [] } } }
lang_map: Dict[str, Any] = {}
for v in voices:
# 1. FILTER: Strict quality control - Neural voices only
if "Neural" not in v["ShortName"]:
continue
# 2. EXTRACT: Parse metadata
short_name = v["ShortName"] # e.g., "vi-VN-NamMinhNeural"
locale = v["Locale"] # e.g., "vi-VN"
gender = v["Gender"].lower() # "male" or "female"
# ISO Language Code (e.g., 'vi' from 'vi-VN')
lang_code = locale.split('-')[0]
# 3. INITIALIZE: Create structure if language not seen before
if lang_code not in lang_map:
lang_map[lang_code] = {
"name": locale, # Store locale as a friendly name reference
"voices": {
"male": [],
"female": []
}
}
# 4. POPULATE: Add voice to the specific gender pool
# This creates the "List" structure required by engines.py
target_list = lang_map[lang_code]["voices"].get(gender)
# Handle case where gender might be undefined or new
if target_list is None:
lang_map[lang_code]["voices"][gender] = []
target_list = lang_map[lang_code]["voices"][gender]
if short_name not in target_list:
target_list.append(short_name)
# 5. OPTIMIZE: Remove languages with empty voice lists (optional cleanup)
final_map = {
k: v for k, v in lang_map.items()
if v["voices"]["male"] or v["voices"]["female"]
}
# 6. SAVE: Write to JSON
try:
with open(LANG_MAP_FILE, "w", encoding="utf-8") as f:
json.dump(final_map, f, ensure_ascii=False, indent=2)
print(f"\n[+] SUCCESS! Generated configuration for {len(final_map)} languages.")
print(f" File saved to: {LANG_MAP_FILE}")
# Preview a specific language (e.g., Vietnamese)
if "vi" in final_map:
print("\n[*] Preview (Vietnamese):")
print(json.dumps(final_map["vi"], indent=2))
except Exception as e:
print(f"[!] ERROR: Failed to write JSON file: {e}")
if __name__ == "__main__":
asyncio.run(generate_lang_map())

Binary file not shown.

364
main.py Normal file
View File

@@ -0,0 +1,364 @@
#!/usr/bin/env python3
"""YouTube Auto Dub command-line entrypoint."""
from __future__ import annotations
import argparse
import asyncio
import shutil
import time
from src.core_utils import ConfigurationError
from src.translation import TranslationConfig
def build_parser() -> argparse.ArgumentParser:
"""Build the command-line parser."""
parser = argparse.ArgumentParser(
description="YouTube Auto Dub - Automated Video Subtitling",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""\
Examples:
python main.py "https://youtube.com/watch?v=VIDEO_ID" --lang es
python main.py "https://youtube.com/watch?v=VIDEO_ID" --lang fr --gpu
python main.py "https://youtube.com/watch?v=VIDEO_ID" --lang ja --browser chrome
python main.py "https://youtube.com/watch?v=VIDEO_ID" --whisper_model large-v3
python main.py "https://youtube.com/watch?v=VIDEO_ID" --lmstudio-model gemma-3-4b-it
""",
)
parser.add_argument("url", help="YouTube video URL to subtitle")
parser.add_argument(
"--lang",
"-l",
default="es",
help="Target language ISO code (e.g., es, fr, ja, vi).",
)
parser.add_argument(
"--browser",
"-b",
help="Browser to extract cookies from (chrome, edge, firefox). Close browser first!",
)
parser.add_argument(
"--cookies",
"-c",
help="Path to cookies.txt file (Netscape format) for YouTube authentication",
)
parser.add_argument(
"--gpu",
action="store_true",
help="Use GPU acceleration for Whisper when CUDA is available.",
)
parser.add_argument(
"--whisper_model",
"-wm",
help="Whisper model to use (tiny, base, small, medium, large-v3). Default: auto-select based on VRAM",
)
parser.add_argument(
"--translation-backend",
default="lmstudio",
choices=["lmstudio"],
help="Translation backend to use. Currently only 'lmstudio' is supported.",
)
parser.add_argument(
"--lmstudio-base-url",
help="Override the LM Studio OpenAI-compatible base URL (default: env or http://127.0.0.1:1234/v1).",
)
parser.add_argument(
"--lmstudio-model",
help="Override the LM Studio model name (default: env or gemma-3-4b-it).",
)
return parser
def _check_deps() -> None:
"""Verify critical runtime dependencies."""
from shutil import which
missing = []
if not which("ffmpeg"):
missing.append("ffmpeg")
if not which("ffprobe"):
missing.append("ffprobe")
if missing:
print(f"[!] CRITICAL: Missing dependencies: {', '.join(missing)}")
print(" Please install FFmpeg and add it to your System PATH.")
print(" Download: https://ffmpeg.org/download.html")
raise SystemExit(1)
try:
import torch
print(f"[*] PyTorch {torch.__version__} | CUDA Available: {torch.cuda.is_available()}")
except ImportError:
print("[!] CRITICAL: PyTorch not installed.")
print(" Install with your UV env, for example:")
print(" uv pip install --python .venv\\Scripts\\python.exe -r requirements.txt")
raise SystemExit(1)
def _cleanup() -> None:
"""Clean up the temp directory with retries for Windows file locks."""
import src.engines
max_retries = 5
for attempt in range(max_retries):
try:
if src.engines.TEMP_DIR.exists():
shutil.rmtree(src.engines.TEMP_DIR)
src.engines.TEMP_DIR.mkdir(parents=True, exist_ok=True)
return
except PermissionError:
wait_time = 0.5 * (2 ** attempt)
print(f"[-] File locked (attempt {attempt + 1}/{max_retries}). Retrying in {wait_time}s...")
time.sleep(wait_time)
print(f"[!] WARNING: Could not fully clean temp directory after {max_retries} attempts.")
print(f" Files may persist in: {src.engines.TEMP_DIR}")
def _detect_device() -> str:
"""Detect the best available inference device."""
import torch
if torch.backends.mps.is_available():
return "mps"
if torch.cuda.is_available():
return "cuda"
return "cpu"
def _build_translation_config(args: argparse.Namespace) -> TranslationConfig:
"""Resolve translation configuration from env vars plus CLI overrides."""
return TranslationConfig.from_env(
backend=args.translation_backend,
base_url=args.lmstudio_base_url,
model=args.lmstudio_model,
)
def _get_source_language_hint() -> str:
"""Read an optional source language override from the environment."""
import os
return (os.getenv("SOURCE_LANGUAGE_HINT") or "").strip()
async def _synthesize_dub_audio(engine, chunks, target_lang: str, media_module, temp_dir) -> None:
"""Generate and fit dubbed audio clips for each translated chunk."""
total = len(chunks)
for index, chunk in enumerate(chunks, start=1):
translated_text = chunk.get("trans_text", "").strip()
target_duration = max(0.0, chunk["end"] - chunk["start"])
if not translated_text or target_duration <= 0:
chunk["processed_audio"] = None
continue
raw_audio_path = temp_dir / f"tts_{index:04d}.mp3"
rate = engine.calcRate(
text=translated_text,
target_dur=target_duration,
original_text=chunk.get("text", ""),
)
await engine.synthesize(
text=translated_text,
target_lang=target_lang,
out_path=raw_audio_path,
rate=rate,
)
chunk["processed_audio"] = media_module.fit_audio(raw_audio_path, target_duration)
if index == 1 or index % 10 == 0 or index == total:
print(f"[-] Dub synthesis progress: {index}/{total}")
def main() -> None:
"""Run the full YouTube Auto Dub pipeline."""
parser = build_parser()
args = parser.parse_args()
import src.engines
import src.media
import src.youtube
print("\n" + "=" * 60)
print("YOUTUBE AUTO SUB - INITIALIZING")
print("=" * 60)
_check_deps()
try:
translation_config = _build_translation_config(args)
except ConfigurationError as exc:
print(f"[!] INVALID TRANSLATION CONFIG: {exc}")
raise SystemExit(1) from exc
_cleanup()
device = _detect_device()
print(f"[*] Using device: {device.upper()}")
print(f"[*] Translation backend: {translation_config.backend}")
print(f"[*] LM Studio endpoint: {translation_config.base_url}")
print(f"[*] LM Studio model: {translation_config.model}")
if args.whisper_model:
src.engines.ASR_MODEL = args.whisper_model
print(f"[*] Using specified Whisper model: {args.whisper_model}")
else:
print(f"[*] Auto-selected Whisper model: {src.engines.ASR_MODEL} (based on VRAM)")
try:
source_language_hint = _get_source_language_hint()
if source_language_hint:
print(f"[*] Source language hint: {source_language_hint}")
engine = src.engines.Engine(
device,
translation_config=translation_config,
source_language_hint=source_language_hint,
)
print(f"\n{'=' * 60}")
print("STEP 1: DOWNLOADING CONTENT")
print(f"{'=' * 60}")
print(f"[*] Target URL: {args.url}")
print(f"[*] Target Language: {args.lang.upper()}")
try:
video_path = src.youtube.downloadVideo(
args.url,
browser=args.browser,
cookies_file=args.cookies,
)
audio_path = src.youtube.downloadAudio(
args.url,
browser=args.browser,
cookies_file=args.cookies,
)
print(f"[+] Video downloaded: {video_path}")
print(f"[+] Audio extracted: {audio_path}")
except Exception as exc:
print(f"\n[!] DOWNLOAD FAILED: {exc}")
print("\n[-] TROUBLESHOOTING TIPS:")
print(" 1. Close all browser windows if using --browser")
print(" 2. Export fresh cookies.txt and use --cookies")
print(" 3. Check if video is private/region-restricted")
print(" 4. Verify YouTube URL is correct")
return
print(f"\n{'=' * 60}")
print("STEP 2: SPEECH TRANSCRIPTION")
print(f"{'=' * 60}")
print(f"[*] Transcribing audio with Whisper ({src.engines.ASR_MODEL})...")
raw_segments = engine.transcribeSafe(audio_path)
print(f"[+] Transcription complete: {len(raw_segments)} segments")
if raw_segments:
print(f"[*] Sample segment: '{raw_segments[0]['text'][:50]}...'")
print(f"\n{'=' * 60}")
print("STEP 3: INTELLIGENT CHUNKING")
print(f"{'=' * 60}")
chunks = src.engines.smartChunk(raw_segments)
print(f"[+] Optimized {len(raw_segments)} raw segments into {len(chunks)} chunks")
print(f"[*] Average chunk duration: {sum(c['end'] - c['start'] for c in chunks) / len(chunks):.2f}s")
print(f"\n{'=' * 60}")
print(f"STEP 4: TRANSLATION ({args.lang.upper()})")
print(f"{'=' * 60}")
texts = [chunk["text"] for chunk in chunks]
print(f"[*] Translating {len(texts)} text segments...")
translated_texts = engine.translateSafe(texts, args.lang)
for index, chunk in enumerate(chunks):
chunk["trans_text"] = translated_texts[index]
print("[+] Translation complete")
if chunks:
original = chunks[0]["text"][:50]
translated = chunks[0]["trans_text"][:50]
print(f"[*] Sample: '{original}' -> '{translated}'")
print(f"\n{'=' * 60}")
print("STEP 5: DUB AUDIO SYNTHESIS")
print(f"{'=' * 60}")
print(f"[*] Synthesizing dubbed speech for {len(chunks)} translated chunks...")
asyncio.run(_synthesize_dub_audio(engine, chunks, args.lang, src.media, src.engines.TEMP_DIR))
concat_manifest_path = src.engines.TEMP_DIR / "dub_audio_manifest.txt"
silence_ref_path = src.engines.TEMP_DIR / "silence_ref.wav"
src.media.create_concat_file(chunks, silence_ref_path, concat_manifest_path)
print(f"[+] Dub audio manifest generated: {concat_manifest_path}")
print(f"\n{'=' * 60}")
print("STEP 6: SUBTITLE GENERATION")
print(f"{'=' * 60}")
subtitle_path = src.engines.TEMP_DIR / "subtitles.srt"
src.media.generate_srt(chunks, subtitle_path)
print(f"[+] Subtitles generated: {subtitle_path}")
print(f"\n{'=' * 60}")
print("STEP 7: FINAL VIDEO RENDERING")
print(f"{'=' * 60}")
try:
video_name = video_path.stem
output_name = f"dubbed_{args.lang}_{video_name}.mp4"
final_output = src.engines.OUTPUT_DIR / output_name
print("[*] Rendering final video with dubbed audio and subtitles...")
print(f" Source: {video_path}")
print(f" Output: {final_output}")
print(f" Dub audio manifest: {concat_manifest_path}")
print(f" Subtitles: {subtitle_path}")
src.media.render_video(
video_path,
concat_manifest_path,
final_output,
subtitle_path=subtitle_path,
)
if final_output.exists():
file_size = final_output.stat().st_size / (1024 * 1024)
print("\n[+] SUCCESS! Video rendered successfully.")
print(f" Output: {final_output}")
print(f" Size: {file_size:.1f} MB")
else:
print(f"\n[!] ERROR: Output file not created at {final_output}")
except Exception as exc:
print(f"\n[!] RENDERING FAILED: {exc}")
print("[-] This may be due to:")
print(" 1. Corrupted audio chunks")
print(" 2. FFmpeg compatibility issues")
print(" 3. Insufficient disk space")
return
finally:
if "engine" in locals():
engine.translator.close()
print(f"\n{'=' * 60}")
print("YOUTUBE AUTO SUB - PIPELINE COMPLETE")
print(f"{'=' * 60}")
if __name__ == "__main__":
try:
main()
except KeyboardInterrupt:
print("\n[!] Process interrupted by user")
raise SystemExit(1)
except Exception as exc:
print(f"\n[!] UNEXPECTED ERROR: {exc}")
print("[-] Please report this issue with the full error message")
raise SystemExit(1) from exc

12
requirements.txt Normal file
View File

@@ -0,0 +1,12 @@
yt-dlp
faster-whisper
torch
edge-tts
httpx
librosa
numpy
soundfile
tqdm
pathlib
typing-extensions
pytest

127
run-auto-dub.ps1 Normal file
View File

@@ -0,0 +1,127 @@
param(
[string]$DefaultVideoUrl = "https://youtu.be/EExM3dueOeM",
[string]$DefaultOutputLanguage = "es",
[string]$DefaultInputLanguage = "",
[string]$DefaultLmStudioBaseUrl = "http://127.0.0.1:1234/v1",
[string]$DefaultLmStudioApiKey = "lm-studio",
[string]$DefaultLmStudioModel = "gemma-3-4b-it"
)
$ErrorActionPreference = "Stop"
function Read-Value {
param(
[Parameter(Mandatory = $true)]
[string]$Prompt,
[string]$DefaultValue = "",
[switch]$Required
)
if ($DefaultValue) {
$value = Read-Host "$Prompt [$DefaultValue]"
if ([string]::IsNullOrWhiteSpace($value)) {
$value = $DefaultValue
}
}
else {
$value = Read-Host $Prompt
}
if ($Required -and [string]::IsNullOrWhiteSpace($value)) {
throw "A value is required for: $Prompt"
}
return $value.Trim()
}
$repoRoot = Split-Path -Parent $MyInvocation.MyCommand.Path
$pythonExe = Join-Path $repoRoot ".venv\Scripts\python.exe"
$mainPy = Join-Path $repoRoot "main.py"
$logsDir = Join-Path $repoRoot "logs"
$timestamp = Get-Date -Format "yyyyMMdd-HHmmss"
$logFile = Join-Path $logsDir "auto-dub-$timestamp.log"
if (-not (Test-Path $pythonExe)) {
throw "Python executable not found at $pythonExe. Create the UV environment first."
}
if (-not (Test-Path $mainPy)) {
throw "main.py not found at $mainPy."
}
New-Item -ItemType Directory -Force -Path $logsDir | Out-Null
Write-Host ""
Write-Host "YouTube Auto Dub Launcher" -ForegroundColor Cyan
Write-Host "Repo: $repoRoot"
Write-Host "Log file: $logFile"
Write-Host ""
Write-Host "Leave input language blank to let Whisper auto-detect it." -ForegroundColor Yellow
Write-Host ""
$videoUrl = Read-Value -Prompt "Video URL" -DefaultValue $DefaultVideoUrl -Required
$outputLanguage = Read-Value -Prompt "Output language code" -DefaultValue $DefaultOutputLanguage -Required
$inputLanguage = Read-Value -Prompt "Input language code (optional)" -DefaultValue $DefaultInputLanguage
$lmStudioBaseUrl = Read-Value -Prompt "LM Studio base URL" -DefaultValue $DefaultLmStudioBaseUrl -Required
$lmStudioApiKey = Read-Value -Prompt "LM Studio API key" -DefaultValue $DefaultLmStudioApiKey -Required
$lmStudioModel = Read-Value -Prompt "LM Studio model" -DefaultValue $DefaultLmStudioModel -Required
$env:LM_STUDIO_BASE_URL = $lmStudioBaseUrl
$env:LM_STUDIO_API_KEY = $lmStudioApiKey
$env:LM_STUDIO_MODEL = $lmStudioModel
$commandArgs = @(
$mainPy,
$videoUrl,
"--lang",
$outputLanguage
)
if (-not [string]::IsNullOrWhiteSpace($inputLanguage)) {
$env:SOURCE_LANGUAGE_HINT = $inputLanguage
Write-Host "Using input language hint: $inputLanguage" -ForegroundColor Yellow
}
else {
Remove-Item Env:SOURCE_LANGUAGE_HINT -ErrorAction SilentlyContinue
}
Write-Host ""
Write-Host "Running with:" -ForegroundColor Cyan
Write-Host " Video URL: $videoUrl"
Write-Host " Output language: $outputLanguage"
Write-Host " LM Studio URL: $lmStudioBaseUrl"
Write-Host " LM Studio model: $lmStudioModel"
if ($inputLanguage) {
Write-Host " Input language hint: $inputLanguage"
}
else {
Write-Host " Input language hint: auto-detect"
}
Write-Host ""
Push-Location $repoRoot
try {
$commandLine = @($pythonExe) + $commandArgs
"[$(Get-Date -Format s)] Starting run" | Tee-Object -FilePath $logFile -Append | Out-Null
"[$(Get-Date -Format s)] Command: $($commandLine -join ' ')" | Tee-Object -FilePath $logFile -Append | Out-Null
"[$(Get-Date -Format s)] LM_STUDIO_BASE_URL=$lmStudioBaseUrl" | Tee-Object -FilePath $logFile -Append | Out-Null
"[$(Get-Date -Format s)] LM_STUDIO_MODEL=$lmStudioModel" | Tee-Object -FilePath $logFile -Append | Out-Null
if ($inputLanguage) {
"[$(Get-Date -Format s)] SOURCE_LANGUAGE_HINT=$inputLanguage" | Tee-Object -FilePath $logFile -Append | Out-Null
}
& $pythonExe @commandArgs 2>&1 | Tee-Object -FilePath $logFile -Append
}
catch {
Write-Host ""
Write-Host "The run failed." -ForegroundColor Red
Write-Host $_.Exception.Message -ForegroundColor Red
"[$(Get-Date -Format s)] Launcher error: $($_.Exception.Message)" | Tee-Object -FilePath $logFile -Append | Out-Null
}
finally {
Pop-Location
Write-Host ""
Write-Host "Run log saved to: $logFile" -ForegroundColor Cyan
Read-Host "Press Enter to close"
}

4
src/__init__.py Normal file
View File

@@ -0,0 +1,4 @@
"""YouTube Auto Dub - Automated Video Translation and Dubbing"""
__version__ = "1.0.0"
__author__ = "Nguyen Cong Thuan Huy (mangodxd)"

181
src/core_utils.py Normal file
View File

@@ -0,0 +1,181 @@
"""Core utilities and exceptions for YouTube Auto Sub.
This module consolidates shared utilities, exceptions, and helper functions
used across the entire pipeline to reduce code duplication.
Author: Nguyen Cong Thuan Huy (mangodxd)
Version: 1.0.0
"""
import subprocess
import time
import traceback
from pathlib import Path
from typing import Dict, List, Optional, Union
class YouTubeAutoSubError(Exception):
"""Base exception for all YouTube Auto Sub errors."""
pass
class ModelLoadError(YouTubeAutoSubError):
"""Raised when AI/ML model fails to load."""
pass
class AudioProcessingError(YouTubeAutoSubError):
"""Raised when audio processing operations fail."""
pass
class TranscriptionError(YouTubeAutoSubError):
"""Raised when speech transcription fails."""
pass
class TranslationError(YouTubeAutoSubError):
"""Raised when text translation fails."""
pass
class TTSError(YouTubeAutoSubError):
"""Raised when text-to-speech synthesis fails."""
pass
class VideoProcessingError(YouTubeAutoSubError):
"""Raised when video processing operations fail."""
pass
class ConfigurationError(YouTubeAutoSubError):
"""Raised when configuration is invalid or missing."""
pass
class DependencyError(YouTubeAutoSubError):
"""Raised when required dependencies are missing."""
pass
class ValidationError(YouTubeAutoSubError):
"""Raised when input validation fails."""
pass
class ResourceError(YouTubeAutoSubError):
"""Raised when system resources are insufficient."""
pass
def _handleError(error: Exception, context: str = "") -> None:
"""Centralized error handling with context.
Args:
error: The exception that occurred.
context: Additional context about where the error occurred.
Returns:
None
"""
if context:
print(f"[!] ERROR in {context}: {error}")
else:
print(f"[!] ERROR: {error}")
print(f" Full traceback: {traceback.format_exc()}")
def _runFFmpegCmd(cmd: List[str], timeout: int = 300, description: str = "FFmpeg operation") -> None:
"""Run FFmpeg command with consistent error handling.
Args:
cmd: FFmpeg command to run.
timeout: Command timeout in seconds.
description: Description for error messages.
Raises:
RuntimeError: If FFmpeg command fails.
"""
try:
subprocess.run(cmd, check=True, timeout=timeout)
except subprocess.TimeoutExpired:
raise RuntimeError(f"{description} timed out")
except subprocess.CalledProcessError as e:
raise RuntimeError(f"{description} failed: {e}")
except Exception as e:
raise RuntimeError(f"Unexpected error during {description}: {e}")
def _validateAudioFile(file_path: Path, min_size: int = 1024) -> bool:
"""Validate that audio file exists and has minimum size.
Args:
file_path: Path to audio file.
min_size: Minimum file size in bytes.
Returns:
True if file is valid, False otherwise.
"""
if not file_path.exists():
return False
if file_path.stat().st_size < min_size:
return False
return True
def _safeFileDelete(file_path: Path) -> None:
"""Safely delete file with error handling.
Args:
file_path: Path to file to delete.
Returns:
None
"""
try:
if file_path.exists():
file_path.unlink()
except Exception as e:
print(f"[!] WARNING: Could not delete file {file_path}: {e}")
class ProgressTracker:
"""Simple progress tracking for long operations."""
def __init__(self, total: int, description: str = "Processing", update_interval: int = 10):
"""Initialize progress tracker.
Args:
total: Total number of items to process.
description: Description for progress messages.
update_interval: How often to update progress (every N items).
"""
self.total = total
self.description = description
self.update_interval = update_interval
self.current = 0
def update(self, increment: int = 1) -> None:
"""Update progress counter.
Args:
increment: Number of items processed.
Returns:
None
"""
self.current += increment
if self.current % self.update_interval == 0 or self.current >= self.total:
progress = (self.current / self.total) * 100
print(f"[-] {self.description}: {self.current}/{self.total} ({progress:.1f}%)", end='\r')
if self.current >= self.total:
print()

547
src/engines.py Normal file
View File

@@ -0,0 +1,547 @@
"""
AI/ML Engines Module for YouTube Auto Dub.
This module provides the core AI/ML functionality including:
- Device and configuration management
- Whisper-based speech transcription
- LM Studio translation integration
- Edge TTS synthesis
- Pipeline orchestration and chunking
Author: Nguyen Cong Thuan Huy (mangodxd)
Version: 1.0.0
"""
import torch
import asyncio
import edge_tts
import gc
import json
import os
from abc import ABC
import numpy as np
from pathlib import Path
from typing import List, Dict, Optional, Union, Any
# Local imports
from src.core_utils import (
ModelLoadError, TranscriptionError, TranslationError, TTSError,
AudioProcessingError, _handleError, _runFFmpegCmd, ProgressTracker,
_validateAudioFile, _safeFileDelete
)
from src.translation import LMStudioTranslator, TranslationConfig
# =============================================================================
# CONFIGURATION
# =============================================================================
# Base directory of the project
BASE_DIR = Path(__file__).resolve().parent.parent
# Working directories
CACHE_DIR = BASE_DIR / ".cache"
OUTPUT_DIR = BASE_DIR / "output"
TEMP_DIR = BASE_DIR / "temp"
# Configuration files
LANG_MAP_FILE = BASE_DIR / "language_map.json"
# Ensure directories exist
for directory_path in [CACHE_DIR, OUTPUT_DIR, TEMP_DIR]:
directory_path.mkdir(parents=True, exist_ok=True)
# Audio processing settings
SAMPLE_RATE = 24000
AUDIO_CHANNELS = 1
def _select_optimal_whisper_model(device: str = "cpu") -> str:
"""Select optimal Whisper model based on available VRAM and device.
Args:
device: Device type ('cuda' or 'cpu').
Returns:
Optimal Whisper model name.
"""
if device == "cpu":
return "base" # CPU works best with base model
try:
import torch
if not torch.cuda.is_available():
return "base"
# Get VRAM information
gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3) # GB
if gpu_memory < 4:
return "tiny" # < 4GB VRAM
elif gpu_memory < 8:
return "base" # 4-8GB VRAM
elif gpu_memory < 12:
return "small" # 8-12GB VRAM
elif gpu_memory < 16:
return "medium" # 12-16GB VRAM
else:
return "large-v3" # > 16GB VRAM - use latest large model
except Exception:
return "base" # Fallback to base if detection fails
ASR_MODEL = _select_optimal_whisper_model(device="cuda" if torch.cuda.is_available() else "cpu")
DEFAULT_VOICE = "en-US-AriaNeural"
# Load language configuration
try:
with open(LANG_MAP_FILE, "r", encoding="utf-8") as f:
LANG_DATA = json.load(f)
print(f"[*] Loaded language configuration for {len(LANG_DATA)} languages")
except (FileNotFoundError, json.JSONDecodeError) as e:
print(f"[!] WARNING: Could not load language map from {LANG_MAP_FILE}")
LANG_DATA = {}
class DeviceManager:
"""Centralized device detection and management."""
def __init__(self, device: Optional[str] = None):
"""Initialize device manager.
Args:
device: Device type ('cuda' or 'cpu'). If None, auto-detects.
"""
if device is None:
if torch.backends.mps.is_available(): #macOS
device = "mps"
elif torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
self.device = device
self._logDeviceInfo()
def _logDeviceInfo(self) -> None:
"""Log device information to console.
Args:
None
Returns:
None
"""
print(f"[*] Device initialized: {self.device.upper()}")
if self.device == "cuda":
gpu_name = torch.cuda.get_device_name(0)
gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)
print(f" GPU: {gpu_name} | VRAM: {gpu_memory:.1f} GB")
def getMemoryInfo(self) -> Dict[str, float]:
"""Get GPU memory usage information.
Args:
None
Returns:
Dictionary with allocated and reserved memory in GB.
"""
if self.device != "cuda":
return {"allocated": 0.0, "reserved": 0.0}
return {
"allocated": torch.cuda.memory_allocated(0) / (1024**3),
"reserved": torch.cuda.memory_reserved(0) / (1024**3)
}
def clearCache(self) -> None:
"""Clear GPU cache and run garbage collection.
Args:
None
Returns:
None
"""
if self.device == "cuda":
torch.cuda.empty_cache()
gc.collect()
class ConfigManager:
"""Centralized configuration access with validation."""
def getLanguageConfig(self, lang_code: str) -> Dict[str, Any]:
"""Get language configuration by language code.
Args:
lang_code: ISO language code.
Returns:
Language configuration dictionary.
"""
return LANG_DATA.get(lang_code, {})
def extractVoice(self, voice_data, fallback_gender: str = "female") -> str:
"""Extract voice string from various data formats.
Args:
voice_data: Voice data in list, string, or other format.
fallback_gender: Default gender to use if extraction fails.
Returns:
Voice string for TTS.
"""
if isinstance(voice_data, list):
return voice_data[0] if voice_data else DEFAULT_VOICE
if isinstance(voice_data, str):
return voice_data
return DEFAULT_VOICE
def getVoicePool(self, lang_code: str, gender: str) -> list:
"""Get pool of available voices for language and gender.
Args:
lang_code: ISO language code.
gender: Voice gender (male/female).
Returns:
List of available voice strings.
"""
lang_config = self.getLanguageConfig(lang_code)
voices = lang_config.get('voices', {})
pool = voices.get(gender, [DEFAULT_VOICE])
if isinstance(pool, str):
pool = [pool]
return pool
class PipelineComponent(ABC):
"""Base class for pipeline components with shared utilities."""
def __init__(self, device_manager: DeviceManager, config_manager: ConfigManager):
"""Initialize pipeline component.
Args:
device_manager: Device management instance.
config_manager: Configuration management instance.
"""
self.device_manager = device_manager
self.config_manager = config_manager
self.device = device_manager.device
def _validateFileExists(self, file_path: Path, description: str = "File") -> None:
"""Validate that a file exists.
Args:
file_path: Path to validate.
description: Description for error messages.
Raises:
FileNotFoundError: If file doesn't exist.
"""
if not file_path.exists():
raise FileNotFoundError(f"{description} not found: {file_path}")
def _ensureDirectory(self, directory: Path) -> None:
"""Ensure directory exists, create if necessary.
Args:
directory: Directory path to ensure exists.
Returns:
None
"""
directory.mkdir(parents=True, exist_ok=True)
# =============================================================================
# MAIN AI/ML ENGINE
# =============================================================================
class Engine(PipelineComponent):
"""Central AI/ML engine for YouTube Auto Dub pipeline."""
def __init__(
self,
device: Optional[str] = None,
translation_config: Optional[TranslationConfig] = None,
source_language_hint: Optional[str] = None,
):
device_manager = DeviceManager(device)
config_manager = ConfigManager()
super().__init__(device_manager, config_manager)
self._asr = None
self.source_language_hint = (source_language_hint or os.getenv("SOURCE_LANGUAGE_HINT") or "").strip()
self.detected_source_lang = self.source_language_hint or "auto"
self.translation_config = translation_config or TranslationConfig.from_env()
self.translator = LMStudioTranslator(self.translation_config)
print(f"[+] AI Engine initialized successfully")
@property
def asrModel(self):
"""Lazy-load Whisper ASR model.
Returns:
Loaded Whisper model instance.
Raises:
ModelLoadError: If model fails to load.
"""
if not self._asr:
print(f"[*] Loading Whisper model ({ASR_MODEL}) on {self.device}...")
try:
from faster_whisper import WhisperModel
compute_type = "float16" if self.device == "cuda" else "int8"
self._asr = WhisperModel(ASR_MODEL, device=self.device, compute_type=compute_type)
print(f"[+] Whisper model loaded successfully")
except Exception as e:
raise ModelLoadError(f"Failed to load Whisper model: {e}") from e
return self._asr
def _getLangConfig(self, lang: str) -> Dict:
"""Get language configuration.
Args:
lang: Language code.
Returns:
Language configuration dictionary.
"""
return self.config_manager.getLanguageConfig(lang)
def _extractVoiceString(self, voice_data: Union[str, List[str], None]) -> str:
"""Extract voice string from data.
Args:
voice_data: Voice data in various formats.
Returns:
Voice string for TTS.
"""
return self.config_manager.extractVoice(voice_data)
def releaseMemory(self, component: Optional[str] = None) -> None:
"""Release VRAM and clean up GPU memory.
Args:
component: Specific component to release ('asr').
If None, releases all components.
Returns:
None
"""
if component in [None, 'asr'] and self._asr:
del self._asr
self._asr = None
print("[*] ASR VRAM cleared")
self.device_manager.clearCache()
def transcribeSafe(self, audio_path: Path) -> List[Dict]:
"""Transcribe audio with automatic memory management.
Args:
audio_path: Path to audio file.
Returns:
List of transcription segments with timing.
Raises:
TranscriptionError: If transcription fails.
"""
try:
res = self.transcribe(audio_path)
self.releaseMemory('asr')
return res
except Exception as e:
_handleError(e, "transcription")
raise TranscriptionError(f"Transcription failed: {e}") from e
def translateSafe(self, texts: List[str], target_lang: str) -> List[str]:
"""Translate texts safely with memory management.
Args:
texts: List of text strings to translate.
target_lang: Target language code.
Returns:
List of translated text strings.
"""
self.releaseMemory()
return self.translate(texts, target_lang)
def transcribe(self, audio_path: Path) -> List[Dict]:
"""Transcribe audio using Whisper model.
Args:
audio_path: Path to audio file.
Returns:
List of transcription segments with start/end times and text.
"""
segments, info = self.asrModel.transcribe(str(audio_path), word_timestamps=False, language=None)
detected = getattr(info, "language", "auto") or "auto"
self.detected_source_lang = self.source_language_hint or detected
print(f"[*] Detected source language: {self.detected_source_lang}")
return [{'start': s.start, 'end': s.end, 'text': s.text.strip()} for s in segments]
def translate(self, texts: List[str], target_lang: str) -> List[str]:
"""Translate texts to target language.
Args:
texts: List of text strings to translate.
target_lang: Target language code.
Returns:
List of translated text strings.
Raises:
TranslationError: If translation fails.
"""
if not texts: return []
print(f"[*] Translating {len(texts)} segments to '{target_lang}'...")
source_lang = self.detected_source_lang or "auto"
try:
return self.translator.translate_segments(
texts=texts,
target_language=target_lang,
source_language=source_lang,
)
except Exception as e:
_handleError(e, "translation")
raise TranslationError(f"Translation failed: {e}") from e
def calcRate(self, text: str, target_dur: float, original_text: str = "") -> str:
"""Calculate speech rate adjustment for TTS with dynamic limits.
Args:
text: Text to be synthesized (translated text).
target_dur: Target duration in seconds.
original_text: Original text for length comparison (optional).
Returns:
Rate adjustment string (e.g., '+10%', '-5%').
"""
words = len(text.split())
if words == 0 or target_dur <= 0: return "+0%"
# Base calculation
wps = words / target_dur
estimated_time = words / wps
if estimated_time <= target_dur:
return "+0%"
ratio = estimated_time / target_dur
speed_percent = int((ratio - 1) * 100)
# Dynamic speed limits based on text length comparison
if original_text:
orig_len = len(original_text.split())
trans_len = words
# If translated text is significantly longer, allow more slowdown
if trans_len > orig_len * 1.5:
# Allow up to -25% slowdown for longer translations
speed_percent = max(-25, min(speed_percent, 90))
elif trans_len < orig_len * 0.7:
# If translation is shorter, be more conservative with speedup
speed_percent = max(-15, min(speed_percent, 50))
else:
# Normal case: -10% to 90%
speed_percent = max(-10, min(speed_percent, 90))
else:
# Fallback to original limits
speed_percent = max(-10, min(speed_percent, 90))
return f"{speed_percent:+d}%"
async def synthesize(
self,
text: str,
target_lang: str,
out_path: Path,
gender: str = "female",
rate: str = "+0%"
) -> None:
if not text.strip(): raise ValueError("Text empty")
out_path.parent.mkdir(parents=True, exist_ok=True)
try:
lang_cfg = self._getLangConfig(target_lang)
voice_pool = self.config_manager.getVoicePool(target_lang, gender)
voice = voice_pool[0] if voice_pool else DEFAULT_VOICE
communicate = edge_tts.Communicate(text, voice=voice, rate=rate)
await communicate.save(str(out_path))
if not out_path.exists() or out_path.stat().st_size < 1024:
raise RuntimeError("TTS file invalid")
except Exception as e:
if out_path.exists(): out_path.unlink(missing_ok=True)
_handleError(e, "TTS synthesis")
raise TTSError(f"TTS failed: {e}") from e
def smartChunk(segments: List[Dict]) -> List[Dict]:
n = len(segments)
if n == 0: return []
# Calculate segment durations and gaps for dynamic analysis
durations = [s['end'] - s['start'] for s in segments]
gaps = [segments[i]['start'] - segments[i-1]['end'] for i in range(1, n)]
# Dynamic parameters based on actual video content
avg_seg_dur = sum(durations) / n
avg_gap = sum(gaps) / len(gaps) if gaps else 0.5
# Dynamic min/max duration based on content characteristics
min_dur = max(1.0, avg_seg_dur * 0.5) # Minimum 1s, or 50% of average
max_dur = np.percentile(durations, 90) if n > 5 else min(15.0, avg_seg_dur * 3)
max_dur = max(5.0, min(30.0, max_dur)) # Clamp between 5-30 seconds
# Hard threshold for gap-based splitting (1.5x average gap)
gap_threshold = max(0.4, avg_gap * 1.5)
path = []
curr_chunk_segs = [segments[0]]
for i in range(1, n):
prev = segments[i-1]
curr = segments[i]
gap = curr['start'] - prev['end']
# Dynamic splitting criteria:
# 1. Gap exceeds threshold (natural pause)
# 2. Current chunk exceeds safe duration
# 3. Dynamic lookback: consider context but don't go too far back
current_dur = curr['end'] - curr_chunk_segs[0]['start']
if gap > gap_threshold or current_dur > max_dur:
# Close current chunk
path.append({
'start': curr_chunk_segs[0]['start'],
'end': curr_chunk_segs[-1]['end'],
'text': " ".join(s['text'] for s in curr_chunk_segs).strip()
})
curr_chunk_segs = [curr]
else:
curr_chunk_segs.append(curr)
# Add final chunk
if curr_chunk_segs:
path.append({
'start': curr_chunk_segs[0]['start'],
'end': curr_chunk_segs[-1]['end'],
'text': " ".join(s['text'] for s in curr_chunk_segs).strip()
})
print(f"[+] Smart chunking: {len(path)} chunks (Dynamic: min={min_dur:.1f}s, max={max_dur:.1f}s, gap_thr={gap_threshold:.2f}s)")
return path

410
src/media.py Normal file
View File

@@ -0,0 +1,410 @@
"""Media Processing Module for YouTube Auto Dub.
This module handles all audio/video processing operations using FFmpeg.
It provides functionality for:
- Audio duration detection and analysis
- Silence generation for gap filling
- Audio time-stretching and duration fitting (PADDING logic added)
- Video concatenation and rendering (Volume Mixing fixed)
- Audio synchronization and mixing
Author: Nguyen Cong Thuan Huy (mangodxd)
Version: 1.1.0 (Patched)
"""
import subprocess
from pathlib import Path
from typing import List, Dict, Optional
from src.engines import SAMPLE_RATE, AUDIO_CHANNELS
def _build_subtitle_filter(subtitle_path: Path) -> str:
"""Build a Windows-safe FFmpeg subtitles filter expression."""
escaped_path = str(subtitle_path.resolve()).replace("\\", "/").replace(":", "\\:")
return f"subtitles=filename='{escaped_path}'"
def _render_with_soft_subtitles(video_path: Path, output_path: Path, subtitle_path: Path) -> None:
"""Fallback render path that muxes subtitles instead of hard-burning them."""
cmd = [
'ffmpeg', '-y', '-v', 'error',
'-i', str(video_path),
'-i', str(subtitle_path),
'-map', '0:v',
'-map', '0:a?',
'-map', '1:0',
'-c:v', 'copy',
'-c:a', 'copy',
'-c:s', 'mov_text',
str(output_path)
]
subprocess.run(cmd, check=True, timeout=None)
def _render_mixed_with_soft_subtitles(
video_path: Path,
concat_file: Path,
output_path: Path,
subtitle_path: Path,
filter_complex: str,
) -> None:
"""Fallback render path that muxes subtitles while preserving mixed dubbed audio."""
cmd = [
'ffmpeg', '-y', '-v', 'error',
'-i', str(video_path),
'-f', 'concat', '-safe', '0', '-i', str(concat_file),
'-i', str(subtitle_path),
'-filter_complex', filter_complex,
'-map', '0:v',
'-map', '[outa]',
'-map', '2:0',
'-c:v', 'copy',
'-c:a', 'aac', '-b:a', '192k',
'-ar', str(SAMPLE_RATE),
'-ac', str(AUDIO_CHANNELS),
'-c:s', 'mov_text',
'-shortest',
str(output_path),
]
subprocess.run(cmd, check=True, timeout=None)
def _get_duration(path: Path) -> float:
"""Get the duration of an audio/video file using FFprobe."""
if not path.exists():
print(f"[!] ERROR: Media file not found: {path}")
return 0.0
try:
cmd = [
'ffprobe', '-v', 'error',
'-show_entries', 'format=duration',
'-of', 'default=noprint_wrappers=1:nokey=1',
str(path)
]
result = subprocess.run(
cmd,
capture_output=True,
text=True,
check=True,
timeout=60 # Increased from 30s to 60s for better reliability
)
duration_str = result.stdout.strip()
if duration_str:
return float(duration_str)
else:
return 0.0
except Exception as e:
print(f"[!] ERROR: Getting duration failed for {path}: {e}")
return 0.0
def _generate_silence_segment(duration: float, silence_ref: Path) -> Optional[Path]:
"""Generate a small silence segment for the concat list."""
if duration <= 0:
return None
# Use the parent folder of the reference silence file
output_path = silence_ref.parent / f"gap_{duration:.4f}.wav"
if output_path.exists():
return output_path
try:
cmd = [
'ffmpeg', '-y', '-v', 'error',
'-f', 'lavfi', '-i', f'anullsrc=r={SAMPLE_RATE}:cl=mono',
'-t', f"{duration:.4f}",
'-c:a', 'pcm_s16le',
str(output_path)
]
subprocess.run(cmd, check=True)
return output_path
except Exception:
return None
def _analyze_audio_loudness(audio_path: Path) -> Optional[float]:
"""Analyze audio loudness using FFmpeg volumedetect filter.
Args:
audio_path: Path to audio file to analyze.
Returns:
Mean volume in dB, or None if analysis fails.
"""
if not audio_path.exists():
return None
try:
cmd = [
'ffmpeg', '-y', '-v', 'error',
'-i', str(audio_path),
'-filter:a', 'volumedetect',
'-f', 'null', '-'
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True, timeout=30)
# Parse mean volume from output
for line in result.stderr.split('\n'):
if 'mean_volume:' in line:
# Extract dB value from line like: "mean_volume: -15.2 dB"
parts = line.split()
if len(parts) >= 2:
try:
return float(parts[1])
except ValueError:
continue
return None
except Exception:
return None
def fit_audio(audio_path: Path, target_dur: float) -> Path:
if not audio_path.exists() or target_dur <= 0:
return audio_path
actual_dur = _get_duration(audio_path)
if actual_dur == 0.0:
return audio_path
out_path = audio_path.parent / f"{audio_path.stem}_fit.wav"
# Increased tolerance from 0.05s to 0.15s for more natural audio
if actual_dur > target_dur + 0.15:
ratio = actual_dur / target_dur
filter_chain = []
current_ratio = ratio
# Dynamic speed limit: max 1.5x instead of 2.0x to avoid chipmunk effect
max_speed_ratio = 1.5
while current_ratio > max_speed_ratio:
filter_chain.append(f"atempo={max_speed_ratio}")
current_ratio /= max_speed_ratio
if current_ratio > 1.0:
filter_chain.append(f"atempo={current_ratio:.4f}")
filter_complex = ",".join(filter_chain)
cmd = [
'ffmpeg', '-y', '-v', 'error',
'-i', str(audio_path),
'-filter:a', f"{filter_complex},aresample=24000",
'-t', f"{target_dur:.4f}",
'-c:a', 'pcm_s16le',
str(out_path)
]
else:
cmd = [
'ffmpeg', '-y', '-v', 'error',
'-i', str(audio_path),
'-filter:a', f"apad,aresample=24000",
'-t', f"{target_dur:.4f}",
'-c:a', 'pcm_s16le',
str(out_path)
]
print(f"Fiting {actual_dur:.4f}s to {target_dur:.4f}s")
try:
subprocess.run(cmd, check=True, timeout=120)
return out_path
except Exception:
return audio_path
def create_concat_file(segments: List[Dict], silence_ref: Path, output_txt: Path) -> None:
if not segments:
return
try:
with open(output_txt, 'w', encoding='utf-8') as f:
current_timeline = 0.0
for segment in segments:
start_time = segment['start']
end_time = segment['end']
audio_path = segment.get('processed_audio')
gap = start_time - current_timeline
if gap > 0.01:
silence_gap = _generate_silence_segment(gap, silence_ref)
if silence_gap:
f.write(f"file '{silence_gap.resolve().as_posix()}'\n")
current_timeline += gap
if audio_path and audio_path.exists():
f.write(f"file '{audio_path.resolve().as_posix()}'\n")
current_timeline += (end_time - start_time)
else:
dur = end_time - start_time
silence_err = _generate_silence_segment(dur, silence_ref)
if silence_err:
f.write(f"file '{silence_err.resolve().as_posix()}'\n")
current_timeline += dur
except Exception as e:
raise RuntimeError(f"Failed to create concat manifest: {e}")
def render_video(
video_path: Path,
concat_file: Optional[Path],
output_path: Path,
subtitle_path: Optional[Path] = None,
) -> None:
"""Render final video with Dynamic Volume Mixing."""
if not video_path.exists():
raise FileNotFoundError("Source video for rendering is missing")
if concat_file is not None and not concat_file.exists():
raise FileNotFoundError("Concat audio manifest for rendering is missing")
output_path.parent.mkdir(parents=True, exist_ok=True)
try:
print(f"[*] Rendering final video...")
if concat_file is None:
video_codec = 'copy'
cmd = [
'ffmpeg', '-y', '-v', 'error',
'-i', str(video_path),
'-map', '0:v',
'-map', '0:a?',
]
if subtitle_path:
video_codec = 'libx264'
cmd.extend(['-vf', _build_subtitle_filter(subtitle_path)])
cmd.extend([
'-c:v', video_codec,
'-c:a', 'copy',
])
cmd.append(str(output_path))
try:
subprocess.run(cmd, check=True, timeout=None, capture_output=True, text=True)
except subprocess.CalledProcessError as exc:
if subtitle_path and "No such filter: 'subtitles'" in (exc.stderr or ""):
print("[!] FFmpeg subtitles filter is unavailable. Falling back to soft subtitles.")
_render_with_soft_subtitles(video_path, output_path, subtitle_path)
else:
raise
if not output_path.exists():
raise RuntimeError("Output file not created")
print(f"[+] Video rendered successfully: {output_path}")
return
# DYNAMIC VOLUME MIXING STRATEGY:
# Analyze original audio loudness to determine optimal background volume
original_loudness = _analyze_audio_loudness(video_path)
if original_loudness is not None:
# Calculate background volume based on loudness analysis
# Target: voice should be 10-15dB louder than background
if original_loudness > -10: # Very loud audio
bg_volume = 0.08 # 8% - reduce more for loud content
elif original_loudness > -20: # Normal audio
bg_volume = 0.15 # 15% - standard reduction
else: # Quiet audio
bg_volume = 0.25 # 25% - reduce less for quiet content
print(f"[*] Dynamic volume mixing: original={original_loudness:.1f}dB, bg_volume={bg_volume*100:.0f}%")
else:
# Fallback to default if analysis fails
bg_volume = 0.15
print(f"[*] Using default volume mixing: bg_volume={bg_volume*100:.0f}%")
filter_complex = (
f"[0:a]volume={bg_volume}[bg]; "
"[bg][1:a]amix=inputs=2:duration=first:dropout_transition=0[outa]"
)
video_codec = 'copy'
cmd = [
'ffmpeg', '-y', '-v', 'error',
'-i', str(video_path),
'-f', 'concat', '-safe', '0', '-i', str(concat_file),
'-filter_complex', filter_complex,
]
# Handle Hard Subtitles (Requires re-encoding)
if subtitle_path:
video_codec = 'libx264'
cmd.extend(['-vf', _build_subtitle_filter(subtitle_path)])
cmd.extend([
'-map', '0:v',
'-map', '[outa]',
'-c:v', video_codec,
'-c:a', 'aac', '-b:a', '192k',
'-ar', str(SAMPLE_RATE),
'-ac', str(AUDIO_CHANNELS),
'-shortest'
])
cmd.append(str(output_path))
# Run rendering
try:
subprocess.run(cmd, check=True, timeout=None, capture_output=True, text=True)
except subprocess.CalledProcessError as exc:
if subtitle_path and "No such filter: 'subtitles'" in (exc.stderr or ""):
print("[!] FFmpeg subtitles filter is unavailable. Falling back to soft subtitles.")
_render_mixed_with_soft_subtitles(
video_path=video_path,
concat_file=concat_file,
output_path=output_path,
subtitle_path=subtitle_path,
filter_complex=filter_complex,
)
else:
raise
if not output_path.exists():
raise RuntimeError("Output file not created")
print(f"[+] Video rendered successfully: {output_path}")
except subprocess.CalledProcessError as e:
raise RuntimeError(f"FFmpeg rendering failed: {e}")
except Exception as e:
raise RuntimeError(f"Rendering error: {e}")
def generate_srt(segments: List[Dict], output_path: Path) -> None:
"""Generate SRT subtitle file."""
if not segments: return
output_path.parent.mkdir(parents=True, exist_ok=True)
try:
with open(output_path, 'w', encoding='utf-8') as f:
for i, segment in enumerate(segments, 1):
start = _format_timestamp_srt(segment['start'])
end = _format_timestamp_srt(segment['end'])
text = segment.get('trans_text', '').strip()
f.write(f"{i}\n{start} --> {end}\n{text}\n\n")
print(f"[+] SRT subtitles generated")
except Exception as e:
print(f"[!] Warning: SRT generation failed: {e}")
def _format_timestamp_srt(seconds: float) -> str:
"""Convert seconds to HH:MM:SS,mmm."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

358
src/translation.py Normal file
View File

@@ -0,0 +1,358 @@
"""LM Studio translation client for YouTube Auto Dub."""
from __future__ import annotations
import os
import time
from dataclasses import dataclass
from typing import Any, Dict, List, Optional
from urllib.parse import urlparse
import httpx
from src.core_utils import ConfigurationError, TranslationError
DEFAULT_LM_STUDIO_BASE_URL = "http://127.0.0.1:1234/v1"
DEFAULT_LM_STUDIO_API_KEY = "lm-studio"
DEFAULT_LM_STUDIO_MODEL = "gemma-3-4b-it"
DEFAULT_TRANSLATION_BACKEND = "lmstudio"
def _normalize_base_url(base_url: str) -> str:
"""Normalize LM Studio base URLs to the OpenAI-compatible /v1 root."""
if not base_url or not isinstance(base_url, str):
raise ConfigurationError("LM Studio base URL must be a non-empty string.")
normalized = base_url.strip().rstrip("/")
if normalized.endswith("/chat/completions"):
normalized = normalized[: -len("/chat/completions")]
if not normalized.endswith("/v1"):
normalized = f"{normalized}/v1"
parsed = urlparse(normalized)
if parsed.scheme not in {"http", "https"} or not parsed.netloc:
raise ConfigurationError(
"LM Studio base URL must be a valid http(s) URL, for example "
"'http://127.0.0.1:1234/v1'."
)
return normalized
@dataclass(frozen=True)
class TranslationConfig:
"""Runtime configuration for the translation backend."""
backend: str = DEFAULT_TRANSLATION_BACKEND
base_url: str = DEFAULT_LM_STUDIO_BASE_URL
api_key: str = DEFAULT_LM_STUDIO_API_KEY
model: str = DEFAULT_LM_STUDIO_MODEL
timeout_seconds: float = 45.0
max_retries: int = 3
retry_backoff_seconds: float = 1.0
@classmethod
def from_env(
cls,
backend: Optional[str] = None,
base_url: Optional[str] = None,
model: Optional[str] = None,
api_key: Optional[str] = None,
) -> "TranslationConfig":
"""Build config from environment variables plus optional overrides."""
config = cls(
backend=(backend or os.getenv("TRANSLATION_BACKEND") or DEFAULT_TRANSLATION_BACKEND).strip().lower(),
base_url=_normalize_base_url(base_url or os.getenv("LM_STUDIO_BASE_URL") or DEFAULT_LM_STUDIO_BASE_URL),
api_key=api_key or os.getenv("LM_STUDIO_API_KEY") or DEFAULT_LM_STUDIO_API_KEY,
model=model or os.getenv("LM_STUDIO_MODEL") or DEFAULT_LM_STUDIO_MODEL,
)
config.validate()
return config
@property
def chat_completions_url(self) -> str:
return f"{_normalize_base_url(self.base_url)}/chat/completions"
def validate(self) -> None:
"""Validate the translation configuration."""
if self.backend != DEFAULT_TRANSLATION_BACKEND:
raise ConfigurationError(
f"Unsupported translation backend '{self.backend}'. "
f"Only '{DEFAULT_TRANSLATION_BACKEND}' is supported."
)
if not self.model or not isinstance(self.model, str):
raise ConfigurationError("LM Studio model must be a non-empty string.")
if not self.api_key or not isinstance(self.api_key, str):
raise ConfigurationError("LM Studio API key must be a non-empty string.")
if self.timeout_seconds <= 0:
raise ConfigurationError("LM Studio timeout must be greater than zero.")
if self.max_retries < 1:
raise ConfigurationError("LM Studio max retries must be at least 1.")
if self.retry_backoff_seconds < 0:
raise ConfigurationError("LM Studio retry backoff cannot be negative.")
_normalize_base_url(self.base_url)
def _build_system_prompt(source_language: str, target_language: str) -> str:
source_descriptor = source_language or "auto"
return (
"You are a professional audiovisual translator.\n"
f"Translate the user-provided text from {source_descriptor} to {target_language}.\n"
"Preserve meaning, tone, style, and intent as closely as possible.\n"
"Keep punctuation natural and keep subtitle-like lines concise when the source is concise.\n"
"Return only the translation.\n"
"Do not explain anything.\n"
"Do not add notes, headings, metadata, or commentary.\n"
"Do not add quotation marks unless they are part of the source.\n"
"Preserve line breaks and segment boundaries exactly.\n"
"Keep names, brands, URLs, emails, code, and proper nouns unchanged unless transliteration "
"is clearly appropriate.\n"
"Expand abbreviations only when needed for a natural translation.\n"
"Do not censor, summarize, or omit content."
)
class LMStudioTranslator:
"""OpenAI-style chat completions client for LM Studio."""
def __init__(
self,
config: TranslationConfig,
client: Optional[httpx.Client] = None,
sleeper=time.sleep,
) -> None:
self.config = config
self.config.validate()
self._client = client or httpx.Client(timeout=httpx.Timeout(self.config.timeout_seconds))
self._owns_client = client is None
self._sleeper = sleeper
def build_payload(self, text: str, source_language: str, target_language: str) -> Dict[str, Any]:
"""Build the OpenAI-compatible chat completions payload."""
return {
"model": self.config.model,
"messages": [
{"role": "system", "content": _build_system_prompt(source_language, target_language)},
{"role": "user", "content": text},
],
"temperature": 0.1,
"top_p": 1,
"stream": False,
}
def build_user_only_payload(
self,
text: str,
source_language: str,
target_language: str,
) -> Dict[str, Any]:
"""Build a fallback payload for models that require the first turn to be user."""
instructions = _build_system_prompt(source_language, target_language)
merged_prompt = f"{instructions}\n\nText to translate:\n{text}"
return {
"model": self.config.model,
"messages": [
{"role": "user", "content": merged_prompt},
],
"temperature": 0.1,
"top_p": 1,
"stream": False,
}
def build_structured_translation_payload(
self,
text: str,
source_language: str,
target_language: str,
) -> Dict[str, Any]:
"""Build a payload for custom translation models with structured user content."""
return {
"model": self.config.model,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"source_lang_code": source_language or "auto",
"target_lang_code": target_language,
"text": text,
"image": None,
}
],
}
],
"temperature": 0.1,
"top_p": 1,
"stream": False,
}
@staticmethod
def parse_response_content(payload: Dict[str, Any]) -> str:
"""Extract translated text from an OpenAI-compatible response payload."""
try:
content = payload["choices"][0]["message"]["content"]
except (KeyError, IndexError, TypeError) as exc:
raise TranslationError("LM Studio response did not contain a chat completion message.") from exc
if isinstance(content, list):
parts = []
for item in content:
if isinstance(item, str):
parts.append(item)
elif isinstance(item, dict) and item.get("type") == "text":
parts.append(str(item.get("text", "")))
content = "".join(parts)
if not isinstance(content, str):
raise TranslationError("LM Studio response content was not a text string.")
translated = content.strip()
if not translated:
raise TranslationError("LM Studio returned an empty translation.")
return translated
def _headers(self) -> Dict[str, str]:
return {
"Authorization": f"Bearer {self.config.api_key}",
"Content-Type": "application/json",
}
def _should_retry(self, exc: Exception) -> bool:
if isinstance(exc, (httpx.ConnectError, httpx.ReadTimeout, httpx.WriteTimeout, httpx.PoolTimeout)):
return True
if isinstance(exc, httpx.HTTPStatusError):
return exc.response.status_code in {408, 409, 429, 500, 502, 503, 504}
return False
@staticmethod
def _should_retry_with_user_only_prompt(exc: Exception) -> bool:
if not isinstance(exc, httpx.HTTPStatusError):
return False
if exc.response.status_code != 400:
return False
response_text = exc.response.text.lower()
return "conversations must start with a user prompt" in response_text
@staticmethod
def _should_retry_with_structured_translation_prompt(exc: Exception) -> bool:
if not isinstance(exc, httpx.HTTPStatusError):
return False
if exc.response.status_code != 400:
return False
response_text = exc.response.text.lower()
return "source_lang_code" in response_text and "target_lang_code" in response_text
def _post_chat_completion(self, payload: Dict[str, Any]) -> str:
response = self._client.post(
self.config.chat_completions_url,
headers=self._headers(),
json=payload,
)
response.raise_for_status()
return self.parse_response_content(response.json())
def translate_text(
self,
text: str,
target_language: str,
source_language: str = "auto",
) -> str:
"""Translate a single text segment."""
if not text.strip():
return ""
payload = self.build_payload(text, source_language, target_language)
last_error: Optional[Exception] = None
for attempt in range(1, self.config.max_retries + 1):
try:
return self._post_chat_completion(payload)
except (httpx.HTTPError, ValueError, TranslationError) as exc:
last_error = exc
if self._should_retry_with_user_only_prompt(exc):
try:
fallback_payload = self.build_user_only_payload(text, source_language, target_language)
return self._post_chat_completion(fallback_payload)
except (httpx.HTTPError, ValueError, TranslationError) as fallback_exc:
last_error = fallback_exc
if self._should_retry_with_structured_translation_prompt(last_error):
try:
structured_payload = self.build_structured_translation_payload(
text,
source_language,
target_language,
)
return self._post_chat_completion(structured_payload)
except (httpx.HTTPError, ValueError, TranslationError) as structured_exc:
last_error = structured_exc
if attempt >= self.config.max_retries or not self._should_retry(exc):
break
self._sleeper(self.config.retry_backoff_seconds * attempt)
if isinstance(last_error, TranslationError):
raise last_error
if isinstance(last_error, ValueError):
raise TranslationError("LM Studio returned a non-JSON response.") from last_error
raise TranslationError(f"LM Studio request failed: {last_error}") from last_error
def translate_segments(
self,
texts: List[str],
target_language: str,
source_language: str = "auto",
) -> List[str]:
"""Translate an ordered list of subtitle-like segments."""
results: List[str] = []
for text in texts:
results.append(
self.translate_text(
text=text,
target_language=target_language,
source_language=source_language,
)
)
return results
def close(self) -> None:
if self._owns_client:
self._client.close()
def translate_text(
text: str,
target_language: str,
source_language: str = "auto",
config: Optional[TranslationConfig] = None,
client: Optional[httpx.Client] = None,
) -> str:
"""Translate a single text string using LM Studio."""
translator = LMStudioTranslator(config or TranslationConfig.from_env(), client=client)
try:
return translator.translate_text(text, target_language, source_language)
finally:
translator.close()
def translate_segments(
texts: List[str],
target_language: str,
source_language: str = "auto",
config: Optional[TranslationConfig] = None,
client: Optional[httpx.Client] = None,
) -> List[str]:
"""Translate a list of text strings using LM Studio."""
translator = LMStudioTranslator(config or TranslationConfig.from_env(), client=client)
try:
return translator.translate_segments(texts, target_language, source_language)
finally:
translator.close()

329
src/youtube.py Normal file
View File

@@ -0,0 +1,329 @@
"""YouTube Content Download Module for YouTube Auto Dub.
This module provides a robust interface for downloading YouTube content
using yt-dlp. It handles:
- Video and audio extraction from YouTube URLs
- Authentication via cookies or browser integration
- Format selection and quality optimization
- Error handling and retry logic
- Metadata extraction and validation
Author: Nguyen Cong Thuan Huy (mangodxd)
Version: 1.0.0
"""
import yt_dlp
from pathlib import Path
from typing import Optional, Dict, Any
from src.engines import CACHE_DIR
def _format_minutes_seconds(total_seconds: float) -> str:
"""Format seconds as M:SS for logging."""
seconds = int(round(total_seconds))
minutes, remaining_seconds = divmod(seconds, 60)
return f"{minutes}:{remaining_seconds:02d}"
def _getOpts(browser: Optional[str] = None,
cookies_file: Optional[str] = None,
quiet: bool = True) -> Dict[str, Any]:
"""Generate common yt-dlp options with authentication configuration.
Args:
browser: Browser name for cookie extraction (chrome, edge, firefox).
If provided, cookies will be extracted from this browser.
cookies_file: Path to cookies.txt file in Netscape format.
Takes priority over browser extraction if both provided.
quiet: Whether to suppress yt-dlp output messages.
Returns:
Dictionary of yt-dlp options.
Raises:
ValueError: If invalid browser name is provided.
Note:
Priority order: cookies_file > browser > no authentication.
"""
opts = {
'quiet': quiet,
'no_warnings': True,
'extract_flat': False,
}
if cookies_file:
cookies_path = Path(cookies_file)
if not cookies_path.exists():
raise FileNotFoundError(f"Cookies file not found: {cookies_file}")
opts['cookiefile'] = str(cookies_path)
print(f"[*] Using cookies file: {cookies_file}")
elif browser:
valid_browsers = ['chrome', 'firefox', 'edge', 'safari', 'opera', 'brave']
browser_lower = browser.lower()
if browser_lower not in valid_browsers:
raise ValueError(f"Invalid browser '{browser}'. Supported: {', '.join(valid_browsers)}")
opts['cookiesfrombrowser'] = (browser_lower,)
print(f"[*] Extracting cookies from browser: {browser}")
else:
print(f"[*] No authentication configured (public videos only)")
return opts
def getId(url: str,
browser: Optional[str] = None,
cookies_file: Optional[str] = None) -> str:
"""Extract YouTube video ID from URL with authentication support.
Args:
url: YouTube video URL to extract ID from.
browser: Browser name for cookie extraction.
cookies_file: Path to cookies.txt file.
Returns:
YouTube video ID as string.
Raises:
ValueError: If URL is invalid or video ID cannot be extracted.
RuntimeError: If yt-dlp fails to extract information.
Note:
This function validates the URL and extracts metadata
without downloading the actual content.
"""
if not url or not isinstance(url, str):
raise ValueError("URL must be a non-empty string")
if not any(domain in url.lower() for domain in ['youtube.com', 'youtu.be']):
raise ValueError(f"Invalid YouTube URL: {url}")
try:
print(f"[*] Extracting video ID from: {url[:50]}...")
opts = _getOpts(browser=browser, cookies_file=cookies_file)
with yt_dlp.YoutubeDL(opts) as ydl:
try:
info = ydl.extract_info(url, download=False)
video_id = info.get('id')
if not video_id:
raise RuntimeError("No video ID found in extracted information")
title = info.get('title', 'Unknown')
duration = info.get('duration', 0)
uploader = info.get('uploader', 'Unknown')
print(f"[+] Video ID extracted: {video_id}")
print(f" Title: {title[:50]}{'...' if len(title) > 50 else ''}")
print(f" Duration: {duration}s ({_format_minutes_seconds(duration)})")
print(f" Uploader: {uploader}")
return video_id
except yt_dlp.DownloadError as e:
if "Sign in to confirm" in str(e) or "private video" in str(e).lower():
raise ValueError(f"Authentication required for this video. Please use --browser or --cookies. Original error: {e}")
else:
raise RuntimeError(f"yt-dlp extraction failed: {e}")
except Exception as e:
if isinstance(e, (ValueError, RuntimeError)):
raise
raise RuntimeError(f"Failed to extract video ID: {e}") from e
def downloadVideo(url: str,
browser: Optional[str] = None,
cookies_file: Optional[str] = None) -> Path:
"""Download the best quality video with audio from YouTube.
Args:
url: YouTube video URL to download.
browser: Browser name for cookie extraction.
cookies_file: Path to cookies.txt file.
Returns:
Path to the downloaded video file.
Raises:
ValueError: If URL is invalid or authentication is required.
RuntimeError: If download fails or file is corrupted.
Note:
This function downloads both video and audio in a single file.
If the video already exists in cache, it returns the existing file.
"""
try:
video_id = getId(url, browser=browser, cookies_file=cookies_file)
except Exception as e:
raise ValueError(f"Failed to validate video URL: {e}") from e
out_path = CACHE_DIR / f"{video_id}.mp4"
if out_path.exists():
file_size = out_path.stat().st_size
if file_size > 1024 * 1024:
print(f"[*] Video already cached: {out_path}")
return out_path
else:
print(f"[!] WARNING: Cached video seems too small ({file_size} bytes), re-downloading")
out_path.unlink()
try:
print(f"[*] Downloading video: {video_id}")
opts = _getOpts(browser=browser, cookies_file=cookies_file)
opts.update({
'format': (
'bestvideo[ext=mp4][vcodec^=avc]+bestaudio[ext=m4a]/'
'best[ext=mp4]/'
'best'
),
'outtmpl': str(out_path),
'merge_output_format': 'mp4',
'postprocessors': [],
})
with yt_dlp.YoutubeDL(opts) as ydl:
ydl.download([url])
if not out_path.exists():
raise RuntimeError(f"Video file not created after download: {out_path}")
file_size = out_path.stat().st_size
if file_size < 1024 * 1024:
raise RuntimeError(f"Downloaded video file is too small: {file_size} bytes")
print(f"[+] Video downloaded successfully:")
print(f" File: {out_path}")
print(f" Size: {file_size / (1024*1024):.1f} MB")
return out_path
except yt_dlp.DownloadError as e:
error_msg = str(e).lower()
if "sign in to confirm" in error_msg or "private video" in error_msg:
raise ValueError(
f"Authentication required for this video. Please try:\n"
f"1. Close all browser windows and use --browser\n"
f"2. Export fresh cookies.txt and use --cookies\n"
f"3. Check if video is public/accessible\n"
f"Original error: {e}"
)
else:
raise RuntimeError(f"Video download failed: {e}")
except Exception as e:
if out_path.exists():
out_path.unlink()
raise RuntimeError(f"Video download failed: {e}") from e
def downloadAudio(url: str,
browser: Optional[str] = None,
cookies_file: Optional[str] = None) -> Path:
"""Download audio-only from YouTube for transcription processing.
Args:
url: YouTube video URL to extract audio from.
browser: Browser name for cookie extraction.
cookies_file: Path to cookies.txt file.
Returns:
Path to the downloaded WAV audio file.
Raises:
ValueError: If URL is invalid or authentication is required.
RuntimeError: If audio download or conversion fails.
Note:
The output is always in WAV format at the project's sample rate
for consistency with the transcription pipeline.
"""
try:
video_id = getId(url, browser=browser, cookies_file=cookies_file)
except Exception as e:
raise ValueError(f"Failed to validate video URL: {e}") from e
temp_path = CACHE_DIR / f"{video_id}"
final_path = CACHE_DIR / f"{video_id}.wav"
if final_path.exists():
file_size = final_path.stat().st_size
if file_size > 1024 * 100:
print(f"[*] Audio already cached: {final_path}")
return final_path
else:
print(f"[!] WARNING: Cached audio seems too small ({file_size} bytes), re-downloading")
final_path.unlink()
try:
print(f"[*] Downloading audio: {video_id}")
opts = _getOpts(browser=browser, cookies_file=cookies_file)
opts.update({
'format': 'bestaudio/best',
'outtmpl': str(temp_path),
'postprocessors': [{
'key': 'FFmpegExtractAudio',
'preferredcodec': 'wav',
'preferredquality': '192',
}],
})
with yt_dlp.YoutubeDL(opts) as ydl:
ydl.download([url])
if not final_path.exists():
temp_files = list(CACHE_DIR.glob(f"{video_id}.*"))
if temp_files:
print(f"[!] WARNING: Expected {final_path} but found {temp_files[0]}")
final_path = temp_files[0]
else:
raise RuntimeError(f"Audio file not created after download: {final_path}")
file_size = final_path.stat().st_size
if file_size < 1024 * 100:
raise RuntimeError(f"Downloaded audio file is too small: {file_size} bytes")
print(f"[+] Audio downloaded successfully:")
print(f" File: {final_path}")
print(f" Size: {file_size / (1024*1024):.1f} MB")
try:
from src.media import _get_duration
duration = _get_duration(final_path)
if duration > 0:
print(f" Duration: {duration:.1f}s ({_format_minutes_seconds(duration)})")
else:
print(f"[!] WARNING: Could not determine audio duration")
except Exception as e:
print(f"[!] WARNING: Audio validation failed: {e}")
return final_path
except yt_dlp.DownloadError as e:
error_msg = str(e).lower()
if "sign in to confirm" in error_msg or "private video" in error_msg:
raise ValueError(
f"Authentication required for this video. Please try:\n"
f"1. Close all browser windows and use --browser\n"
f"2. Export fresh cookies.txt and use --cookies\n"
f"3. Check if video is public/accessible\n"
f"Original error: {e}"
)
else:
raise RuntimeError(f"Audio download failed: {e}")
except Exception as e:
for path in [temp_path, final_path]:
if path.exists():
path.unlink()
raise RuntimeError(f"Audio download failed: {e}") from e

11
tests/conftest.py Normal file
View File

@@ -0,0 +1,11 @@
"""Pytest configuration for local imports."""
from __future__ import annotations
import sys
from pathlib import Path
ROOT = Path(__file__).resolve().parents[1]
if str(ROOT) not in sys.path:
sys.path.insert(0, str(ROOT))

61
tests/test_main_cli.py Normal file
View File

@@ -0,0 +1,61 @@
"""Tests for CLI parser and translation config wiring."""
from __future__ import annotations
from main import _build_translation_config, build_parser
def test_parser_accepts_lmstudio_flags():
parser = build_parser()
args = parser.parse_args(
[
"https://youtube.com/watch?v=demo",
"--translation-backend",
"lmstudio",
"--lmstudio-base-url",
"http://localhost:1234/v1",
"--lmstudio-model",
"gemma-custom",
]
)
assert args.translation_backend == "lmstudio"
assert args.lmstudio_base_url == "http://localhost:1234/v1"
assert args.lmstudio_model == "gemma-custom"
def test_translation_config_prefers_cli_over_env(monkeypatch):
monkeypatch.setenv("LM_STUDIO_BASE_URL", "http://env-host:1234/v1")
monkeypatch.setenv("LM_STUDIO_MODEL", "env-model")
parser = build_parser()
args = parser.parse_args(
[
"https://youtube.com/watch?v=demo",
"--lmstudio-base-url",
"http://cli-host:1234/v1",
"--lmstudio-model",
"cli-model",
]
)
config = _build_translation_config(args)
assert config.base_url == "http://cli-host:1234/v1"
assert config.model == "cli-model"
def test_translation_config_uses_env_defaults(monkeypatch):
monkeypatch.setenv("LM_STUDIO_BASE_URL", "http://env-host:1234/v1")
monkeypatch.setenv("LM_STUDIO_MODEL", "env-model")
monkeypatch.setenv("LM_STUDIO_API_KEY", "env-key")
parser = build_parser()
args = parser.parse_args(["https://youtube.com/watch?v=demo"])
config = _build_translation_config(args)
assert config.base_url == "http://env-host:1234/v1"
assert config.model == "env-model"
assert config.api_key == "env-key"

136
tests/test_translation.py Normal file
View File

@@ -0,0 +1,136 @@
"""Tests for the LM Studio translation layer."""
from __future__ import annotations
import httpx
import pytest
from src.core_utils import TranslationError
from src.translation import LMStudioTranslator, TranslationConfig
def _mock_client(handler):
return httpx.Client(transport=httpx.MockTransport(handler))
def test_translation_config_normalizes_base_url():
config = TranslationConfig.from_env(base_url="http://127.0.0.1:1234")
assert config.base_url == "http://127.0.0.1:1234/v1"
assert config.chat_completions_url == "http://127.0.0.1:1234/v1/chat/completions"
assert config.model == "gemma-3-4b-it"
def test_build_payload_includes_model_and_prompt():
translator = LMStudioTranslator(TranslationConfig(), client=_mock_client(lambda request: None))
payload = translator.build_payload("Hello world", "en", "es")
assert payload["model"] == "gemma-3-4b-it"
assert payload["messages"][0]["role"] == "system"
assert "Translate the user-provided text from en to es." in payload["messages"][0]["content"]
assert payload["messages"][1]["content"] == "Hello world"
def test_translate_segments_preserves_order_and_blank_segments():
def handler(request: httpx.Request) -> httpx.Response:
text = request.read().decode("utf-8")
if "first" in text:
content = "primero"
elif "third" in text:
content = "tercero"
else:
content = "desconocido"
return httpx.Response(200, json={"choices": [{"message": {"content": content}}]})
translator = LMStudioTranslator(TranslationConfig(), client=_mock_client(handler))
translated = translator.translate_segments(["first", "", "third"], target_language="es", source_language="en")
assert translated == ["primero", "", "tercero"]
def test_retry_on_transient_http_error_then_succeeds():
attempts = {"count": 0}
def handler(request: httpx.Request) -> httpx.Response:
attempts["count"] += 1
if attempts["count"] == 1:
return httpx.Response(503, json={"error": {"message": "busy"}})
return httpx.Response(200, json={"choices": [{"message": {"content": "hola"}}]})
translator = LMStudioTranslator(
TranslationConfig(max_retries=2),
client=_mock_client(handler),
sleeper=lambda _: None,
)
translated = translator.translate_text("hello", target_language="es", source_language="en")
assert translated == "hola"
assert attempts["count"] == 2
def test_parse_response_content_rejects_empty_content():
with pytest.raises(TranslationError, match="empty translation"):
LMStudioTranslator.parse_response_content({"choices": [{"message": {"content": " "}}]})
def test_translate_text_raises_on_malformed_response():
def handler(request: httpx.Request) -> httpx.Response:
return httpx.Response(200, json={"choices": []})
translator = LMStudioTranslator(TranslationConfig(), client=_mock_client(handler))
with pytest.raises(TranslationError, match="did not contain a chat completion message"):
translator.translate_text("hello", target_language="es", source_language="en")
def test_translate_text_falls_back_to_user_only_prompt_for_template_error():
attempts = {"count": 0}
def handler(request: httpx.Request) -> httpx.Response:
attempts["count"] += 1
body = request.read().decode("utf-8")
if attempts["count"] == 1:
return httpx.Response(
400,
text='{"error":"Error rendering prompt with jinja template: \\"Conversations must start with a user prompt.\\""}',
)
assert '"role":"user"' in body
return httpx.Response(200, json={"choices": [{"message": {"content": "hola"}}]})
translator = LMStudioTranslator(TranslationConfig(), client=_mock_client(handler))
translated = translator.translate_text("hello", target_language="es", source_language="en")
assert translated == "hola"
assert attempts["count"] == 2
def test_translate_text_falls_back_to_structured_prompt_for_custom_template():
attempts = {"count": 0}
def handler(request: httpx.Request) -> httpx.Response:
attempts["count"] += 1
body = request.read().decode("utf-8")
if attempts["count"] == 1:
return httpx.Response(
400,
text='{"error":"Error rendering prompt with jinja template: \\"Conversations must start with a user prompt.\\""}',
)
if attempts["count"] == 2:
return httpx.Response(
400,
text='{"error":"Error rendering prompt with jinja template: \\"User role must provide `content` as an iterable with exactly one item. That item must be a mapping(type:\'text\' | \'image\', source_lang_code:string, target_lang_code:string, text:string | none, image:string | none).\\""}',
)
assert '"source_lang_code":"en"' in body
assert '"target_lang_code":"es"' in body
return httpx.Response(200, json={"choices": [{"message": {"content": "hola"}}]})
translator = LMStudioTranslator(TranslationConfig(), client=_mock_client(handler))
translated = translator.translate_text("hello", target_language="es", source_language="en")
assert translated == "hola"
assert attempts["count"] == 3