Show HN: KVoiceWalk – Voice cloning for Kokoro TTS using random walk algorithms
The scoring mechanism using Resemblyzer to calculate similarity to target audio and similarity to another segment of audio it generates itself, self similarity. This self similarity was key in keeping the model stable and the audio consistent across inputs. But it was not enough to prevent over fitting to Resemblyzer.
I had to create a third metric which uses a normalized difference of a variety of audio features compared to the target features. Summing those I get a feature similarity metric which is useful in keeping audio quality from degrading too much and prevents over fitting.
The last challenge was weighting the score while keeping it flexible enough to explore the complex text to speech style space. Using a weighted harmonic mean allowed for back sliding on some metrics for significant improvement in others, which reduced stagnation and worked well enough for the random walk to work.
The results are fairly good. I would say it ends up in the uncanny valley of similarity rather than producing a proper clone of the target voice. It sounds like it might be the target voice, but does well enough to improve similarity from 70% to around 90%. There are probably limitations to the architecture of Kokoro in how close it can possibly sound to other voices, but there is probably some more progress to be made using a more advanced genetic algorithm.
Check out the code, make some new voices, and let me know if you have any ideas on ways to improve.
No comments yet