Week 04 Progress Report by Mebin J Thattil

~ Things are starting to take shape!
These blogs can be found in the SugarLabs website. The content is the same in both places.~

Project: Speak Activity
Mentors: Chihurumnaya Ibiam, Kshitij Shah
Assisting Mentors: Walter Bender, Devin Ulibarri
Reporting Period: 2025-06-22 - 2025-06-29

Goals for This Week

This Week’s Achievements

Note: I was on leave this week till the 26th due to my final exams. But I still managed to do a bunch of cool stuff after that.

  1. Kokoro meets Speak - A new chapter

  2. One of the three major parts of my proposal was to integrate a more modern, natural-sounding TTS model into Speak.

  3. I used Kokoro and integrated it with the activity.
  4. We now have access to the entire catalog of voices that Kokoro comes with. This will be helpful for our idea of having different personas—each persona could have a different voice.
  5. The current implementation of the code is a rather hacky way of integrating Kokoro. I say this because the audio pipeline currently looks like this:

    Text → Kokoro → Outputs a temporary WAV file → Read by GStreamer → Audio output can be heard

  6. This is not ideal for obvious reasons. We don't want Kokoro to save an audio file every time and then read from it again. This is slow because Kokoro has to process the entire text, convert it to a WAV, and then GStreamer has to read and output it. For smaller text inputs it's still fine, but it’s not optimal.

  7. The better approach would be to have Kokoro stream the audio directly, which GStreamer can then stream and output. This would reduce perceived latency significantly. Kokoro currently does not have a function / API that works like this. I would have to make one.
  8. But for now, this is just an initial implementation to get feedback from mentors and peers, optimization can come later.
  9. Kokoro also uses the espeak-ng engine as a fallback. Since Speak already uses espeak, I’ll try to go under the hood and tweak Kokoro to use espeak instead. This would reduce additional dependencies.
  10. Currently, I was able to get this working with just 125KB of additional dependencies.

Video demo:

Note that the recording has a slight echo, but that's the recordings issue, it sounds perfectly fine inside of speak.

  1. Quantization Pipeline
# Model Config

MODEL_REPO="hfusername/modelname"
GGUF_OUT="output_model_name.gguf"
GGUF_QUANT="output_model_name-q4.gguf"
N_CTX=2048
BUILD_DIR="build"
SAVED_DIR_NAME_HF="output_dir_name"
# Another thing to note is the URL to the plugin inference script:
RAW_URL="https://raw.githubusercontent.com/mebinthattil/template_llama_chat_python/main/chatapp.py"

This script tries to be OS agnostic, and attempts to detect which OS you're on to run commands accordingly. It’s not fully comprehensive yet, but it works well on macOS, as that’s the only platform I’ve tested it on.

Next Week’s Roadmap

Acknowledgments

Thank you to my mentors, the Sugar Labs community, and fellow GSoC contributors for their ongoing support.




Powered by Not An SSG 😎