Sunday, October 19, 2014

Voice control your Roku player with Python

In this post, I describe a simple way to implement a voice controlled Roku streaming player 'remote control' over a telnet connection with Python.

The complete code (about 100 lines) is available on github.

Roku on Telnet

Roku streaming players offer an interesting unofficial control interface over a telnet connection on the local network. To connect to Roku over telnet:
  1. Put Roku in development mode by pressing the following button sequence on your Roku remote: Home (three times), up (twice), right, left, right, left, right.
  2. Follow the on-screen instructions and Roku will tell you its IP address: e.g. 123.456.7.8.90
  3. Using any computer on the same network as Roku, open a terminal and telnet into Roku's port 8080:
    telnet 123.456.7.8.90 8080
  4. If all goes well, you should see Roku's device ID, ethernet and wifi MAC addresses, and command prompt printed on the console:
    Trying 123.456.7.8.90...
    Connected to 123.456.7.8.90.
    Escape character is '^]'.
    ETHMAC 00:00:00:00:00:0a
    WIFIMAC 00:00:00:00:00:0b
  5. From the telnet command line, we can now type simple Roku control commands like 'press play' and 'press right'.
I successfully telneted into Roku SD, Roku HD, and Roku 3 players.
  • Supported commands include: right, left, up, down, select, pause, play, forward, reverse
  • All models support abbreviated commands, e.g. "press r" for "press right".
  • Roku3 requires abbreviated commands. Moreover, Roku3 can accept multiple commands per turn such as "press dls" (down, left, select). Roku3 also appears to support some additional button presses such as "press a" and "press c".

Adding Voice Control

To have some more fun with this feature, I used Python's SpeechRecognition library to develop a simple voice enabled interface for Roku. The SpeechRecognition module is a Python wrapper around Google's speech recognition API. The module uses PyAudio to record wav files from the microphone:
import speech_recognition as sr
recognizer = sr.Recognizer()
with sr.Microphone() as source:       
 audio = recognizer.listen(source, timeout=2)
These files are then streamed to Google's recognition API for transcription:
command = recognizer.recognize(audio)
If recognition is successful, SpeechRecognition should return a string transcription. We can then process this transcription to control the Roku. To manage the processing of transcriptions or "concept mapping", we use a simple approach based on keywords. We define a concept class to hold a list of keywords and a related action. For example, Roku needs a SelectConcept that should request the select button when the user says either of the keywords 'select' or 'ok':
SelectConcept = Concept(['select','ok'], 'press select')
Each concept has a scoring method which takes an input message and returns an action if the message contains the keyword:
# Returns the action string 'press select'
SelectConcept.conceptMatch('please press select button'.split(" "))
Once we've defined all of our concept data, we need a way to pass the action requests to the Roku from within Python. For this, we can use Python's telnetlib to connect to Roku's 8080 port and pass message strings like 'press select'.

See the full code on github.

Tuning and other details

If you are lucky, PyAudio and SpeechRecognizer will work well out of the box. More likely, however, you will encounter a few library dependency problems and hardware-specific issues. On my Linux systems, I needed to add the following packages to get PyAudio working: Note that even when PyAudio is working, you may still see a number of warning/diagnostic messages printed to the console, e.g.:
ALSA lib pcm_dsnoop.c:618:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1022:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_dmix.c:1022:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2239:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2239:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2239:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
bt_audio_service_open: connect() failed: Connection refused (111)
bt_audio_service_open: connect() failed: Connection refused (111)
bt_audio_service_open: connect() failed: Connection refused (111)
bt_audio_service_open: connect() failed: Connection refused (111)
ALSA lib pcm_dmix.c:1022:(snd_pcm_dmix_open) unable to open slave
The call:
with sr.Microphone() as source:       
 audio = recognizer.listen(source, timeout=2)
will listen to the microphone channel for up to two seconds waiting for a signal above a particular energy threshold. If the signal passes over the threshold value, recording begins and will continue until a pause is encountered. The threshold value and stopping pause length are controlled by the parameters:
recognizer.energy_threshold = 2000
recognizer.pause_threshold = .5
The values of these parameters will need to be tuned for particular microphone hardware, system settings, and the level of ambient noise during recording. It may be helpful to adjust the system microphone input level as well to optimize recognition performance. Experimenting on several different microphones, I found optimal energy thresholds for this application varied widely between 200 and 2000. More details about recognizer parameters are available in the SpeechRecognizer documentation.

Even with optimization, SpeechRecognizer for this toy application is not very reliable and latency is too high. A better implementation would use a custom language model and a more robust concept mapping system. An interesting alternative to SpeechRecognition might be blather.