Research Demonstration for ISMIR 2004

The PDF files of this paper and slides are available at this link.

Speech-Recognition Interfaces for Music Information Retrieval: ``Speech Completion'' and ``Speech Spotter''

Masataka Goto†, Katunobu Itou‡, Koji Kitayama††, and Tetsunori Kobayashi††

† National Institute of Advanced Industrial Science and Technology (AIST) Ibaraki 305-8568, Japan
‡ Nagoya University. Aichi 464-8603, Japan
†† Waseda University. Tokyo 169-8555, Japan


box Menu


box Paper abstract

This paper describes music information retrieval (MIR) systems featuring automatic speech recognition. Although various interfaces for MIR have been proposed, speech-recognition interfaces suitable for retrieving musical pieces have not been studied. We propose two different speech-recognition interfaces for MIR, speech completion and speech spotter, and describe two MIR-based hands-free jukebox systems that enable a user to retrieve and play back a musical piece by saying its title or the artist's name. The first is a music-retrieval system with the speech-completion interface that is suitable for music stores and car driving situations. When a user can remember only part of the name of a musical piece or an artist and utters only a remembered fragment, the system helps the user recall and enter the name by completing the fragment. The second is a background-music playback system with the speech-spotter interface which can enrich human-human conversation. When a user is talking to another person, the system allows the user to enter voice commands for music-playback control by spotting a special voice-command utterance in face-to-face or telephone conversations. Our experimental results from use of these systems have demonstrated the effectiveness of the speech-completion and speech-spotter interfaces.


box Video Clips


box Screen Snapshots of Music-Retrieval System with the Speech-Completion Interface

Forward Speech Completion

A user who does not remember the last part of a name can invoke this completion by uttering the first part while intentionally lengthening its last syllable (making a filled pause).

[Entering the phrase ``maikeru jakuson'' (``Michael Jackson'') when its last part (``jakuson'') is uncertain.]

  1. Uttering ``maikeru--.''
  2. A pop-up window containing completion candidates appears.
    Speech Completion Snapshot
  3. Uttering ``No. 2.''
  4. The second candidate is highlighted and bounces.
    Speech Completion Snapshot
  5. The selected candidate ``maikeru jakuson'' is determined as the recognition result.
    Speech Completion Snapshot
Backward Speech Completion

A user who does not remember the first part of a name can invoke this completion by uttering the last part after intentionally lengthening the last syllable of a predefined special keyword --- called the wildcard keyword.

[Entering the phrase ``maikeru jakuson'' (``Michael Jackson'') when its first part (``maikeru'') is uncertain.]

  1. Uttering ``nantoka--.'' (wildcard keyword)
  2. A pop-up window with colorful flying decorations appears.
    Speech Completion Snapshot
  3. Uttering ``jakuson.''
  4. A window containing completion candidates appears.
    Speech Completion Snapshot
  5. Uttering ``No. 1.''
  6. The first candidate ``maikeru jakuson'' is determined as the recognition result.
    Speech Completion Snapshot
Music Playback

After the artist's name is identified by either the forward or backward speech completion, the system shows a numbered list of titles for the specified artist in a music database, and a user can select an appropriate title by uttering either the title or its number. When the musical piece is identified, the system plays back its sound file.

[Playing back a musical piece of the artist ``maikeru jakuson'' (``Michael Jackson'') whose name is determined by the speech-completion interface.]

  1. Continued from the above figures.
  2. A pop-up window containing a list of musical pieces appears.
    Speech Completion Snapshot
  3. Uttering ``No. 1.''
  4. The first musical piece is highlighted and played back.
    Speech Completion Snapshot

box Acknowledgments:

This research utilized the RWC Music Database "RWC-MDB-P-2001" (Popular Music) and "RWC-MDB-G-2001" (Music Genre).


box Back to:


Masataka GOTO <m.goto [at] aist.go.jp>

All pages are copyrighted by the author. Unauthorized reproduction is strictly prohibited.