Speech-Recognition Interfaces for Music Information Retrieval: ``Speech Completion'' and ``Speech Spotter''

video snapshot video snapshot


This project is proposed and researched by Masataka Goto, Katunobu Itou (Nagoya University), Koji Kitayama (Waseda University), and Tetsunori Kobayashi (Waseda University).


box Introduction:

We developed two hands-free music information retrieval (MIR) systems that enable a user to retrieve and play back a musical piece by saying its title or the artist's name. Although various interfaces for MIR have been proposed, speech-recognition interfaces suitable for retrieving musical pieces have not been studied. Our MIR-based jukebox systems employ two different speech-recognition interfaces for MIR, speech completion and speech spotter, which exploit intentionally controlled nonverbal speech information in original ways. The first is a music retrieval system with the speech-completion interface that is suitable for music stores and car-driving situations. When a user only remembers part of the name of a musical piece or an artist and utters only a remembered fragment, the system helps the user recall and enter the name by completing the fragment. The second is a background-music playback system with the speech-spotter interface that can enrich human-human conversation. When a user is talking to another person, the system allows the user to enter voice commands for music playback control by spotting a special voice-command utterance in face-to-face or telephone conversations. Experimental results from use of these systems have demonstrated the effectiveness of the speech-completion and speech-spotter interfaces.


box Video Clips:


box Screen Snapshots of Music-Retrieval System with the Speech-Completion Interface:

Forward Speech Completion

A user who does not remember the last part of a name can invoke this completion by uttering the first part while intentionally lengthening its last syllable (making a filled pause).

[Entering the phrase ``maikeru jakuson'' (``Michael Jackson'') when its last part (``jakuson'') is uncertain.]

  1. Uttering ``maikeru--.''
  2. A pop-up window containing completion candidates appears.
    Speech Completion Snapshot
  3. Uttering ``No. 2.''
  4. The second candidate is highlighted and bounces.
    Speech Completion Snapshot
  5. The selected candidate ``maikeru jakuson'' is determined as the recognition result.
    Speech Completion Snapshot
Backward Speech Completion

A user who does not remember the first part of a name can invoke this completion by uttering the last part after intentionally lengthening the last syllable of a predefined special keyword --- called the wildcard keyword.

[Entering the phrase ``maikeru jakuson'' (``Michael Jackson'') when its first part (``maikeru'') is uncertain.]

  1. Uttering ``nantoka--.'' (wildcard keyword)
  2. A pop-up window with colorful flying decorations appears.
    Speech Completion Snapshot
  3. Uttering ``jakuson.''
  4. A window containing completion candidates appears.
    Speech Completion Snapshot
  5. Uttering ``No. 1.''
  6. The first candidate ``maikeru jakuson'' is determined as the recognition result.
    Speech Completion Snapshot
Music Playback

After the artist's name is identified by either the forward or backward speech completion, the system shows a numbered list of titles for the specified artist in a music database, and a user can select an appropriate title by uttering either the title or its number. When the musical piece is identified, the system plays back its sound file.

[Playing back a musical piece of the artist ``maikeru jakuson'' (``Michael Jackson'') whose name is determined by the speech-completion interface.]

  1. Continued from the above figures.
  2. A pop-up window containing a list of musical pieces appears.
    Speech Completion Snapshot
  3. Uttering ``No. 1.''
  4. The first musical piece is highlighted and played back.
    Speech Completion Snapshot

box Background:

The purpose of this research is to build a music-retrieval system with a speech-recognition interface that facilitates both identification of a musical piece and music playback in everyday life. We think a speech-recognition interface is well-suited to music information retrieval (MIR), especially retrieval of a musical piece by entering its title or the artist's name. At home or in a car, for example, an MIR-based jukebox system with a speech-recognition interface would allow users to change background music just by saying the name of a musical piece or an artist. At music-listening stations in music stores or on karaoke machines, a speech-recognition interface could also help users find musical pieces they have been looking for without using any input device other than a microphone.

Most previous MIR research, however, has not explored how speech recognition can be used for retrieving music information, although various MIR interfaces using text, symbols, MIDI, or audio signals have been proposed. To retrieve a musical piece, a typical approach is to use a text query related to bibliographic information. This approach requires the use of hand-operated input devices, such as a computer keyboard, mouse, or stylus pen. Another approach is to use a melody-related query given through symbols, MIDI, or audio signals. In particular, music retrieval through a sung melody is called query by humming (QBH), and this approach is considered promising because it requires only a microphone and can easily be used by a novice. However, even though this approach uses a microphone, speech recognition of the names of musical pieces and artists has not been considered.

Against the above background, we developed two original speech-recognition interfaces, speech completion and speech spotter, which are suitable for MIR.


box Technology:

The speech-completion function requires real-time detection of a filled pause and the generation of a list of completion candidates:

  1. Real-time detection of a filled pause
    To meet the first requirement, we use our robust filled-pause detection method. This is a language-independent bottom-up method that can detect a lengthened vowel in any word through a sophisticated signal-processing technique. It determines the beginning and end of each filled pause by finding two acoustical features of filled pauses --- small F0 (voice pitch) transitions and small spectral envelope deformations.
  2. Generation of a list of completion candidates
    To meet the second requirement, we extended a typical HMM-based speech recognizer to provide a list of completion candidates whenever a filled pause was detected (even within a word). Because single phonemes cannot be recognized accurately enough, up-to-date speech recognizers do not determine a word's phoneme sequence phoneme by phoneme. Instead, they choose the maximum likelihood (ML) hypothesis while pursuing multiple hypotheses on a vocabulary tree where all vocabulary words are stored. When the beginning of a filled pause is detected, the recognizer determines which completion method is to be invoked (forward or backward). The forward speech completion is achieved by deriving from the vocabulary tree completion candidates that share the prefix corresponding to each incomplete plausible word hypothesis for the uttered fragment. The backward speech completion is achieved by recognizing a last-part fragment uttered after the wildcard keyword; since we cannot register such a word fragment as a vocabulary word in advance, every syllable in the middle of all the vocabulary words is dynamically searched just after the wildcard keyword.

The speech-spotter function requires real-time detection of a filled pause, determination of the endpoints of an utterance, and judgment as to whether the pitch of an utterance is intentionally raised:

  1. Real-time detection of a filled pause
    We use the same filled-pause detection method as in the speech-completion function.
  2. Determination of the endpoints of an utterance
    The end of each utterance is automatically determined by using an intermediate speech-recognition result, which is the ML hypothesis in the HMM-based speech recognizer. It monitors the ML hypothesis at every frame and stops decoding (determines the end of the utterance) when the ML hypothesis reaches a silent pause or when there is no possibility of other recognition results.
  3. Judgment as to whether the pitch of an utterance is intentionally raised
    Because the pitch range of voices differs among individuals, it is difficult to judge whether the pitch of an utterance is intentionally shifted (raised). We therefore introduced a unique pitch reference for each speaker, called the base fundamental frequency (base F0), which represents the pitch of the speaker's natural voice. We use an original method of estimating the base F0 by averaging the voice pitch during a filled pause: we found that the pitch during filled pauses is stable and is close to the pitch of the natural voice. After estimating the base F0, we can deal with the pitch value relative to the base F0, which compensates for a wide variety of voice pitch ranges. If the relative pitch value of an utterance, which is calculated by subtracting the base F0 from the pitch averaged over the utterance, is higher than a threshold, it is judged to be intentionally shifted.


box Contribution:

The main contribution of this research is to propose two novel speech-recognition interfaces suitable for MIR, speech completion and speech spotter, and demonstrate their usefulness in two different music jukebox systems. The music retrieval system with the speech-completion interface enables a user to listen to a musical piece even if part of its name cannot be recalled. The background-music playback system with the speech-spotter interface enables users to share music playback on the telephone as if they were talking in the same room with background music. As far as we know, this is the first system that people can use to obtain speech-based music information assistance in the midst of a telephone conversation.

We believe that practical speech-recognition interfaces for MIR cannot be achieved by simply applying the current automatic speech recognition to MIR: retrieval of musical pieces just by uttering entire titles or artist names is not sufficient. Our two interfaces can be considered an important first step toward building the ultimate speech-capable MIR interface. It will become more and more important to explore various speech-recognition interfaces for MIR as well as other traditional MIR interfaces.


box Acknowledgments:

This research utilized the RWC Music Database "RWC-MDB-P-2001" (Popular Music) and "RWC-MDB-G-2001" (Music Genre).


References:

  1. Masataka Goto, Katunobu Itou, Koji Kitayama, and Tetsunori Kobayashi: Speech-Recognition Interfaces for Music Information Retrieval: ``Speech Completion'' and ``Speech Spotter'', Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR 2004), pp.403-408, October 2004.
    PDF Slide PDF

box Back to:


Masataka GOTO <m.goto [at] aist.go.jp>

All pages are copyrighted by the author. Unauthorized reproduction is strictly prohibited.

last update: October 1, 2005