Speech-Recognition Interfaces for Music Information Retrieval: ``Speech Completion'' and ``Speech Spotter''

This project is proposed and researched by Masataka Goto, Katunobu Itou (Nagoya University), Koji Kitayama (Waseda University), and Tetsunori Kobayashi (Waseda University).

Introduction:

We developed two hands-free music information retrieval (MIR) systems that enable a user to retrieve and play back a musical piece by saying its title or the artist's name. Although various interfaces for MIR have been proposed, speech-recognition interfaces suitable for retrieving musical pieces have not been studied. Our MIR-based jukebox systems employ two different speech-recognition interfaces for MIR, speech completion and speech spotter, which exploit intentionally controlled nonverbal speech information in original ways. The first is a music retrieval system with the speech-completion interface that is suitable for music stores and car-driving situations. When a user only remembers part of the name of a musical piece or an artist and utters only a remembered fragment, the system helps the user recall and enter the name by completing the fragment. The second is a background-music playback system with the speech-spotter interface that can enrich human-human conversation. When a user is talking to another person, the system allows the user to enter voice commands for music playback control by spotting a special voice-command utterance in face-to-face or telephone conversations. Experimental results from use of these systems have demonstrated the effectiveness of the speech-completion and speech-spotter interfaces.

Video Clips:

Demonstration of Music-Retrieval System with the Speech-Completion Interface
In this video, a user can retrieve a musical piece or a list of musical pieces by an artist even if the user can remember only part of the name of the piece or artist.
[Details]: The speech-completion function is invoked by an intentional filled pause (a vowel-lengthening hesitation like ``er...''). A user who does not remember the last part of a name can invoke this completion by uttering the first part while intentionally lengthening its last syllable (making a filled pause). Here, the user can insert a filled pause at an arbitrary position (even within a word). The user then gets a numbered list of completion candidates whose beginnings acoustically resemble the uttered fragment. On the other hand, a user who does not remember the first part of a name can invoke this completion by uttering the last part after intentionally lengthening the last syllable of a predefined special keyword --- called the wildcard keyword (in the current implementation, we use the Japanese wildcard ``nantoka'' (in English, ``something'')). The user then gets a numbered list of completion candidates whose endings acoustically resemble the uttered last part. Completion candidates are generated by replacing the wildcard keyword (filling in the first part) as if a wildcard search was done.
Demonstration of Music-Retrieval System with the Speech-Completion Interface
(11,506,124 bytes, 1 min 3 sec, MPEG-1 file)

(Short excerpt version: 3,058,384 bytes, 16 sec, MPEG-1 file)

[Video caption]
Forward Speech Completion: Music retrieval by uttering part of artist's name

Michael- (Michael, uh...)

(*) A pop-up window containing completion candidates appears.

Jackson

(*) A pop-up window containing a list of musical pieces appears.

No. 1

(*) The first song is highlighted and played back.

Forward Speech Completion: Music retrieval by uttering part of musical-piece title

The Way- (The Way, er...)

(*) A pop-up window containing completion candidates appears.

No. 1

(*) The song of the selected title is played back.

Backward Speech Completion: Music retrieval by uttering part of artist's name

Something- (wildcard keyword)

(*) A pop-up window with colorful flying decorations appears.

Jackson

(*) A pop-up window containing completion candidates appears.

No. 1

(*) A pop-up window containing a list of musical pieces appears.

No. 3

(*) The third song is highlighted and played back.

This demonstration featured RWC-MDB-G-2001 No.10, 24, 26 from the RWC Music Database (Music Genre).
Demonstration of Music Playback System with the Speech-Spotter Interface
In this video, a user can listen to background music by uttering the name of a musical piece or artist while talking to another person. The video shows that users can share music playback on the telephone as if they were talking in the same room with background music.
[Details]: The speech-spotter interface regards a user utterance as a command utterance only when it is intentionally uttered with a high pitch just after a filled pause such as ``er...'' or ``uh...''. In other words, a computer system accepts this specially designed unnatural utterance only and ignores other normal utterances in human-human conversation. We dare to use the unnaturalness of nonverbal speech information --- in this case an intentional filled pause and a subsequent high-pitch utterance --- because this combination is not normally uttered in human-human conversation but nevertheless can be easily uttered.
Demonstration of Music Playback System with the Speech-Spotter Interface
(12,793,620 bytes, 1 min 10 sec, MPEG-1 file)

(Short excerpt version: 3,760,232 bytes, 20 sec, MPEG-1 file)

[Video caption]
B calls A on the telephone.

A:
Yes...

B:
Hello?

A:
Uh..., what's up?

B:
Thanks for all your help last time.

A:
No problem. How have you been since?

B:
Whew! I've been super busy writing that paper... I'm beat.

(Several minutes later)

A:
Uh..., that reminds me, the song called ``Fly Away'' that we heard at that place, wasn't that good?

B:
Oh, what song was that?

A:
Shall we try listening to it?

B:
What? We can hear it now?

A:
Sure. This is a phone with a music-playback system. We can listen to that song like this... Er..., ``Fly Away''!

(*) The system plays the song of that name on both of their handsets.

B:
Wow, amazing! You can listen to a song by just saying its name! Um..., this is a good song.

A:
That's right!

(**) In this caption, underlining indicates that the pitch of the underlined words is intentionally raised.
This demonstration featured RWC-MDB-P-2001 No.28 from the RWC Music Database (Popular Music).

Screen Snapshots of Music-Retrieval System with the Speech-Completion Interface:

Forward Speech Completion

A user who does not remember the last part of a name can invoke this completion by uttering the first part while intentionally lengthening its last syllable (making a filled pause).

[Entering the phrase ``maikeru jakuson'' (``Michael Jackson'') when its last part (``jakuson'') is uncertain.]

Uttering ``maikeru--.''
A pop-up window containing completion candidates appears.
Uttering ``No. 2.''
The second candidate is highlighted and bounces.
The selected candidate ``maikeru jakuson'' is determined as the recognition result.

Backward Speech Completion

A user who does not remember the first part of a name can invoke this completion by uttering the last part after intentionally lengthening the last syllable of a predefined special keyword --- called the wildcard keyword.

[Entering the phrase ``maikeru jakuson'' (``Michael Jackson'') when its first part (``maikeru'') is uncertain.]

Uttering ``nantoka--.'' (wildcard keyword)
A pop-up window with colorful flying decorations appears.
Uttering ``jakuson.''
A window containing completion candidates appears.
Uttering ``No. 1.''
The first candidate ``maikeru jakuson'' is determined as the recognition result.

Music Playback

After the artist's name is identified by either the forward or backward speech completion, the system shows a numbered list of titles for the specified artist in a music database, and a user can select an appropriate title by uttering either the title or its number. When the musical piece is identified, the system plays back its sound file.

[Playing back a musical piece of the artist ``maikeru jakuson'' (``Michael Jackson'') whose name is determined by the speech-completion interface.]

Continued from the above figures.
A pop-up window containing a list of musical pieces appears.
Uttering ``No. 1.''
The first musical piece is highlighted and played back.

Background:

The purpose of this research is to build a music-retrieval system with a speech-recognition interface that facilitates both identification of a musical piece and music playback in everyday life. We think a speech-recognition interface is well-suited to music information retrieval (MIR), especially retrieval of a musical piece by entering its title or the artist's name. At home or in a car, for example, an MIR-based jukebox system with a speech-recognition interface would allow users to change background music just by saying the name of a musical piece or an artist. At music-listening stations in music stores or on karaoke machines, a speech-recognition interface could also help users find musical pieces they have been looking for without using any input device other than a microphone.

Most previous MIR research, however, has not explored how speech recognition can be used for retrieving music information, although various MIR interfaces using text, symbols, MIDI, or audio signals have been proposed. To retrieve a musical piece, a typical approach is to use a text query related to bibliographic information. This approach requires the use of hand-operated input devices, such as a computer keyboard, mouse, or stylus pen. Another approach is to use a melody-related query given through symbols, MIDI, or audio signals. In particular, music retrieval through a sung melody is called query by humming (QBH), and this approach is considered promising because it requires only a microphone and can easily be used by a novice. However, even though this approach uses a microphone, speech recognition of the names of musical pieces and artists has not been considered.

Against the above background, we developed two original speech-recognition interfaces, speech completion and speech spotter, which are suitable for MIR.

Technology:

The speech-completion function requires real-time detection of a filled pause and the generation of a list of completion candidates:

Real-time detection of a filled pause
To meet the first requirement, we use our robust filled-pause detection method. This is a language-independent bottom-up method that can detect a lengthened vowel in any word through a sophisticated signal-processing technique. It determines the beginning and end of each filled pause by finding two acoustical features of filled pauses --- small F0 (voice pitch) transitions and small spectral envelope deformations.
Generation of a list of completion candidates
To meet the second requirement, we extended a typical HMM-based speech recognizer to provide a list of completion candidates whenever a filled pause was detected (even within a word). Because single phonemes cannot be recognized accurately enough, up-to-date speech recognizers do not determine a word's phoneme sequence phoneme by phoneme. Instead, they choose the maximum likelihood (ML) hypothesis while pursuing multiple hypotheses on a vocabulary tree where all vocabulary words are stored. When the beginning of a filled pause is detected, the recognizer determines which completion method is to be invoked (forward or backward). The forward speech completion is achieved by deriving from the vocabulary tree completion candidates that share the prefix corresponding to each incomplete plausible word hypothesis for the uttered fragment. The backward speech completion is achieved by recognizing a last-part fragment uttered after the wildcard keyword; since we cannot register such a word fragment as a vocabulary word in advance, every syllable in the middle of all the vocabulary words is dynamically searched just after the wildcard keyword.

The speech-spotter function requires real-time detection of a filled pause, determination of the endpoints of an utterance, and judgment as to whether the pitch of an utterance is intentionally raised:

Real-time detection of a filled pause
We use the same filled-pause detection method as in the speech-completion function.
Determination of the endpoints of an utterance
The end of each utterance is automatically determined by using an intermediate speech-recognition result, which is the ML hypothesis in the HMM-based speech recognizer. It monitors the ML hypothesis at every frame and stops decoding (determines the end of the utterance) when the ML hypothesis reaches a silent pause or when there is no possibility of other recognition results.
Judgment as to whether the pitch of an utterance is intentionally raised
Because the pitch range of voices differs among individuals, it is difficult to judge whether the pitch of an utterance is intentionally shifted (raised). We therefore introduced a unique pitch reference for each speaker, called the base fundamental frequency (base F0), which represents the pitch of the speaker's natural voice. We use an original method of estimating the base F0 by averaging the voice pitch during a filled pause: we found that the pitch during filled pauses is stable and is close to the pitch of the natural voice. After estimating the base F0, we can deal with the pitch value relative to the base F0, which compensates for a wide variety of voice pitch ranges. If the relative pitch value of an utterance, which is calculated by subtracting the base F0 from the pitch averaged over the utterance, is higher than a threshold, it is judged to be intentionally shifted.

Contribution:

The main contribution of this research is to propose two novel speech-recognition interfaces suitable for MIR, speech completion and speech spotter, and demonstrate their usefulness in two different music jukebox systems. The music retrieval system with the speech-completion interface enables a user to listen to a musical piece even if part of its name cannot be recalled. The background-music playback system with the speech-spotter interface enables users to share music playback on the telephone as if they were talking in the same room with background music. As far as we know, this is the first system that people can use to obtain speech-based music information assistance in the midst of a telephone conversation.

We believe that practical speech-recognition interfaces for MIR cannot be achieved by simply applying the current automatic speech recognition to MIR: retrieval of musical pieces just by uttering entire titles or artist names is not sufficient. Our two interfaces can be considered an important first step toward building the ultimate speech-capable MIR interface. It will become more and more important to explore various speech-recognition interfaces for MIR as well as other traditional MIR interfaces.

Acknowledgments:

This research utilized the RWC Music Database "RWC-MDB-P-2001" (Popular Music) and "RWC-MDB-G-2001" (Music Genre).

References:

Masataka Goto, Katunobu Itou, Koji Kitayama, and Tetsunori Kobayashi: Speech-Recognition Interfaces for Music Information Retrieval: ``Speech Completion'' and ``Speech Spotter'', Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR 2004), pp.403-408, October 2004.

Back to:

Masataka Goto's Home Page

Masataka GOTO <m.goto [at] aist.go.jp>

All pages are copyrighted by the author. Unauthorized reproduction is strictly prohibited.

last update: October 1, 2005