Research Demonstration for ISMIR 2004
The PDF files of this paper and slides are available at
this link.
Speech-Recognition Interfaces for Music Information Retrieval:
``Speech Completion'' and ``Speech Spotter''
Masataka Goto†,
Katunobu Itou‡,
Koji Kitayama††, and
Tetsunori Kobayashi††
†
National Institute of Advanced Industrial Science and Technology (AIST)
Ibaraki 305-8568, Japan
‡
Nagoya University.
Aichi 464-8603, Japan
††
Waseda University.
Tokyo 169-8555, Japan
Menu
Paper abstract
This paper describes music information retrieval (MIR) systems
featuring automatic speech recognition.
Although various interfaces for MIR have been proposed,
speech-recognition interfaces suitable for retrieving musical pieces
have not been studied.
We propose two different speech-recognition interfaces for MIR,
speech completion and
speech spotter,
and describe two MIR-based hands-free jukebox systems
that enable a user to retrieve and play back a musical piece
by saying its title or the artist's name.
The first is
a music-retrieval system with the speech-completion interface
that is suitable for music stores and car driving situations.
When a user can remember only part of the name of a musical piece or an artist
and utters only a remembered fragment,
the system helps the user recall and enter the name
by completing the fragment.
The second is
a background-music playback system with the speech-spotter interface
which can enrich human-human conversation.
When a user is talking to another person,
the system allows the user to enter
voice commands for music-playback control
by spotting a special voice-command utterance
in face-to-face or telephone conversations.
Our experimental results from use of these systems
have demonstrated the effectiveness of
the speech-completion and speech-spotter interfaces.
Video Clips
-
Demonstration of Music-Retrieval System with the Speech-Completion Interface
In this video,
a user can retrieve
a musical piece or a list of musical pieces by an artist
even if the user can remember only part of the name of the piece or artist.
Demonstration of Music-Retrieval System with the Speech-Completion Interface
- (11,506,124 bytes, 1 min 3 sec, MPEG-1 file)
- (Short excerpt version:
3,058,384 bytes, 16 sec, MPEG-1 file)
-
[Video caption]
Forward Speech Completion: Music retrieval by uttering part of artist's name
- Michael- (Michael, uh...)
- (*)
A pop-up window containing completion candidates appears.
- Jackson
- (*)
A pop-up window containing a list of musical pieces appears.
- No. 1
- (*)
The first song is highlighted and played back.
Forward Speech Completion: Music retrieval by uttering part of musical-piece title
- The Way- (The Way, er...)
- (*)
A pop-up window containing completion candidates appears.
- No. 1
- (*)
The song of the selected title is played back.
Backward Speech Completion: Music retrieval by uttering part of artist's name
- Something- (wildcard keyword)
- (*)
A pop-up window with colorful flying decorations appears.
- Jackson
- (*)
A pop-up window containing completion candidates appears.
- No. 1
- (*)
A pop-up window containing a list of musical pieces appears.
- No. 3
- (*)
The third song is highlighted and played back.
This demonstration featured RWC-MDB-G-2001 No.10, 24, 26 from the
RWC Music Database (Music Genre).
-
Demonstration of Music Playback System with the Speech-Spotter Interface
In this video,
a user can listen to background music
by uttering the name of a musical piece or artist
while talking to another person.
The video shows that users can share music playback on the telephone
as if they were talking in the same room with background music.
Demonstration of Music Playback System with the Speech-Spotter Interface
- (12,793,620 bytes, 1 min 10 sec, MPEG-1 file)
- (Short excerpt version:
3,760,232 bytes, 20 sec, MPEG-1 file)
-
[Video caption]
B calls A on the telephone.
- A:
-
Yes...
- B:
-
Hello?
- A:
-
Uh..., what's up?
- B:
-
Thanks for all your help last time.
- A:
-
No problem. How have you been since?
- B:
-
Whew! I've been super busy writing that paper... I'm beat.
(Several minutes later)
- A:
-
Uh..., that reminds me,
the song called ``Fly Away'' that we heard at that place, wasn't that good?
- B:
-
Oh, what song was that?
- A:
-
Shall we try listening to it?
- B:
-
What? We can hear it now?
- A:
-
Sure. This is a phone with a music-playback system.
We can listen to that song like this...
Er..., ``Fly Away''!
(*)
The system plays the song of that name on both of their handsets.
- B:
-
Wow, amazing! You can listen to a song by just saying its name!
Um..., this is a good song.
- A:
-
That's right!
(**)
In this caption, underlining indicates
that the pitch of the underlined words is intentionally raised.
This demonstration featured RWC-MDB-P-2001 No.28 from the
RWC Music Database (Popular Music).
Screen Snapshots of Music-Retrieval System with the Speech-Completion Interface
|
Forward Speech Completion |
A user who does not remember the last
part of a name can invoke this completion by uttering
the first part while intentionally lengthening its last syllable
(making a filled pause).
[Entering the phrase ``maikeru jakuson'' (``Michael
Jackson'') when its last part (``jakuson'') is uncertain.]
-
Uttering ``maikeru--.''
-
A pop-up window containing completion candidates appears.
-
Uttering ``No. 2.''
-
The second candidate is highlighted and bounces.
-
The selected candidate ``maikeru jakuson'' is determined as the recognition result.
|
Backward Speech Completion |
A user who does not remember the first
part of a name can invoke this completion
by uttering the last part
after intentionally lengthening the last syllable of
a predefined special keyword --- called the wildcard keyword.
[Entering the phrase ``maikeru jakuson'' (``Michael Jackson'')
when its first part (``maikeru'') is uncertain.]
-
Uttering ``nantoka--.''
(wildcard keyword)
-
A pop-up window with colorful flying decorations appears.
-
Uttering ``jakuson.''
-
A window containing completion candidates appears.
-
Uttering ``No. 1.''
-
The first candidate ``maikeru jakuson'' is determined as the recognition result.
After the artist's name is identified
by either the forward or backward speech completion,
the system shows
a numbered list of titles for the specified artist in a music database,
and a user can select an appropriate title
by uttering either the title or its number.
When the musical piece is identified,
the system plays back its sound file.
[Playing back a musical piece of the artist ``maikeru jakuson''
(``Michael Jackson'')
whose name is determined by the speech-completion interface.]
-
Continued from the above figures.
-
A pop-up window containing a list of musical pieces appears.
-
Uttering ``No. 1.''
-
The first musical piece is highlighted and played back.
Acknowledgments:
This research
utilized the
RWC Music Database "RWC-MDB-P-2001" (Popular Music)
and "RWC-MDB-G-2001" (Music Genre).
Back to:
Masataka GOTO
<m.goto [at] aist.go.jp>
All pages are copyrighted by the author.
Unauthorized reproduction is strictly prohibited.