Speech Spotter:
On-demand Speech Recognition in Human-Human Conversation
on the Telephone or in Face-to-Face Situations
This project is proposed and researched by
Masataka Goto,
Koji Kitayama,
Katunobu Itou, and
Tetsunori Kobayashi.
Introduction
This paper describes a novel speech-interface function,
called ``speech spotter'',
which enables a user to enter voice commands into a speech recognizer
in the midst of natural human-human conversation.
In the past,
it has been difficult to use automatic speech recognition
in human-human conversation
since it was not easy to judge, from only microphone input,
whether a user was speaking to another person or a speech recognizer.
We solve this problem by using two kinds of nonverbal speech information:
a
filled pause (a vowel-lengthening hesitation like ``er...'')
and voice pitch.
Only when a user utters a voice command with a high pitch
just after a filled pause
is the voice command accepted by the speech recognizer.
By using this speech-spotter function,
we have built two application systems:
an
on-demand information system for assisting human-human conversation
and
a music-playback system for enriching telephone conversation.
The results from using these systems have shown that
the speech-spotter function is robust and convenient enough
to be used in face-to-face or cellular-phone conversations.
Video Clips
-
Introduction of Speech Spotter
In this video,
a user demonstrates both
the on-demand information system for assisting human-human conversation
and
th music-playback system for enriching telephone conversation.
Introduction of Speech Spotter
- (8,798,664 bytes, 48 sec, MPEG-1 file)
-
[Video caption]
Er..., in this way, er..., if I say ``today's weather'' in a normal
tone of voice, nothing is detected by the speech recognition system.
However, if I say, ``Er..., today's weather?''
(*1) after pausing with
a sound like ``er'' or ``uh,'' that is, if I intentionally say
``Today's weather?'' with a raised voice, it will be recognized by the
system. You can also listen to background music by saying the name of
a song. Let's try playing the song, uh..., ``it's all right.''
(*2)
(*1)
After this utterance, the system displays the answer ``Clear.''
(*2)
The system plays the song with that title.
(**)
In this caption, underlining indicates
that the pitch of the underlined words is intentionally raised.
This demonstration featured RWC-MDB-P-2001 No.24 from the
RWC Music Database (Popular Music).
-
Demonstration of On-Demand Information System for Assisting Human-Human Conversation
In this video,
users chatting in front of the microphone can easily obtain information on
the date and weather through speech-spotter utterances.
Demonstration of On-Demand Information System for Assisting Human-Human Conversation
- (5,549,712 bytes, 30 sec, MPEG-1 file)
-
[Video caption]
- A:
-
Hey, I've suddenly forgotten... What is the date today?
- B:
-
Yes, what is today's date?
Well, shall we ask the On-Demand Conversation Assistance System?
Er..., what's the date today?
(*)
The system displays the current date and time:
``August 22, 2003, Friday, 23:51:10 JST''
- A:
-
Uh, it's already the 22nd!
- B:
-
Oh really? Well, that means our excursion is tomorrow.
I hope it doesn't rain.
- A:
-
Shall we ask about the weather too?
Er..., what's tomorrow's weather?
(*)
The system checks tomorrow's weather report and displays the result: ``Clear''
- B:
-
Uh, no rain. Great!
- A:
-
That's good!
(**)
In this caption, underlining indicates
that the pitch of the underlined words is intentionally raised.
-
Demonstration of Music-Playback System for Enriching Telephone Conversation
In this video,
users can share music playback on the telephone
as if they were talking in the same room with background music.
Demonstration of Music-Playback System for Enriching Telephone Conversation
- (12,793,620 bytes, 1 min 10 sec, MPEG-1 file)
-
[Video caption]
B calls A on the telephone.
- A:
-
Yes...
- B:
-
Hello?
- A:
-
Uh..., what's up?
- B:
-
Thanks for all your help last time.
- A:
-
No problem. How have you been since?
- B:
-
Whew! I've been super busy writing that paper... I'm beat.
(Several minutes later)
- A:
-
Uh..., that reminds me,
the song called ``Fly Away'' that we heard at that place, wasn't that good?
- B:
-
Oh, what song was that?
- A:
-
Shall we try listening to it?
- B:
-
What? We can hear it now?
- A:
-
Sure. This is a phone with a music-playback system.
We can listen to that song like this...
Er..., ``Fly Away''!
(*)
The system plays the song of that name on both of their handsets.
- B:
-
Wow, amazing! You can listen to a song by just saying its name!
Um..., this is a good song.
- A:
-
That's right!
(**)
In this caption, underlining indicates
that the pitch of the underlined words is intentionally raised.
This demonstration featured RWC-MDB-P-2001 No.28 from the
RWC Music Database (Popular Music).
References:
- Masataka Goto, Katunobu Itou, Koji Kitayama, and Tetsunori Kobayashi:
Speech-Recognition Interfaces for Music Information Retrieval:
``Speech Completion'' and ``Speech Spotter'',
Proceedings of
the 5th International Conference on Music Information Retrieval
(ISMIR 2004),
pp.403-408, October 2004.
- Masataka Goto, Koji Kitayama, Katunobu Itou, and Tetsunori Kobayashi:
Speech Spotter:
On-demand Speech Recognition in Human-Human Conversation
on the Telephone or in Face-to-Face Situations,
Proceedings of
the 8th International Conference on Spoken Language Processing
(ICSLP-2004),
pp.1533-1536, October 2004.
Acknowledgments:
This research
utilized the
RWC Music Database "RWC-MDB-P-2001" (Popular Music).
Back to:
Masataka GOTO
<m.goto [at] aist.go.jp>
All pages are copyrighted by the author.
Unauthorized reproduction is strictly prohibited.
last update: September 15, 2004