[ English | Japanese ]
This paper presents the results of two experiments on singing skill evaluation, where human subjects (raters) judge the subjective quality of previously unheard melodies. This will serve as a preliminary basis for developing an automatic singing skill evaluation method for unknown melodies. Such an evaluation system can be a useful tool for improving singing skills, and also can be applied to broadening the scope of music information retrieval and singing voice synthesis. Previous research on singing skill evaluation for unknown melodies has focused on analyzing the characteristics of the singing voice, but were not directly applied to automatic evaluation or studied in comparison with the evaluation by human subjects.
The two experiments used the rank ordering method, where the subjects ordered a group of given stimuli according to their preferred ratings. Experiment 1 was intended to explore the criteria that human subjects use in judging singing skill and the stability of their judgments, using unaccompanied singing sequences (solo singing) as the stimuli. Experiment 2 uses the F0 sequences (F0 singing) extracted from solo singing, and was resynthesized as a sinusoidal wave. The experiment was intended to identify the contribution of F0 in the judgment. In experiment 1, six key features were extracted from the introspective reports of the subjects as being significant for judging singing skill. The results of experiment 1 show that 88.9% of the correlation between the subjects' evaluations were significant at the 5 % level. This drops to 48.6% in experiment 2, meaning that F0 contribution is relatively low, although the median ratings of stimuli evaluated as good were higher than the median ratings of stimuli evaluated as poor in all cases.
Human subjects can be seen to consistently evaluate the singing skills for unknown melodies. This suggests that their evaluation utilizes easily discernible features which are independent of the particular singer or melody. The approach presented in the paper uses pitch interval accuracy and vibrato (intentional, periodic fluctuation of F0) which are independent from specific characteristics of the singer or melody. These features was tested by a 2-class (good/poor) classification test with 600 song sequences, and achieved a classification rate of 83.5%.
Following the results of the subjective evaluation, MiruSinger, a singing skill visualization interface, was implemented. MiruSinger provides realtime visual feedback of singing voice, and focuses on the visualization of two key features . F0 (for pitch accuracy improvement) and vibrato sections (for singing technique improvement). Unlike previous systems, real-world music CD recordings are used as referential data. The F0 of vocal-part is estimated automatically from music CD recordings, which can further be hand-corrected interactively using a graphical interface on the MiruSinger screen.
This research utilized the RWC Music Database "RWC-MDB-P-2001" (Popular Music) and AIST Humming Database.
Authors would like to thank Mr. Hirokazu Kameoka (the University of Tokyo) for his valuable discussions and Dr. Elias Pampalk (CREST/AIST) for proofreading an earlier version.