Keywords: Music information processing, Music information retrieval, Music understanding, Metadata of musical pieces, End-user interfaces
PACS number: 43.75.-z [DOI: 10.1250/ast.25.419]
Music information processing has been widely deployed in music industries over the years. Of course, technologies oriented to musicians have long been studied including sound synthesis on music synthesizers, desktop music production based on MIDI sequencers, and various kinds of support for composing, performing, and recording music. Such tools have already become an essential part of the music-production process. But more recently, focus has shifted from these conventional tools to new technologies that target the direct enjoyment of music by end users who are not musicians. For example, it has become relatively easy to ``rip'' audio signals from compact discs (CD) and compress them and to deal with many musical pieces on a personal computer. It has also become possible to load a huge number of songs onto a portable music player (e.g., Apple iPod) enabling anyone to carry their personal collection of music anywhere and to listen to it at anytime.
A variety of factors can be given for this trend including advances in computer hardware (high processing speeds and large-capacity/small-size memory and hard disks), spread of the Internet, and provision of low-cost audio input/output devices as standard equipment. The standardization of MPEG Audio Layer 3 (MP3) in 1992 and its spread in the latter half of the 1990s and the establishment of MP3-based businesses in response to end-user demand have also played a role here. This trend is accelerating all the more in the first half of the 2000s with the proposal of Ogg Vorbis, MPEG-4 AAC, Windows Media Audio (WMA), and other compression systems following on the heels of MP3. Enterprises for delivering music via the Internet are also appearing in rapid succession.
End users who are not musicians are not generally proficient in music --- their knowledge of notes, harmony, and other elements of music is usually limited. Furthermore, they generally have little desire to create music. They are quite interested, however, in retrieving and listening to their favorite music or a portion of a musical piece in a convenient and flexible way. Recent research themes of music information processing reflect such end-user demand. The target of processing is expanding from the internal content of individual musical pieces (notes, chords, etc.) to entire musical pieces and even sets of musical pieces. Accordingly, research is becoming active in music systems that can be used by people with no musical knowledge. Typical technologies driving this trend are technology for computing similarity between musical pieces and for retrieving and classifying music; technology for referring to what music friends and other people listen to and for selecting music accordingly; and technology for creating advanced music-handling interfaces.
Focusing on this emerging research trend, this paper introduces recent studies on music information processing from a unique perspective.
In contrast to past research that focused on the internal contents of individual musical pieces, the past ten years have seen the growth of a new research field targeting the retrieval, classification, and management of large sets of music in which a single musical piece is treated as a unit. This field, which is called Music Information Retrieval (MIR), has become quite active, and since 2000, the International Conference on Music Information Retrieval (ISMIR) has been held annually. A variety of topics are being researched in this field, but in the following, we introduce three ways of retrieving music based on audio signals as opposed to text searches based on bibliographic information (such as titles and artist names from CDDB, an online database of CD information).
As the name implies, Query by Humming (QBH) enables one to retrieve the title of a musical piece by humming or singing its melody using sounds like ``la-la-la...'' In other words, humming or singing a melody becomes the search key for finding a musical piece with that melody. The use of such search keys raises some issues, however, such as how to deal with errors when singing off key and how to absorb differences in key and tempo. Specific methods differ in terms of the database used, which may consist of melodies only [1, 2, 3, 4, 5], standard MIDI files (SMF) of entire musical pieces [6, 7, 8], or audio signals of entire musical pieces [9, 10]. If we use a melody-only database, similarity with a search key can be directly computed, but for SMF, the track containing a melody must be identified before computing similarity. In the case of audio signals, similarity with a melody included in a mixture of sounds must be computed, which is even harder to achieve.
For someone who would like to know the title of a musical piece that is currently playing on the street or elsewhere, this retrieval method enables that title to be identified based on a fragment of that piece, which can be recorded on a cellular phone. The fragment is therefore the search key and the method searches for the musical piece containing that fragment. Important issues here are how to achieve efficient searching and how to absorb acoustic fluctuations caused by noise and distortion on the transmission path. Proposed methods include the time-series active search method based on histograms of vector-quantized power-spectrum shapes [11] and a method based on patterns of power-spectrum peaks [12].
Given that one likes certain musical pieces, this method searches for another musical piece having a similar feeling. The search key is musical pieces themselves and the method searches for a similar piece. To this end, similarity must be defined based on various features such as timbral texture within a piece (power-spectrum shape) [13, 14], rhythm [14, 15, 16, 17], modulation spectrum [18], and singer voice [19]. Similarity is also important for purposes other than retrieval. For example, the use of similarity to automatically classify musical pieces (into genres, music styles, etc.) is also being researched [14, 17, 18, 19, 20]. It is difficult, however, to compute the appropriate similarity between musical pieces considering various factors. This in conjunction with the understanding of musical audio signals as introduced in the following section needs further research in the years to come.
Research related to the understanding of musical audio signals has developed significantly over the last ten years. Before that, it was common to research the segregation and extraction of individual sound components making up an audio signal (sound source segregation) and to use that information to automatically generate a musical score (automatic transcription). But in 1997, in reconsideration of what it means for human beings to understand music, a new research approach was proposed on music understanding (Sect. 3.2, Music scene description [21, 22, 23]) based on the viewpoint that listeners understand music without segregating sound sources and without mentally representing audio signals as musical scores [24]. Research themes conforming to this approach, including beat tracking, melody extraction, and music structure analysis, have also been proposed.
This major development on the understanding of musical audio signals has been supported by advances in hardware and in techniques for processing audio signals. Ten years ago, it was still difficult to calculate a Fast Fourier Transform (FFT) in real time, but nowadays, it can be performed so fast that the time required for its computation can essentially be ignored. This jump in processing performance has let researchers devise computationally intensive approaches that could not be considered in the past, and has also promoted the use of a wide range of statistical techniques. For example, techniques based on probabilistic models such as the Hidden Markov Model (HMM) and various techniques making use of maximum likelihood estimation and Bayes estimation have been proposed.
Automatic transcription has a long history as a research theme going back to the 1970s, and has progressed steadily as the difficulty of the target music has increased from monophonic sounds of melodies to polyphonic sounds from a single instrument and a mixture of sounds from several instruments. This progression has been accompanied by a shift toward more specialized research topics, namely, sound source segregation and estimation of fundamental frequency (F0, perceived as pitch).
Because space does not allow an exhaustive introduction to the many studies in this research field, we here focus on new approaches that first appeared in the past ten years. In 1994, Kashino et al. introduced a method based on a probabilistic model and implemented as a process model called OPTIMA [25, 26]. This method was novel in its use of a graphical model to describe the hierarchical structure of frequency components, musical notes, and chords and in determining the most likely interpretation based on this hierarchical relationship. Then, in 1999, Goto proposed a predominant-F0 estimation method (PreFEst) that does not assume the number of sound sources [21, 23, 27]. This method prepares probability distributions that represent the shape of harmonic structures for all possible F0s, and models input frequency components as a mixture (weighted sum) of those distributions. It then estimates the parameters of this model --- the amplitude (weight) of each component sound in the input sound mixture and the shape of its harmonic structure --- by using Maximum A Posteriori Probability (MAP) estimation executed by the Expectation-Maximization (EM) algorithm. This method can be extended, in principle, to an inharmonic structure [23, 27], and as such, can be considered a framework for understanding general sound mixtures.
Other proposed methods include a method for sequentially determining the components in a sound mixture by repeatedly estimating the predominant F0 and removing its harmonic components [28]; a method for estimating model parameters such as the number of simultaneous sounds, the number of frequency components making up each sound, F0s, and amplitude by modeling the signal as a weighted sum of sound waveforms in the time domain and applying the Markov Chain Monte Carlo (MCMC) algorithm [29]; a method for estimating notes, tempo, and waveforms by associating them with a graphical model that models the waveform-generation process when performing a musical score at a certain (local) tempo [30]; and a method that formalizes the problem as the clustering of frequency components under harmonic-structure constraints and determines the number of clusters (sound sources) that minimizes the Akaike Information Criterion (AIC) so as to estimate the median (F0) and weight (amplitude) of each cluster [31].
Music scene description [21, 22, 23] aims to achieve an understanding of musical audio signals at the level of untrained listeners. This contrasts with most studies in the past that aimed to achieve it at the level of trained musicians by identifying all musical notes forming a musical score or obtaining segregated signals from sound mixtures. Music scene description features the description of ``scenes'' that occur within a musical performance such as melody, bass, beat, chorus and phrase repetition, structure of the musical piece, and timbre of musical instruments. The following introduce methods for obtaining descriptions of such scenes.
To respond directly to end-user demands for flexible retrieval of music for one's listening pleasure, much research has been targeting the extraction and usage of metadata to enhance the listening of musical pieces or to facilitate their retrieval. Such metadata include information on the composers and performers of musical pieces and the listener's preferences with regard to those pieces.
End users find it convenient if they can refer to a list of music that other people listened to as a basis for selecting their own music. On the Internet, music-sales sites (e.g., Amazon) and music-review sites (e.g., Allmusic) perform daily collections of metadata including user evaluations and impressions and purchase history. This information can be subjected to collaborative filtering to achieve services that promote the purchasing of music by recommending artists and albums to users [59, 60, 61] and proposing playlists [62, 63, 64, 65]. (The original meaning of ``playlist'' is a broadcast or concert program of musical pieces but it here means a list of musical pieces to be played back on a media player or other device.)
Because collaborative filtering by itself cannot easily deal with unknown new musical pieces, it can be reinforced with content-based filtering [61]. This makes the music-understanding methods described in Section 3 all the more important. Using the results of those methods can aid in the generation of more appropriate recommendations and playlists based on the acoustical features and content of musical pieces.
As an example of using both metadata and acoustical features, Whitman et al. have developed methods for detecting artist styles [20] and identifying artists [66] by combining acoustical features of artist's musical pieces with statistical data on words or phrases on WWW pages that include that artist's name. Ellis et al., moreover, have investigated an automatic measure of the similarity between artists by extracting metadata from lists of similar artists on a music-review site, from end-user music collections, and from statistical data on words or phrases on WWW pages that include artist names [67]. In addition, Berenzweig et al. have compared similarity based on such metadata with similarity based on audio signals [68].
While methods for encoding only musical-score information have already reached a sufficiently practical level [69], there are currently several XML-based proposals for music-description methods (including metadata) and their standardization. For example, MusicXML [70] and WEDELMUSIC [71] have been proposed for describing music at a symbol level including musical-score information. Likewise, MPEG-7 Audio [72] has been standardized for describing metadata related to musical audio signals such as melody contour and statistical data on the power spectrum. We expect various research and development activities conforming to these proposals to appear.
To enable end users without detailed knowledge of music to deal with music on their own terms, it is important that new types of interfaces be developed since existing tools designed for musicians are not sufficient for this purpose.
To achieve interfaces that can be used naturally without hassle, we can consider the use of real-world objects themselves as interfaces. Here, it is important that such an approach conforms to conventional usage. The musicBottles [73] and FieldMouse [74] are two examples of this real-world approach. The musicBottles is a music-playback interface that associates each musical-instrument part in a musical piece with different glass bottles and then enables a user to play the sound of any part only when the cap of that bottle is in the open position. The FieldMouse is an input device that combines an ID-tag detector such as a barcode reader with a relative-position detector such as a mouse. A user can then select musical pieces and adjust the playback volume, for example, by moving the FieldMouse through space to read ID-tags that correspond to those operations.
In conventional listening stations and media players, an end user who would like to listen to only the chorus section of a song must search for it himself by pressing the fast-forward button repeatedly. To make this task easier for an end user, SmartMusicKIOSK [75] adds a ``NEXT CHORUS'' button by employing the RefraiD method [48] described earlier. Pressing this button forces a jump to the next chorus section in that song through automatic chorus detection. This makes it easy for a user to immediately skip a section (part) of no interest within a song much like the ``NEXT TRACK'' button on a CD player enables him to skip a song (track) of no interest.
Considering that end users will be faced with various types of music systems in the future, they should be able to input musical information (such as melody and rhythm) associated with specific musical pieces. Effective means to this end include humming (Sect. 2.1) and onomatopoeia as introduced in the following. A music notation system ``Sutoton Music'' [76] enables a melody to be described in text form in the manner of ``do re mii so-mmi re do'' for playback on a computer. A drum-pattern retrieval method by voice percussion (beatboxing) [77] aims to recognize the drum part of a musical piece that the user utters using natural sounds like ``dum ta dum-dum ta'' and to search for that piece on the basis of that drum part.
It was mentioned earlier that end users do not generally have much interest in creating music. Nevertheless, if easy-to-use support functions for creating music could be embedded in social networking tools (e.g., Orkut), music might also become a means of communication for end users.
One example of music systems that place particular emphasis on inter-user communication is CosTune [78]. In this system, pads that control different sounds are attached to a user's jacket or pants and touching these pads in a rhythmical manner enables the user to jam with other nearby users via a wireless network. Another example is Music Resonator [79], which enables a user to process and edit annotated fragments of musical pieces and to share that music with other users for collaborative music productions. In addition, RemoteGIG [80] enables remotely located users to jam together along a repetitive chord progression like 12-bar blues in real time over the Internet despite its relatively large latency. This system overcomes the network latency by having users listen to each other's performance delayed by just one cycle of the chord progression (several tens of seconds).
There are other interesting themes not introduced in this paper that are being actively researched in the field of music information processing. While we have here focused on topics related to audio signal processing, research into symbol-level processing including musical scores and MIDI has also been progressing. For example, there has been much work on computing similarity between melodies at a symbol level [81] and research on musical structure and expression in musical performances [82]. Fusion of symbol processing and audio signal processing is still far from sufficient and will be targeted as an important issue in the future. Such fusion will bridge a gap between them and enable symbol processing to be based on proper symbol grounding and audio signal processing to cover abstract semantic computing, eventually achieving music computing that reflects the manifold meaning of music.
The research environment for music information processing is also expanding. The years 2000 and 2001 saw the construction of the world's first copyright-cleared music database ``RWC Music Database'' that can be used in common for research purposes [83]. This database makes it easier to use music for comparing and evaluating various methods, for corpus-based machine learning, and for publishing research and making presentations without conventional copyright restrictions. Considering that shared databases of various kinds have long been constructed in other research fields and have made significant contributions to their advancement, we anticipate the RWC Music Database to contribute to the advancement of music information processing in a similar way.
Ten years ago, it was necessary for us to explain that music information processing was not ``amusement'' but a real research topic. Today, it is common sense to treat it as an important research field. This field is now experiencing the birth of large-scale projects one after another, an increase in international conferences year by year, and an ever increasing number of researchers. We look forward to further advances in music information processing research.
Masataka Goto
received his Doctor of Engineering degree in
Electronics, Information and Communication Engineering
from Waseda University, Japan, in 1998.
He then joined the Electrotechnical Laboratory
(ETL; reorganized as the
National Institute of Advanced Industrial Science and Technology (AIST)
in 2001),
where he has been engaged as a researcher ever since.
He served concurrently as a researcher in
Precursory Research for Embryonic Science and Technology (PRESTO),
Japan Science and Technology Corporation (JST) from 2000 to 2003.
His research interests include
music information processing and spoken language processing.
Dr. Goto received
the IPSJ Yamashita SIG Research Awards
(MUS and SLP)
from the Information Processing Society of Japan (IPSJ),
Best Paper Award for Young Researchers from the
Kansai-Section Joint Convention of Institutes of Electrical Engineering,
WISS 2000 Best Paper Award and Best Presentation Award,
Awaya Prize for Outstanding Presentation and
Award for Outstanding Poster Presentation
from the Acoustical Society of Japan (ASJ),
Award for Best Presentation
from the Japanese Society for Music Perception and Cognition (JSMPC),
and Interaction 2003 Best Paper Award.
He is a member of the ASJ, IPSJ, JSMPC,
Institute of Electronics, Information and Communication Engineers (IEICE),
and
International Speech Communication Association (ISCA).
Keiji Hirata
received his Doctor of Engineering degree in
Information Engineering
from University of Tokyo, Japan, in 1987.
He then joined NTT Basic Research Laboratories.
He spent 1990 to 1993 at the
Institute for New Generation Computer Technology (ICOT),
where he was engaged in the research and development of
parallel inference machines.
In 1999,
he joined NTT Communication Science Laboratories,
where he has been engaged as a researcher ever since.
His research interests include
musical knowledge programming and interaction.
Dr. Hirata received
the IPSJ Best Paper Award
from the Information Processing Society of Japan (IPSJ) in 2001,
and is a member of the IPSJ,
Japanese Society for Artificial Intelligence (JSAI),
and Japan Society for Software Science and Technology.