Demo page

Supplemental webpage for
　　Unsupervised Melody Style Conversion
　　by Eita Nakamura, Kentaro Shibata, Ryo Nishikimi, Kazuyoshi Yoshii
　　(Paper submitted to ICASSP 2019)

1. Data

We use three sets of data belonging to different music categories.

(Western) classical music category
Dataset consists of soprano melodies compoased by Amadeus Mozart (7133 bars)
J-pop category
Dataset consists of vocal melodies compoased by a band Mr. Children (3878 bars)
Enka music category
Enka is a genre of Japanese popular music. The dataset consists of vocal melodies by various artists (37032 bars).

2. Unsupervised learning of music styles

We train a transposition-symmetric torus Markov mixture model (TSTMMixM) (see paper) from the dataset of each music category. Each component model is expected to represent music styles in each music category. Some examples of learned styles are visualized below.

2.1. Styles in pitch organization

Pitch-class transition probabilities obtained by integrating out the rhythmic variables. Transition (bigram) probabilities are represented by bands.

Classical music style
(Major diatonic scale)

Classical music style
(Minor diatonic scale)

J-pop style 1
(Major diatonic scale)

J-pop style 2
(Minor diatonic scale)

Enka music style 1
(Major pentatonic scale)

Enka music style 2
(Minor pentatonic scale)

It is worth emphasizing that these musical scales were inferred unsupervisedly without any annotation on the tonic and mode!

2.2. Styles in rhythmic organization

Metrical transition probabilities obtained by integrating out the pitch-class variables. Transition (bigram) probabilities are represented by bands. Like in a clock, a metrical (beat) position in a bar (4/4 time) is represented on a circle. 0 o'clock, 3 o'clock, 6 o'clock, and 9 o'clock indicate quarter-note beats.

Classical music style

J-pop style 1
(8th-note rhythm)

J-pop style 2
(16th-note rhythm)

Enka music style 1
(Mixed 8th-note rhythm)

Enka music style 2
(Dotted 8th-note rhythm)

3. Sound examples

Each original melody (8-bar long) is converted to the target style. For comparison, we tested the following three methods (see paper for details):

TMM+SEM: Torus Markov model + simple edit model
TSTMMixM+SEM: TSTMMixM + simple edit mode
TSTMMixM+REM: TSTMMixM + refined edit mode

3.1. Classical to J-pop / Enka

Original melody (Classical music style)

Target: J-pop (TMM+SEM)

Target: J-pop (TSTMMixM+SEM)

Target: J-pop (TSTMMixM+REM)

Target: Enka (TMM+SEM)

Target: Enka (TSTMMixM+SEM)

Target: Enka (TSTMMixM+REM)

3.2. Classical to J-pop / Enka

Original melody (Classical music style)

Target: J-pop (TMM+SEM)

Target: J-pop (TSTMMixM+SEM)

Target: J-pop (TSTMMixM+REM)

Target: Enka (TMM+SEM)

Target: Enka (TSTMMixM+SEM)

Target: Enka (TSTMMixM+REM)

3.3. J-pop to Classical / Enka

Original melody (J-pop style)

Target: Classical (TMM+SEM)

Target: Classical (TSTMMixM+SEM)

Target: Classical (TSTMMixM+REM)

Target: Enka (TMM+SEM)

Target: Enka (TSTMMixM+SEM)

Target: Enka (TSTMMixM+REM)

3.4. J-pop to Classical / Enka

Original melody (J-pop style)

Target: Classical (TMM+SEM)

Target: Classical (TSTMMixM+SEM)

Target: Classical (TSTMMixM+REM)

Target: Enka (TMM+SEM)

Target: Enka (TSTMMixM+SEM)

Target: Enka (TSTMMixM+REM)

3.5. Enka to Classical / J-pop

Original melody (Enka music style)

Target: Classical (TMM+SEM)

Target: Classical (TSTMMixM+SEM)

Target: Classical (TSTMMixM+REM)

Target: J-pop (TMM+SEM)

Target: J-pop (TSTMMixM+SEM)

Target: J-pop (TSTMMixM+REM)

3.6. Enka to Classical / J-pop

Original melody (Enka music style)

Target: Classical (TMM+SEM)

Target: Classical (TSTMMixM+SEM)

Target: Classical (TSTMMixM+REM)

Target: J-pop (TMM+SEM)

Target: J-pop (TSTMMixM+SEM)

Target: J-pop (TSTMMixM+REM)

4. Subjective evaluation

4.2. Setup

We conducted a subjective evaluation test to measure the quality of melody style conversion. Here is the page used for the listening evaluation.

10 evaluators who listen to music more than one hour a day participated the experiment; they listened to the above arranged melodies and evaluated each arrangment in terms of the following scores:

Style match: Does the arranged melody match the target style? (1: very poorly , ... , 6: very well)
Similarity: Do you feel the original melody? (1: very poorly , ... , 6: very well)
Naturalness: Is the melody natural? (1: very unnatural , ... , 6: very natural)
Attractiveness: Is the melody attractive? (1: not attractive at all , ... , 6: very attractive)

4.2. Result

The results clearly demonstrate the effectiveness of the TSTMMixM and the refined edit model. The results show that the mean scores of all the metrics are improved by refinements of the method. Particularly, the style match score improved by 0.5 (p-value < 10⁻⁵, t-test) with the refined language model (M1 vs M2), and the similarity score improved by 0.28 (p-value = 3.3 × 10⁻³, t-test) with the refined edit model (M2 vs M3). The improvements in the naturalness and attractiveness scores are also statistically significant. These results clearly demonstrate the efficacy of the proposed method.