TY - JOUR
T1 - Controlling the strength of emotions in speech-like emotional sound generated by WaveNet
AU - Matsumoto, Kento
AU - Hara, Sunao
AU - Abe, Masanobu
N1 - Publisher Copyright:
Copyright © 2020 ISCA
PY - 2020
Y1 - 2020
N2 - This paper proposes a method to enhance the controllability of a Speech-like Emotional Sound (SES). In our previous study, we proposed an algorithm to generate SES by employing WaveNet as a sound generator and confirmed that SES can successfully convey emotional information. The proposed algorithm generates SES using only emotional IDs, which results in having no linguistic information. We call the generated sounds “speech-like” because they sound as if they are uttered by human beings although they contain no linguistic information. We could synthesize natural sounding acoustic signals that are fairly different from vocoder sounds to make the best use of WaveNet. To flexibly control the strength of emotions, this paper proposes to use a state of voiced, unvoiced, and silence (VUS) as auxiliary features. Three types of emotional speech, namely, neutral, angry, and happy, were generated and subjectively evaluated. Experimental results reveal the following: (1) VUS can control the strength of SES by changing the durations of VUS states, (2) VUS with narrow F0 distribution can express stronger emotions than that with wide F0 distribution, and (3) the smaller the unvoiced percentage is, the stronger the emotional impression is.
AB - This paper proposes a method to enhance the controllability of a Speech-like Emotional Sound (SES). In our previous study, we proposed an algorithm to generate SES by employing WaveNet as a sound generator and confirmed that SES can successfully convey emotional information. The proposed algorithm generates SES using only emotional IDs, which results in having no linguistic information. We call the generated sounds “speech-like” because they sound as if they are uttered by human beings although they contain no linguistic information. We could synthesize natural sounding acoustic signals that are fairly different from vocoder sounds to make the best use of WaveNet. To flexibly control the strength of emotions, this paper proposes to use a state of voiced, unvoiced, and silence (VUS) as auxiliary features. Three types of emotional speech, namely, neutral, angry, and happy, were generated and subjectively evaluated. Experimental results reveal the following: (1) VUS can control the strength of SES by changing the durations of VUS states, (2) VUS with narrow F0 distribution can express stronger emotions than that with wide F0 distribution, and (3) the smaller the unvoiced percentage is, the stronger the emotional impression is.
KW - Emotional speech
KW - Speech synthesis
KW - WaveNet
UR - http://www.scopus.com/inward/record.url?scp=85098129233&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098129233&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-2064
DO - 10.21437/Interspeech.2020-2064
M3 - Conference article
AN - SCOPUS:85098129233
SN - 2308-457X
VL - 2020-October
SP - 3421
EP - 3425
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -