

Therefore, a rather large number of analyses were conducted on emo- tional speech mainly from the viewpoint of prosodic fea- tures. INTRODUCTION Recent advancement of multimedia interfaces between man and machine largely increased interests on realizing and recognizing emotions conveyed by speech.

Especially, a high importance was observed in the case of happiness. Perceptual experi- ments using synthetic speech with copied acoustic features of target speech indicated importance of the segmental fea- tures other than the prosodic features. Investigation was also conducted on what acoustic features were important to express emotions. The analysis results on fundamental frequency (F0) con- tours and speech rates implied that humans have several ways to express emotions and use them rather randomly. Investigation was conducted on how prosodic features of emotional speech changed depending on emotion levels. Results of preference testing of synthetic speech show that the proposed cost functions generate speech of higher quality than the conventional method. On the basis of these results, we propose a new cost function design, which changes a cost function according to the prosody of a speech database. The results indicate that the tendency of perceptual degradation differs according to the prosodic features of the original speech. Then with respect to the speech databases, we investigated the relationships between the amount of prosody modification and the perceptual degradation. First, we recorded nine databases of Japanese speech with different prosodic characteristics. In this paper, we propose a novel cost function design based on the prosody of speech segments. However, the conventional cost ignores the original prosody of speech segments, although it is assumed that the quality deterioration tendency varies in relation to the pitch or speech rate of original speech. Speech quality degrades in proportion to the amount of prosody modification, therefore a target cost for prosody is set to evaluate prosodic difference between target prosody and speech candidates in such a unit selection system. In this study, we employ a unit selection-concatenation method and also introduce an analysis-synthesis process to provide precisely controlled prosody in output speech. Many TTS systems have implemented a prosody control system but such systems have been fundamentally designed to output speech with a standard pitch and speech rate.

This research aims to construct a high-quality Japanese TTS (Text-to-Speech) system that has high flexibility in treating prosody.
