Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS

Wenjiang Chi, Xiaoqin Feng, Liumeng Xue, Yunlin Chen, Lei Xie, Zhifei Li Shanghai Mobvoi Information Technology Co., Ltd, China. Audio, Speech and Language Processing Group (ASLP@NPU).

0. Contents

Abstract
Demos -- Comparison with different methods
Demos -- Contribution of coarse-grained information
Demos -- Stress demo on paragraph TTS
Conclusion

1. Abstract

Stress, as the perceptual prominence within sentences, plays a key role in expressive text-to-speech (TTS). It can be either the semantic focus in text or the acoustic prominence in speech. However, stress labels are always annotated by listening to the speech, lacking semantic information in the corresponding text, which may degrade the accuracy of stress prediction and the expressivity of TTS. This paper proposes a multi-granularity stress prediction method for expressive TTS. Specifically, we first build Chinese Mandarin datasets with both coarse-grained semantic stress and fine-grained acoustic stress. Then, the proposed model progressively predicts semantic stress and acoustic stress. Finally, a TTS model is adopted to synthesize speech with the predicted stress. Experimental results on the proposed model and synthesized speech show that our proposed model achieves good accuracy in stress prediction and improves the expressiveness and naturalness of the synthesized speech.

1.1 Data Construction

In consideration of the nature of multi-grained stress, we construct the dataset with coarse-grained semantic stress and fine-grained acoustic stress to incorporate both the semantic knowledge in the text and the acoustic information in the speech. Compared with the standard stress corpus collection, our approach is more flexible and suitable for real-world scenarios, it also enhances the reliability and diversity of stress labels.

1.1 System View

The overall architecture of our proposed multi-granularity stress prediction model is presented in figure. The training process of the model is divided into two stages: stage 1 predicts coarse-grained semantic stress and stage 2 predicts fine-grained acoustic stress. The models of the two stages have similar structures but they are trained on the dataset with semantic stress and acoustic stress, respectively. Compared with the model trained on the dataset with acoustic stress only, our model incorporates stress from both semantic and acoustic information, improving the diversity of stress and also benefiting expressive speech synthesis.

2. Demos -- Comparison with Different Methods

Corresponding to section 4.2 in the paper, several samples synthesized by the proposed FSM and other compared methods on the stress task are listed below.

number	Original	FSM(Proposed)	CGM
demo1	大哥出国家里没钱，你宁可卖房子也要支持，苏明成结婚，你为这两个儿子砸锅卖铁也心甘情愿，那我呢?你们为我呢? When brother went abroad but we have no money, you would rather sell the house to support him. When Su Mingcheng got married, you were willing to do anything for these two sons. What about me? What will you do for me?	大哥出国家里没钱，你宁可卖房子也要支持。苏明成结婚，你为这两个儿子砸锅卖铁也心甘情愿,那我呢?你们为我呢?	大哥出国家里没钱，你宁可卖房子也要支持。苏明成结婚，你为这两个儿子砸锅卖铁也心甘情愿,那我呢?你们为我呢?
demo1
demo2	广告魔音魔声京东影音娱乐，轻薄之躯，澎湃能量，告别繁杂走线，让您的心情“无线愉快”。 Advertisement - enchanting sounds and voices, JD Video & Entertainment, slim body, powerful energy, saying goodbye to complicated wiring, making your mood 'wirelessly happy'	广告魔音魔声京东影音娱乐,轻薄之躯，澎湃能量，告别繁杂走线，让您的心情“无线愉快”	广告魔音魔声京东影音娱乐,轻薄之躯，澎湃能量，告别繁杂走线，让您的心情“无线愉快”
demo2
demo3	强势的母亲暴跳如雷,我们给你吃给你穿养你这么大，我们有罪了是不是?你要是有能耐，你就别用我们的钱哪！ The overbearing mother is raging and saying: 'We have fed you, clothed you, and raised you to this age. Are we guilty? If you can, don't use our money!'	强势的母亲暴跳如雷,我们给你吃给你穿养你这么大,我们有罪了是不是?你要是有能耐,你就别用我们的钱哪!	强势的母亲暴跳如雷,我们给你吃给你穿养你这么大,我们有罪了是不是?你要是有能耐,你就别用我们的钱哪!
demo3
demo4	有本事你再说一遍 If you can, say it again.	有本事你再说一遍	有本事你再说一遍
demo4
demo5	拥有一颗感恩的心我们就会拥有欢乐、拥有幸福，拥有一个完美的未来！ By having a grateful heart, we will have joy, happiness, and a perfect future!	拥有一颗感恩的心，我们就会拥有欢乐、拥有幸福，拥有一个完美的未来!	拥有一颗感恩的心，我们就会拥有欢乐、拥有幸福，拥有一个完美的未来!
demo5
demo6	所爱隔山海，山海皆可平 Love can conquer any distance, be it across mountains or oceans.	所爱隔山海，山海皆可平	所爱隔山海，山海皆可平
demo6

Short summary: we can see that CGM performs poorly, resulting in unnatural and abnormal speech. This can be attributed to the fact that CGM solely focuses on semantic stress annotated from the text without considering of acoustic features of the speech. Additionally, the final stress should be at the fine-grained level, which is why CGM_random outperforms CGM. The result of our proposed model FSM improves the naturalness and expressiveness of TTS by leveraging both semantic and acoustic information.

3. Demos -- Contribution of Coarse-Grained Information

Corresponding to section 4.3 in the paper, samples synthesized by FSM and FSM' variants that withoutcoarse-grained iupervised information (w/o L_cgce) on the stress task are listed below.

number	*FSM (w/o Lcgce)*	FSM (Proposed)
demo1	这衣服是好，就是贵了点。 This clothing is good, but it's just a bit expensive.	这衣服是好，就是贵了点。
demo1
demo2	远征星辰大海，归来仍是少年 Embarking on a journey across the vast expanse of stars, returning still a youth at heart.	远征星辰大海，归来仍是少年
demo2
demo3	虽然代价巨大，可依然有很多人。为了这一时的欢愉，而献出灵魂。 Despite the enormous cost, there are still many people who in pursuit of a momentary pleasure, would sacrifice their souls.	虽然代价巨大，可依然有很多人。为了这一时的欢愉，而献出灵魂。
demo3
demo4	陈娇的父母为何如此的憎恨女儿,竟然对他们下此狠手? Why do Chen Jiao's parents hate their daughter so much that they resorted to such a cruel act against her?	陈娇的父母为何如此的憎恨女儿,竟然对他们下此狠手?
demo4
demo5	你为什么会出现在这里！ Why did you appear here?	你为什么会出现在这里！
demo5

Corresponding to section 4.3 in the paper, samples synthesized by FSM and FSM' variants that without coarse-grained supervised information and without CGM (w/o Lcgce & CGM) on the stress task are listed below.

number	FSM (w/o Lcgce & w/o CGM )	FSM (Proposed)
demo1	听了孩子如此忤逆的话，强势的母亲暴跳如雷。 Upon hearing her child's disobedient words, the authoritarian mother flew into a rage.	听了孩子如此忤逆的话，强势的母亲暴跳如雷。
demo1
demo2	母亲轻飘飘的说了句，这是大人的事儿，根本懒得搭理她。 The mother casually said :'This is an adult matter,' and didn't bother to pay attention to her.	母亲轻飘飘的说了句，这是大人的事儿，根本懒得搭理她。
demo2
demo3	能折磨你的，从来不是别人的绝情，而是你心中的幻想和期待。 What can torment you is never the heartlessness of others, but the illusions and expectations in your own mind.	能折磨你的，从来不是别人的绝情，而是你心中的幻想和期待。
demo3
demo4	一个没踩稳飞了出去。 One didn't step firmly and flew out.	一个没踩稳飞了出去。
demo4
demo5	成为了他一生中最正确的决定。 It became the most correct decision of his life.	成为了他一生中最正确的决定。
demo5
demo6	我警告你，你要是敢胡言乱语，别怪我对你动粗 I'm warning you, if you dare to talk nonsense, don't blame me for getting physical with you.	我警告你，你要是敢胡言乱语，别怪我对你动粗
demo6

Short summary: As expected, introducing coarse-grained information improves our fine-grained stress model’s ability.can provide abundant semantic information, the CGM model still offers remarkable stress-related information in the sentence, thereby facilitating the FSM in detecting fine-grained stress keywords. Overall, The results demonstrate that the incorporation of coarse-grained information supervision is an effective approach to preventing stress weight dispersion during training.

4. Demos -- Stress Demo on Paragraph TTS

Following samples are synthesized with stress prediction for long paragraph.

number	Original	FSM (Proposed)
demo1	Text: 女人刚离开姐妹家，就看见加班的丈夫出现在姐妹楼下。她懵逼了？昨天还好心劝导姐妹，不要像亲妹妹一样做傻事。无论靠偷，靠抢，也要当上正宫，可居然抢的是自己的丈夫。为了证实猜测，她来到姐妹门前。事实证明，巧合太多就不是巧合。她强装镇定打开大门，摸了进来，地上果然摆着丈夫的鞋，屋里到处都是狗男女的嬉笑声，桌上还剩着没吃完的桃，但她的目标却在刀上，她起了杀心。她无法接受这个现实，哪怕是幻想，突然一个电话提醒了她，是她的母亲。她回家后就拿着姐妹的手链问丈夫，果然是老戏骨，干啥啥不行，演戏第一名。虽然看似天衣无缝，但女人心里已种下怀疑的种子。妻子的这种行为也引起了丈夫的警惕，立马打电话问小三，手链还在吗？小三心虚了，但为了不被对象发现她接近原配的事，她告诉对象，自己一直戴着。不料第二天一大早，原配就找上了门，不是姐妹叙旧，而是找丈夫的踪迹。	Text: 女人刚离开姐妹家，就看见加班的丈夫出现在姐妹楼下。她懵逼了。昨天还好心劝导姐妹，不要像亲妹妹一样做傻事。无论靠偷，靠抢，也要当上正宫，可居然抢的是自己的丈夫。为了证实猜测，她来到姐妹门前。事实证明，巧合太多就不是巧合。她强装镇定打开大门，摸了进来。地上果然摆着丈夫的鞋。屋里到处都是狗男女的嬉笑声。桌上还剩着没吃完的桃，但她的目标却在刀上，她起了杀心。她无法接受这个现实。哪怕是幻想。突然一个电话提醒了她，是她的母亲。她回家后就拿着姐妹的手链问丈夫。果然是老戏骨，干啥啥不行，演戏第一名。虽然看似天衣无缝，但女人心里已种下怀疑的种子。妻子的这种行为也引起了丈夫的警惕，立马打电话问小三，手链还在吗。小三心虚了。但为了不被对象发现她接近原配的事，她告诉对象，自己一直戴着。不料第二天一大早。原配就找上了门。不是姐妹叙旧，而是找丈夫的踪迹。
demo1
demo2	北国风光，千里冰封，万里雪飘。望长城内外，惟余莽莽；大河上下，顿失滔滔。山舞银蛇，原驰蜡象，欲与天公试比高。须晴日，看红装素裹，分外妖娆。江山如此多娇，引无数英雄竞折腰。惜秦皇汉武，略输文采；唐宗宋祖，稍逊风骚。一代天骄，成吉思汗，只识弯弓射大雕。俱往矣，数风流人物，还看今朝。	北国风光，千里冰封，万里雪飘。望长城内外，惟余莽莽；大河上下，顿失滔滔。山舞银蛇，原驰蜡象，欲与天公试比高。须晴日，看红装素裹，分外妖娆。江山如此多娇，引无数英雄竞折腰。惜秦皇汉武，略输文采；唐宗宋祖，稍逊风骚。一代天骄，成吉思汗，只识弯弓射大雕。俱往矣，数风流人物，还看今朝。
demo2

Short summary: We can see that integrated with stress, the expressiveness of the overall article is enhanced. At the same time, the combination of stress makes the poems perform better. (note: the tts model uses a non-poetry corpus).

5. Conclusion

In this work, we propose a multi-granularity stress prediction model to improve the naturalness and expressiveness of TTS. In consideration of the nature of multi-grained stress, we construct the dataset with coarse-grained semantic stress and fine-grained acoustic stress to incorporate both the semantic knowledge in the text and the acoustic information in the speech. Then, the two-stage stress prediction model progressively predicts the coarse-grained stress and fine-grained stress from text, where the pre-trained language model is adopted to extract contextualized word representation from the text. Experimental results objectively and subjectively show that our proposed stress prediction model gets a higher F1 score and achieves more natural and expressive synthetic speech. In the future, we will further improve the multi-stage stress prediction model to close the performance gap between human speech and synthetic speech.

The End, Thank You!