Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS

Wenjiang Chi, Xiaoqin Feng, Liumeng Xue, Yunlin Chen, Lei Xie, Zhifei Li
Shanghai Mobvoi Information Technology Co., Ltd, China.
Audio, Speech and Language Processing Group (ASLP@NPU).

0. Contents

  1. Abstract
  2. Demos -- Comparison with different methods
  3. Demos -- Contribution of coarse-grained information
  4. Demos -- Stress demo on paragraph TTS
  5. Conclusion

1. Abstract

Stress, as the perceptual prominence within sentences, plays a key role in expressive text-to-speech (TTS). It can be either the semantic focus in text or the acoustic prominence in speech. However, stress labels are always annotated by listening to the speech, lacking semantic information in the corresponding text, which may degrade the accuracy of stress prediction and the expressivity of TTS. This paper proposes a multi-granularity stress prediction method for expressive TTS. Specifically, we first build Chinese Mandarin datasets with both coarse-grained semantic stress and fine-grained acoustic stress. Then, the proposed model progressively predicts semantic stress and acoustic stress. Finally, a TTS model is adopted to synthesize speech with the predicted stress. Experimental results on the proposed model and synthesized speech show that our proposed model achieves good accuracy in stress prediction and improves the expressiveness and naturalness of the synthesized speech.

1.1 Data Construction

In consideration of the nature of multi-grained stress, we construct the dataset with coarse-grained semantic stress and fine-grained acoustic stress to incorporate both the semantic knowledge in the text and the acoustic information in the speech. Compared with the standard stress corpus collection, our approach is more flexible and suitable for real-world scenarios, it also enhances the reliability and diversity of stress labels.

1.1 System View

The overall architecture of our proposed multi-granularity stress prediction model is presented in figure. The training process of the model is divided into two stages: stage 1 predicts coarse-grained semantic stress and stage 2 predicts fine-grained acoustic stress. The models of the two stages have similar structures but they are trained on the dataset with semantic stress and acoustic stress, respectively. Compared with the model trained on the dataset with acoustic stress only, our model incorporates stress from both semantic and acoustic information, improving the diversity of stress and also benefiting expressive speech synthesis.

2. Demos -- Comparison with Different Methods

Corresponding to section 4.2 in the paper, several samples synthesized by the proposed FSM and other compared methods on the stress task are listed below.

number Original FSM(Proposed) CGM
demo1 大哥出国家里没钱,你宁可卖房子也要支持,苏明成结婚,你为这两个儿子砸锅卖铁也心甘情愿,那我呢?你们为我呢?
When brother went abroad but we have no money, you would rather sell the house to support him. When Su Mingcheng got married, you were willing to do anything for these two sons. What about me? What will you do for me?
大哥出国里没钱,你宁可子也要支持。明成结婚,你为这两个儿子锅卖铁也甘情愿,那呢?你们呢? 大哥出国家里,你宁可卖房子也要。苏明成,你为这两个儿子砸锅卖铁也,那我呢?你们为我呢?
demo2 广告魔音魔声京东影音娱乐,轻薄之躯,澎湃能量,告别繁杂走线,让您的心情“无线愉快”。
Advertisement - enchanting sounds and voices, JD Video & Entertainment, slim body, powerful energy, saying goodbye to complicated wiring, making your mood 'wirelessly happy'
广告魔音魔声音娱乐,薄之躯,能量,告别杂走线,让您的心情“线愉快” 广告魔音,,告别线,让您的线
demo3 强势的母亲暴跳如雷,我们给你吃给你穿养你这么大,我们有罪了是不是?你要是有能耐,你就别用我们的钱哪!
The overbearing mother is raging and saying: 'We have fed you, clothed you, and raised you to this age. Are we guilty? If you can, don't use our money!'
强势的母亲跳如雷,我们给你吃给你穿养你么大,我们有了是不是?你要是有,你就用我们的钱哪! 强势的母亲,我们穿,我们?你要是,你就别用我们的钱哪!
demo4 有本事你再说一遍
If you can, say it again.
有本事你说一遍 有本事你再说一遍
demo5 拥有一颗感恩的心我们就会拥有欢乐、拥有幸福,拥有一个完美的未来!
By having a grateful heart, we will have joy, happiness, and a perfect future!
拥有一颗恩的心,我们就会拥有欢乐、拥有福,拥有一个的未来! 拥有一颗,我们就会拥有欢乐、,拥有一个完美的未来!
demo6 所爱隔山海,山海皆可平
Love can conquer any distance, be it across mountains or oceans.
爱隔山海,可平 所爱隔山海,山海皆可

Short summary: we can see that CGM performs poorly, resulting in unnatural and abnormal speech. This can be attributed to the fact that CGM solely focuses on semantic stress annotated from the text without considering of acoustic features of the speech. Additionally, the final stress should be at the fine-grained level, which is why CGMrandom outperforms CGM. The result of our proposed model FSM improves the naturalness and expressiveness of TTS by leveraging both semantic and acoustic information.


3. Demos -- Contribution of Coarse-Grained Information

Corresponding to section 4.3 in the paper, samples synthesized by FSM and FSM' variants that withoutcoarse-grained iupervised information (w/o Lcgce) on the stress task are listed below.

number FSM (w/o Lcgce) FSM (Proposed)
demo1 这衣服好,就是贵了点。
This clothing is good, but it's just a bit expensive.
这衣服是好,就是了点。
demo2 辰大海,归来
Embarking on a journey across the vast expanse of stars, returning still a youth at heart.
远征辰大海,归来少年
demo3 虽然代价巨大,可然有很多人。为了这一时的欢愉,而魂。
Despite the enormous cost, there are still many people who
in pursuit of a momentary pleasure, would sacrifice their souls.
虽然代价巨大,可然有很多人。为了这一时的欢愉,而献出魂。
demo4 陈娇的父母为何此的憎恨女儿,竟然对他们此狠手?
Why do Chen Jiao's parents hate their daughter so much that they resorted to such a cruel act against her?
陈娇的父母为何此的憎恨女儿,竟然对他们手?
demo5 什么会出现在
Why did you appear here?
什么会出现在里!

Corresponding to section 4.3 in the paper, samples synthesized by FSM and FSM' variants that without coarse-grained supervised information and without CGM (w/o Lcgce & CGM) on the stress task are listed below.

number FSM (w/o Lcgce & w/o CGM ) FSM (Proposed)
demo1 听了孩子如此忤逆的话,强势的母亲跳如雷。
Upon hearing her child's disobedient words, the authoritarian mother flew into a rage.
听了孩子此忤逆的话,强势的母亲跳如雷。
demo2 母亲轻飘飘的说了句,这是大人的事儿,本懒得搭理她。
The mother casually said :'This is an adult matter,' and didn't bother to pay attention to her.
母亲飘飘的说了句,这是大人的事儿,根本得搭理她。
demo3 折磨你的,来不是人的绝情,而是你心中的想和待。
What can torment you is never the heartlessness of others, but the illusions and expectations in your own mind.
磨你的,不是别人的绝情,而是你心中的想和待。
demo4 个没踩稳飞了出去。
One didn't step firmly and flew out.
个没踩稳了出去。
demo5 成为了他一生中最正确的决定。
It became the most correct decision of his life.
成为了他一生中正确的决定。
demo6 告你,你要是敢言乱语,我对你动粗
I'm warning you, if you dare to talk nonsense, don't blame me for getting physical with you.
你,你要是敢胡言乱怪我对你动粗

Short summary: As expected, introducing coarse-grained information improves our fine-grained stress model’s ability.can provide abundant semantic information, the CGM model still offers remarkable stress-related information in the sentence, thereby facilitating the FSM in detecting fine-grained stress keywords. Overall, The results demonstrate that the incorporation of coarse-grained information supervision is an effective approach to preventing stress weight dispersion during training.


4. Demos -- Stress Demo on Paragraph TTS

Following samples are synthesized with stress prediction for long paragraph.

number Original FSM (Proposed)
demo1 Text: 女人刚离开姐妹家,就看见加班的丈夫出现在姐妹楼下。她懵逼了?昨天还好心劝导姐妹,不要像亲妹妹一样做傻事。无论靠偷,靠抢,也要当上正宫,可居然抢的是自己的丈夫。为了证实猜测,她来到姐妹门前。事实证明,巧合太多就不是巧合。她强装镇定打开大门,摸了进来,地上果然摆着丈夫的鞋,屋里到处都是狗男女的嬉笑声,桌上还剩着没吃完的桃,但她的目标却在刀上,她起了杀心。她无法接受这个现实,哪怕是幻想,突然一个电话提醒了她,是她的母亲。她回家后就拿着姐妹的手链问丈夫,果然是老戏骨,干啥啥不行,演戏第一名。虽然看似天衣无缝,但女人心里已种下怀疑的种子。妻子的这种行为也引起了丈夫的警惕,立马打电话问小三,手链还在吗?小三心虚了,但为了不被对象发现她接近原配的事,她告诉对象,自己一直戴着。不料第二天一大早,原配就找上了门,不是姐妹叙旧,而是找丈夫的踪迹。 Text: 女人离开姐妹家,就看见加班的丈夫出现在姐妹楼下。她逼了。昨天还好心劝导姐妹,不要像亲妹妹一样做事。无论靠,靠,也要当上正宫,可居然抢的是己的丈夫。为了实猜测,她来到姐妹门前。事实证明,巧合太多就是巧合。她装镇定打开大门,了进来。地上然摆着丈夫的鞋。屋里处都是男女的嬉笑声。桌上还剩着吃完的,但她的目标却在上,她起了心。她法接受这个现实。怕是想。一个电话提醒了她,是她的母亲。她回家后就拿着姐妹的手链丈夫。然是戏骨,啥啥不行,演戏第名。虽然衣无缝,但女人心里种下怀疑的种子。妻子的这种行为也引起了丈夫的警马打电话问小三,手链吗。小三虚了。但为了不被对象发现她接近原配的事,她告诉对象,自己直戴着。不料第二天大早。原配就上了门。不是姐妹叙,而是找夫的迹。
demo2 北国风光,千里冰封,万里雪飘。 望长城内外,惟余莽莽;大河上下,顿失滔滔。 山舞银蛇,原驰蜡象,欲与天公试比高。 须晴日,看红装素裹,分外妖娆。 江山如此多娇,引无数英雄竞折腰。 惜秦皇汉武,略输文采;唐宗宋祖,稍逊风骚。 一代天骄,成吉思汗,只识弯弓射大雕。 俱往矣,数风流人物,还看今朝。 北国风里冰里雪飘。望长城内外,莽莽;上下,失滔滔。舞银蛇,驰蜡象,欲与天公试比,看红装素裹,外妖娆。江山此多娇,无数英雄竞折腰。秦皇武,略输文采;唐宗宋祖,风骚。一代天骄,成吉思汗,只识弓射大雕。看今朝。

Short summary: We can see that integrated with stress, the expressiveness of the overall article is enhanced. At the same time, the combination of stress makes the poems perform better. (note: the tts model uses a non-poetry corpus).

5. Conclusion

In this work, we propose a multi-granularity stress prediction model to improve the naturalness and expressiveness of TTS. In consideration of the nature of multi-grained stress, we construct the dataset with coarse-grained semantic stress and fine-grained acoustic stress to incorporate both the semantic knowledge in the text and the acoustic information in the speech. Then, the two-stage stress prediction model progressively predicts the coarse-grained stress and fine-grained stress from text, where the pre-trained language model is adopted to extract contextualized word representation from the text. Experimental results objectively and subjectively show that our proposed stress prediction model gets a higher F1 score and achieves more natural and expressive synthetic speech. In the future, we will further improve the multi-stage stress prediction model to close the performance gap between human speech and synthetic speech.

The End, Thank You!