INTERSPEECH 2021丨希尔贝壳2篇论文入选全球顶级语音学术大会

作为语音领域里的顶级国际会议,INTERSPEECH历来都是学术界和工业界关注的焦点,会议涵盖了语音语言处理和应用的各个方面,以及语音相关领域的各类前沿进展。INTERSPEECH2021于8月30日-9月3日举办,会议由国际语音通信协会 ISCA主办,今年会议为线上加线下(捷克布鲁诺)的形式。为方便全球各地研究者交流,今年被接收的论文都能进行视频展示。




希尔贝壳2篇论文入选

历届INTERSPEECH会收到来自全球上千家科研机构及企业厂商投稿,而最终入选的数量却十分有限。在今年Interspeech2021,希尔贝壳投递的2篇论文《AISHELL-3: A Multi-speaker Mandarin TTS Corpus 》 和《AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario》成功被大会收录其中。

01
论 文 1

  题目:

《AISHELL-3:A Multi-speaker Mandarin TTS Corpus 》


  下载地址:

https://arxiv.org/abs/2010.11567


  作者:

Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li


  合作单位:

  • School of Computer Science, Wuhan University, Wuhan, China

  • Data Science Research Center, Duke Kunshan University, Kunshan, China

  • Beijing Shell Shell Technology Co., Ltd, Beijing, China


  简介:

In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is able to achieve zero-shot voice cloning. The system trained on this dataset also generalizes well on speakers that are never seen in the training process. Objective evaluation results from our experiments show that the proposed multi-speaker synthesis system achieves high voice similarity concerning both speaker embedding similarity and equal error rate measurement. The dataset, baseline system code and generated samples are available online.


  INTERSPEECH展示信息:


02
论 文 2

  题目:

《AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario》


  下载地址:

https://arxiv.org/abs/2104.03603


  作者:

Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, Jingdong Chen


  合作单位:

  • Northwestern Polytechnical University, Xi’an, China

  • Microsoft Corporation, USA

  • Microsoft Corporation, China

  • Beijing Shell Shell Technology Co., Ltd., Beijing, China

  • University of Science and Technology of China, Hefei, China



  简介:

In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field.


  INTERSPEECH展示信息:


AISHELL 的开源项目已经成为了语音技术领域的数据开源标杆,目前已形成了智能语音技术+数据的矩阵开源方案,覆盖语音识别、声纹识别、语音合成、场景智能语音技术应用方案。

AISHELL会持续投入做开源,通过技术引领数据业务的发展,通过数据带动技术产业的成熟,在未来用前沿的数据库去服务开发者和科研人员,降低企业在算法落地层面的成本。还要用更多的开源数据与教育、研发、产品等相结合让技术落地走进更多的场景,为实现人工智能民主化希尔贝壳还需要更努力。


希尔贝壳,以人工智能民主化为目标