Research

I am a research scientist and engineer with 15 years of experience in speech technologies, deep neural networks and representation learning. My research is focused on detecting patterns in speech, including phoneme recognition, continuous speech recognition and voice activity detection. I’ve also worked on paralinguistic detection tasks, specifically speech emotion recognition and non-verbal vocalisation detection, focusing on the generalisation capability of such models. Lately I’m interested in self-supervision and acoustic-to-articulatory inversion.

I am a principal research scientist and tech lead at Speech Graphics since 2017. In my work, I focus on developing and improve the audio-driven facial animation technology, which is a standard in the game industry. I have experience in leading long-term research directions and technical solutions, software development (C++, python), project and product management, operations and infrastructure.

I obtained a Ph.D. from EPFL in Switzerland in 2016, where I worked on applying deep learning methods to speech recognition. My doctoral studies were done at Idiap Research Institute under the supervision of Ronan Collobert, Mathew Magimai Doss and Hervé Bourlard.

Awards

2023 Eurasip Best Paper Award for Speech Communication Journal

“End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition”, Palaz D., Magimai-Doss M., and Collobert R., Volume 108, April, 2019

https://eurasip.org/best-paper-awards/

ISCA Award for Best Paper Published in Speech Communication (2018-2022)

“End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition”, Palaz D., Magimai-Doss M., and Collobert R., Volume 108, April, 2019

Publications

E. Stanley, E. DeMattos, A. Klementiev, P. Ozimek, G. Clarke, M. Berger, and D. Palaz, D. Emotion Label Encoding Using Word Embeddings for Speech Emotion Recognition. Proceedings of Interspeech 2023, pp. 2418-2422. [pdf]

J. Parry, E. DeMattos, A. Klementiev, A. Ind, D. Morse-Kopp, G. Clarke, and D. Palaz. Speech Emotion Recognition in the Wild using Multi-task and Adversarial Learning. Proceedings of Interspeech 2022, pp. 1158-1162. [pdf]

S. Condron, G. Clarke, A. Klementiev, D. Morse-Kopp, J. Parry, D. Palaz, Non-Verbal Vocalisation and Laughter Detection Using Sequence-to-Sequence Models and Multi-Label Training, Proceedings of Interspeech 2021, pp. 2506-2510. [pdf]

J. Parry, D. Palaz, G. Clarke, P. Lecomte, R. Mead, M. Berger, G. Hofer, Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition, Proceedings of Interspeech 2019, pp. 1656-1660. [pdf]

D. Palaz, M. Magimai-Doss, R. Collobert, End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition, Speech Communication 108, pp. 15-32, 2019. [pdf]

D. Palaz, Towards end-to-end speech recognition, Ph.D. Thesis, EPFL, 2016. [pdf]

D. Palaz, G. Synnaeve, R. Collobert, Jointly Learning to Locate and Classify Words Using Convolutional Network, Proceedings of Interspeech 2016, pp. 2741-2745. [pdf]

D. Palaz, M. Magimai-Doss, R. Collobert, Convolutional neural networks-based continuous speech recognition using raw speech signal, Proceedings of ICASSP 2015. [pdf]

D. Palaz, M. Magimai-Doss, R. Collobert, Analysis of CNN-based Speech Recognition System using Raw Speech as Input, Proceedings of Interspeech 2015, pp. 11-15. [pdf]

D. Palaz, M. Magimai-Doss, R. Collobert, Learning linearly separable features for speech recognition using convolutional neural networks, ICLR 2015 workshop. [pdf]

D. Palaz, M. Magimai-Doss, R. Collobert, Joint phoneme segmentation inference and classification using CRFs, Proceedings of IEEE Global Conference on Signal and Information Processing (GlobalSIP) 2014. [pdf]

D. Palaz, R. Collobert, M. Magimai-Doss, End-to-end phoneme sequence recognition using convolutional neural networks, NIPS 2013 Deep Learning workshop. [pdf]

D. Palaz, R. Collobert, M. Magimai Doss, Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks, Proceedings of Interspeech 2013, pp. 1766-1770. [pdf]

D. Palaz, I. Tošić, P. Frossard, Sparse stereo image coding with learned dictionaries, IEEE International Conference on Image Processing 2011, pp. 133-136.

M. Borgeaud, D. Palaz, P. Deleglise, Monitoring of Land Cover Charge Using SAR and Optical Data from the ESA Rolling Archives, Proceedings of ESA Living Planet Symposium, 2010.