Sept 11, 2013

DATE:	Wed, Sept 25, 2013
TIME:	11:45 am
PLACE:	Council Room (SITE 5-084)
TITLE:	Making Sense in Translation: Lexical Choice Errors When Translating Across Domains
PRESENTER:	Marine Capuat NRC
ABSTRACT: While Statistical Machine Translation has achieved significant progress in recent years, state-of-the-art systems cannot yet be trusted to convey the correct semantics of the original language. Performance is particularly poor when systems are applied on test domains that differ from their training domain. In this talk, I will present an analysis of lexical choice errors observed when porting a French-English system trained on the Canadian Hansard to very different new domains (e.g., scientific papers or movie subtitles). I will show that many errors fall into a category that has not been addressed in the machine translation literature: French words that acquire new senses in the new domain. For instance, the word "rime" is frequently used in the "political regime" sense in the Hansard, while the previously unseen "diet" sense is more frequent in scientific articles. I will introduce a novel approach for detecting such words automatically, using cues inspired from word sense disambiguation/induction models. This case study highlights potential for future research at the intersection of machine translation and lexical semantics. Joint work with Hal DaumIII, Alex Fraser, Chris Quirk, Fabienne Braune, Ann Clifton, Ann Irvine, Jagadeesh Jagarlamudi, John Morgan, Majid Razmara, Ale Tamchyna, Katharine Henry and Rachel Rudinger Bio: Marine Carpuat is a Research Officer at the National Research Council Canada, where she works on natural language processing and statistical machine translation. Before joining the NRC, Marine was a postdoctoral researcher at Columbia University in New York. She received a PhD in Computer Science from the Hong Kong University of Science & Technology (HKUST) in 2008, a MPhil in Electrical Engineering also from HKUST in 2002, and a Diplome d'Ingenieur from the French Grande Ecole Supelec.