The project and its objectives

HyghTra (Hybrid High-Quality Translation System) is a collaborative FP7 Marie Curie Industry-Academia Partnership and Pathways project between the Centre for Translation Studies of the University of Leeds and Lingenio GmbH, a Language Engineering company based in Heidelberg, Germany. The Project ran between 2010-12 (first part) and 2012-14 (second part). The project website is:

Objectives: Project's principal goal is a technology for fast development of high quality Machine Translation (MT) systems that translate texts between different languages, such as German, French, Dutch, Spanish, Russian, Ukrainian, English. The project team has developed a novel architecture for building MT systems, which is designed to overcome existing technological limitations of current approaches to MT. For instance, our architecture allows the developers to combine wider coverage of linguistic phenomena with higher accuracy of linguistic analysis, at the same time achieving faster development cycle and keeping low computational requirements for MT systems, which potentially can lead to smaller size solutions that will also run in off-line mode on mobile devices. Overcoming these technological limitations allowed the industrial partner (Lingenio) to create translation systems and dictionaries for new translation directions, as well as a new range of innovative translation solution and services for new markets and applications.

Traditionally MT systems have been built within one of the two major architectures: rule-based machine translation (RBMT) and statistical machine translation (SMT). The RBMT systems explicitly represent and process linguistic knowledge about languages (grammar, lexicon) and about translation equivalents between source and target (translation dictionaries, corresponding grammatical structures). These systems have higher accuracy of linguistic analysis (e.g., they can successfully handle multiword linguistic constructions and re-arrangements of the word order, long-distance dependencies between words or overall syntactic structure of sentences), they have smaller size and use less computational power. However, they have a slower development cycle and need manually built dictionaries, grammars and processing algorithms, which seriously limits the number of supported languages. SMT systems, on the other hand, are built using large collections of previously translated texts (parallel text corpora), which are automatically aligned on the sentence and word levels and which are stored as large databases of phrases that are translations of each other; translations for new text is generated by intelligent search algorithms that recombine the segments from the database into a faithful and natural translation. SMT systems have faster development cycle, are more accurate in resolving ambiguities, but have more problems in handling sentence-level linguistic phenomena. In addition, they require more storage space and use more computational power: typically they run as web services on powerful servers or computer clusters, which limits their use for off-line mobile applications. More recently researchers attempted to combine SMT and SMT approaches (so called Hybrid MT), usually adding some linguistic rules, features and sentence structure representations on top of SMT systems.

The major scientific contribution of our HyghTra project is developing a new way of building Hybrid MT systems, where statistical techniques are added on top of an existing wide-coverage RBMT: statistical methods support rapid development of grammars and dictionaries for new translation directions and perform run-time disambiguation, but the core system architecture remains rule-based, preserving the accuracy of the linguistic analysis and smaller computational footprint of the system. The project has developed a methodology for rapidly creating adequate linguistic resources for rule-based MT systems on the basis of statistical analysis of the linguistic data. The resources for new languages (dictionaries, tools for linguistic annotation and analysis, transfer between different languages and resolving linguistic ambiguities) so far has been the major obstacle for the developers of RBMT systems, so our HyghTra project has filled the gap, which allowed Lingenio to speed up the development cycle and enhance the quality of its rule-based MT systems with statistical analysis and disambiguation techniques. From the commercial perspective these novel Hybrid MT solutions resulted in an increased range of products and services currently offered by Lingenio.

Performed work

The project team has created a methodology and a set of computational tools and resources for rapidly integrating new languages and translation directions into Lingenio's rule-based MT system using statistical MT techniques. This work also resulted in creation of a modular development infrastructure, which resulted a new range of products and services, which Lingenio now offers to new markets beyond traditional users of MT. The team also worked on novel uses of MT technology in other areas, such language learning and translator training and proposed a pedagogically grounded methods and scenarios of using MT for advanced language learners to support learning process.

Main results of the project

  • A range of new products and services offered by Lingenio, such as modules for rich linguistic analysis and generation, terminology extraction, and support of collaborative translation process (Text Simplifier, Standardizer, Summarizer, Intelligent Concordancer, Dictionary Builder, Translation Templates Builder, TM Multiplier, TM Standardization, Dictionary Standardization, Lemmatizer, Morphological Annotator, Syntactic Annotator, Semantic Annotator, Discourse Analyzer, Text Generation).

    Full description of the services is available on Lingenio's website: Configuration of Tools

  • New languages and translation directions developed for Lingenio's flagship rule-based MT products (Translate Pro/Plus/Quick)

  • A methodology for induction of richly annotated dictionaries and grammars from large text collections (text corpora)

  • A methodology for extracting databases of translation equivalents for Lingenio's rule-based MT systems from parallel and comparable corpora

  • A methodology for bootstrapping electronic dictionaries and grammars for new closely related languages from existing Lingenio resources

  • A methodology for statistical disambiguation (evaluation) of competing applications of parsing rules

  • A pedagogically motivated scenarios of using MT for generating negative linguistic evidence for advanced language learning and translator training, which was tested in teaching a University-level module English for Translators.

  • An on-going series of HyTra workshops (HyTra-2 at ACL-2013, Sofia; HyTra-2 at EACL-2014, Gothenburg, HyTra-4 at ACL-2015, Beijin) co-located with leading international conferences on Computational Linguistics, which bring together a community of MT researchers and industrial MT developers interested in hybrid approaches to machine translation.

Socio-economic impact and wider implications of the project: There are two main socio-economic impacts of our HyghTra project. Firstly, HyghTra has moved technological boundaries in Hybrid Machine Translation beyond the established paradigm. The focus of the project's innovative technology was specifically on the needs of industrial developers and users of MT systems. New technology enables the developers to fill in an existing market gap with new products, which combine SMT's superior disambiguation techniques and its rapid development cycle with RBMT's accuracy of linguistic analysis, smaller system size and lower computational cost. This enables the development of MT systems for new markets, such as the market for mobile devices, where highly accurate off-line translation is needed with small computational footprint (with applications for emergency services, security, social support, tourism, etc., where stable internet connection to translation web services is either too expensive or cannot be guaranteed).

Secondly, HyghTra has brought the range of new Lingenio's products and underlying MT technological solutions into new areas, which go beyond the two main traditional markets for MT systems, i.e., beyond the professional translation automation market and end-user MT market. Specifically, Lingenio's modular development infrastructure for RBMT systems has packaged new combinations of individual workflow components into a new range of Lingenio's products and services for text analytics, bilingual terminology extraction, dictionary creation, intelligent text processing, intelligent linguistic search in large corpora, language teaching and translator training. These new products and services have a much broader range of markets: from foreign language teaching to intelligent big data mining for industry, government, defense and security. The development of technological foundations for these innovative RBMT-based solutions is one of the major successes of the project, since currently such development is unique for the MT industry, and we expect that within 5-10 years it will be widely used by other companies, having made an important impact on the Language Technology industry as a whole.