Open Source Machine Translation Projects

October 7, 2016 at 10:30 am 1 comment

Why I want to contribute to an open-source project

I’m looking for an open source projects for machine translation of natural languages that I can contribute to. My motivation comes partly from my sabbatical project– one part of the project is to manage an open source project. No, I’m not looking for a project to take over. I just want to get some experience as a contributor before I manage a project. Another part of my motivation comes from my long-term interest in linguistics and translation- my master’s degree is in Linguistics. And, I’ve started getting sucked in by the exciting stuff happening now in machine leaning.

What I’m looking for in a project

I’m a bit picky. I’m looking for a project that has these characteristics:

  • Being actively developed. There should be recent commits, pull requests, and pull requests that have been recently accepted.
  • Uses up-to-date development methodologies: Git instead of Svn, unit tests, issue tracking, maybe even some elements of Agile project management.
  • Written in a programming language I want to use. I’m done with C and C++, I’d rather work in C#, Java, Python or some other modern language. (No offense to good old C and C++, they’re old friends.)
  • Cross-platform- OS X, Windows and Linux would be great. And mobile + web platforms would be even better!
  • Not too hard to build and test on my own machine.

 What I’ve found

Apertium
Apertium is a toolbox to build open-source shallow-transfer machine translation systems, especially suitable for related language pairs: it includes the engine, maintenance tools, and open linguistic data for several language pairs. Here’s how it stacks against on my criteria:

  • Actively being developed
  • Uses svn, has test code, uses SourceForge ticketing system and also Bugzilla, discussions using IRC, share code with pastebin
  • There are multiple languages used in different parts of the project:
    • The original Apertium toolbox appears to be written mainly in C
    • Apertium-caffeine is a multi-platform Java application built on top of lttoolbox-java that translates as you type.

This looks like a nice active and well managed project, but they are using svn and since one of my goals is to get more experience using modern development tools, I’d rather contribute to a project using Git.

Anusaaraka
Anusaaraka is an English-Hindi language accessing software. It is a machine translation tool with insights from Panini’s Ashtadhyayi (Grammar rules); and aims at the fusion of traditional Indian shastras and advanced modern technologies. -quote from the official web site.

Source code is in a BitBucket repository.

  • There are quite a few recent commits, but no recent pull requests (neither open nor merged).
  • Appears to be written mainly in Java
  • No issues in the issue tracker.
  • No apparent documentation for contributors.

OpenLogos
OpenLogos is the open source version of LOGOS. The article B. Scott: The Logos Model: An Historical Perspective. In: Machine Translation 18 (2003), pp. 1-72 presents an excellent overview of the Logos approach to machine translation.

The C++ source code is in a repository on SourceForge that hasn’t been updated since 2010. This is not what I would call an active project!

The LOGOS Machine translation system is one of the largest and most powerful among the commercial machine translation systems. Various text documents in different formats can be submitted to the system and within a short amount of time are translated into different target languages. The result, a raw translation, is already of high language quality. -quote from the official web site.

Moses
Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair. All you need is a collection of translated texts (parallel corpus). Once you have a trained model, an efficient search algorithm quickly finds the highest probability translation among the exponential number of choices. -quote from the official web site.

  • The GitHub repository has  quite a few very recent commits (yesterday). The last pull request was merged 26 days ago.
  • The code is written mainly (entirely?) in C++
  • Issues are being tracked using a mailing list rather than the issue tracker. There are regression tests.
  • It is cross-platform: Windows, OS X, and Linux

This looks like a very mature, sophisticated and well documented project.

Mitzuli
Mitzuli is Mitzuli is an open source translator app for Android featuring a full offline mode, voice input (ASR), camera input (OCR), voice output (TTS), and more! – From the Google Play Store description.

  • The GitHub repository shows the most recent commit to have been in April of 2016 and the last pull request was closed in June of 2015. There are 8 open issues.

 

Maxtin
Maxtin Machine translation engine based on a dependency grammar and XML interchange format. The Spanish-Basque (es-eu) translation is ready.

Last updated 10/27/2014

Joshua
Joshua is a statistical machine translation toolkit for both phrase-based (new in version 6.0) and syntax-based decoding. It can be run with pre-built language packs available for download, and can also be used to build models for new language pairs. Among the many features of Joshua are:

  • Support for both phrase-based and syntax-based decoding models
  • Translation of weighted input lattices
  • Thrax: a Hadoop-based, scalable grammar extractor
  • A sparse feature architecture supporting an arbitrary number of features
  • Quote from the README file in the GitHub repository.

This is a very active and well managed project. The code is written mainly in Java. It looks like development needs to be done on Linux and you need to install Hadoop. It could be a bit complex to set up the development environment.

Phrasal
Stanford Phrasal is a state-of-the-art statistical phrase-based machine translation system, written in Java. At its core, it provides much the same functionality as the core of Moses. Distinctive features include: providing an easy to use API for implementing new decoding model features, the ability to translating using phrases that include gaps (Galley et al. 2010), and conditional extraction of phrase-tables and lexical reordering models.

  • The GitHub repository shows fairly recent activity, the last commit was on August 31, 2016. The last pull request was closed on the same date. No open issues.
  • Code is mainly written in Java
  • The project is cross-platform: Linux, OS X, Windows
  • Uses the Gradle build system

Which one will I contribute to?

After looking at all these projects, I concluded that the really interesting ones are also somewhat complex. If I were going to pick one, I’d pick Joshua, but it looks like there’s a significant learning curve to just build and test it. I’ll put this idea on the back burner for now. But maybe some one else will find a project here that they want to contribute to!

 

 

Advertisements

Entry filed under: Programming. Tags: .

Agile Sabbatical Visual Studio Code: Use, Build, Improve

1 Comment Add your own

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Bird’s Bits

Computers, software & the Internet

Recent Posts

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 37 other followers


%d bloggers like this: