Computational linguistics can manifest itself in many ways, and in a modern society where many cultures and languages are exchanging information, the need for unambiguous machine translation has never been more prevalent. As part of the CS124: Natural Language Processing course, I built a rudimentary machine translation system to translate Spanish into English. While many advanced techniques and language models exist, my program was limited to a very basic, unigram-based model that translated each word by "looking up" the definition in a dictionary, trying out several different definitions, and choosing the best one based on a massive English corpus. In other words, the process is rule-based, as opposed to a statistical learning machine, because it attempts to generate translations from the application of pre-defined syntactic patterns to natural language.
While this straightforward method tends to produce meaningful sentences, it would break down significantly whenever the pronoun "se" was used, typically in combination with a verb. The pronoun "se" is a special case in Spanish because it is used in a wide variety of ways, and each meaning has a very specific grammatical usage. For example, consider the "se" as an indirect object. This immediately breaks the unigram model, since the full meaning of the phrase is expressed through the bigram "se [verb]". However, Spanish allows for the verb and the pronoun to be separated within the sentence -- "se lo dí", "I gave it to them" -- so even a bigram model using consecutive words would fail to capture the full meaning of this "se" as an indirect object.
A specialized, and often subtle, case of this is the "Reflexive 'se'". This is a use case in which applying the "se" pronoun makes the verb reflexive: "levantarse" means literally "to wake up oneself". However, the "self" need not be explicit in the translation; "He wakes up at eight o'clock" is more common in English than "He wakes himself up at eight o'clock". So, in this case, the "se" is not explicitly translated at all, but its meaning is included with the verb, causing more problems for a machine translation system.
Finally, we can use the pronoun "se" in an impersonal or passive sense. For example: "Se dice que ella es muy inteligénte," translates to "It is said that she is very intelligent." In this case, the "se" is telling us that there is no specific subject to the sentence, but this use can often only be recognized in the context of the statement. This again frustrates the machine translation system, because many machine translation "rules" search for specific syntax trees with specific necessary parts, like subject and verb and so on. Without a subject, the machine translation system will not recognize this use of "se."
More possible meanings of the "se" pronoun exist, and each causes unique problems for a machine translation system that focuses on a pure rule-based method of translation. Over the years, services like Google Translate have integrated both rule-based systems and statistical models into their translation methods, which improves on the baseline of a pure rule-based system. In addition, Google Translate provides a way for users to correct bad translations , or to offer alternative ones. In this way, Google is able to make use of the most accurate natural language processor available today -- human translators.
Sounds like you had a lot of fun coding up a translator program, which sounds very interesting and also relevant to the class. This thought is kind of a tangent from what you described. "Credible" sources, as per traditions in academia, have always been rule-based. For example, Wikipedia is often criticized because it relies on user input and correction. It's eerie to think that to make its translator more accurate, Google pulls data from the web and its users, which have a lot of subjectivity and errors, to use for its translator's machine learning process.
ReplyDeleteOn the other hand, I guess language is very unique in that it changes because of its usage. If people start using expressions that diverge from a grammatical "rule", then the statistical model is crucial in detecting these expressions that cannot be categorized by rules yet still make sense to people.
Another thought I had was the effect Google Translate has on people's understanding of language. If Google translates in a certain way, how much does it influence its users' understanding of the languages?