Is data the only thing different here? Meaning if you have written and spoken data of some language, how would you go about implementing the training models etc Any resources?
No, as different languages have different grammatical and semantical structure, the same algorithm will not be able to work on different languages. Language recognition and parsing is done using the plain old grammar of the language under consideration; you actually need linguistic knowledge to design an effective method, algorithm in this case to work with a specific language..
Consider two languages - English and Hindi. The order in which Subject , Verb and Object appear in a normal sentence is completely opposite, so the same algorithm shall not work on the two.
These are basics of NLP; learnt them in college, so my knowledge on the subject is limited and might be outdated... Hope that helps...
Just as Aakash Mallik said, languages can differ extremely, so one algorithm to rule them all is not gonna work. For example, there are even languages, which do not use word order at all. Japanese, to name one, uses a concept called "particles", which are attached to some words, sometimes and define the meaning in a sentence. Example:
それは俺のパーそこんです。 ------------------------- それ は 俺 の パーそこん です Sore-wa ore-no paasokon desu ^^^^ that close to you ^^ topic of sentence _particle_ ^^^ male me/I ^^ possessive _particle_ (me->mine) ^^^^^^^^ computer (PC) ^^^^ is ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ that close to you mine computer is. However, a complete and correct translation would be: => I am male and the thing, which is close to you, is my computer. Though it should be translated to (dropping JP-specific info) => That is my computer.Now, that's just a very simple sentence, however think about the grammatical differences between many different languages and concepts, which that one algorithm would have to handle, while being able to process much more complicated samples. Think about the algorithm having to translate the English to Japanese. How would it get the information I dropped and put it back into the sentence? It's not available, so you would have to add a routine, which takes a guess from context, if available. If no context is available, you will most likely end up with a wrong translation anyway (for example "sore-wa watashi-no conpyuuta desu", taken from Google Translator, which is using the neutral form of me/I, hence missing information and takes a guess at the location). Try putting the complete translation into Google Translator and you will get even more BS gibberish, because it is not able to shorten the translation back to the orginal compact version, which Japanese is able to provide.
So, only Japanses <-> English is hard. There are hundreds of languages with original grammar out there.
Let me throw in another language, which you would have to handle: French!
Ça, c'est mon ordinateur. ---------------------------- Ça, c'est mon ordinateur ^^ that ^ (ce) that ^^ is ^^^ my ^^^^^^^^^^ computer ^^^^^^^^^^^^^^^^^^^^^^^^ that, that is my computerFrench is a very beautiful language, which puts twirls everywhere. However, as opposed to English (and German), which are Germanic languages, it stems from Latin. Geographically speaking, Germanic and Latin countries where always close in Europe. As a result, it has many sentences, which are rather similar and easy to translate. However, since the roots are different, there are cases, in which French (and other Latin-based languages, like Italian and Spanish) differs a lot from English.
And even languages with the same origin differ a lot:
Wir treffen uns morgen im Gebäude B. A literal translation would be: => We meet us tomorrow in building B. The correct translation, though: => Tomorrow, we will meet in building B.The problem is with word order and tense. In Germany, we can use the present tense in a future-context. In English, we have to use a future form (
will). Now compare that to the problems we had in Japanese. They are completely different.I am pretty sure, that there are languages, which conflict with problems you already have in a language-pair. That's why it is so difficult to translate stuff and why even big companies, like Google struggle a lot.
As far as I know, the most recent advances with AI are, that every language is translated to a neutral meta-language by AI (which even the researchers do not know and have not defined) and then translated to the target language. That allows for restructuring all information while keeping all data and in theory could yield a better translation.