Contents:
Here too, the Polish language is do so in cleanly-separated language communities. Will it always be like this? It is estimated level domain. However, there is a yawning technological gap between English and Polish, and it is currently getting wider. Otherwise stands their language. Forerunners of such develop- all you get is an honorary mention in Wikipedia.
However, other researchers believe easy-to-use voice commands. Language-enabled tech- that English is inherently better suited to computer pro- nology will be able to translate automatically or assist cessing. And languages such as Spanish and French are interpreters; summarise conversations and documents; also a lot easier to process than Polish using current and support users in learning scenarios.
Only then can we readiness with respect to language solutions. However, the whole situa- tion could change dramatically when a new generation Only brush the teeth you want to keep! If there is adequate language technol- tural diversity. You can study every language overview. Up-to-date information such as the cur- under the sun all you want, but if you really intend to rent version of the META-NET vision paper [3] or the keep them alive, you also need to develop technologies Strategic Research Agenda SRA can be found on the to support them. What can this analogy tell In the past twenty years, information technology has us about the future of the European information soci- helped to automate and facilitate many processes: In subsequent centuries, cultural techniques have been developed to better handle language processing audio and video encoding formats make it easy to ex- and knowledge exchange: In the global economic and information space, many European languages.
According to one estimate, the European mar- dia and multilingual user experience in the near future. Human beings had to do priate technology, just as we use technology to solve our the hard work of looking up, assessing, translating, and transport and energy needs among others.
We had to wait until Edison Language technology targeting all forms of written text to record spoken language — and again his technology and spoken discourse can help people to collaborate, simply made analogue copies. For example, machine translation is al- translate web pages via an online service. Technological progress needs to be accelerated.
It can help to address funding commitments. However, citi- new methods to accelerate development right across the zens need to communicate across the language borders map.
Looking even further ahead, innovative European mul- 2. Future intelligent robots with cross-lingual lan- actions between their parents, siblings and other family guage capabilities have the potential to save lives. Some of the leading rule- mersed in a language community of native speakers. At based machine translation systems have been under con- school, foreign languages are usually acquired by learn- stant development for more than 20 years.
Second, Polish is relatively morphologically rich, which means that for roughly thousand base forms of 3. National research budgets in the Baltic countries are very limited. Eine Publikation des Internet und Gesellschaft-Co: The user can also train a system on his own proprietary data uploaded to the system. An be enhanced with search for simple paraphrases of the alternative approach, for which some research has been text. Among young people, the proportion of users is any issues on the use of the Polish language to the Coun- even higher. Springer Berlin Heidelberg,
However, due to the high cost of this work, underlying language rules. However, these approaches have so phrases and complete sentences are translated. Although this technol- and Google Translate, all rely on statistical approaches. In the next section, we de- even though quality can vary randomly. German in the west areas of Poland 22 communes us- Recently, it was debated if Silesians are to be considered ing it as auxiliary language , and Belarusian in the east 3 a national minority. In during the census the Sile- communes , Kashubian 2 communes and Lithuanian sian nationality was declared by , people [10].
First, word order is minorities are the Ruthenians 50, , the Roma relatively free in Polish sentences, and it is used to stress 20, , the Tatars 2, and the Karaites Historically, it was one of the biggest An apple was given to the man by the woman. Even now, there are at least three popular code though some of them are less likely to be used: Second, Polish is relatively morphologically rich, which means that for roughly thousand base forms of 3.
Even native contemporary Polish. Even a grocery shop could bear an English sign- phenomenon in Polish than in English. Today, such a name would be con- sidered ridiculous by a much larger group of speakers. It must be stressed, singular would have been considered rude, it is quite however, that these claims are not based on corpus- popular these days. Even some typographical characters come almost extinct in everyday speech.
Action may be taken against individuals or businesses As regards commercial activities, according to Article that do not respect these requirements. Fines are charge- 7, in commercial dealings involving the participation of able for infractions. Unless parties decide other- higher education, schools and classes with a for- wise, the basis for the interpretation of such documents eign language of instruction or bilingual instruc- is their Polish-language version.
Every second year, it presents a re- of all types, in higher state and non-state schools, in ed- port on the protection of the Polish language to the Par- ucational establishments and other educational institu- liament of the Republic of Poland. Among young people, the proportion of users is any issues on the use of the Polish language to the Coun- even higher. In addition, some multi-lingual resources like the tion and Sports dated 15 October allow foreign online dictionary mash-up ling.
In some the Internet is important for two reasons. On the one countries, the Polish language is prized as giving access hand, the large amount of digitally available language to Polish universities and the Polish job market. It involves sophisti- ter Finland , the eight best place [19].
Considering the high costs as- well as an evaluation of the current situation of LT sup- sociated with manually translating these contents, com- port for Polish. Human text summarisation; language comes in spoken and written forms. While question answering; speech is the oldest and in terms of human evolution the most natural form of language communication, com- speech recognition; plex information and most human knowledge is stored speech synthesis.
Links to tools and resources for Pol- independently of the media speech or text in which it ish, which will be mentioned below, are available on the is expressed. Figure 1 illustrates the LT landscape. Digital texts link to pictures and sounds. While such applications tend to be areas of language technology, i.
Comparable corpora have several obvious advantages over parallel corpora — they can draw on much richer, more available and more diverse sources which are produced every day e. Although the majority of these texts are not direct translations, they share a lot of common paragraphs, sentences, phrases, terms and named entities in different languages. Expansion of Web content with daily multilingual news feeds and large knowledge bases like Wikipedia make comparable corpora more widely available than parallel corpora.
This white paper is part of a series that promotes knowledge about language technology and its potential. It addresses educators, journalists, politicians. This white paper is part of a series that promotes knowledge about language technology and its potential. It addresses educators, journalists.
It contains tools for collecting comparable corpora, measuring comparability, data alignment at different levels and extraction of data useful for training statistical machine translation SMT systems. Besides the task specific tools, the toolkit also contains two general-purpose workflow chaining tools for particular usage scenarios: Tilde experimented with SMT domain adaptation for Baltic languages utilizing bilingual terms and bilingual comparable corpora collected from the Web.
The results of these experiments showed that integration of terminology within SMT systems even with simple techniques adding translated term pairs to the parallel data corpus or adding an in-domain language model can achieve an SMT system quality improvement of up to Transformation of translation model phrase tables into term-aware phrase tables can boost the quality up to Data collected for Baltic languages supplements parallel and monolingual data stored in the repository of LetsMT!
Currently this repository includes One of the main areas where we target development of statistical MT for Baltic languages is its application in translation and localization. Global vendors want to adapt their products for the small Baltic markets as inexpensively as possible. Volumes of texts to be translated are growing at a higher rate than the capacity of human translation, and translation results are expected in real-time. Translation memories TM have been in use in localization for more than 10 years to increase productivity.
Translation memories can significantly improve the efficiency of localization if the new text is similar to the previously translated material. However, if the text is in a different domain than the TM or in the same domain from a different customer using different terminology, support from the TM is minimal. This can be achieved by combining traditional TMs with machine translation solutions adapted for the particular domain or customer requirements.
Customization of SMT for a particular translation domain can be achieved by using previously translated data in the training of adapted SMT system.
We elaborated and evaluated this approach for translation in the IT domain. Additional tweaking was made by manually adding a factored model over the disambiguated morphological tags. If the source language segment is not found in the translation memory Trados translates it using the designated MT. We clearly mark MT suggestions to distinguish them from TM suggestions, because MT output may be inaccurate, ungrammatical, it may use the wrong terminology etc. We evaluated such MT assisted process against typical translation work where just translation memories are used.
The results showed clear benefits from MT integration. Assistance from the machine translation increased the translation productivity by an average We have to note that there were significant performance differences in the various translation tasks and by individual translators. In addition a quality assessment for texts was performed according to the standard internal quality assessment procedure.
Although the error score increased for all translators from This degradation is not critical and the result is acceptable for production purposes. In both human and machine translation a critical requirement for translation quality is the appropriateness and consistency of domain and project specific terminology. To facilitate development and accessibility of Latvian and Lithuanian terminology Tilde actively participates in terminology creation and standardization work, and the development of online terminology databases and services. In partnership with the Terminology Commission of Academy of Science of Latvia Tilde has developed the Latvian online terminology database termnet.
Numerous terminology glossaries have been integrated, many of which had to be digitized from the paper form. Experience in consolidating Latvian terminology served as the background for expanding terminology consolidation work on a pan-European level. It enables searching almost 2 million terms in over 25 languages. Under the term bank federation principle, it provides a single access point to the central database along with interlinked national and international term banks, consolidating terms from such major collections as IATE, WebTerm, Microsoft Terminology Collection, Terminology database of the Latvian Terminology Commission, and others.
Most of the online terminology databases offer not much more than the typical database features of storing and querying terminology entries.
The evolution of the Internet and cloud-. TaaS platform will provide a variety of online terminology services, to serve the needs for automated acquisition, processing, and application of terminological data by human users i. Automatic extraction of monolingual term candidates, using state-of-the-art terminology extraction techniques, from documents uploaded by users;. Automatic lookup of translation equivalent term candidates in user-defined target language s from different terminology data- bases for automatically extracted monolingual term candidates ;.
Facilities for cleaning up automatically acquired raw terminological data;. Facilities for exporting terminological data in different formats, e. Terminology services can also be used by machine users i. Thus, terminology services have the potential to significantly enhance the quality of language tools, and machine translation in particular.
The easiest method for terminology integration in SMT training is by adding the bilingual term collection to the parallel corpus that is used for generation of SMT system. Although the size of the term collection usually is relatively small in comparison to the whole parallel corpus, namely the presence of a term collection in training data helps the SMT training engine to build better word and phrase alignments, and it also fills gaps in the vocabulary by allowing translation of previously unknown terms.
In addition to this simple approach, we also propose to use online terminology services to tag terms in both parallel and monolingual corpora used in SMT training Figure. For this we have developed the phrasal level term tagging method. Another process were terminology identification is helpful is in the translation phase. Preprocessing of the source text and marking terms and their possible translations assist th SMT system in making lexical choices of translation candidates.
Figure 2 The conceptual design of the terminology service integration into statistical MT. To facilitate the availability and usage of numerous Baltic language resources and tools developed by Tilde and other institutions it is important to ensure they are easy to find, that they follow commonly accepted standards and are interoperable, that they are free to use or there are clear licensing conditions, that sufficient description and documentation is provided. For this purpose we take an active part in the development of the European linguistic infrastructure and the establishment of the META-NET cooperation network of research and industry institutions.
META-NET is a network of excellence dedicated to fostering the technological foundations of a multilingual European information society through facilitating cooperation across different research fields, development of a common vision and strategic research agenda and establishment of an open language resource distribution platform. This freely accessible distributed online infrastructure provides facilities for describing, storing, preserving language resources, and making them publicly available.
Among various language resources that can be considered useful for different purposes, META-SHARE places a strong focus on language data that are important in development of language technology applications that are useful to EU citizens in their everyday communication and information search needs. META-SHARE is intended for providers and users of language resources and technologies such as language technology developers, researchers, students, translators, technical writers and others.
In the managing node, information about the catalogued language resources is collected and synchronised with other managing nodes across Europe, thus providing access to the full catalogue of the pan-European infrastructure. As a part of the activities related to populating META-SHARE with language resources, we wanted to extend the open linguistic infrastructure with multilingual terminology resources. We described some major activities that foster the development of language technologies and resources for the Baltic languages Latvian and Lithuanian.
Although the assessment of the META-NET experts includes Baltic languages in the cluster of the less supported languages of Europe in all key language resource categories, significant progress has been achieved in several areas. Development of the cloud-based platform LetsMT! University of the Basque Country. There are application tools for speech synthesis, speech recognition, spelling correction, and grammar checking. There are also some applications for automatic translation, mainly between Spanish and Basque.
One of the major conclusions is that Basque is one of the EU languages that still needs further research before truly effective language technology solutions are ready for everyday use.
At the same time, there are good prospects for achieving an outstanding position in this important technology area. This development of high-quality language technology for Basque is urgent and of utmost importance for the preservation for a minority language as Basque.
March 25th, Category: