Tasks
Description of work |
Task 2.1 Charting the national scene - producing "Language whitepaper" (RILHAS, TMIT, FFZG, IPIPAN, ULodz, UBG, IPUP, IBL, LSIL; M01-M04):
The function of a language report is to recapitulate and document the language community landscape as it is at the time of preparation of the report (to identify relevant researchers and projects, politicians, industry representatives, language communities and additional stakeholders), and as such to serve as a source of information for all interested parties. The target audience is as broad as LRT is able to affect the European digital market, in which CESAR and related projects aim to inform the language-related community as a whole (language, research, industry). The details of the survey will be further analyzed with partner projects and META-NET. The CESAR project seeks to chart the following:
- Language community: number of speakers worldwide, number of web pages in that language, other relevant quantitative elements including e.g. estimated volume of translations as source or target language, main trading partners (within and outside the EU), etc.
- Role of the language in question in the respective country/language community: legal framework regarding the use of national language(s); institutional communication and local administration; place and function in the media: TV, cinema, press; place and function in the software and digital media (e.g. games) industry e.g. degree of localisation; policies and public programmes in support of language, e.g. language learning, book translation etc.
- Research community: estimated size of the research community in the areas of NLP and ST, including specialist groups (e.g. machine translation, information retrieval/extraction); main universities and research centres in NLP and ST; teaching curricula and number of graduates in recent years; national programmes/agencies in support of language technology; main gaps e.g. underdeveloped human or technical resources; activities at national level, their relevance for addressing the identified gaps.
- Language service industry: qualitative and quantitative analysis of the local translation, localization and interpretation industries; description and actual/estimated number of businesses and professionals with an indication of leading companies; degree of sophistication and ICT use of the service industry.
- Language technology industry: qualitative and quantitative analysis of the local industrial landscape; estimated number of vendors and developers (companies as well as individuals); the report should contain a description of the main existing LT products and services, and of their actual or potential users (public at large, business/professional users).
- Policy makers: politicians, administration, media, funding agencies, affecting the language-related community and digital market.
- Demand side: role of language-technology products and services within the Internet, digital media and telecommunications sectors, by ways of examples; where applicable: "success stories" i.e. examples of use of language technology by businesses and administrations.
- Legal provisions: national intellectual-property and digital-copyright regulations related to language resources i.e. databases and software.
- Various types of users: analysis of the needs (of different types od users - from individual users to large multinational organisations (practically all stakeholders at the modern digital market: everyday end-users, professional end-users (business, administration, media, education, libraries, etc.) as well as expertise holders (researchers, industrialists, policy makers, etc) - from the perspective of the current status as well as from the near future prospects.
- Contacts information: (i.e., name, phone number, affiliation, email and postal address) on the international and especially on the national level of representatives of the following stakeholder types: research, politics, administration, funding agencies, LT user industries, LT provider industries, journalists, language communities.
To collect the relevant information, previous surveys will be taken into account. However, these are typically partial, might be outdated, and may not involve data from industry partners and administration. Other sources on national/language level include three types of institutions: public authorities, private resource owners, as well as NLP groups at universities will be approached for gathering responses.
The comprehensive list of contacts that is available at the project partners will be further updated and enlarged. The analysis will result in an extensive report charting the language community landscape for each language covered in the project as well as in a contacts database comprising the representatives from the different stakeholders groups.
The report will become a source for Language whitepapers, the professionally written reports on the state of a language (general, social, strategic, technological aspects; statistics about the availability of LRs/LTs).
Task 2.2 Identification of resources actually or potentially available to the consortium (RILHAS, TMIT, FFZG, IPIPAN, ULodz, UBG, IPUP, IBL, LSIL; M01-M18):
In the frame of this task the already developed or under development language resources and tools will be identified; a model for the description of language resources and tools will be adopted; an on-line questionnaire will be elaborated and a unified catalogue of the identified language resources and tools will be developed. The main outcomes of this task will be:
A detailed model for description of language resources: The language resources can be categorised: according to their type: written, spoken, multimodal; with respect to the number of languages they support: monolingual and bi/multi-lingual; based on the vocabulary coverage: general and domain-specific, etc. For purposes of the CESAR project a model for the description of language resources and tools in relation to their type, ownership, scope, modality, purpose, format, availability, etc. will be adopted - the classification itself will support the identification and further selection of the resources and tools appropriate for the proposed project. The BLARK (Basic Language Resources Kit) concept defined in a joint initiative between ELSNET (European Network of Excellence in Language and Speech) and ELRA (European Language Resources Association) will be used as a basic description set of language resources and tools. The BLARK extensions can be made in two directions: further to specify resources and tools with respect to their purpose, and to define new criteria that will help at the selection stage, specifying in details availability, quality, standardization, and quantity aspects. The extensive classification set which is expected to be established will be synchronized with the templates, guidelines, specifications and available models for the resources and tools description provided by META-NET. As a result an on-line classification questionnaire will be set up.
Language resources identification: The partners will contact research institutions and private companies in their countries that are developers and copyrights owners of language resources and tools. The questionnaire, designed according to the specify of our survey and seeking the important information concerning available language resources and tools, will be addressed to the target institutions in each country. The national players will be attracted to fill the questionnaire by different approaches: meetings, public lectures, demonstrations, etc. As a result the partners will identify the resources which are or can be made available to them and establish a catalogue of written and spoken language resources and tools that can be potentially contributed to the project. The catalogue will be opened for further extension both in number of resources and the number of attributes they are specified for. The description of the resources in a well structured and documented format with different Internet access rights according to the copyright licenses will be provided for a long-term period, thus making language resources and tools visible. For the preliminary specification of the expected outcomes and different national contributors mobilized so far, refer to the Section B.2. It will be specified which of the resources and tools can be distributed as copyright-free (providing the source is acknowledged). For the rest of the identified resources the clear copyrights conditions for their obtaining, distribution, and exploitation will be specified. The legal situations regarding intellectual property rights and copyright legislation particular to different countries will be followed. For each resource if necessary an agreement will be concluded between the owner, the consortium and META-NET, specifying the intended modalities, timing of upload and related access/reuse policies.
As the range and number of resources available to the consortium are expected to grow throughout the project lifetime as result of further networking and alliances, the D2.2 will therefore be updated at M12 and M18. The D2.2 will be the main source for Language whitepapers.
During the work on Tasks 2.2 a joint stakeholders contact database from consortium countries will be collected and made available to all consortium partners as well as to other META-NET partners. This database will cover individuals (experts), institutions (research, national-funding agencies, government) and companies (producers and important users) dealing with LRT. This database will be used primarily for dissemination purposes, but it will remain available for other purposes as well. The first version will be issued at M04 and it will be updated at M18.
Task 2.3 Selection of resources of further interest (RILHAS, TMIT, FFZG, IPIPAN, ULodz, UBG, IPUP, IBL, LSIL; M4-M18):
Not all resources identified at previous stage will be in the particular focus of the CESAR project. In cooperation with the partner projects and META-NET the consortium will define the methodology and criteria to be used for a precise selection of resources and tools. Top-level criteria will include (but not limited to) availability, quality, standardization, quantity, usability, fitness, extensibility, perceived potential for reuse, recombination and repurposing. A particular place in the criteria list will take the estimation of the expected needs of different groups of end-users. Our preliminary approach for language resources and tools selection is described in Section 3.2. It is based on the EU NEMLAR project achievements extended by means of defining exact measures for quality and quantity aspects.
Based upon the agreed criteria and methodology the consortium will select the best possible mix of resources that will make the subject of further interest of different groups of end-users. Together with partner projects and META-NET the consortium will ensure a balanced coverage of resources for different end-users and tasks, families of products and services, etc.
The outline of the resources and tools with wide importance will show the possible gaps at the national and international levels and will focus the further efforts of the community.