Personal tools
You are here: Home Workplan WP3 Tasks
  EU

The project has received funding from the Competitiveness and Innovation Framework Programme under Grant Agreement n° 271022.

Tasks

 

Description of work

 

Task 3.1 Upgrading resources to agreed standards (RILHAS, TMIT, FFZG, IPIPAN, ULodz, UBG, IPUP, IBL, LSIL; M01-M16):

 

The upgrade task will mostly focus on reaching META-SHARE compliance, but in some cases additional actions will have to be carried out, depending on the tool/resource. We foresee the following activities:

  • upgrade for interoperability (changing annotation format, type, tagset),
  • technology-related upgrade (wrapping, refactoring, etc.),
  • application of techniques of finding inconsistencies and errors in (automatically and/or manually built) linguistic resources, incl. corpora and lexica,
  • metadata-related work (creation, enchancement, conversion, standarization),
  • harmonization of documentation (conversion to open formats, reformating, linking),
  • preparation for maintenance and deployment (debugging, cleaning, building test environments, preparing code repositories),
  • programming tasks (bug-fixing and standardizing API calls).

 

The tools to be used in order to enhance the quality of resources are typically language-specific, however, CESAR will aim to promote and coordinate exchanging existing language-independent tools among consortium partners where available and/or applicable, e.g. to enhance resources by adding new levels of annotation.

 

Selection of resources to be upgraded will be carried out with following principles in mind:

  • the resources are state-of-the-art representatives of their type for a certain language,
  • if more than one valuable representative of certain tool type for a language is available (e.g. two morphosyntactic analysers with equally popular tagsets or formal grammars used for different purposes), all of them are included in the selection,
  • current status of resources present superior quality at least on regional level without the need of excessive further development,
  • licencing issues allow to freely process and make available the resources and resource-related materials or the consortium succeeds in reaching an agreement with respective copyright holders.

 

Task 3.2 Extending and linking resources (RILHAS, TMIT, FFZG, IPIPAN, ULodz, UBG, IPUP, IBL, LSIL; M01-M24):

 

Existing resources may need to be extended or linked across different sources to improve their coverage and increase their suitability for both research and development work. This task takes into account the specific goals of the project, identified gaps in the respective language community, and most relevant application domains.

Selection of resources to be extended/linked will base on those made available within task 3.1 to further enhance a smaller, but well-defined set of resources. Following rationale will be applied:

 

  •  the extension of resources provides considerable value to the community, at least on regional level,
  • the emphasis is on providing building blocks to the existing tools (e.g. extended grammars to existing shallow parsers) rather than major restructuring,
  • additional resources are integrated with existing ones only if they significantly improve the quality of resulting resources,
  • if more than one representative of certain tool type for a language has been selected in task 3.1, they are very likely to be interlinked to benefit from strong points of both solutions (unless their usage patterns do not encourage such action),
  • if less-developed, but still very popular (at least within one language community) tools can benefit from the enhancement basing on their well-developed equivalent (provided that no extensive work will be necessary and that the latter tool cannot be used as a building block in further applications of the former tool), their enhancement is also considered,
  • experience of other consortium members (or, where applicable, other consortia) is extensively used in the process of further extending national resources to provide strong foundation for cross-linguality,
  • tools offering language-neutrality or cross-linguality are preferred.

 

Task 3.3 Aligning resources across languages (RILHAS, TMIT, FFZG, IPIPAN, ULodz, UBG, IPUP, IBL, LSIL; M01-M24):

 

  • Cross-lingual alignment of resources, as the most demanding task, will be applied only to a small number of resources. We foresee the following activities:
  • application of techniques of mapping between tagsets and, more generally, outputs and inputs of linguistic tools for particular language,
  • synchronization of resources available for consortium languages,
  • extension of language models to embrace cross-linguality and/or promote language-independence.

 

Following rationale will be applied: no more than a tool of a certain type for each language tuple is used in the process,

  • whenever applicable, the largest set of languages is selected (preferably with English as a hub language; the languages going beyond natural consortium scope of interest are not excluded),
  • language-independence is targeted to a great extent,
  • the quality of a result is of immense concern (not the quantity of the integrated tools ), which will be assessed according to standard evaluation measures used for LRs.

 

Special mention must be made of work to upgrade, extend and cross-link the finite state tool NooJ (www.nooj4nlp.net). The planned work is justified on the following grounds:

 

  • Five of the six languages involved in the CESAR consortium has extensive resources developed in the NooJ linguistic development environment
  • NooJ is a truly multilingual platform: resources exist in no less than 14 languages ranging from Arabic, Chinese to Catalan and Hebrew not to mention major languages like Fench, English, Spanish, Portuguese. In other words, members of the partner consortiums have also a vested interest in enhancing and upgrading NooJ.
  • Widespread use of NooJ is seriously limited by the fact that although freely available for research purposes it is not open source and suffers from limited interoperability and cross-platform availability.

We intend to carry out the following work:

  • Make NooJ open source
  • Make NooJ platform independent by turning the current C# code into Java
  • Make NooJ maximally interoperative by making sure it will seamlessly work with major tools and resources.

 

As a result, the LT Community of numerous langauges will benefit from availability of a high performance open source, platform independent tool already reputed to be extremely fast and efficient. Kimmo Koskenniemi also welcomed such plans as it opens up avenues of integrating NooJ with the HFST toolset.

 

The CESAR project intends to use the expertise of Max Silberztein, (the developer of NooJ) as part time external consultant hired by the project (estimated cost cc. 36 000 EUR) as well as hire a competent programmer for 14 months to rewrite the code and produce a platform independent Java version.

 

Document Actions

Cesar