ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. ILSP-FC integrates modules for text normalization, language identification, document clean-up, metadata extraction, text classification, identification of bitexts (documents that are translations of each other), alignment of segments, and filtering of segment pairs. The service can be used for the acquisition of data to fine-tune, to train systems in new languages and domains and to create benchmarking and NLP tasks evaluation data.
Field | Value |
Partner | Athena Research Center (ATHENA) |
Technological Fields | ML/AI |
Service subtype | Other/Undefined |
Service type | Test Before Invest |
Target organization | SME |