Language data acquisition and cleaning

ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. ILSP-FC integrates modules for text normalization, language identification, document clean-up, metadata extraction, text classification, identification of bitexts (documents that are translations of each other), alignment of segments, and filtering of segment pairs. The service can be used for the acquisition of data to fine-tune, to train systems in new languages and domains and to create benchmarking and NLP tasks evaluation data.

Field	Value
Partner	Athena Research Center (ATHENA)
Technological Fields	ML/AI
Service subtype	Other/Undefined
Service type	Test Before Invest
Target organization	SME