Language data acquisition and cleaning

ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. ILSP-FC integrates modules for text normalization, language identification, document clean-up, metadata extraction, text classification, identification of bitexts (documents that are translations of each other), alignment of segments, and filtering of segment pairs. The service can be used for the acquisition of data to fine-tune, to train systems in new languages and domains and to create benchmarking and NLP tasks evaluation data.

FieldValue
PartnerAthena Research Center (ATHENA)
Technological FieldsML/AI
Service subtypeOther/Undefined
Service typeTest Before Invest
Target organizationSME