Embedding-based data matching for disparate data sources

Dealing with heterogeneous sources is an important chal- lenge in the field of knowledge discovery and management. Schema match- ing methods are employed to solve this problem using three approaches: schema-based, instance-based, or a combination. This paper focuses on mapping between a schema-available (only) data source and a data source containing both schema and instance (both). Given the lack of suit- able methods for aligning these two types of sources, we propose an ap- proach using embedding models to provide vector modelling of sources and calculate similarities between data. Our solution consists in com- bining domain-specific embedding models and cross-domain embedding models to make data matching possible and efficient between the above- mentioned data sources. We have conducted several experiments using the Valentine datasets to evaluate our data matching method on sev- eral disparate tabular data. The result indicate effectiveness in terms of stability and ablation handling.

Mots clés

Schema Matching Disparate Data Source Embeddings Schema Matching Disparate Data Source Embeddings

Domaines

Informatique [cs]

Fichier principal

Embedding_based_data_matching_for_disparate_data_sources.pdf (457.4 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Nour Elhouda KIRED : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04612345

Soumis le : vendredi 14 juin 2024-14:27:47

Dernière modification le : mercredi 25 septembre 2024-03:19:45

Dates et versions

hal-04612345 , version 1 (14-06-2024)

Identifiants

HAL Id : hal-04612345 , version 1
DOI : 10.1007/978-3-031-68323-7_5

Citer

Nour Elhouda Kired, Franck Ravat, Jiefu Song, Olivier Teste. Embedding-based data matching for disparate data sources. The 26th International Conference on Big Data Analytics and Knowledge Discovery (DAWAK 2024), Aug 2024, Naples, Italy. pp.66-71, ⟨10.1007/978-3-031-68323-7_5⟩. ⟨hal-04612345⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLSE2 CNRS UT1-CAPITOLE IRIT IRIT-SIG IUT-BLAGNAC IRIT-GD TOULOUSE-INP UNIV-UT3 UT3-TOULOUSEINP

206 Consultations

87 Téléchargements