Oct. 11, 2005

DATE:	Tuesday, Oct. 24, 2006
TIME:	2:30 pm
PLACE:	Council Room (SITE 5-084)
TITLE:	Applications of Corpus-based Semantic Similarity and Word Segmentation to Database Schema Matching
PRESENTER:	Aminul Islam University of Ottawa
ABSTRACT: We present a method for database schema matching, the problem of identifying elements of two given schemas that correspond to each other. Schema matching is useful in e-commerce exchanges, in data integration/warehousing, and in Semantic Web applications. We first present two corpus-based methods: one method is for determining the semantic similarity of two target words and the other is for automatic word segmentation. Then we present a name-based element-level database schema matching method that exploits both the semantic similarity and the word segmentation methods. Our word similarity method uses Pointwise Mutual Information (PMI) to sort lists of important neighbor words of the two target words and distinguish the words which are common in both lists and aggregate their PMI values (from the opposite list) to calculate the relative similarity score. Our word segmentation method uses corpus type frequency information to choose the type with maximum length and frequency from "desegmented" text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. For the database schema matching method, we also use normalized and modified versions of the Longest Common Subsequence (LCS) string matching algorithm with weight factors to allow for a balanced combination. We validate our methods with experimental studies, the results of which suggest that these methods can be a useful addition to the set of existing methods.