Proximity rank join based on cosine similarity

Proximity rank join is the problem of finding the top-K combinations with the highest aggregate score in which the best combinations of objects coming from different services are sought, and each object is equipped with both a score and a real-valued feature vector. The proximity of the objects i.e. the geometry of the feature space plays a distinctive role in the computation of the overall score of a combination. Proximity rank join problem can capture many interesting scenarios e.g. finding similar documents in different data collections given a set of keywords, requesting similar images from different archives given a sample image, to name a few, if the notion of proximity is replaced with the notion of similarity. For this reason, the main goal of this thesis work is the incorporation of the cosine similarity measure into the proximity rank join algorithm. But this incorporation results in an optimization problem which is computationally demanding and complex in nature. Therefore our incorporation is achieved via solving an approximated version of the original optimization problem by making an assumption which we have found to be totally in line with the concept of top-K combinations. Moreover, the solution of our approximated optimization problem has excellent properties in that with the increase of dimension of the feature space (from 2D to 3D) the normalized (in terms of the maximum possible aggregate score) deviation between the true aggregate score and the approximated aggregate score tends to drop and like the true solution it also avoids the triviality,which is unrealistic, that occurs in the solutions of other approximated variants of the original optimization problem.

Proximity rank join based on cosine similarity

AHMMED, ASHEK

2009/2010

Abstract

Proximity rank join is the problem of finding the top-K combinations with the highest aggregate score in which the best combinations of objects coming from different services are sought, and each object is equipped with both a score and a real-valued feature vector. The proximity of the objects i.e. the geometry of the feature space plays a distinctive role in the computation of the overall score of a combination. Proximity rank join problem can capture many interesting scenarios e.g. finding similar documents in different data collections given a set of keywords, requesting similar images from different archives given a sample image, to name a few, if the notion of proximity is replaced with the notion of similarity. For this reason, the main goal of this thesis work is the incorporation of the cosine similarity measure into the proximity rank join algorithm. But this incorporation results in an optimization problem which is computationally demanding and complex in nature. Therefore our incorporation is achieved via solving an approximated version of the original optimization problem by making an assumption which we have found to be totally in line with the concept of top-K combinations. Moreover, the solution of our approximated optimization problem has excellent properties in that with the increase of dimension of the feature space (from 2D to 3D) the normalized (in terms of the maximum possible aggregate score) deviation between the true aggregate score and the approximated aggregate score tends to drop and like the true solution it also avoids the triviality,which is unrealistic, that occurs in the solutions of other approximated variants of the original optimization problem.

Scheda breve

Scheda completa

	Relatore
	
				TAGLIASACCHI, MARCO
			
	Scuola / Dip.
	
				ING V - Facolta' di Ingegneria dell'Informazione
			
	Data
	
				22-ott-2010
			
	Anno accademico
	
				2009/2010
			
	Tipo di documento
	
				Tesi di laurea Magistrale
			
	Appare nelle tipologie:
	
				Tesi di laurea Magistrale

File allegati

File	Dimensione	Formato
thesis_ashek_737737.pdf accessibile in internet per tutti Descrizione: Thesis text Dimensione 5.67 MB Formato Adobe PDF Visualizza/Apri	5.67 MB	Adobe PDF	Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/3786