A survey on data integration using crowdsourcing techniques

The main problem in data management and any analytics tasks is that there is not automated process which can completely provide a fully comparison mechanism. In addition, many algorithms use complex computation to address this problem, however the results are still not enough. On the other hand, human do comparisons like this as a common task, in faster and less complex way machine does it. Task such as entity resolution, sentiment analysis, and image recognition are some of the task that a normal person can perform with the use of the human cognitive ability. Therefore, human computation introduced the human labor to leverage the effort of the machines in solving these tasks. Then data integration, an area that can benefit from the harmonic cooperation of human and machines since one of the main task is the comparison between entities. Crowdsourcing offers a way to validate and reduce uncertainty of the matches found by the data integration approach, improving greatly the precision and recall of the solutions and levering the computation effort. However, a crowdsourcing approach is expensive time wise and money wise, increasing the complexity of the problem, adding two more factors to take in account. We propose a classification of the approaches that used crowdsourcing to improve the data integration systems, it has four main divisions that compare how the approaches handled the problems of crowdsourcing. The four divisions are (1) Cost Control, that mentions the efforts of reducing the number of task send to the crowd in a way to reduce the final cost. (2) Task Assignment, different types of task are used by the researcher in an effort to reduce the human work and get the most information from the task. (3) Quality Control, never fully trust anything that comes from the internet and crowdsourcing is not the exception. Most of the approaches knew it and made an effort on filter this noise from the answers. (4) User motivation, most of the crowdsourcing campaigns rewards the users after a task completion, this helps on get the user interested to answer and reduced the time waiting to get a number of responses. In this thesis, we survey and synthesize a wide spectrum of existing studies on crowdsourced data integration. Based on this analysis we then outline key factors that need to be considered to improve data integration using crowdsourcing techniques.

Il problema principale nella gestione dei dati e in tutte le attività di analisi è che non esiste un processo automatizzato che possa fornire completamente un meccanismo di confronto completo. Inoltre, molti algoritmi utilizzano il calcolo complesso per risolvere questo problema, tuttavia i risultati non sono ancora sufficienti. D'altra parte, gli esseri umani fanno confronti come questo come un compito comune, in modo più rapido e meno complesso in cui lo fa la macchina. Compiti come la risoluzione dell'entità, l'analisi del sentimento e il riconoscimento dell'immagine sono alcuni dei compiti che una persona normale può svolgere con l'uso della capacità cognitiva umana. Pertanto, il calcolo umano ha introdotto il lavoro umano per sfruttare lo sforzo delle macchine nel risolvere questi compiti. Quindi integrazione dei dati, un'area che può beneficiare della cooperazione armonica di umani e macchine poiché uno dei compiti principali è il confronto tra entità. Crowdsourcing offre un modo per convalidare e ridurre l'incertezza delle corrispondenze rilevate dall'approccio di integrazione dei dati, migliorando notevolmente la precisione e il richiamo delle soluzioni e facendo leva sullo sforzo di calcolo. Tuttavia, un approccio di crowdsourcing è costoso in termini di tempo e denaro, aumentando la complessità del problema, aggiungendo altri due fattori da prendere in considerazione. Proponiamo una classificazione degli approcci che hanno utilizzato il crowdsourcing per migliorare i sistemi di integrazione dei dati, ha quattro divisioni principali che confrontano il modo in cui gli approcci gestivano i problemi del crowdsourcing. Le quattro divisioni sono (1) Controllo dei costi, che menziona gli sforzi per ridurre il numero di attività inviate alla folla in modo da ridurre il costo finale. (2) Compito del compito, i diversi tipi di compiti sono usati dal ricercatore nel tentativo di ridurre il lavoro umano e ottenere il massimo delle informazioni dal compito. (3) Controllo di qualità, mai fidarsi completamente di tutto ciò che proviene da Internet e il crowdsourcing non fa eccezione. La maggior parte degli approcci lo sapeva e si sforzò di filtrare questo rumore dalle risposte. (4) Motivazione degli utenti, la maggior parte delle campagne di crowdsourcing premia gli utenti dopo il completamento di un'attività, questo aiuta a far sì che l'utente sia interessato a rispondere e riduce il tempo in attesa di ottenere un certo numero di risposte. In questa tesi, analizziamo e sintetizziamo un ampio spettro di studi esistenti sull'integrazione di dati crowdsourcing. Sulla base di questa analisi, illustriamo quindi i fattori chiave che devono essere considerati per migliorare l'integrazione dei dati con crowdsourcing.