Humans in the loop: optimization of active and passive crowdsourcing

In the last years, social media have attracted millions of users and have been integrated in people’s daily practices. They enable users to create and share content or to participate in social networking. User-generated content, i.e., the various forms of media assets publicly available and created by end-users, is published every day on the Web and mostly in social media at a massive scale, either in the form of textual documents (e.g., blog articles, posts on social networks, comments and discussion) or in the form of multimedia items (e.g., images and videos). Most user-generated content is about personal lives and facts about users. However, users often publish more structured and complex information. Crowdsourcing has gained increasing importance in the last years. The term crowdsourcing generally refers to the outsourcing of a non-automatable task to people. The growth of the time spent online has led to a growth of interest in crowdsourcing. Several works have been developed, either making users actively participate in the resolution of tasks or exploiting data they generate and publish over the Web. We refer to these approaches as, respectively, active crowdsourcing (i.e., active participation of motivated users in task execution) and passive crowdsourcing (i.e., exploitation of user-generated content to extract useful information). On the one hand, active crowdsourcing is the process of outsourcing tasks to a large group of people, called workers. In this scenario, human workers are asked to perform very specific tasks (called crowd tasks), which usually are easy to be solved by humans but hard to be solved by machines. In the context of active crowdsourcing, only tasks difficult to be performed by a machine are submitted as crowd tasks. They are often based on uncertain data, since these data can hardly be processed by computers, due to their unstructured nature. Unfortunately, an appropriate modeling of the impact of a crowd task answer on uncertain data is yet to be defined. Moreover, similarly to the use of machine resources, which cost, also human computational resources are not freely available in any amount, and may provide erroneous answers. Consequently, an approach for the selection of the best candidate set of tasks to submit to the crowd under some fixed constraints (e.g., costs and time) needs to be devised, together with quality assurance procedures that guarantee an appropriate result quality level. On the other hand, passive crowdsourcing denotes an alternative approach for leveraging the online activity of users for task resolution, which amounts to analyzing a huge amount of publicly available contents, to extract information about behaviors, interests and activities of the social media population. Researchers from different fields (e.g., social science, economy and marketing) analyze a variety of user-generated datasets to understand human behaviors, find new trends in society and possibly formulate adequate policies in response. However, due to the uncontrolled nature of users’ participation on the Web, the huge mass of available data contains replicated information, as well as low quality or irrelevant content. Moreover, content is often replicated maliciously: users copy content created by others (and often subject to copyright laws), rename it and pretend they are the authors of the corresponding original content. In this Thesis, we propose methods to overcome these problems, both in the active and passive crowdsourcing field, with the objective of maximizing the quality of results, under the assumption of budget and time constraints.

Durante gli ultimi anni, i social media hanno attratto milioni di utenti, e sono stati integrati progressivamente nella loro routine quotidiana. Essi permettono di creare e condividere contenuto, o di connettersi con altre persone. Lo user-generated content, ovvero le varie forme di contenuto creato dagli utenti e disponibile sui media, è pubblicato ogni giorno sul Web in quantità enormi, sia in formato testuale (ad esempio sottoforma di articoli in blog, post sui social network, commenti e discussioni) sia sottoforma di file multimediali (ad esempio immagini e video). La maggior parte del contenuto parla di fatti personali che gli utenti vogliono condividere con altre persone nella propria cerchia sociale. Tuttavia, altri utenti pubblicano spesso informazione più strutturata e complessa. Il crowdsourcing è diventato uno degli argomenti più discussi degli ultimi anni. Il termine crowdsourcing generalmente fa riferimento all’esecuzione da parte di un gruppo di persone di un compito non automatizzabile (detto anche task). La crescita del tempo che le persone spendono online ha aumentato via via l’interesse che le persone mostrano nei confronti del crowdsourcing. Di conseguenza, molti lavori di ricerca hanno studiato a fondo aspetti riguardanti il crowdsourcing, sia creando politiche di partecipazione attiva degli utenti alla risoluzione di task, sia sfruttando i dati che gli utenti pubblicano tutti i giorni sul Web. Ci riferiamo a questi approcci come, rispettivamente, active crowdsourcing (cioè: partecipazione attiva di utenti motivati nell’esecuzione di task) e passive crowdsourcing (cioè: sfruttamento dello user-generated content per l’estrazione di informazione altrimenti non nota). L’active crowdsourcing è il processo di esecuzione di task da parte un largo gruppo di persone, chiamati lavoratori. In questo scenario, si chiede ad un lavoratore umano di eseguire un compito specifico (chiamato crowd task), che solitamente è semplice da eseguire per il lavoratore, ma difficile da eseguire per una macchina. Nel contesto dell’active crowdsourcing, solo i compiti difficilmente automatizzabili vengono fatti eseguire da lavoratori umani. Di conseguenza, i crowd task vanno spesso a trattare dati incerti, non strutturati e non comprensibili da componenti automatiche. Sfortunatamente, non è ancora stato quantificato appropriatamente l’impatto che la risoluzione di un crowd task su dati incerti potrebbe avere sul grado di incertezza dei dati. Inoltre, come accade anche con l’impiego di risorse automatiche, anche l’impiego di lavoratori umani ha un certo costo, ed inoltre i lavoratori non sono sempre disponibili e potrebbero fornire risposte errate. Di conseguenza, è necessario progettare un approccio per la selezione dell’insieme di crowd task più promettente, dati alcuni vincoli (ad esempio, di tempo e di costo), così che l’esecuzione dei suddetti possa garantire una alta qualità del risultato. Il passive crowdsourcing denota invece un approccio alternativo per sfruttare l’attività online degli utenti: esso richiede di analizzare grosse quantità di dati (resi pubblici dagli utenti stessi), per estrarre informazione sui comportamenti, sugli interessi e sulle attività svolte dalla popolazione dei social media. Ricercatori di diversi campi (ad esempio delle scienze sociali, economiche e di marketing) analizzano una grande varietà di dati creati dagli utenti per comprenderne il comportamento, per trovare nuovi trend nella società e per formulare politiche adeguate in risposta. Tuttavia, data la natura incontrollata della partecipazione degli utenti sul Web, una grande quantità di dati contiene informazione replicata, di bassa qualità ed irrilevante. Inoltre, il contenuto viene spesso replicato senza permesso: gli utenti copiano contenuto creato da altri (anche se soggetto a copyright), lo rinominano e fingono di esserne gli autori. In questa Tesi, proponiamo metodi per superare i problemi citati, sia nel campo dell’active crowdsourcing sia in quello del passive crowdsourcing, con l’obiettivo di massimizzare la qualità del risultato, anche in presenza di vincoli di budget e di tempo.

Humans in the loop: optimization of active and passive crowdsourcing

CICERI, ELEONORA

Abstract

In the last years, social media have attracted millions of users and have been integrated in people’s daily practices. They enable users to create and share content or to participate in social networking. User-generated content, i.e., the various forms of media assets publicly available and created by end-users, is published every day on the Web and mostly in social media at a massive scale, either in the form of textual documents (e.g., blog articles, posts on social networks, comments and discussion) or in the form of multimedia items (e.g., images and videos). Most user-generated content is about personal lives and facts about users. However, users often publish more structured and complex information. Crowdsourcing has gained increasing importance in the last years. The term crowdsourcing generally refers to the outsourcing of a non-automatable task to people. The growth of the time spent online has led to a growth of interest in crowdsourcing. Several works have been developed, either making users actively participate in the resolution of tasks or exploiting data they generate and publish over the Web. We refer to these approaches as, respectively, active crowdsourcing (i.e., active participation of motivated users in task execution) and passive crowdsourcing (i.e., exploitation of user-generated content to extract useful information). On the one hand, active crowdsourcing is the process of outsourcing tasks to a large group of people, called workers. In this scenario, human workers are asked to perform very specific tasks (called crowd tasks), which usually are easy to be solved by humans but hard to be solved by machines. In the context of active crowdsourcing, only tasks difficult to be performed by a machine are submitted as crowd tasks. They are often based on uncertain data, since these data can hardly be processed by computers, due to their unstructured nature. Unfortunately, an appropriate modeling of the impact of a crowd task answer on uncertain data is yet to be defined. Moreover, similarly to the use of machine resources, which cost, also human computational resources are not freely available in any amount, and may provide erroneous answers. Consequently, an approach for the selection of the best candidate set of tasks to submit to the crowd under some fixed constraints (e.g., costs and time) needs to be devised, together with quality assurance procedures that guarantee an appropriate result quality level. On the other hand, passive crowdsourcing denotes an alternative approach for leveraging the online activity of users for task resolution, which amounts to analyzing a huge amount of publicly available contents, to extract information about behaviors, interests and activities of the social media population. Researchers from different fields (e.g., social science, economy and marketing) analyze a variety of user-generated datasets to understand human behaviors, find new trends in society and possibly formulate adequate policies in response. However, due to the uncontrolled nature of users’ participation on the Web, the huge mass of available data contains replicated information, as well as low quality or irrelevant content. Moreover, content is often replicated maliciously: users copy content created by others (and often subject to copyright laws), rename it and pretend they are the authors of the corresponding original content. In this Thesis, we propose methods to overcome these problems, both in the active and passive crowdsourcing field, with the objective of maximizing the quality of results, under the assumption of budget and time constraints.

Scheda breve

Scheda completa

	Relatore
	
				FRATERNALI, PIERO
			
	Coordinatore
	
				FIORINI, CARLO ETTORE
			
	Tutor
	
				PERNICI, BARBARA
			
	Data
	
				20-mar-2015
			
	Abstract in italiano
	
				Durante gli ultimi anni, i social media hanno attratto milioni di utenti, e sono stati integrati progressivamente nella loro routine quotidiana. Essi permettono di creare e condividere contenuto, o di connettersi con altre persone. Lo user-generated content, ovvero le varie forme di contenuto creato dagli utenti e disponibile sui media, è pubblicato ogni giorno sul Web in quantità enormi, sia in formato testuale (ad esempio sottoforma di articoli in blog, post sui social network, commenti e discussioni) sia sottoforma di file multimediali (ad esempio immagini e video). La maggior parte del contenuto parla di fatti personali che gli utenti vogliono condividere con altre persone nella propria cerchia sociale. Tuttavia, altri utenti pubblicano spesso informazione più strutturata e complessa.
Il crowdsourcing è diventato uno degli argomenti più discussi degli ultimi anni. Il termine crowdsourcing generalmente fa riferimento all’esecuzione da parte di un gruppo di persone di un compito non automatizzabile (detto anche task). La crescita del tempo che le persone spendono online ha aumentato via via l’interesse che le persone mostrano nei confronti del crowdsourcing. Di conseguenza, molti lavori di ricerca hanno studiato a fondo aspetti riguardanti il crowdsourcing, sia creando politiche di partecipazione attiva degli utenti alla risoluzione di task, sia sfruttando i dati che gli utenti pubblicano tutti i giorni sul Web. Ci riferiamo a questi approcci come, rispettivamente, active crowdsourcing (cioè: partecipazione attiva di utenti motivati nell’esecuzione di task) e passive crowdsourcing (cioè: sfruttamento dello user-generated content per l’estrazione di informazione altrimenti non nota).
L’active crowdsourcing è il processo di esecuzione di task da parte un largo gruppo di persone, chiamati lavoratori. In questo scenario, si chiede ad un lavoratore umano di eseguire un compito specifico (chiamato crowd task), che solitamente è semplice da eseguire per il lavoratore, ma difficile da eseguire per una macchina. Nel contesto dell’active crowdsourcing, solo i compiti difficilmente automatizzabili vengono fatti eseguire da lavoratori umani. Di conseguenza, i crowd task vanno spesso a trattare dati incerti, non strutturati e non comprensibili da componenti automatiche. Sfortunatamente, non è ancora stato quantificato appropriatamente l’impatto che la risoluzione di un crowd task su dati incerti potrebbe avere sul grado di incertezza dei dati. Inoltre, come accade anche con l’impiego di risorse automatiche, anche l’impiego di lavoratori umani ha un certo costo, ed inoltre i lavoratori non sono sempre disponibili e potrebbero fornire risposte errate. Di conseguenza, è necessario progettare un approccio per la selezione dell’insieme di crowd task più promettente, dati alcuni vincoli (ad esempio, di tempo e di costo), così che l’esecuzione dei suddetti possa garantire una alta qualità del risultato.
Il passive crowdsourcing denota invece un approccio alternativo per sfruttare l’attività online degli utenti: esso richiede di analizzare grosse quantità di dati (resi pubblici dagli utenti stessi), per estrarre informazione sui comportamenti, sugli interessi e sulle attività svolte dalla popolazione dei social media. Ricercatori di diversi campi (ad esempio delle scienze sociali, economiche e di marketing) analizzano una grande varietà di dati creati dagli utenti per comprenderne il comportamento, per trovare nuovi trend nella società e per formulare politiche adeguate in risposta. Tuttavia, data la natura incontrollata della partecipazione degli utenti sul Web, una grande quantità di dati contiene informazione replicata, di bassa qualità ed irrilevante. Inoltre, il contenuto viene spesso replicato senza permesso: gli utenti copiano contenuto creato da altri (anche se soggetto a copyright), lo rinominano e fingono di esserne gli autori.
In questa Tesi, proponiamo metodi per superare i problemi citati, sia nel campo dell’active crowdsourcing sia in quello del passive crowdsourcing, con l’obiettivo di massimizzare la qualità del risultato, anche in presenza di vincoli di budget e di tempo.
			
	Tipo di documento
	
				Tesi di dottorato
			
	Appare nelle tipologie:
	
				Tesi di Dottorato

File allegati

File	Dimensione	Formato
2015_03_PhD_Ciceri.pdf Open Access dal 27/02/2016 Descrizione: Testo della tesi Dimensione 11.25 MB Formato Adobe PDF Visualizza/Apri	11.25 MB	Adobe PDF	Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/102903