Power and Energy Consumption Analysis of Regular Expression Matching Libraries

Regular expressions are a fundamental part of many applications that work in text-based environment or in any software that works by searching patterns: for example the gene-editing software CRISPR-Cas9 and the malware detection softwares like ClamAV and YARA make use of such expressions to find points of interest. Most regular expression libraries used with such software have been evaluated with execution time and memory as their focus and thus literature lacks an accurate power consumption and energy evaluation. Due to the ever-increasing demand for efficient text processing and energy-conscious software in data centers where large amounts of text data are frequently processed, like Google's, such an evaluation is now necessary. In this thesis work we thus performed a comparative analysis on gathered power data of two common regular expression matching libraries, RE2 by Google and Hyperscan by Intel, by using Running Average Power Limit, a software interface that accurately measures the energy and power consumption of CPUs. These libraries were tested on a variety of benchmarks coming from different scopes for automata processing engines to provide insight on their power consumption in different use-cases. Two modes of operation were considered: ending a scan on a first match and ending the scan when all possible matches have been found. Our results show that both libraries outperform the other in specific cases depending on execution type and benchmark, with RE2 being better on average when we're looking for a single match, and Hyperscan being better on multiple matches when patterns are complex.

Le espressioni regolari sono diventate una parte fondamentale in molte applicazioni che lavorano tramite testo o in qualsiasi software che opera tramite la ricerca di pattern: per esempio il software di editing genetico CRISPR-Cas9 e i software di individuazione malware come ClamAV e YARA fanno uso di tali espressioni per trovare punti di interesse. La maggior parte delle librerie di espressioni regolari, sono state valutate considerando il tempo di esecuzione e la memoria come fattori più importanti, perciò la letteratura è carente di una valutazione del consumo di potenza e di energia. Per via della crescente richiesta di analisi del testo e di software a basso consumo energetico nei data center dove grandi quantità di testo vengono processate, come in quelli di Google, tale valutazione è ora necessaria. In questo lavoro di tesi abbiamo quindi eseguito un'analisi statistica e comparativa di due comuni librerie di matching di espressioni regolari, RE2 di Google e Hyperscan di Intel, tramite Running Average Power Limit, un'interfaccia software che misura accuratamente l'energia e il consumo di potenza delle CPU. Queste librerie sono state testate su una serie di benchmark di diverso dominio di provenienza per processori di automi a stati finiti, in modo da comprendere il loro consumo di potenza in diversi casi. Sono state considerate due modalità di matching: terminare la scansione alla prima corrispondenza e terminare la scansione quando sono state trovate tutte le possibili corrispondenze. I nostri risultati mostrano che RE2 consumi meno potenza quando abbiamo bisogno di una singola, breve corrispondenza e che Hyperscan consumi di meno quando dobbiamo trovare un pattern lungo e complesso più volte.