Classificazione del traffico Internet con attributi per-source

Internet traffic classification techniques aim at associating to a packet sequence between two hosts and two corresponding transport ports (flow) the generating application. This is an important tool for traffic and Quality of Service management, network monitoring and security policies implementation. The most widely used classification techniques are based on packet payload inspection or assume the use of well known transport ports. However, nowadays a lot of protocols and applications use random port numbers and cipher packet payloads, making these techniques useless. Therefore, the recent literature proposes the use of observable attributes of traffic patterns, such as packet sizes and interarrival times. As the formulation of models based on such attributes is extremely complex, it is usually achieved by using Machine Learning algorithms. Classification can be performed online or offline: in the first case, classification delays are of primary importance, as it is typically required to obtain a result by only considering the first few packets of the flow, thus minimizing the time elapsed between the beginning of the flow and its identification. On the other hand, the more information is available, the more the classification performance is improved, resulting in a trade-off between delay and accuracy. This work proposes a classification technique for the outgoing traffic from a server host on a specific port. After evaluating the characteristics of different classifiers (e.g. Support Vector Machines and Random Forests) and statistical attributes, we focused on a single classification feature, the Index of Variability, which has been proposed in literature for characterizing the autocorrelation of a traffic process. This parameter has been evaluated for different time scales and the obtained measurements have been used to train a Parzen classifier, which has been proved to provide good performance. The classification error has been analytically computed and compared to results obtained with synthetic data.

Le tecniche di classificazione del traffico Internet associano ad una sequenza di pacchetti scambiati tra due host e due rispettive porte di trasporto (flusso) la presunta applicazione generatrice, la cui identificazione riveste elevata importanza in innumerevoli ambiti, dalla gestione della Qualità del Servizio al monitoraggio di rete. Le tecniche più comunemente sfruttate per il riconoscimento delle applicazioni si basano sull'ispezione del payload dei pacchetti o sull'utilizzo di porte note. Tuttavia, ad oggi molti protocolli ed applicazioni sfruttano numeri di porta casuali ed effettuano la cifratura del payload, rendendo queste tecniche inapplicabili. La letteratura più recente propone perciò l'utilizzo di attributi statistici, quali la lunghezza dei pacchetti o i tempi di interarrivo. Poichè la formulazione di modelli basati su tali grandezze si è rivelata di estrema complessità, essa viene effettuata servendosi di algoritmi di Machine Learning. L'operazione di classificazione può essere realizzata online oppure offline: nel primo caso, le tempistiche rivestono un'importanza cruciale, giacchè tipicamente si richiede di pervenire ad un risultato considerando solo i primi pacchetti del flusso e dunque minimizzando il ritardo introdotto tra l'inizio del flusso e la sua identificazione. D'altro canto, le prestazioni dei classificatori migliorano al crescere delle informazioni disponibili, rendendo dunque necessario un trade-off tra ritardo e accuratezza. Il presente lavoro propone una tecnica di classificazione basata sull'osservazione del traffico uscente dalla coppia host-porta lato server. Dopo aver valutato le caratteristiche di diversi classificatori e l'impatto di vari attributi di classificazione, è stat focalizzata l'attenzione sull'Indice di Variabilità, un parametro proposto in letteratura per caratterizzare l'autocorrelazione di un processo di traffico. Esso è stato valutato per diverse scale temporali e le misure ottenute sono state utilizzate per addestrare un classificatore di Parzen, il quale ha dato prova di fornire buone prestazioni. l'errore di classificazione è stato valutato analiticamente e confrontato con i risultati ottenuti con dati sintetici.