Network Machine Learning Research Group S. Jiang, Ed. Internet-Draft B. Liu Intended status: Informational Huawei Technologies Co., Ltd Expires: December 5, 2016 P. Demestichas University of Piraeus J. Francois Inria G. M. Moura SIDN Labs P. Barlet Network Polygraph June 3, 2016 Use Cases of Applying Machine Learning Mechanism with Network Traffic draft-jiang-nmlrg-traffic-machine-learning-00 Abstract This document introduces a set of use cases in which machine learning technologies are applied to network traffic relevant activities, including machine learning based traffic classification, traffic management, etc. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on December 5, 2016. Copyright Notice Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents Jiang, et al. Expires December 5, 2016 [Page 1] Internet-Draft Network Machine Learning June 2016 (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Methodology of Learning from Traffic . . . . . . . . . . . . 4 3.1. Data of the Network Traffic . . . . . . . . . . . . . . . 4 3.2. Data Source and Storage . . . . . . . . . . . . . . . . . 5 3.3. Architecture Considerations . . . . . . . . . . . . . . . 5 3.4. Closed Control Loop . . . . . . . . . . . . . . . . . . . 6 4. Use Cases Study of Applying Machine Learning in Network . . . 6 4.1. HTTPS Traffic Classification . . . . . . . . . . . . . . 6 4.2. Malicious Domains: Automatic Detection with DNS Traffic Analysis . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3. Machine-learning based Policy Derivation and Evaluation in Broadband Networks . . . . . . . . . . . . . . . . . . 10 4.4. Traffic Anomaly Detection in the Router . . . . . . . . . 11 4.5. Applications of Machine Learning to Flow Monitoring . . . 12 5. Security Considerations . . . . . . . . . . . . . . . . . . . 15 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 15 8. Change log [RFC Editor: Please remove] . . . . . . . . . . . 16 9. Informative References . . . . . . . . . . . . . . . . . . . 16 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 17 1. Introduction Machine learning technology has been successful in solving complicated issues. It helps to make predictions or decisions based on large datasets. It could also dynamically adapt to varying situations and response to real-time issues. Therefore, more and more research starts on applying machine learning in the network area. Among many aspects of networks, the network traffic is one of the most complicated managed objectives. Its volume is rapidly growing along with the Internet explosion. It is always dynamically changing. Most network traffic flows only last a few minutes, or even shorter. And the user contents within traffic is becoming more diverse due to the development of various network services, and increasing use of encryption. Consequently, it is more and more Jiang, et al. Expires December 5, 2016 [Page 2] Internet-Draft Network Machine Learning June 2016 challenging for administrators to get aware of the network's running status and efficiently manage the network traffic flows. Although more and more data regarding network traffics are generated, traditional mechanisms based on pre-designed network traffic patterns become less and less efficient. It is natural to utilize powerful machine learning technology to analyze the large mount of data regarding network traffic, to understand the network's status, such as performance, failures, security, etc. It is a big advantage that machines can measure and analyse the network traffic, then report the results and predictions to humans for further decision. The machines could handle vast amounts of data which is almost impossible for humans to deal with, in close to real time. Even more, if the speed and accuracy of the prediction is high enough, it is possible that the subsequent action based on the prediction result could form a closed control loop to achieve autonomic management. However, the maturity of latter might be far in the future. Today, the traditional control programs still look more reliable than machine learning based control mechanisms. This document firstly analyzes the data of the network traffic from various perspectives; and also discusses several important practical considerations, including the training data source, data storage and the learning system architecture. It then introduce a set of use cases, which have been shown to work well although there is large scope for improvements, including ML-based traffic classification, traffic management, interface failure prediction, etc. Editor notice: this document is in the primary stage. It collects the use cases presented in the proposed Network Machine Learning Research Group (NMLRG) session in IETF95 meeting. 2. Terminology The terminology defined in this document. Machine Learning A computational mechanism that analyzes and learns from data input, either historic data or real-time feedback data, following a set of designed features and algorithms. It can be used to make analysis, predictions or decisions, rather than following strictly static program instructions. Network Traffic The amount of data moving across a network at a given point of time. They are mostly encapsulated in network packets. Traffic Flow A sequence of packets from a source computer to a destination [RFC6437]. It is the unit of network traffic. Jiang, et al. Expires December 5, 2016 [Page 3] Internet-Draft Network Machine Learning June 2016 Feature (machine learning) In machine learning and pattern recognition, a feature is an individual measurable property of a phenomenon being observed. Choosing informative, discriminating and independent features is a crucial step for effective algorithms in pattern recognition, classification and regression. Algorithm (machine learning) Machine learning algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions. A incomplete list of machine learning algorithms includes supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, deep learning, etc. 3. Methodology of Learning from Traffic 3.1. Data of the Network Traffic There is plenty of valuable data related to the network traffic. These data are raw features in learning process. Following is a simple classification of network traffic data. Measurable properties There are many measurable properties of network traffic, such as latency, number of packets, duration, etc. These properties are also very essential features, especially for use cases relevant to performance, QoS (Quality of Service), etc. Data within communication protocols The user contents are encapsulated in layered communication protocols. Many information are contained within the protocol headers, for example the source and destination IP addresses in the IP header, the port numbers in the TCP/UDP header, etc. Transport layer protocols are often related to the type of applications, such as FTP (File Transfer Protocol) for file transfer, HTTP (Hyper Text Transfer Protocol) for web, etc; and many application-relevant data are embedded within these protocols. These could also be essential data for classification or application-oriented analysis. However, some traffic will not provide transport or application information, due to unknown protocols or encryption. User content User contents are the payload of packets, which might be obtained by DPI (Deep Packet Inspection) within the transit network if the packets are unencrypted, or they could be analyzed by the source or destination nodes. Data in network signaling protocols Traffic flows are managed or indirectly influenced by various network signaling protocols. For Jiang, et al. Expires December 5, 2016 [Page 4] Internet-Draft Network Machine Learning June 2016 example, the routing protocols determine the next hop of a specific network traffic flow, or even the traffic path (by some sophisticated routing protocol such as MPLS-TE (Multi-Protocol Label Switching - Traffic Engineering), segment routing, etc.); the P2P (Peer to Peer) protocol can even decide the destination of a specific content traffic. They are relevant and are potential features for traffic analysis. Furthermore, the traffic of these signaling protocols themselves may also be learning objectives. 3.2. Data Source and Storage Within networks, forwarding devices such as routers, switches, firewalls, etc., are the entities that directly handle the network traffic. Thus, they could collect network traffic data, such as measurable properties, protocol information, etc. Source nodes or destination nodes, particularly servers, could also be the source of network traffic data. They could either report the collected data to a central repository for storage and learning, or collect and store the data by themselves for local learning. This depends on the learning architecture, which is discussed in the following section. 3.3. Architecture Considerations Global learning vs. local learning * Global learning refers to the tasks that are mostly network- level, so that they need to be done in a global viewpoint. In this case, the learning entity is normally centralized and is different from the data source entities. * Local learning is more applicable to the tasks that are only relevant to one or a limited group of devices, and they could be done directly within that one node or that limited group of nodes. In this case of grouped nodes, the data may also need to be transited from the data source entity to learning entity. Offline & online learning * Co-located mode: training (offline, based on historic data) and prediction (online, based on real-time data) are both done within the same entity. The entity could be a central repository or a specific node. * De-coupled mode: training is done in the central repository, and prediction is made by the routers/switches/firewalls or other devices that directly process the network traffic. Jiang, et al. Expires December 5, 2016 [Page 5] Internet-Draft Network Machine Learning June 2016 Central learning & distributed learning Central learning means the learning process is done at a single entity, which is either a central repository or a node. Distributed learning refer to ensemble learning that multiple entities do the learning simultaneously and ensemble the results together to sort out a final results. Since network devices are naturally distributed, it could be foreseen that ensemble learning is a good approach for a certain of use cases. 3.4. Closed Control Loop The prediction made by machine learning mechanism could be directly used on manipulating the network traffic, or other relevant actions, such as changing the device configuration, etc. However, as the introduction section said, this kind of utilization might be suitable only for a small set of the use cases, due to the limited accuracy of machine learning technologies. Besides, some critical usages simply cannot tolerate any false decision. 4. Use Cases Study of Applying Machine Learning in Network Editor notes: This section is a collection of the work presented in the proposed NMLRG session in IETF95 meeting. More contributions on use cases are welcome. 4.1. HTTPS Traffic Classification Managing network traffic requires a good understanding of the content of traffic flows for various purposes. Indeed, enhancing the QoS by prioritizing or scheduling the flows or enforcing security policies by filtering some of them cannot solely on rely protocol headers like IP, TCP or UDP headers. Analyzing the user content with DPI is so necessary. However, this poses serious concerns regarding the user privacy. In addition, OTT (Over-the-Top) actors would prefer to fully control their network traffic rather than being subject to any intermediaries policies. As a result, encrypting the traffic has been widely adopted in last years. In that context, traffic management is facing to severe difficulties since DPI is not efficient anymore. Using an intermediary service or proxy are the only ways to analyze the content of encrypted traffic but it requires a high trustfulness in the intermediaries and so not always guaranteed, for example with end-users of an operator networks. Therefore, new techniques wit the ability to extract knowledge and insight from encrypted flows is necessary. Especially HTTPS Jiang, et al. Expires December 5, 2016 [Page 6] Internet-Draft Network Machine Learning June 2016 [RFC2818] is now a major protocol use over Internet because it provides secure Web communication while Web is now embracing various services which have been provided apart in the past: email, video streaming, chat, VoIP, file sharing, etc. It relies on TLS (Transport Layer Security) [RFC5246], [RFC6066] to encapsulate HTTP requests. Being able to identify the service and the providers of an HTTPS connection would help in applying different strategies for managing the corresponding flow. For instance, VoIP (Voice over IP) and email do not require the same QoS or some service use might be prohibited like file sharing to avoid data leakage in a company. As a concrete example, Google, Facebook or Amazon are service providers while maps, drive, gmail are services of Google. To identify them when they are accessed by a user, IP addresses and DNS (Domain Name System) names based identification is not reliable as the users can relies on intermediates to respectively serve as proxy or resolve DNS requests. The SNI (Server Name Indication) [RFC5246] is an extension of HTTPS which is indicated by the user when initiating the TLS handshake (Client Hello). SNI actually contains the hostname to which the request is addressed. Such an hostname is significative of the service and service provider name. However, SNI is an optional field and can be easily forged to circumvent HTTPS filtering without impacting service use [bypasssni]. More advanced mechanisms are hence necessary to improve the robustness of identification even in the case of non collaborative users. Because the objective is to automatically label an HTTPS connection by the service and service provider associated with. The TLS handshake is not encrypted but data exchanged during this phase (random number, selected ciphers,...) is not distinctive of the accessed service. However, the nature of accessed service directly impacts on user content transmitted through the secure channel especially on the type, size and way to transmit those data. Such metadata are still measurable properties. Jiang, et al. Expires December 5, 2016 [Page 7] Internet-Draft Network Machine Learning June 2016 HTTPS Connection + |(1) +-------v------+ |TLS Connection| |Reconstruction| +-------+------+ |(2) +-------v------+ (3') (4') | Features +-------------+----------------------------+ | Extraction | | | +-------+------+ +-------v---------+ +----v----+ | |Service Provider +------------->Services | |(3) |L1 model | Load |L2 model | | +-------^---------+ services +----^----+ +-------v------+ | model X | |SNI Labelling | +----------------------------+ +-------+------+ |(5) | +-----------------------------------------+ +------------> Training and | (4) | Models building | +-----------------------------------------+ Two-levels HTTPS traffic classification In figure above, step(1) consists in reconstructing the HTTPS connection and retrieving packets on top of which the following metrics are observed (2): o Inter Arrival Time o Packet size o Encrypted data size: this feature has the advantage to be strongly related to the service accessed instead of the packet size which is biased by other lower layer headers Based on these values, aggregated features are computed: average, minimum, maximum, 25th percentile, median, 75th percentile. Because different providers may offer a similar service, a single classifier could fail to to distinguish them. A multi-level machine learning approach has been proposed. For learning, a dataset without forged SNI is used (3) to build the classifiers (4). The result is (5): o a first level model (L1 model) whose the goal is to identify the service provider, Jiang, et al. Expires December 5, 2016 [Page 8] Internet-Draft Network Machine Learning June 2016 o a set of second level models (L2 models), one for each service provider to identify specific service of a service provider Once all classifiers are trained, a new unknown HTTPS connection is first matched against the LV1 model (3'). The output is the predicted service provider but also leads to load the corresponding LV2 model (4') to determine the specific service of this service provider. This framework is independent of the ML technique. being used. Each model could be also built with a different technique but our study have shown that best results are obtained with Random Forest. The HTTPS classification framework has been tested over 288,901 connections from lab users. Standard evaluation procedure have been applied. Less representative features have been automatically discarded. Using a ten-fold cross-validation, each tested connection has been marked as perfect identification (both the service provider and the service name are rightly identified), partial identification (only the service provider is identified) or invalid (none of them). 93.1% falls in the first category, 2.9% in the second and the rest in the third. Full results are available in [httpsframework]. Although results are promising, the current method can only be applied at the end once the HTTPS connection, i.e. after being reconstructed. This avoids to apply any kind of policies to the corresponding traffic flow. Future challenge is thus to classify the connection before it ends in order to apply. 4.2. Malicious Domains: Automatic Detection with DNS Traffic Analysis Since their inception, domain names have been used to provide a simple identification label for hosts, services, applications, and networks on the Internet [RFC1034]. In the same way, domains and the DNS infrastructure have also been misused in various types of abuses, such as phishing, spam, malware distribution, among others. Newly registered malicious domain names are well-know to a very distinct initial DNS lookup pattern than legitimates ones: typically, they exhibit an abnormally higher number of lookups [Hao2011]. One of the reasons is that malicious domains tend to rely upon spam campaigns within the first ours after the registration of these domains in order to maximize the number of victims before the domain is detected and taken down. In order to protect users from such domains, nDEWS (New Domains Early Warning System) [Moura2016], a tool that classifies the newly registered domains based on their initial lookup pattern, has been Jiang, et al. Expires December 5, 2016 [Page 9] Internet-Draft Network Machine Learning June 2016 proposed. To perform that, it is required to have access to (i) a domains registration database and (ii) authoritative DNS server traffic data, which is typically the case for Top-Level Domains (TLD) registries. These domains are classified using k-means as a clustering method into two clusters using four features extracted from the analyzed DNS traffic: # DNS queries, # IP addresses, # Autonomous Systems (ASes), and # Countries, which were chosen empirically. As a result, in an automated fashion, a large variety of suspicious domains can be detected, including phishing, malware, but also other types, such as fake pharmaceutical shops as well as counterfeit sneakers. In this particular case, the responsible registrars are notified in this pilot study about these websites. Ultimately, it allows these websites to be taken down, minimizing the potential number of victims. 4.3. Machine-learning based Policy Derivation and Evaluation in Broadband Networks Service provisioning is becoming more complex. For instance, there are services having diverse quality requirements, there is variance of the requirements in time and space, and there is the need for utmost resource efficiency. Moreover, full agility in time and space (in order to accomplish resource efficient service provisioning) requires the solution of computationally intensive tasks. In this respect, policies can play a role: specify the network behaviour in time periods and service area regions. In this direction, machine learning can have a fundamental role, e.g., for learning situations encountered and "good" ways (policies) for handling them. The contribution addresses the role that machine learning can play for policy derivation and evaluation. In more detail it addresses the requirements on the role of machine learning, including potential inputs and outputs. Knowledge and machine learning can be an important aspect of wireless networks. Knowledge is created both regarding the contexts and their occurrence, as well as on the association of the context with specific actions and its scoring. The latter encompasses development of knowledge on how to handle acquired contexts; this knowledge will include the contexts encountered, the corresponding handlings done (decisions applied), the potential alternative handlings, and the respective efficiency of each handling (actually applied or alternate). Reinforcing "good" solutions per each encountered context (e.g. reinforcement learning) can be a vital and unique element of a Jiang, et al. Expires December 5, 2016 [Page 10] Internet-Draft Network Machine Learning June 2016 knowledge-based management system. Machine learning can be realized through clustering to discover underlying structures in data, regression to identify patterns and predict values in cell and network usage, classification to classify first-seen unknown users, and density estimation to model complex user behavior and network usage. Several deep architectures and techniques (such as pre- training) can be utilized, in order to generalize better on complex data with underlying information and be able to make accurate predictions, even on unseen data. As a result, depending on what we want to achieve, the proper machine learning approach can be used. Through machine learning it will be possible to provide faster and targeted solutions to specific network problems. Moreover, it is possible cluster various usage profiles and prioritize the traffic according to the criticality level. For instance, mission critical services need special attention with respect to latency and prioritization, compared to plain services which may tolerate a bit of delay without jeopardizing the overall quality. In addition, machine learning can lead to improved results in KPIs (Key Performance Indicator) such as end-user throughput, latency, energy consumption and overall cost effectiveness. Moreover, reliability can be increased since certain problematic situations may be predicted before happening, hence it will be possible to act pro- actively and alleviate the negative impact of a problem in the network. It is evident that machine learning can have significant importance in policy derivation and evaluation in broadband networks, especially towards in 5G infrastructures which will be complex, heterogeneous and need to accommodate multi-services ranging from mobile broadband to massive machine type, mission critical and vehicular communications. 4.4. Traffic Anomaly Detection in the Router Modern routers usually have the capability that makes alarms of high bandwidth usage rate of a specific interface. When network traffic exceeds a certain threshold, the router will consider it as an anomaly event and report it to the NMS (Network Management System). For instance, in some routers/switches, there exists configuration such as "trap-threshold { input-rate | output-rate }" to trigger traffic alarms, which is statically configured by experienced administrators. However, network traffic is usually not static and even changes significantly due to the changes of carried services, residential situation, and etc. Thus, static configuration could not effectively identify the traffic anomaly events. Jiang, et al. Expires December 5, 2016 [Page 11] Internet-Draft Network Machine Learning June 2016 To address above issue, machine learning technologies are applied for routers/switches to learn local traffic pattern and detect the traffic anomaly events based on the learning results. Wavelets are employed to analyze time-series network traffic for anomaly detection. In some certain interval, the routers measure, record, and analyze the input and output traffic rates respectively, or in the form of rate sums. (The former is recommended for a finer granularity analysis.) Running for some time, the router would get a set of "time-rate" data, collected as time-series waves for further wavelet analysis. Besides wavelets, this use case proposes other machine learning techniques such as outlier detection. For this way, features are to be extracted from wavelets for supervised or unsupervised learning. After data collection, the router would sort up the data and figure out the alarm threshold statistically based on data distribution, to discriminate the normal and outlier traffic rates. When interface traffic exceeds the threshold, the router would make alarms to the NMS. The router could dynamically adjust the alarm threshold with new coming data, by periodical anomaly analysis. This approach helps devices detect traffic anomaly more efficiently and effectively, compared to traditional way of learning at the central repository that collects traffic information from various devices. This use case could be extended from single interface to multiple ones, that is, device scope of multiple traffic waves, and even wider scope of multiple devices in a certain domain. Thus would make the analysis more comprehensive. Besides wavelet analysis, there might be more techniques to explore, such as correlation analysis of traffic anomaly events among multiple devices. 4.5. Applications of Machine Learning to Flow Monitoring A commercial cloud-based flow monitoring service from Network Polygraph [polygraph] has used Machine Learning analysis as a cost- effective alternative to DPI for traffic classification, which identifies the application responsible for each network traffic flow. Nowadays, DPI is considered as the standard technology for traffic classification. However, DPI is generally expensive as it requires the analysis of the payload of every single packet. This usually involves the use of powerful, specialized hardware appliances, which need to be deployed in every link to obtain full coverage of the network. In the case of Network Polygraph, the use of DPI is Jiang, et al. Expires December 5, 2016 [Page 12] Internet-Draft Network Machine Learning June 2016 impractical, because the volume of data to be exported to the cloud would be overwhelming (i.e., all traffic should be replicated). A more viable alternative is the use flow-based monitoring technologies, such as NetFlow [RFC3954] or IPFIX [RFC7011], where the volume of exported data is significantly lower. Flow-based monitoring technologies provide summarized information (e.g., duration, traffic volume) for every connection (or "traffic flow") handled by a router. The information available in flow records is more limited compared to DPI (e.g., packet payloads are not available). As a result, most flow-based monitoring tools base their classification on the port numbers or simple heuristics, which are known to be highly unreliable. To address this problem, Network Polygraph uses a traffic classification approach based on ML. Several studies showed that supervised learning can achieve similar classification accuracy to DPI at a fraction of its cost. However, supervised methods suffer from some practical limitations that make them very difficult to deploy and maintain in production environments. For example, they require a costly training phase prior to its deployment and need to be frequently retrained, every time there is a change in the network or in the network applications. This section describes the ML approach used by Network Polygraph for online classification of NetFlow/IPFIX traffic. To solve the practical limitations of supervised learning, Network Polygraph incorporates an automatic retraining system. Figure 1 shows the components and data flow of the classification engine, which is divided in two parts: o The classification path (Figure 1, top) is in charge of the classification of the traffic online using ML. The input of the classification path are the NetFlow/IPFIX flows exported by the routers, while the output are the classified flows. Several traffic features are extracted from each flow, including the information directly available in the flow records (e.g., addresses, ports, packet and byte counts) together with some features we construct (e.g., average packet size, rate and interarrival time). The traffic features are the input of the traffic classification algorithm, whose function is to identify the application that generated the flow. Among the different supervised algorithms, a C5.0 decision tree was selected, because it has been shown to present the best accuracy/cost ratio for traffic classification. Other supervised methods, e.g., Support Vector Machine (SVM) and Artificial Neural Network (ANN), obtain similar accuracy, but classification and training times are faster with decision trees. In Network Polygraph, training times are Jiang, et al. Expires December 5, 2016 [Page 13] Internet-Draft Network Machine Learning June 2016 critical as the training path is continuously updating the classification model in the background. o The training path (Figure 1, bottom) implements the automatic retraining system, which is responsible of automatically updating the classification model when it becomes obsolete. To that end, a random packet-level sample of the network traffic is continuously collected using flow-based sampling. Sampled flows are then labeled using DPI. It is possible to use DPI in the training path because training can be performed only with a small data sample (e.g., 1/1000 flows). This significantly reduces the computational overhead and volume of data to be exported. The labeled sample is used to verify the accuracy of the classification model. The system accuracy is estimated by comparing the output of DPI (training path) and C5.0 (classification path) for those flows sampled in the training path. If the estimated accuracy falls below a configurable threshold, the labeled sample is used to generate an updated model using only those features available in NetFlow/IPFIX (IP Flow Information Export) records. This training process can also be performed in few vantage points, and use it for other networks where only NetFlow/IPFIX monitoring data is available. CLASSIFICATION PATH NetFlow/ +----------+ +----------+ Classified IPFIX | Feature | | C5.0 | flows +-------->|Extraction+------------------------>|Classifier+-----------> | | | | +----------+ +----------+ ^ | TRAINING PATH +----------+ +----------+ | | NetFlow/ | | Feature | | Retraining +-->| IPFIX +-->|Extraction+--+ | Packet stream | |Generation| | | | | (flow sampling) | +----------+ +----------+ | | +--------------->| +--+ DPI-labeled | +----------+ | NetFlow/ | | DPI | | IPFIX +---------->| App. +---------+ | Labeling | +----------+ Network Polygraph classification engine data flow Figure 1 Jiang, et al. Expires December 5, 2016 [Page 14] Internet-Draft Network Machine Learning June 2016 In order to validate the performance of the described ML approach, the accuracy of Network Polygraph was measured using a complete 14-day trace from the 10-Gigabit link that connects the Catalan Research and Education Network (Anella Cientifica) to its Spanish counterpart (RedIRIS). The trace contained about 70 million flows with a flow sampling rate of 1/400. The experimental results showed that, with a 96% retraining threshold, the system sustained an average classification accuracy of 97.5%, needing only 15 retrainings during the 14 days, which were performed automatically without requiring any human intervention. When the retraining threshold was decreased to 94%, the accuracy was slightly reduced to 96.76% with only 5 retrainings. The target objective is to progressively reduce the dependence on DPI technologies, which are expensive, difficult to deploy, not scalable, and not robust against encryption, in favor of flow-based machine learning approaches that are more cost-effective and can be easily offered as a cloud service. In this direction, some research challenges include the classification of web services and CDN traffic from flow-based measurements, and the combination of multiple ground truths obtained from vantage points in different networks. 5. Security Considerations This document is focused on applying machine learning in network, including of course applying machine learning in network security, on higher-layer concepts. Therefore, it does not itself create any new security issues. 6. IANA Considerations This memo includes no request to IANA. 7. Acknowledgements The authors would like to acknowledge Josep Sanjuas, Andreas Georgakopoulos, Kostas Tsagkaris, Valentin Carela, Wazen M. Shbair, Thibault Cholez, and Isabelle Chrisment for their contributions. The author would like to acknowledge the valuable comments made by participants in the IRTF Network Machine Learning Research Group, particular thanks to Lars Eggert, Brian Carpenter, Albert Cabellos, Shufan Ji, Susan Hares, Rudra Saha, and Dacheng Zhang. Jerome Francois was partly funded by Flamingo, a Network of Excellence project (ICT-318488) supported by the European Commission under its 7th Framework Programme. Jiang, et al. Expires December 5, 2016 [Page 15] Internet-Draft Network Machine Learning June 2016 This document was produced using the xml2rfc tool [RFC7749]. 8. Change log [RFC Editor: Please remove] draft-jiang-nmlrg-traffic-machine-learning-00: original version, 2016-06-03. 9. Informative References [bypasssni] Shbair, W., Cholez, T., Goichot, A., and I. Chrisment, "Efficiently Bypassing SNI-based HTTPS Filtering", IFIP/ IEEE International Symposium on Integrated Network Management (IM2015) , 2015. [Hao2011] Hao, S., Feamster, N., and R. Pandrangi, "Monitoring the Initial DNS Behavior of Malicious Domains", Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference (IMC 2011) , Nov 2011. [httpsframework] Shbair, W., Cholez, T., Francois, J., and I. Chrisment, "A Multi-Level Framework to Identify HTTPS Services", IEEE/ IFIP Network Operations and Management Symposium , 2016. [Moura2016] M. Moura, G., Mueller, M., Wullink, M., and C. Hesselman, "nDEWS: a New Domains Early Warning System for TLDs", IEEE/IFIP International Workshop on Analytics for Network and Service Management (AnNet 2016), co-located with IEEE/ IFIP Network Operations and Management Symposium (NOMS 2016) , 04 2016. [polygraph] "Network Polygraph", . [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987, . [RFC2818] Rescorla, E., "HTTP Over TLS", RFC 2818, DOI 10.17487/RFC2818, May 2000, . [RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export Version 9", RFC 3954, DOI 10.17487/RFC3954, October 2004, . Jiang, et al. Expires December 5, 2016 [Page 16] Internet-Draft Network Machine Learning June 2016 [RFC5246] Dierks, T. and E. Rescorla, "The Transport Layer Security (TLS) Protocol Version 1.2", RFC 5246, DOI 10.17487/RFC5246, August 2008, . [RFC6066] Eastlake 3rd, D., "Transport Layer Security (TLS) Extensions: Extension Definitions", RFC 6066, DOI 10.17487/RFC6066, January 2011, . [RFC6437] Amante, S., Carpenter, B., Jiang, S., and J. Rajahalme, "IPv6 Flow Label Specification", RFC 6437, DOI 10.17487/RFC6437, November 2011, . [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, "Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information", STD 77, RFC 7011, DOI 10.17487/RFC7011, September 2013, . [RFC7749] Reschke, J., "The "xml2rfc" Version 2 Vocabulary", RFC 7749, DOI 10.17487/RFC7749, February 2016, . Authors' Addresses Sheng Jiang (editor) Huawei Technologies Co., Ltd Q 22, Huawei Campus, No.156 Beiqing Road Hai-Dian District, Beijing, 100095 P.R. China Email: jiangsheng@huawei.com Bing Liu Huawei Technologies Co., Ltd Q 22, Huawei Campus, No.156 Beiqing Road Hai-Dian District, Beijing, 100095 P.R. China Email: leo.liubing@huawei.com Jiang, et al. Expires December 5, 2016 [Page 17] Internet-Draft Network Machine Learning June 2016 Panagiotis Demestichas University of Piraeus Piraeus Greece Email: pdemestichas@gmail.com Jerome Francois Inria 615 rue du jardin botanique 54600 Villers-les-Nancy France Email: jerome.francois@inria.fr Giovane C. M. Moura SIDN Labs Meander 501 Arnhem, 6825 MD The Netherlands Email: giovane.moura@sidn.nl Pere Barlet Network Polygraph Edifici K2M - Parc UPC Jordi Girona, 1-3, Barcelona 08034 Spain Email: pbarlet@polygraph.io Jiang, et al. Expires December 5, 2016 [Page 18]