Network Working Group C. Jennings Internet-Draft Cisco Intended status: Informational July 8, 2016 Expires: January 9, 2017 ICE and STUN Timing Experiments for WebRTC draft-jennings-ice-rtcweb-timing-01 Abstract This draft summarizes the results in some experiments looking at the impact of proposed changes to ICE based on the latest consumer NATs. It looks and the amount of non congestion controlled bandwidth a browser can use and the impacts of that on ICE timing. This draft is not meant to become an RFC. It is purely information to help guide development of other specifications. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 9, 2017. Copyright Notice Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Jennings Expires January 9, 2017 [Page 1] Internet-Draft ICE Timing Experiments July 2016 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Background . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1. A multi PC use case . . . . . . . . . . . . . . . . . . . 3 3. Nat Connection Rate Results . . . . . . . . . . . . . . . . . 4 4. ICE Bandwidth Usage . . . . . . . . . . . . . . . . . . . . . 5 4.1. History of RFC 5389 . . . . . . . . . . . . . . . . . . . 5 4.2. Bandwidth Usage . . . . . . . . . . . . . . . . . . . . . 5 4.3. What should global rate limit be . . . . . . . . . . . . 6 4.4. Rate Limits . . . . . . . . . . . . . . . . . . . . . . . 6 4.5. ICE Synchronization . . . . . . . . . . . . . . . . . . . 7 5. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 7 6. Future work . . . . . . . . . . . . . . . . . . . . . . . . . 8 7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 8 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 9 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 9.1. Normative References . . . . . . . . . . . . . . . . . . 9 9.2. Informative References . . . . . . . . . . . . . . . . . 9 Appendix A. Appendix A - Bandwidth testing . . . . . . . . . . . 10 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 12 1. Introduction The ICE WG at IETF has been considering speeding up the rate of starting new STUN and TURN connections when doing ICE. The two primary questions that have been raised about this are 1) can the NATs create new connections fast enough to not cause problems 2) what will the impact on bandwidth usage be. 2. Background A web page using WebRTC can form multiple PeerConnections. Each one of these starts an ICE process that initiates STUN transactions towards various IP and ports that are specified by the Javascript of the web page. Browser do not limit the number of PeerConnections but do limit the total amount of STUN traffic that is sent with no congestion control. This draft assumes that browsers will limit this traffic to 250kbps thought right now implementation seems to exceed that when measured over an 100ms window. Each PeerConnection starts a new STUN transaction periodically until all the iCE testing is done. [RFC5245] limits this to be 20ms or Jennings Expires January 9, 2017 [Page 2] Internet-Draft ICE Timing Experiments July 2016 more while [I-D.ietf-ice-rfc5245bis] proposes moves the minimum time to 5 ms. Retransmission for previous stun transaction can be happening in parallel with this. The STUN specification [RFC5389] specifies 7 retransmission each one doubling in timeout starting with a 500ms retransmission time unless certain conditions are meant. This was put in the RFC to meet the requirements of the IESG from long ago and is largely ignored by existing implementations. Instead system do several retransmissions (6 for Firefox, 8 for Chrome) with a retransmission time starting at 100ms and doubling every retransmission. Often there is a limit of how the maximum retransmission time. Chrome for example only doubles the retransmission time up to a limit of 1600 ms for chrome where it stop increasing the time between retransmissions. The size of STUN packets can vary based on a variety of options selected but the packets being used by browser today for IPv4 are about 70 bytes for the STUN requests. (Note: some other drafts have significantly higher numbers for this size so some investigation is likely needed to determine what the correct number is) As the speed of the pacing is speeded up to 5ms, it increases the number of new mappings the NAT needs to create as well as increasing the non congestion controlled bandwidth used by by the browser. The rest of this draft looks at what sort of issue may or may not come out of this. Additional information about ICE in WebRTC can be found in [I-D.thomson-mmusic-ice-webrtc]. 2.1. A multi PC use case A common design for small conferences is to have a full mesh of media formed between all participants where each participants sends their audio and video to all other participants who mix and render the results. If there are 9 people on a conference call and a 10th one joins, one design might be for the new person to in parallel form 9 new PeerConnections - one to each existing participant. This might result in 9 ICE agent each starting a new STUN transaction every 5 ms. Assuming no retransmissions, that is a new NAT mapping every 5ms / 9 ICE agent = 0.5 ms and about 5 ms / 9 ICE agent * 70 bytes / packet * 8 bits per byte which comes to about 1 mbps. As many of the ICE candidates are expected not to work, they will result in the full series of retransmitting which will up the bandwidth usage significantly. The browser would rate limit this traffic by dropping some of it. Jennings Expires January 9, 2017 [Page 3] Internet-Draft ICE Timing Experiments July 2016 An alternative design would be to form these connection to the 9 people in the conference sequentially. Given the bandwidth limitations and other issues, later parts of this draft propose that if we move the pacing to 5ms, the WebRTC drafts probably need to caution developers that parallel implementation with these many peers are likely to have failures. With the current timings, doing this in parallel often works and there are applications that do it in parallel that ill likely need to change if the timing change. 3. Nat Connection Rate Results The first set of tests are concerned with how many new mappings the NAT can create. The 20ms limit in [RFC5389] was based on going faster than than exceeded the rate of which NATs widely deployed at that time could create new mappings. The test for this draft were run on the very latest models NATs from Asus, DLink, Netgear, and Linksys. These four vendors were selected due to the large market share they represent. This is not at all representative of what is actually deployed in the field today but represents what we will be seeing widely deployed in the next 3 to 7 years as this generation of NATs moves into the marketplace as well as the lower end NATs in the product lines. It is also clear that in some geographies, a national broadband provider may use some globally less common NAT causing that vendors NAT to prevalent in a given country even if it is not common world wide. Test were only run using wired interfaces and consisted of connecting both sides of the NAT to two different interfaces on the same computer and using a single program to send packet various direction as well as measure the exact arrival times of packets. Key results were verified using Wireshark to look wire captures made on a separate computer. The first test was normal tests made to classify the type of the NAT for the cases when 1, 2, and 3 internal clients all have the same source port. The second test created many new mappings to measure the maximum rate mapping could reliably be made. The conclusion of the first test was that all of the NATs tested were behave complaint (for UDP) with [RFC4787] with regards to mapping and filtering allocations. This is great news as well as a strong endorsement on the success of the BEHAVE WG. The fact that we see a non trivial percentage of non behave compliant NATs deployed in the field does highlight that this sample set of NATs tested is not a representative sample of what is deployed. It does suggest that we should see a reduced use of TURN servers over time. Jennings Expires January 9, 2017 [Page 4] Internet-Draft ICE Timing Experiments July 2016 On the second test, all the NATs tested could reliably create new mapping in under 1ms - often more like several hundred micro seconds. The NATs do drop packets if the rate of new mapping gets too high but for all the NATs tested, this rate was faster than 1000 mappings per second. Looking at the code of one NAT, this largely seems to be due to large increase in clock speed of the CPUs in the NATs tested here vs the speed in the NATs tested in 2005 in [I-D.jennings-behave-test-results]. This implies that as long as there or less than 5 or 10 PC doing ICE in parallel in a given browser, we do not anticipate problems on the texted NATs moving the ICE pacing to 5ms. 4. ICE Bandwidth Usage 4.1. History of RFC 5389 At the time [RFC5389] was done, the argument made was it was OK for STUN to use as much non congestion controlled bandwidth as RTP audio was likely to do as the STUN was merely setting up a connection for an RTP phone call. The premise was the networks that IP Phones were used on were designed to have enough bandwidth to reasonable work with the audio codecs being used and that the RTP audio was not elastic and not congestion controlled in most implementations. There was a form of "User congestion control" in that if your phone call sounded like crap because it was having 10% packet loss, the user ended the call, tried again, and if it was till bad gave up and stopped causing congestion. Since that time the number of candidates used in ICE has significantly increased, the range of networks ICE is used over has expanded, and uses have increased. We have also seem much more widespread use of FEC that that allows high packet loss rate with no impact on the end user perception of media quality. In WebRC there applications such as file sharing and background P2P backup that form data channel connecting using ICE with no human interaction to stop if the packet loss rate is high. ICE in practical usage has expanded beyond a tool for IP phones to become the preferred tool on the internet for setting up end to end connection. 4.2. Bandwidth Usage To prevent things like DDOS attacks on DNS servers, WebRTC browser limit the non congestion controlled bandwidth of STUN transaction to an unspecified number but seems that browsers currently plan to set this to 250 kbps. An advertisement running on a popular webpage can create as many PeerConnections as it wants and specify the IP and port to send all the STUN transaction to. Each Peer Connection Jennings Expires January 9, 2017 [Page 5] Internet-Draft ICE Timing Experiments July 2016 objects sends UDP traffic to an IP and port of specified in the JavaScript which the browser limits by dropping packets that exceed the global limit for the browser. It seems that the current plans for major browsers would allow the browser to 250 kbps of UDP traffic when there was 100% packet loss. Currently they send more than this. As far as I can tell there is specification defining what this limit should be in this case. 4.3. What should global rate limit be It is clear that sending 250 kbps on 80 kbps edge cellular connection severely impacts other application on that connection and is not even remotely close to TCP friendly. In the age of cellular wifi hot spots and highly variable backhaul, the browser has very little idea of what the available bandwidth is. This draft is not in anyway suggesting what the bandwidth limit should be but it is looking at what are the implication to ICE timing based on that number. The limit has security implication in that browser loading Javascript in paid advertisements on popular web sides could use this to send traffic to DDOS an server. The limit has transport implication in how it interacts with other traffic on the networks that are close to or less than this limit. More information on this topic can be found in [I-D.ietf-tsvwg-rfc5405bis] and [I-D.ietf-avtcore-rtp-circuit-breakers]. 4.4. Rate Limits Having a global bandwidth limit for the browser, which if exceeded will drop packets, means that applications need to stay under this rate limit or the loss of STUN packets will cause ICE to start mistakenly thinking there is no connectivity on flows which do not work. Consider the case with two NICs (cellular and wifi), each with an v4 and v6 address, and a reachable TURN server on each. This gives 12 candidates and if the other side is the same there are six v6 addresses matching on the other side so 36 pairs for v6 and the same for v4 resulting in 72 pairs for ICE to check (assuming full bundle, RTCP mux etc). The number of pairs we will see in practice in the future is a somewhat controversial topic and the 72 here was a number pulled out of a hat and not based on any real tests. There is probably a better number to use. A simple simulation on this where none of the connections works suggests that the peak bandwidth in 100ms windows is about 112kbps if the pacing is 20 ms while it goes to about 290kbps if the pacing is 5 Jennings Expires January 9, 2017 [Page 6] Internet-Draft ICE Timing Experiments July 2016 ms. This is the bandwidth used by a single ICE agent and there could easily be multiple ICE agents running at the same time in the same tab or across different tabs in the browser. The point I am trying to get at with this is that if the global rate limit would need to be much higher than 250 kbps to move to a 5 ms pacing and have it reliably work with multiple things happening at the same time in the browser. 4.5. ICE Synchronization NATs and firewalls create very short windows of time where a response to an outbound request is allowed to in and allowed to create a new flow. (We do not have current measurements of this but suspect the time is in the order of 500 ms to 5 seconds ). Though this draft did not test these timing on major firewalls, some information indicates these windows are being reduced as time goes on to possible provide better a short attack window for certain types of attacks. ICE takes advantage of both one side sending a suicide packet that will be lost but will create a short window of time where if the other side sends a packet it will get in a the window created by the suicide packet and allow a full connection to form. To make this work, the timing of the packets from either side needs to be closely coordinated. Most the complexity of the ICE algorithm comes from trying to coordinate both sides such that they send the related packets at similar times. A key implication of this is that if several ICE agent are running in single browser, what is happening in other ICE agent can't change the timing of what a given ICE agent is sending. Or at least the amount of skew introduced can't cause the packets to fall outside the timing widows from the NATs and Firewalls. So any solution that slowed down the transmission in one Peer Connection if there were lots of other simultaneous Peer Connection may have issues unless the far side also knows to slow down. Figuring out how much the timing can be skewed between the two sides requires measuring how long the window is open on the NATs and firewalls. Currently we do not have good measurements of this timing and it is not possible to evaluate how much this is an issue without that information. 5. Recommendations The ICE and RTCWeb Transport documents should specify a clear upper bound on the amount of non congestion controlled traffic an browser or applications should be limited to. The transport and perhaps security area should provide advice on what that number should be. Jennings Expires January 9, 2017 [Page 7] Internet-Draft ICE Timing Experiments July 2016 WebRTC basically application work better the larger that number is at the expense of other applications running on the same congested links. There is no way for a JavaScript application to know how many other web pages or tabs in the browser are also doing stun yet all of these impact the global rate limit in the browser. If the browser discards STUN packets due to the global rate limit being exceeded, it results in applicant failures that look like network problems which are in fact just an artifact of other applications running the browser at the same time. This is critical information to understanding why applications are failing. The recommendation here is that the WebRTC API be extended to provide a way for the browsers to inform the application using a given PeerConnection object if STUN packets that PeerConnection is sending are being discarded by the browser. 6. Future work It would be nice to collect measurements on how long NATs and Firewalls keep mapping with no response open. It would be nice to simulate how much global pacing would introduce skew the timing of ICE packets and if that would reduce non relay connectivity success rates. 7. Conclusions The combination of a low ICE pace timing, lots of Peer Connections, and many candidates will cause problems. The optimal way to balance this depends on the factors such as what how much non congestion controlled bandwidth we should assume is available. The speed of NATs mapping creation going forward in the future is likely adequate to move the pacing to 5ms. However applications that create parallel peer connections or situations where more than a handful of PeerConnections are forming in parallel in the same browser (possibly in different tabs or web pages) need to be avoided. From a bandwidth limit point of view, if the bandwidth is limited at 250 kbps, a 5ms timing will work for a single PeerConnection but not much more than that. The specification should make developers aware of this limitation. If the non congestion controlled bandwidth limit is less than 250 kbps, a 5ms timing is likely too small to work reliably particularly with multiple ICE agents running in the browser. Jennings Expires January 9, 2017 [Page 8] Internet-Draft ICE Timing Experiments July 2016 8. Acknowledgments Many thanks to review from Eric Rescorla for review and simple simulator and to Ari Keraenen, and Harald Alvestrand. 9. References 9.1. Normative References [RFC5245] Rosenberg, J., "Interactive Connectivity Establishment (ICE): A Protocol for Network Address Translator (NAT) Traversal for Offer/Answer Protocols", RFC 5245, DOI 10.17487/RFC5245, April 2010, . [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing, "Session Traversal Utilities for NAT (STUN)", RFC 5389, DOI 10.17487/RFC5389, October 2008, . 9.2. Informative References [I-D.ietf-avtcore-rtp-circuit-breakers] Perkins, C. and V. Singh, "Multimedia Congestion Control: Circuit Breakers for Unicast RTP Sessions", draft-ietf- avtcore-rtp-circuit-breakers-16 (work in progress), June 2016. [I-D.ietf-ice-rfc5245bis] Keraenen, A., Holmberg, C., and J. Rosenberg, "Interactive Connectivity Establishment (ICE): A Protocol for Network Address Translator (NAT) Traversal", draft-ietf-ice- rfc5245bis-04 (work in progress), June 2016. [I-D.ietf-tsvwg-rfc5405bis] Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage Guidelines", draft-ietf-tsvwg-rfc5405bis-15 (work in progress), June 2016. [I-D.jennings-behave-test-results] Jennings, C., "NAT Classification Test Results", draft- jennings-behave-test-results-04 (work in progress), July 2007. Jennings Expires January 9, 2017 [Page 9] Internet-Draft ICE Timing Experiments July 2016 [I-D.thomson-mmusic-ice-webrtc] Thomson, M., "Using Interactive Connectivity Establishment (ICE) in Web Real-Time Communications (WebRTC)", draft- thomson-mmusic-ice-webrtc-01 (work in progress), October 2013. [RFC4787] Audet, F., Ed. and C. Jennings, "Network Address Translation (NAT) Behavioral Requirements for Unicast UDP", BCP 127, RFC 4787, DOI 10.17487/RFC4787, January 2007, . Appendix A. Appendix A - Bandwidth testing The following example web page was used to measure how much bandwidth a browser will send to an arbitrary IP and port when getting 100% packet loss to that destination. It creates 100 Peer Connections that all send STUN traffic to port 10053 at 10.1.2.3. It them creates a single data channel for each one and starts the ICE machine by creating an offer setting that to be the local SDP. Jennings Expires January 9, 2017 [Page 10] Internet-Draft ICE Timing Experiments July 2016 STUN traffic demo

This is STUN traffic demo

grab traffic with: sudo tcpdump -i en1 -s 0 -w /tmp/dump.pcap 'port 10053'

Jennings Expires January 9, 2017 [Page 11] Internet-Draft ICE Timing Experiments July 2016 Author's Address Cullen Jennings Cisco Email: fluffy@iii.ca Jennings Expires January 9, 2017 [Page 12]