At Aeris Secure, we really enjoy Amazon's infrastructure, AWS. Like most organizations, we went from having a sense of pride in our nice physical servers to eventually resenting the trips to the collocation that cloud providers make unnecessary. We enjoy all of the services that one can access with the flip of a switch, most of which require little configuration and monitoring overhead to keep running.
However, every now and again a an issue will crop up that forces you to learn something new. This happened to me while configuring some of our Nessus vulnerability scanners in AWS EC2. I noticed that while running scans over the public internet I would sometimes lose connectivity to the web management interface of the scanner.
My first thought was that the instance must be undersized. Even though it met the minimum specs of the vendor, the t2.small instance we were using was definitely on the under-powered and under-rammed side of things. The interesting thing was that during even decent sized scans, the CPU usage wouldn't go over 10% and the memory would barely budge (Nessus in particular has made major improvements on this front over the last few major releases). The other interesting thing was that established SSH sessions would also remain unaffected.
I have a good friend that also runs vulnerability scanners from AWS and he didn't seem to be running into the same issue. The only things different between his configuration and mine were the operating system (Debian for mine, Amazon Linux for his) and the network interface configuration (single NIC for mine, dual for his), and a slight difference in instance sizing.
My first major step of troubleshooting was to change the size of the instance. I experimented with a number of different options, all the way up to an m3.xlarge which as four CPUs and 15 GB of RAM. While the larger instances did fare a little better, they did not resolve the issue I was experiencing with the unresponsive web interface.
My next bit of troubleshooting involved installing another web service on the same system and attempting to connect to it while I was experiencing the issue with the scanner's web interface. The results of this confirmed that the problem wasn't just limited to Nessus's web interface.
This let me to my theory that this wasn't a bandwidth or sizing issue, but a connection table issue. If you have ever run some intense scans with nmap or a vulnerability scanner from behind a firewall, especially a SOHO device, you may have noticed that "the internet went down". This is the issue I'm describing. Firewalls and NAT devices have to maintain a table to track all initiated connections so that they can allow traffic back through (firewall) or associate the connection with the original sender (NAT).
Firewalls and routers using NAT have a couple of different performance ratings related to this table:
- Max concurrent sessions
- New sessions per second
1. documents how many connections the device can keep track of until the table is full and no new connections through the device will be permitted.
2. covers the rate at which those new connections can be established. New sessions per second is much more critical to scanning than max concurrent sessions because establishing new sessions takes more resources and port scanning can easily overwhelm devices that can't keep up.
When new sessions are being created faster than the network device can create them, things start to feel weird if you don't know whats happening. Already created connections will be maintained, such as SSH sessions or other services that maintain a connection and will function as normal. Other services that rely on frequent new connections, such as web browsing, will most likely not work at all.
This is exactly what was happening in AWS. AWS has two features that could be bottle-necking this:
- Security groups
- Elastic IP addresses
Security groups are basically stateful firewall configurable per instance. Elastic IP addresses are essentially NAT'ed IP addresses where you can easily change which instance they are associated with. My understanding is that both of these have a negative impact on port scanning throughput. This limitation will almost never come into play with "normal" usage. Port scanning is almost an anomaly when it comes to how many new connections are generated (one for each port on each host you would like to check).
This issue was confirmed by using a simple script that would count the number of packets per second that the instance was sending. This isn't a perfect proxy for the number of connections that need to be setup, but on a scanner, the vast majority of packets sent during the discovery or "ping" phase are going to be new connections to new ports.
The script just loops and prints the difference between the values of
/sys/class/net/eth0/statistics/rx_packets which is where the Linux Kernel maintains statistics on the packets sent by the interface
Later I came up with a more accurate measuring script that will accurately track TCP connections. This is a one-liner that combines tcpdump with some Python:
sudo tcpdump -i eth0 -n "tcp[tcpflags] & tcp-syn != 0" | python -c "exec(\"import sys,time\ns=1\nt=time.time()\nls=0\nle=0\nfor l in sys.stdin:\n\tle+=1\n\tnt=time.time()\n\tif nt-t > s:\n\t\tprint 'ln/s:', int((le-ls)/(nt-t)), '\t since', int(nt-t), 'secs ago'\n\t\tls=le\n\t\tt=nt\")"
The only issue with this script is that it doesn't always give the update exactly at 1 second because the reading of stdin is a blocking call. Since this annoyed me, I came up with a final version written in Go (golang) to overcome this.
Once I had this handy tool, I started scanning with different size instances in an attempt to narrow down limit on connections per second. (No, AWS does not officially specify this anywhere and most likely the limit fluctuates with their load on the cloud at any given time). The major discovery at this stage was that the difference between a t2.small and a m3.xlarge was negligible even though the network rating on the t2.small is "Low to Moderate" and that of the m3.xlarge instance is "High". This means that the network rating has more to do with bandwidth and much less to do with connections per second throughput.
I never did hone in on exact numbers, but to give you an idea, the t2.small was able to sustain about
2000 packets per second. The m3.xlarge was maybe double that. When you compare these numbers to the number of connections that a default Nessus scan uses, they are quite small. You will easily see numbers in the 20k-30k range on a nessus scan with a 20 concurrent hosts and no other tuning. As a reference, a higher end SOHO router like a Fortigate 60D can handle about 4000 connections per second. You're looking at pretty expensive devices to get past 20k or 30k.
To get the connection rate down to an acceptable level, I settled on the following policy settings in Nessus which should allow for a couple of concurrent scans:
- Max simultaneous hosts per scan: 15
- Max number of concurrent TCP sessions per host: 3
The concurrent TCP sessions per host setting also controls the number of connections per second during the discovery phase of the scan. These settings would keep my packets per second measurement to around 500. Interestingly, increasing the concurrent TCP sessions per host beyond 3 always seemed to have a disproportionate affect on the packets per second, regardless of the number of hosts, and would usually result in an unresponsive web interface.
There are a few conclusions to draw. First, AWS, while great for a great many things, is probably not ideal for port scanning. The security and convenience come at the cost of connections per second. For cloud scanning, you'd be much better off opting for a Virtual Private Server (VPS) service that gives the instance a public IP and would therefore not hinder the connections per second.
Second, with proper tuning, vulnerability scanning in AWS is quite possible. The scan throttle may actually be appreciated by the target, but it will definitely impact your ability to do concurrent scanning.
Third, watch your connections per second. This is equally important when performing scanning of internal networks that include multiple subnets separated by a firewall.
Fourth, there is a big problem with initiating connections faster than the network can handle, and I don't mean an unresponsive web interface. If the infrastructure cannot support the connections, responses aren't being received for open ports and the results will be off.
Finally, why was my buddy's AWS instance outperforming our own? It turned out the second network interface was being used to access the scanner's web interface while the primary interface was used for actual scanning. He was experiencing the same issues he just didn't receive any feedback about via an unresponsive web interface.