After the song…
After the mad rush…
After the other song…
Saturday, February 5th we experienced an unprecedented level of ticket demand for the wildly popular Comic-Con International event in San Diego. Our system was over capacity nearly the entire time the event was selling tickets, which was from 9am PST to 4pm PST. Tickets were selling during this time, albeit slowly and to much frustration from Comic-Con ticket buyers. Trust me, it feels awful for us sitting on the other side, watching this all go down and being pretty limited in what we can do to alleviate the issue. Those of you that chose to make light of a tough situation really cheered us up and we appreciated the feeling that, hey, we’re all human after all. Here’s a deep technical analysis of what happened.
As you may know, we ran a test event offering of 1000 four day passes to Comic-Con in December. We made it through that test, but there were some bumps. Since that test, we made several optimizations to our platform to increase throughput for massively popular events like Comic-Con International. Specifically, we did the following:
- Introduced an administrative “High Volume Mode” for events that allows TicketLeap to lock event data into cache. This includes data such as: event title/description/etc, ticket type title/description/etc. No need to fetch this data from the database unless needed.
- Optimized all queries throughout the checkout process reducing joins, reducing subqueries, and adding indexes where appropriate.
- Increased the `ulimit` on our webservers to handle more connections into our platform
- Increased the number of fastcgi processes spawned for our application
- Introduced a queueing system to throttle orders per second to manageable size.
- Ran extensive load testing on pure request load, pure checkout load, and a blend of request/checkout load. We used a tool called BrowserMob. It’s awesome and we highly recommend checking it out.
Unfortunately, after all these optimizations we ran into another bottleneck on Saturday. Under extremely high load, nearly all of the connections in our MySQL database became tied up doing DNS resolution. So, why would MySQL perform DNS lookups?
Well, the reason these DNS lookups occur is that MySQL offers the ability to restrict logins to specific hostnames and, therefore, uses this mechanism to validate the authenticating user is coming from the proper host. You can read more about how MySQL uses DNS here: http://dev.mysql.com/doc/refman/5.5/en/dns.html.
Why is this a problem? Well, the DNS lookup is a blocking task and DNS may not be as fast as you need it to be for this kind of load. You can read more about this specific problem in detail here: http://hackmysql.com/dns. This problem seems to have been in MySQL since version 4.
Since we leverage Amazon Auto Scaling, the hostnames of our webservers change frequently and we do not restrict access by hostname. We rely on Amazon’s RDS Security Groups to restrict db access to our EC2 Security Groups. Therefore, we do not need our database to perform these lookups. It is frequently mentioned across the net on posts covering this topic (including the links above) that setting the `–skip-name-resolve` flag will alleviate this issue.
For security reasons, the `–skip-name-resolve` flag is unavailable for modification in Amazon’s RDS product offering. Restricting access by hostname is, after all, a valid use case for a database. This is by no means a bug in RDS. Rather, the constant DNS lookups have been an issue in versions of MySQL at least up to version 5.1. We would have never been able to sell all the Comic-Con tickets without the flexibility of the Amazon cloud. We really love working with AWS.
It is important to note that the `–skip-name-resolve` flag is a workaround, not a solution. The real question is: why is MySQL performing DNS lookups for an authenticating user when there is simply no need to? If our GRANT table shows that the authenticating user is allowed access from host ‘%’ then MySQL should intelligently bypass the host lookup, correct? Well, it seems this is not the case in MySQL 5.1.
After some research, it seems the hostname_requires_resolving function, which determines whether to bother doing the DNS lookup, was rewritten for MySQL 5.5. You can see the commit here: http://bazaar.launchpad.net/~mysql/mysql-server/mysql-trunk/revision/2946#sql/sql_acl.cc. While we will not know for certain until we run some testing against a 5.5 MySQL build, it’s certainly possible that this rewritten method resolves the issue mentioned above. If anyone from the MySQL community familiar with this section of code would like to chime in, we’d love to hear your thoughts.
Another option to alleviate this issue, mentioned in the MySQL documentation, is to increase the HOST_CACHE_SIZE in MySQL and recompile mysqld . Clearly, this wasn’t an option for us on Saturday. Regardless, MySQL should skip DNS resolution unless necessary.
There were several conversations on Twitter suggesting that we did not have enough servers. As it turns out, the issue was exacerbated by the number of servers. We decided at 9:13 AM PST to drop the number of web servers to 4 and orders began to flow at that time. This worked because the number DNS lookups MySQL had to perform were reduced and we were able to process ~200 tickets a minute under extremely heavy load. This was certainly not our ideal level of throughput but we were thrilled to start selling tickets to Comic-Con.
Amazon launched RDS support for MySQL 5.5 late last week and we plan to do this major revision upgrade in the very near future – we just didn’t have the time to test and release the new version before Comic-Con. We will be upgrading shortly and are really excited about the new features.
Moving forward on this issue, AWS has reiterated that they will continue to actively work with TicketLeap to address this known MySQL issue. Additionally, we’ll be working directly with the AWS team to confirm this bug has been fixed in the 5.5 build
This blog will be updated with our findings. In the meantime, feel free to reach out to me directly on Twitter.
So… do we believe him?