Every now and then I get back into the Azure ecosystem to see what is there. I recently came across their Front Door product which looked like competitor to CloudFlare or a rehash of their Application Gateway. To be honest there are so many products that do so many things it all kind of blends together.
This is not a post about Front Door. It only plays a supporting role in this little story of how many DDoS events are often self-imposed.
Anyway, I thought the Front Door product looked interesting, so begin to setup a test against my little screenshot side project.
It takes a few minutes to deploy a Front Door, but once it was up I tried it out.
504 Gateway errors...
I head into the weblogs on Traefik and see that I'm getting hammered from a number of IPs in the same B-Class subnet (147.xx.00.00/16). It occurs to me that the Front Door edge servers are attempting to pull content from the site, but this little VM and application cannot keep up.
Reason 1 - No caching (Appliction architecture)
There is no caching in the application. Traefik does not bother with caching, so the issue is in my little Django app. There is no caching configured, nor does it calculate ETags or offer any kind of 304 responses. That means every page request is it hammering Postgres, fetching the data, and building the pages. This should not be a big problem, except that I'm storing images as bytes in the database (because reasons). That means pushing more data than expected from one service to django, assembling the page, and sending it back. Oops.
- Add in-memory page caching for home page.
- Added 304 calculation based on most recently added screenshot. This is called conditional-view processing.
Reason 2 - No rate-limiting on the Reverse Proxy (Traefik)
A quick calculation showed that the Front Door was attempting to make about 25 requests per second in an attempt to acquire the content. I could configure Traefik to rate limit based on IP, subnet, and a few other factors as well. Handing back errors to the CDN is not ideal. However, rate-limiting in general is not a bad idea.
- Add rate-limiting for reasonable human usage of the site. (eg - about 15 reqs/second, including all loading of assets)
Reason 3 - The CDN was set to aggressively health check improper endpoint
As you can imagine getting the CDN configured is also tricky. I am first to admit that I don't know what I'm doing all the time. Upon looking into it I could see that the CDN was simply doing HEAD requests on the home page, which caused quite a load on my little server. The smarter thing to do here would have been to offer up an endpoint that simply return 200OK (or something that did not load my systems too heavily). This became too much when 40+ edge servers are requesting the same data every 30 seconds.
- Add a /health-check endpoint to only return 200OK on HEAD and GET requests. Configure Front Door to hit that instead
- Decrease frequency of healthcheck from every 30 seconds to 300 seconds.
A CDN/App gateway by itself can do more harm than good. It is not a turn-key operation, and you should be prepared before kicking off a project like that. Consider making a risk/gap analysis and subsequent checklist before undertaking such a project. My checklist follows:
- Make sure the app offers a light-weight health check endpoint
- Make sure the CDN hits that health-check endpoint
- Setup sane rate-limiting on your reverse proxy and provide helpful gateway errors for when people see it. (Reddit does a pretty good job here)
- Measure the performance of your site before / after the implementation. Was it worth all the trouble? Could you have just bought a bigger server for the same benefit and fewer moving parts?