This application has a critical third-party API integration that involved ingesting webhook events and acting upon them appropriately. The issue was that the application was slow in responding ‘OK’ to these webhook requests. So, when the site came under load, the application would choke and start failing to respond fast enough to the third-party service. This then in turn shut down the webhook integration as the third-party API saw too many failures from the application. Manual intervention was needed to turn back on the integration.
To understand why the response times were so slow, the webhook intake response flow was analyzed. As in, what exactly was it doing and when. The majority of the logic related to handling each event was actually done in background jobs. But there were a few things the application’s webhook intake responder checked for before creating a record of the event and sending it to an asynchronous worker.
One of the checks it did was to verify that the webhook event was not a duplicate one. The code was doing a simple database query to see if there were any existing events with the same identifier from the third-party API. This identifier was unique, so the application was trying to enforce that uniqueness on its side.
And that simple query was what was causing the sluggish overall response time. This query showed consistently as an expensive one on Heroku’s Diagnose tab.
But why was this so slow and expense?
There was no database index on this column, forcing it to do far, far more work than it needed to. The solution became clear then: to add an index to this unique field. Specifically, to add a unique index.
With this unique index, not only did we reap the benefits of the index for speeds, but we also enforced uniqueness for this identifier field at the database level.
We did not stop there, though. The application was doing its own manual uniqueness check, but it could have been using Rails validations. We refactored the code so that it used Rails validations and did not do any manual checks. This simplified the code flow.
And so, with all these changes, there were less steps for the webhook intake responder to run through and the steps it had to take were optimized at the database level.
After we deployed the index and the code changes, the webhook response time change was staggering. When before it could take a few seconds to respond to an event, it now averages below 100 milliseconds.
And all it took was a database index and some minor code tweaks to really utilize that new index. It was a simple set of changes with a massive impact on the response time. Since then, there have been no issues with these webhooks or sluggish response times: the integration has been speedy and stable.
Quote from the application’s owner after a high period of webhooks with the index in place:
Before the indexing, this would have brought our app almost to its knees. We would have had webhook response failures and long queue times upwards of 20 seconds. Now, the queue time rarely exceeded 300ms and there were zero failures. It’s performing great!
We can help in more ways than this. Contact us and we'll get back to you shortly.
Let's Talk