503: Server Too Busy – Or: How I Learned to Stop Worrying and Love the Scheduled Restart // SQL and other ramblings

The call came in the way they always do: reports failing, dashboards blank, managers breathing heavily down someone else’s neck. “The reporting server is down. Again.”

This time, it was a large enterprise customer. They’d recently completed an upgrade and were now blessed with a new flavour of chaos: the SSRS service crashing with the charmingly vague message, “The request failed with HTTP status 503: Server Too Busy.” It wasn’t the first time we’d seen it. Other customers had faced the same story, different logo. But this time, it was our problem. And the clock was ticking.

A Mystery Wrapped in a 503

The symptoms were annoyingly intermittent. The application would connect fine one moment, and then fail spectacularly the next. There were no code changes. No new firewall rules. No low disk warnings. Just 503s and a handful of cryptic .NET exceptions.

We pulled logs. We pulled traces. We pulled our hair out.

Microsoft had already been involved for months. They’d pored over HTTP traces, debug logs, performance counters. They suggested a handful of potential causes (thread pool starvation, third-party DLL conflicts, memory pressure) but nothing stuck. No smoking gun.

Hardware Doesn’t Fix Software

A new server was built from scratch to take on some of the load. Fresh OS, freshly patched, deployed in January. And for a short while, it looked like that was the answer. Reports were flowing. No 503s.

But the celebration was short-lived. While the new box behaved, the rest of the estate continued to suffer. We had four SSRS report servers and six application servers in total. Most of the errors now clustered on two older report servers, and curiously, many occurred between 1:00 PM and 1:10 PM GMT, an eerily consistent timeframe.

We examined scheduled jobs. We checked for overlapping executions. We even entertained the idea that some shared cloud resource was having its daily existential crisis right around 1 PM, taking us along for the ride. Nothing obvious emerged.

The Accidental Cure

While troubleshooting, we noticed that restarting the SSRS service manually cleared up the issues. For a few hours, everything worked perfectly. Reports ran. Dashboards loaded. Logs were clean.

Then, inevitably, the errors came creeping back.

We tried adjusting memory settings. We recompiled custom DLLs. We removed extensions. Nothing held. Eventually, the solution became obvious: if restarting SSRS helps, maybe we should just… restart it. Regularly.

The Fix That Actually Worked

We set up a scheduled task on each report server using Windows Task Scheduler. It ran as the SYSTEM user with highest privileges and executed a simple script:

net stop "SQLServerReportingServices"
net start "SQLServerReportingServices"

These tasks were configured to run at staggered times across the day: 12:39 AM, 2:12 AM, 4:47 AM, and so on. We chose non-standard times deliberately, to reduce the chance of clashing with heavy report activity.

In tandem, we adjusted the RecycleTime setting in the SSRS configuration to 180 minutes. This meant that SSRS’s own built-in recycle wouldn’t interfere with our scheduled restarts. Every time we restarted the service manually, it reset the internal recycle timer anyway, so we effectively disabled SSRS’s native recycling by superseding it.

And Then, Silence

The impact was immediate. The infamous 503s disappeared overnight. So did the “connection not available” errors that had plagued the application servers. Reports that previously timed out began running consistently.

We monitored logs obsessively for a week. Then two. Still clean. The change had worked. And not just a little.

Of Course, Something Else Broke

With the restarts in place, two new errors surfaced:

Unable to connect to the remote server
The underlying connection was closed: An unexpected error occurred on a receive

These weren’t constant, but they suspiciously lined up with the restart windows. Clearly, some scheduled reports were colliding with the restart schedule. So we tweaked the times again, moving them further away from peak activity windows.

At the same time, we noticed an uptick in a different class of errors:

The operation has timed out
An error has occurred during report processing
The report execution XXXX has expired or cannot be found

These traced back to our SSRS cleanup job, a background process that purges report server tempdb and old execution sessions. If a long-running report is unlucky enough to be executing while cleanup kicks off, its intermediate state can get wiped out. Hence, missing execution sessions and weird timeouts.

To fix this, we’re considering delaying or filtering the cleanup task to only purge sessions older than, say, 15 minutes—just enough to let longer reports finish naturally.

Reflection

This wasn’t the most elegant solution I’ve ever built. It wasn’t the most satisfying. But it worked. And sometimes, that’s all that matters.

We didn’t fix SSRS. We didn’t find the root cause of the 503s. Microsoft didn’t either. But we mitigated it. Reliably. Repeatably. Predictably.

There’s a lesson in that. Sometimes the right answer is a better architecture, better design, or a deeper root cause analysis.

And sometimes, the right answer is:

net stop "SQLServerReportingServices"
net start "SQLServerReportingServices"

Rinse. Repeat. Sleep better.

Share on: