August 9, 2015 2 Comments
- remote control sessions not starting consistently
- Bundle assignments failing \ taking 5-10 minutes to apply
- searching for Devices not completing
Oddly ZCC seemed to be running a bit better on my login so I started having a look around for known issues, then stumbled across this:
Although the article suggests 11.3.1 (that we’re running) should have the issue we tried running a few of the tech accounts as Super Administrators, which seemed to help the initial login but didn’t solve the other issues above. I’ve since seen another article elsewhere that suggests 11.3.2 is required to fix the non-Super Administrator issue. However, I’m waiting for ZCM 11 SP4 to make the server and agent upgrade work worthwhile so holding off of 11.3.2 for now.
Having only made minimal improvement with the Super Administrator fix above I turned my gaze to the servers themselves in case we were hitting a resource issue somewhere. Running “top” on the Linux primary servers didn’t show any signs of them being under heavy load, plenty of free RAM and given they’re running on an auto-tiering SSD-enabled SAN disk performance isn’t a concern either… onto the database server.
Our ZCM database runs on a dedicated Microsoft SQL Server VM, which gives a few potential pain points to watch out for. We’d already experienced issues in the past with ZPM causing massive growth of log files so it wouldn’t be a surprise if a database problem was the root cause here too.
Our database is 30GB+ so we tried upping the memory to run the whole lot in RAM but that had minimal effect (apart from creating a huge pagefile on the C: drive!) so that was scaled back to 16GB . Multiple vCPUs were already configured so nothing to change there. Disk space and datastore latencies were all looking good as well so no problems on the storage side either.
A closer look at SQL
At this point it was time to drill a bit deeper into the SQL Server itself to see if there was something within the database that could be causing our issues.
Initially I tried running the manual index defragmentation process (on top of our standard SQL Maintenance Plan) that’s referenced on the “Advanced SQL Concepts” support page
Various indexes were showing as fragmented but the end result was a marginal speed increase, which could well have been a placebo effect so no magic bullet here (although a good practice to run as per Novell’s recommendations)
By chance I stumbled back across a tool I’d used in the past called SQL Heartbeat so decided to pop it on my machine and watch the ZENWorks database for a while to see what appeared.
The results were almost instant, what I like about the Heartbeat tool is the graphical representation of SQL database process IDs, which makes spotting the problematic one(s) very quick. If you really don’t like 3rd party tools SQL Profiler will probably provide similar results.
A screenshot of what we found is below, watching the activity on the server it seemed that a query from one of the primaries was going round and round in circles every 1-2 minutes causing a huge spike in CPU and disk activity. The server CPU never dropped below 50% and often was staying up in the 80-100% range, no wonder ZCC was running slow!
SPID 152 – I choose you…
Looking at SQL Activity Monitor I could see a matching “Expensive Query” which stands out like a sore thumb in terms of the volume of reads, writes and CPU time.
Initially I tried stopping and restarting the novell-zenserver and novell-zenloader processes on the primary server identified on the SPID. Initially it did make the process disappear but it then reappeared a few minutes later. Restarting the affected primary server also had no effect.
We raised an SR with Novell but before we got past the initial “check ZCC status” type troubleshooting steps we had a large power cut that forced some of our VM hosts offline, including the primary above that had the large query associated with it. When everything came back up the database server was back at normal resource usage and the stuck query had disappeared. Definitely goes down as an unconventional fix!
Now ZCC is lightning fast and all the issues we were experiencing disappeared with the stuck query 🙂
Tools for next time
After doing a bit of post-event research there’s a few more useful free tools out there I’ll try out next time something like this arises:
Idera SQL Check