Save yourself from insanity – MDT, ZCM Agent and Windows 10…
September 23, 2018 Leave a comment
Following on from previous posts this summer marked the end of an era with Windows 10 finally usurping it’s predecessor on our network. With 2500+ machines to deploy over the summer break the process had to be slick and for 90% of the summer it was… however something changed right at the end and it nearly sent me insane. Not quite though so here’s a post to save anyone else the fun and games (!)
ZCM Agent deployment via MDT
During all our test runs we had the ZCM Agent deployed via a Powershell script in the MDT Task Sequence. It went somewhere in the middle of the TS as per the Microsoft default template along with all the usual suspects (Chrome, Office and so on) and we didn’t need to think anything more of it.
All our pilot machines deployed beautifully, as did the first 2000 devices over our summer, registering neatly into ZCM as you’d expect it to.
The script we use can be found here:
Looking back at the webpage now I can actually see another user in the comments describing the situation we encountered, around the same time too so onward to the next part of the post…
Breaking the law (of logon)
However around the end of August our technicians started reporting images failing part way through. At first it was one or two machines out of 15-20 in a classroom but over the course of a week or so it was becoming worse to the point where most were failing and successes were the exception rather than the rule.
The failure occurred within the autologon process that takes place as MDT reboots between deployment stages using the local administrator credentials defined in the Deployment Share.
The logon screen appeared to be corrupted as it was a hybrid mix of the Microsoft Account login screen (with an email address input box) mixed with local login (username and password fields below).
Once the failure occurred there was no way to rescue the image, in fact the only way to restart deployment was to boot into WinPE, bring up command prompt (F8) then clean the drive using
diskpart > select disk 0 > clean
A reboot is then required to start a fresh deployment attempt.
Troubleshooting
Given that a Microsoft Account screen was appearing the first thing I tried was a registry hack to disable that feature entirely
Ref: https://www.top-password.com/blog/block-or-disable-microsoft-account-in-windows-10-8/
Putting this in the TS and running again disabled the Microsoft Account login but gave us a different failure, this time a completely blank login screen with just the lock screen wallpaper showing as per the image below:
If you stood and watched for long enough there was a subtle flash on the screen every couple of minutes, as MDT presumably tried (in vain) to enter the autologon credentials.
Reviewing deployment logs (BDD.log for the in-progress machines) didn’t show any failures, just a nice series of applications deploying, a clean reboot and then… nothing. Looking at where we’d got to in the TS it became clear the failure occurred between Windows Updates (Pre Application Install) and finishing the Applications stage.
At this point I started thinking of what may have changed, given we hadn’t modified the TS since the start of summer. The only two things that sprang to mind were
- a new version of the Impero client that we use for classroom management (adds a helper application on the login screen for users to request remote assistance)
- a new cycle of Windows Cumulative Updates (2018-08 vs 2018-07)
It was simple enough to absolve the Impero client of blame – we disabled it in the TS with no change to the failures.
The second point is a bit harder to isolate. Initially I declined the 2018-08 Cumulative Update for Windows 10 but that had no effect either.
In desperation I also completely rebuilt the Boot Images and upgraded MDT from 8443 to 8450, as there were some notes about additional compatibility for Windows 10 1709 (which we’d decided to stick with as our standard build some time back).
At this point the imaging failures were threatening to put us behind schedule so one Sunday in the peace and quiet at home I fired up the VPN, deployed 3 blank Hyper-V VMs on my work desktop and decided to keep testing, imaging and re-imaging until I’d cracked it.
The fix (TL;DR people scroll to here 😉 )
Once I’d got the test VMs set up I started to work through some theories of what could be happening to break autologon all of a sudden. Removing some of the Reboot steps in the TS delayed the failure but didn’t prevent it from happening.
Looking through the list of Applications most are pretty benign in terms of how they affect Windows but the ZCM Agent stood out to me, knowing it hooks into the Windows GINA to obtain Passive Login credentials. It was a bit of a hunch but a logical one given most other options had been exhausted by this point.
I decided to move it right to the end of the TS, with only installing AV and Join Domain after it. Voila, success!
At this point I then had a search around for anything pertaining to this behaviour to try and explain the root cause and found something very, very similar…
Zenworks Passive Mode login stops working after upgrade to Windows v1709 or v1803
Ref: https://www.novell.com/support/kb/doc.php?id=7022478
It would appear Windows Updates can trigger the same behaviour (and issues) on the ZCM Agent, perhaps not entirely surprising given the final line
“The exact conditions on when it may or may not occur are not fully known since it is the Windows Update, not ZENworks Code, that is removing these keys.”
There is also this issue as well, which we configure via GPO once the machine has joined to the domain but also exhibits Passive Mode login issues. However it never caused us problems prior to the August updates:
Ref: https://www.novell.com/support/kb/doc.php?id=7022379
However I also wondered why this didn’t happen consistently on every machine, given they all should’ve been installing the same updates? Looking at our WSUS server it was getting worked hard, despite having had a memory upgrade and IIS pool increase a while back.
Pushing the WSUS server up to 4 vCPU / 16GB RAM / IIS pool increased to 8GB (who said WSUS doesn’t need much resource?!) seems to have made the update installation process quicker. That makes me wonder if some machines weren’t picking up all the updates and thus the failure didn’t occur every time.
The story does come with a happy ending, after finding the fix (plus an early start the following Monday) the remaining machines were imaged in time for our new classrooms to be brought online at the start of term.
The moral of the story
Keep your ZCM Agent install step as close to the end of your TS as possible and definitely after all Windows Update steps have already ran!
(also VMs on a fast Samsung Evo SSD are a lifesaver for troubleshooting!)