Server Maintenance

D'Agger · 28-10-2015 10:01am #1

Hi All,

Just wondering what kind of schedules you have around server maintenance. I'm looking to get the below setup and just curious as per how any other techs/sysadmins here have their servers running

- Script to clear down space running monthly targetting certain folders
- Report running from SNMP on disk usage with alerts for thresholds broken
- Weekly physical check of Server racks for lights
- Documented backup schedule for all servers &
- Tightening of group policy around windows updates for servers (reboots etc.)

Fysh · 28-10-2015 2:37pm

D'Agger wrote: »

Hi All,

Just wondering what kind of schedules you have around server maintenance. I'm looking to get the below setup and just curious as per how any other techs/sysadmins here have their servers running

- Script to clear down space running monthly targetting certain folders
- Report running from SNMP on disk usage with alerts for thresholds broken
- Weekly physical check of Server racks for lights
- Documented backup schedule for all servers &
- Tightening of group policy around windows updates for servers (reboots etc.)

I make no assertion that the below is the right way to do it, but here's what I've done/seen done in places I've worked:

Nightly free space checks with email notifications, linked to weekly/monthly cleanups if possible and appropriate
Physical inspection not really necessary if you've got and are correctly using ILOM cards on your servers, but it couldn't hurt
Nightly, weekly and monthly backup routines (diffs/incrementals nightly, full either weekly or monthly as appropriate, varying based on service and resources), with email notifications where possible

Group Policy's a tricky one, because it's easy to break stuff if you're not careful. Effective permissions checking then becomes important, which would be fine if AD effective permissions weren't painful to check

(Though this freebie from Netwrix can be useful in certain circumstances)

D'Agger · 28-10-2015 4:36pm

Do you have the backups or server setups documented Fysh?

We have SNMP software that we can display the backup schedule & link assets it's being backed up to i.e. a DR server nightly, tape nightly/weekly/monthly but do you have documentation on server setups or a ticket that you implement for server creation requests?

Fysh · 28-10-2015 10:48pm

TBH I have certain issues/reservations as to how some things are handled in my current place of employ, so we don't always do what I think we should be doing.

We have reasonably good documentation for the backups (it's all scheduled NetBackup jobs, and as we've two main sites each site gets backed up to the other site); equally importantly the NetBackup configuration is sensibly configured so that it's pretty easy to figure out what's going on if you have access to the master server. We don't regularly test restoring from them, though, which is a bugbear of mine on the basis that a backup's just a load of used storage until you've verified it's usable...

Server creation requests tend to come through as work packages that then generate a ticket; most of the setup work is automated based on templates in vSphere.

Patching is managed through WSUS, though I'm unhappy about how we do this - we don't have a test environment, any server profiles against which to test patches, or any kind of UAT process for patching things like production SQL Server boxes. We do have a failed patch backout process, because I wrote one the last time I was on patch duty. But it hasn't been tested, because we don't have a test environment...

Server configuration documentation is another matter entirely - our equipment database is an Access DB (!!!) despite us having teams of MSSQL devs and DBAs within the organisation, and it's woefully incomplete in a lot of cases and horrendously out of date in others. I've lobbied for this to be changed but there seems to be no appetite for the change. Similarly, some services are documented fully while others
may as well not exist for all the documentation we've got.

Where we do particularly badly is in cross-referencing information between different services/systems - for example, we use Infoblox for DNS & DHCP management, AD for authentication/directory services, VMWare as a virtualisation platform, and McAfee VirusScan for AV (with EPO as the management suite). But we've got neither procedures nor tools for identifying disparities between each service, so inaccurate information is all over the place. I've started working on tackling this through a series of PowerShell scripts that can pull together the relevant information and identify inconsistencies, but the real solution will have to come in terms of procedural change.

D'Agger · 29-10-2015 2:08pm

Fysh wrote: »

TBH I have certain issues/reservations as to how some things are handled in my current place of employ, so we don't always do what I think we should be doing.

This is essentially my problem with my current employment also. Family owned & after experiencing massive growth, the structures & procedures of the company don't line up with what's required for a company of it's size now - in fact there's quite a bit that's simply non existent.

We have reasonably good documentation for the backups (it's all scheduled NetBackup jobs, and as we've two main sites each site gets backed up to the other site); equally importantly the NetBackup configuration is sensibly configured so that it's pretty easy to figure out what's going on if you have access to the master server. We don't regularly test restoring from them, though, which is a bugbear of mine on the basis that a backup's just a load of used storage until you've verified it's usable...

We have a DR server for VMs that we robocopy parititons over to folders with the copied servers name on it. The same folder contains a build document that was created when the server was put on the network. Idea being that, were there a massive outage, we could recreate the servers based on the build doc & simply copy over the partitions so effectively have the server recreated at the point it was at the night before.

That's not ideal but it's not the worst solution either.

DR testing is not something that's been done much here either but I do hope to get a better overview of our server setup this month & look to test for DR of top priority servers bi annually.

Server creation requests tend to come through as work packages that then generate a ticket; most of the setup work is automated based on templates in vSphere.

Patching is managed through WSUS, though I'm unhappy about how we do this - we don't have a test environment, any server profiles against which to test patches, or any kind of UAT process for patching things like production SQL Server boxes. We do have a failed patch backout process, because I wrote one the last time I was on patch duty. But it hasn't been tested, because we don't have a test environment...

I'm hoping to get a basic few forms created for our tickets i.e. Hardware request tickets with subs for Laptop, mobile, server etc. the idea being that we weed out the time spent requesting additional information by having required fields such as DNS, C: Size, Additional partitions & sizes, software required etc.

This would mean there's a ticket relating to a new server & changes made to that server can be tracked via our ticketing system - change management effectively.

WSUS is something we've configured but we've no test environment, nor will we have one for a long time if ever. As it stands group policy points towards the WSUS server, but if patches haven't been applied via that WSUS server, or if it's not available for some reason, then it grabs important & security updates directly from the MS site. WSUS is something I need to give proper time to though.

Server configuration documentation is another matter entirely - our equipment database is an Access DB (!!!) despite us having teams of MSSQL devs and DBAs within the organisation, and it's woefully incomplete in a lot of cases and horrendously out of date in others. I've lobbied for this to be changed but there seems to be no appetite for the change. Similarly, some services are documented fully while others
may as well not exist for all the documentation we've got.

How does it work on an access DB - saving of general configs or documents pulled into an access DB? Either way that's a first I've heard of access being used for doc control or config management!

Where we do particularly badly is in cross-referencing information between different services/systems - for example, we use Infoblox for DNS & DHCP management, AD for authentication/directory services, VMWare as a virtualisation platform, and McAfee VirusScan for AV (with EPO as the management suite). But we've got neither procedures nor tools for identifying disparities between each service, so inaccurate information is all over the place. I've started working on tackling this through a series of PowerShell scripts that can pull together the relevant information and identify inconsistencies, but the real solution will have to come in terms of procedural change.

We also use EPO - it's handy, again something I'd like to spend more time on, but one thing that might interest you here is ELK stack - something that allows you to consolidate logs & visualise them - could allow you to create dashboards for multiple systems, but house them under one solution.

I created the server previously with a view to going back to it, went back 3 months after the fact to configure it and start pulling in logs - Virtual server I created it on was deleted to free up space

Gona take a look at infoblox though because we currently use Domain controllers for DNS & DHCP on a per site basis

Fysh · 30-10-2015 4:00pm

D'Agger wrote: »

We have a DR server for VMs that we robocopy parititons over to folders with the copied servers name on it. The same folder contains a build document that was created when the server was put on the network. Idea being that, were there a massive outage, we could recreate the servers based on the build doc & simply copy over the partitions so effectively have the server recreated at the point it was at the night before.

That's not ideal but it's not the worst solution either.

That's not a bad start, though ideally you'd have the VMs backed up as well. One of these days I'll make time to experiment with Hyperv Backup, because it seems a promising and inexpensive way of adding an extra layer of resilience to DR plans. It's not relevant in my current place though as we've got SRM configured in vSphere.

D'Agger wrote: »

DR testing is not something that's been done much here either but I do hope to get a better overview of our server setup this month & look to test for DR of top priority servers bi annually.

I've foundit quite frustrating to see how often DR and BCP stuff is treated as an optional exercise. I've actually had the senior tech in my current team describe it as a box ticking exercise

Unfortunately it's not until after you needed it that management will usually appreciate why you kept asking for it (and related budget) in the first place...

D'Agger wrote: »

I'm hoping to get a basic few forms created for our tickets i.e. Hardware request tickets with subs for Laptop, mobile, server etc. the idea being that we weed out the time spent requesting additional information by having required fields such as DNS, C: Size, Additional partitions & sizes, software required etc.

This would mean there's a ticket relating to a new server & changes made to that server can be tracked via our ticketing system - change management effectively.

Definitely something to prioritise. In my last job I basically harassed my boss into letting me install a ticketing system (only for team-internal use, so templates were less of a pressing issue) and I had the assumption that just having a ticketing system was enough. My current place has a ticketing system that could be pretty good, but we have no templates and poor communication between the support desk and other technical teams, so the amount of time that gets wasted on gathering required data is ridiculous.

Plus a good ticketing system with well defined categories lets you generate reports to identify time sinks and areas requiring further work/investment.

D'Agger wrote: »

WSUS is something we've configured but we've no test environment, nor will we have one for a long time if ever. As it stands group policy points towards the WSUS server, but if patches haven't been applied via that WSUS server, or if it's not available for some reason, then it grabs important & security updates directly from the MS site. WSUS is something I need to give proper time to though.

I'm in the same boat - it needs more time and attention but the view is that "it works" so we just autopublish all important updates with no testing. Given the recent announcement that the win10 upgrade will be released as a recommended update next year, I look forward to finding a load of client machines upgraded unexpectedly...

D'Agger wrote: »

How does it work on an access DB - saving of general configs or documents pulled into an access DB? Either way that's a first I've heard of access being used for doc control or config management!

It's hideous, and boards won't let me type the exact words I uttered when I first found out. I mean, in terms of making Access do stuff it's astonishing, there's some serious macro-fu going on. But it's also a horrible example of using the wrong tool for the job. Not only is it the wrong tool but at least half the data I find in there is out of date, because the lack of procedural documentation means people don't realise they should update these records...

D'Agger wrote: »

We also use EPO - it's handy, again something I'd like to spend more time on, but one thing that might interest you here is ELK stack - something that allows you to consolidate logs & visualise them - could allow you to create dashboards for multiple systems, but house them under one solution.

I created the server previously with a view to going back to it, went back 3 months after the fact to configure it and start pulling in logs - Virtual server I created it on was deleted to free up space

Cheers for the tip, I'll take a look at that! I've been pretty happy that most of our tools have Powershell modules or consoles available so I can just pull stuff into scripts, but generating dashboards is always painful in PS so anything that can do it for me sounds promising

D'Agger wrote: »

Gona take a look at infoblox though because we currently use Domain controllers for DNS & DHCP on a per site basis

I'm not sure how much I'd recommend it - I suspect our deployment of it is a bit broken, but it has these "host" records that it used by default which I don't like, because they behave sort of like A records and sort of like CNAME records but not reliably like either. Having said that it does add some resilience to DNS and DHCP, but personally I'd probably be as happy using Bind for DNS and either Windows DHCP or dhcpd on a lightweight linux stack for DHCP...

D'Agger · 29-11-2015 11:19pm

Fysh wrote: »

That's not a bad start, though ideally you'd have the VMs backed up as well. One of these days I'll make time to experiment with Hyperv Backup, because it seems a promising and inexpensive way of adding an extra layer of resilience to DR plans. It's not relevant in my current place though as we've got SRM configured in vSphere.

Taken me ages to get around to this, on annual leave and up the walls before and after which is to be expected really!

Working on a presentation on powerpoint to highlight what needs to be done from an infrastructure pov in order to help get an extra resource to free me up to do what's required.

Hyperv looks good at a glance, must check out how others use it r/sysadmin etc.

I've foundit quite frustrating to see how often DR and BCP stuff is treated as an optional exercise. I've actually had the senior tech in my current team describe it as a box ticking exercise Unfortunately it's not until after you needed it that management will usually appreciate why you kept asking for it (and related budget) in the first place...

That's frustrating. I'm looking to do some DR testing next weekend where possible - the holdup is that I'm looking to be able to charge time for it as it'll be a few hours of my Saturday - radio silence on whether I'm to schedule it despite a few follow ups!

Definitely something to prioritise. In my last job I basically harassed my boss into letting me install a ticketing system (only for team-internal use, so templates were less of a pressing issue) and I had the assumption that just having a ticketing system was enough. My current place has a ticketing system that could be pretty good, but we have no templates and poor communication between the support desk and other technical teams, so the amount of time that gets wasted on gathering required data is ridiculous.

Plus a good ticketing system with well defined categories lets you generate reports to identify time sinks and areas requiring further work/investment.

Yeah I've pulled down an excel sheet of data from this year to work with from our ticketing system due to us having to duplicate our time into a worksheet that's a terrible system and is costing me time that could be best used elsewhere.

I'm in the same boat - it needs more time and attention but the view is that "it works" so we just autopublish all important updates with no testing. Given the recent announcement that the win10 upgrade will be released as a recommended update next year, I look forward to finding a load of client machines upgraded unexpectedly...

I still haven't gone near this since! Set reminders in your calendar to find the KB serial & look to block it is my advice :pac:

It's hideous, and boards won't let me type the exact words I uttered when I first found out. I mean, in terms of making Access do stuff it's astonishing, there's some serious macro-fu going on. But it's also a horrible example of using the wrong tool for the job. Not only is it the wrong tool but at least half the data I find in there is out of date, because the lack of procedural documentation means people don't realise they should update these records...

Cheers for the tip, I'll take a look at that! I've been pretty happy that most of our tools have Powershell modules or consoles available so I can just pull stuff into scripts, but generating dashboards is always painful in PS so anything that can do it for me sounds promising

Recently had two products on me to install with the aid of a tech over the course of the next two days - wasn't comfortable with the rushed nature of it and it had to be pushed out overnight, with zero testing - it was a disaster and it still isn't implemented anywhere near correctly/properly.

Have you checked out pulling in PS? Must look into PS more for server based info reports - do you run scripts that mail info to you or what? I'm interested to hear what you have setup!

I'm not sure how much I'd recommend it - I suspect our deployment of it is a bit broken, but it has these "host" records that it used by default which I don't like, because they behave sort of like A records and sort of like CNAME records but not reliably like either. Having said that it does add some resilience to DNS and DHCP, but personally I'd probably be as happy using Bind for DNS and either Windows DHCP or dhcpd on a lightweight linux stack for DHCP...

Like the hyper v solution, I'll have to see what other techs say about it - thought I saw a thread on r/sysadmin recently about not using Windows for either being best practice

Looked it up this is it

Fysh · 02-12-2015 10:23pm

D'Agger wrote: »

Taken me ages to get around to this, on annual leave and up the walls before and after which is to be expected really!

Working on a presentation on powerpoint to highlight what needs to be done from an infrastructure pov in order to help get an extra resource to free me up to do what's required.

Hyperv looks good at a glance, must check out how others use it r/sysadmin etc.

On a small scale I've found it alright, but in production terms I've no idea how it compares to ESXi, etc. I can see it being useful if you've got a bunch of Microsoft specialists in-house and don't want to have to train them up in a separate technology. OTOH, there are some implementation issues that can crop up with any hypervisor setup when you're looking at large scales, so it may be better to get a consultant in to do the initial design and setup, then get training for the ongoing maintenance. You can get a basic ESXi licence for free if you want to play around with it, and it's worth doing - you won't get access to any of the shinier features like SRM but it gives you an idea of what to expect.

One thing I will say about Hyper-V is that if you want a simple test environment and you don't want/need it to be on a cloud platform Hyper-V works nicely and doesn't require getting hardware that's on the ESXi compatibility list, since you can run it on anything that'll run Windows Server. I have my home test machine set up with two boot disks, one for ESXi and one for Hyper-V, but I prefer running the Hyper-V one as I can script much more of the setup work in PS.

D'Agger wrote: »

That's frustrating. I'm looking to do some DR testing next weekend where possible - the holdup is that I'm looking to be able to charge time for it as it'll be a few hours of my Saturday - radio silence on whether I'm to schedule it despite a few follow ups!

TBH I'd do it and charge for it regardless, because either they trust you to do what's required or they'll never agree to it until an incident happens, at which point it's too late.

D'Agger wrote: »

Yeah I've pulled down an excel sheet of data from this year to work with from our ticketing system due to us having to duplicate our time into a worksheet that's a terrible system and is costing me time that could be best used elsewhere.

Ah yes, I've been in that boat before and it's rubbish. Still, at least you can get the data from the ticketing system - if you're lucky you may be able to put together a system whereby an export from the ticketing system generates timesheet data. IMO I've never understood requiring timesheets of people working on a ticketing system capable of recording time, like you say it's wasted duplication of work.

D'Agger wrote: »

I still haven't gone near this since! Set reminders in your calendar to find the KB serial & look to block it is my advice :pac:

I've been tinkering with a script that can remove and prevent reinstallation of the GWX client, so I'll be keeping an eye out for that KB and blocking it as soon as the details are released! (I have 10 installed on a desktop and don't mind it, but would prefer it to be my choice when it gets installed on any of my systems...)

D'Agger wrote: »

Recently had two products on me to install with the aid of a tech over the course of the next two days - wasn't comfortable with the rushed nature of it and it had to be pushed out overnight, with zero testing - it was a disaster and it still isn't implemented anywhere near correctly/properly.

Have you checked out pulling in PS? Must look into PS more for server based info reports - do you run scripts that mail info to you or what? I'm interested to hear what you have setup!

It's possible to query Access databases from PS, but I've not bothered to put the time in - to be honest none of the systems in question should be allowed to remain in Access because they're clearly way beyond it in terms of scope, so building tools that work with Access would just make it less likely that the move to SQL Server would happen. There's also a SQL Server PS module that's supposed to be pretty good, but I haven't used it.

In terms of reporting - what I usually do is collate the information I want (generally from AD, occasionally from Exchange 2010 or Office 365) into either text (for simply summary stuff) or CSV files (if it's something like a stale user account report), then send it via email. Mostly I have scripts set up to run as scheduled tasks on a dedicated VM - some run daily, but a few are monthly or quarterly (stale or misconfigured user accounts, stuff like that).

Due to the issue with our logs overwriting within 12 hours, I'm setting up another set of tasks to export the logs every few hours to a network share, and have a script on the way that will parse the exported logs for events of interest like failed logons, account lockouts, etc. (We'll probably end up buying a commercial product for some of this stuff, but equally that'll be at least 4-6 months before it's in place, and in the meantime I'd like to have something in place...)

If you know what you're trying to report on it's usually pretty easy to find out how to get it with PowerShell, and once you've got that you can use the Send-MailMessage cmdlet to email stuff to yourself easily.

D'Agger wrote: »

Like the hyper v solution, I'll have to see what other techs say about it - thought I saw a thread on r/sysadmin recently about not using Windows for either being best practice

Looked it up this is it

Yeah, tbh I don't see any particular reason to have Windows running your DNS (either domain controllers or otherwise). I've always found BIND nice and low overhead. I don't have any experience of Windows as a production DHCP server so can't comment on it, but OTOH if 3 completely different environments have all opted not to use Windows Server for DHCP that does suggest it's not a gread idea...

Server Maintenance

Comments