Working with remote hands

Practices for the actual physical management of servers and components vary by organization. In smaller settings especially Ceph admins may site near the datacenters in which production systems are housed, and be responsible for hardware installation and management. In other and larger organizations, especially those with multiple datacenters within a country or around the world, there may be a dedicated DC Ops team or contracted third party.

We often refer to anyone who performs physical server installation and maintenance tasks as remote hands or simply hands, referring to the fact that they are able to lay literal hands on the systems that a distant admin cannot.

Whichever arrangement floats your organization's boat, there are a number of practices that can optimize hardware management logistics. Some of us may work in a busy, loud, shared office spaces, but datacenters are a whole new level of loud and cold. Careful co-ordination can save time and avoid mistakes when performing maintenance tasks, especially when replacing failed components. This topic may seem trivial and out of scope, but Ceph clusters comprise large numbers of twisty servers, all alike, and this author has seen this aspect of server management time and again fail horribly.

Here's a litany of best practices for working with datacenter hands.

  • Lose your voice
    DCs house hundreds or thousands of spinning fans and cacophonous air handlers. They are LOUD and your hands folks should be wearing hearing protection. Fancy noise-canceling headsets if available can vary in their effectiveness. But even in a silent room voice is a flawed way to communicate with DC hands. Like luggage at an airport, many names and digits do sound alike, and transcription errors are all too common. Was that serial number was 83423NCC1701 or A3423NEE1701? Should I yank drive twenty or twenty-one? The only way to win is not to play. Calls are fine for discussing logistics, but not for exchanging precise data.
  • Real-time digital communication
    It's vital for admins and hands to be able to communicate digitally. Ideal is a laptop or tablet with instant messaging or collaboration tools like Jabber, Slack, or Cisco Spark. In a pinch, you can use a smartphone or even the chat pane of a Zoom, Cisco WebEx, Google Hangout conference. These media allow one to cut/paste information, avoiding transcription errors. Email is not the best medium here given propagation delays, but it can be useful when nothing else is available.
  • Network access
    It's best for the DC hands to have network access, to the Internet at large or at least to your internal network. This allows them to retrieve firmware files and diagrams, and to send you photos and screen captures.
  • Just the right tool
    Software and hardware: This author once worked with an excellent node engineer who was employed by a different and rather parsimonious business unit. They had him using hand tools for hundreds of fasteners every day. The gift of a $30 electric screwdriver was invaluable for efficiency and made me his new best friend.
    Simple hand tools also have a way of walking away even from a locked DC hall, and among the worst frustrations is discovering only after you take a system down that your system vendor uses T10 Torx screws but the tech has only a stripped Phillips driver.
  • Verify server identity
    Adhesive stickers can be handy for visually locating a server, but they can just as easily be deadly. Don't rely on them. They can fade, fall off, be reattached to the wrong server, or be written incorrectly in the first place. Repurposed systems may not get a new and unique label. A different department's server in the next rack may have a name that varies from yours only by a hyphen. This author has been bit by all of these and has learned to not trust them. Modern systems often don't even have a space to attach a label given the desire to pack the front panel with as many connectors, indicator lights, and drive bays as possible.

     

    There are several better methods for ensuring that hands unplug and work on the proper server. Use them together when possible.

    • Indicator lights: tools like ipmitool, storcli, and hpssacli allow one to flash LED's for visual identification of desired servers
    • Serial numbers: Always refer to servers and drives by their serial numbers and never transcribe them. Cut/paste or extract digitally. The facter utility is great for this on a variety of hardware. Run facter hostname serialnumber productname and paste the results to your tech for multiple levels of identification. Utilities like hpssacli, dmidecode, facter, and storcli also help with component serials
    • Power: modern servers often have sophisticated BMCs or other ACPI power management systems. The last step of preparing a system for maintenance can be to run shutdown -P now; this will power down most server components, lights, and fans. The subject server is thus visually and audibly distinct.
    • Lockstep check-in
      Ensure that remote hands verify with you that your server is ready for them to tear into, that they have located the proper serial number, and that you're ready for them to reconnect power and network connections and boot it on up. Most importantly of all, make sure that they don't leave for the day until you both agree that the task is complete and the server is functional.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset