Senior Reliability Engineer Job Description
Department: Technology Operations
Reports to: Director, System Operations
Reliability Engineering focuses on ensuring dependable service delivery of SPS Commerce services. Working within a fast paced, collaborative environment it is the Senior Reliability Engineer’s role to use technical solutions to prevent or reduce the likelihood of failures. A Senior Reliability Engineer helps deliver monitoring and automation patterns to help deliver high performing services to our customers. Collaboration with other teams within the organization to help implement these patterns in their services is critical. Additionally, using automation and other technologies to intelligently cope with challenging failures.
MEASUREMENTS OF SUCCESS
- Ability to effectively manage multiple projects and tasks in a fast-paced environment
- Engineer reusable technical patterns to prevent or reduce the frequency of failures
- Quickly identify and resolve the cause of failures as they may occur
- Engineer effective ways to cope with failures that may occur
- Collaborate with various technology teams to ensure the designs of new systems will have a high rate of reliability and dependability
- Collaborate with various technology teams to apply reliability patterns leveraging monitoring services and automation
- Engineer and maintain shared services that will improve SPS’s overall performance of service delivery
- Responsible for producing automation and monitoring patterns for technology designs of existing and new services
- Developing robust solutions and tools to consistently improve the team’s ability to identify, react, respond, and recover from failures.
- Collaborate with Technology Engineering, Development, and Product Management to help develop, scale, and improve production systems and services; approachability is critical
- Monitor and administer systems as assigned to provide appropriate documentation, cross-training, capacity management, and recommendations for future state.
- Consistently demonstrate superior problem solving and collaboration skills.
- Provide consistent thought leadership and innovation within reliability practices and roadmaps
- College Degree or equivalent years of experience
- 3 or more years of experience in the Information Technology field
- Experience administering Linux – Oracle Enterprise Linux, Red Hat, CentOS
- Experience participating in Agile Development Methodology and task execution
- Proficient with Python
- Experience with Amazon Web Services including Lambda, SNS, API Gateway, EC2, RDS, Dynamo DB, Route53, Elastic Load Balancers, AMIs, IAM Roles, Ops Works, and Cloud Formation/SAM
- Experience with Automation, Configuration Management, Continuous Integration, and related tools including Ansible, Github, Test Kitchen, and Jenkins development
- Experience with advanced monitoring solutions. Experience with Sumo Logic, Logic Monitor, Cloud Watch, and statsd are a plus!
- Demonstrated understanding of networking systems including, TCP/IP, UDP, DHCP, BIND DNS, IP subnets preferred
- Demonstrated understanding of various identity and authorization systems including Open LDAP, Microsoft Active Directory, and SAML implementations preferred
- Demonstrated knowledge of storage systems including iSCSI SAN, Direct Attached Storage (JBOD), and NFS preferred