SPS Commerce is hiring a Site Reliability Engineer (SRE) who will serve as a critical member of the team responsible for delivering highly available platform services and deployment automation that empower our product engineering teams with services that are secure, reliable, and cost effective. The SRE approaches operations as a software problem and aims to apply software engineering approaches to those problems. This role is a key influencer to technical design and execution that has high visibility and broad impact to our entire engineering platform and products.
This is a fully remote role; you can be based anywhere in the continental US.
We solve retail supply chain problems by cutting through inefficiency with innovation and automation. At SPS we empower retailers, suppliers, distributors, grocers, and logistics partners to work better together with our people, our process, and our tech products. We have the world’s largest retail network, and we don’t just lead the industry, we are the industry.
At SPS, we believe every employee makes a difference. We ensure employees have the tools, resources and training to explore new ideas and execute them. Our success comes from playing as a team and always playing to win. Careers don’t just grow here, they’re made here.
Does this sound like you?
- You have a passion for automation and you approach technology operations problems like they're software problems - you apply software engineering approaches to resolution.
- You work collaboratively - you know that success is seldom the result of one person or one team. It takes a village to craft a highly reliable, secure and fast platform.
- You enjoy the pace and responsibility of having a large impact on a team. This includes being confident in making recommendations, issuing guidelines, and helping to drive decisions.
What is the day to day like?
You will help connect application architects and support engineers with those dedicated to IT infrastructure to ensure application and system resilience. Through collaborative post-incident review processes, the SRE contributes and assists with follow-up on action items as they relate to process improvements and completion and helps drive useful metrics and dashboards.
- Maintain a highly available, secure, and cost-effective cloud platform running on Microsoft Azure
- Support the core components of site reliability, particularly as they relate to performance and incident response, to help facilitate service resilience and infrastructure uptime.
- Create automation for improved collaborative response in real-time including updating documentation, runbook tools, and modules to ready teams for incidents.
- Engineer Continuous Integration & Continuous Delivery (CI/CD) solutions that simplify and improve software deployments to enable high velocity for our Engineering and Operations partners
- Develop robust monitoring and observability services and patterns to consistently improve the team’s ability to identify, react, respond, and recover from complex failures.
- Collaborate with Engineering, Development, Operations and Product Management to help develop, scale, and improve production systems and services
- Partner to provide appropriate documentation, cross-training, architecture planning, capacity management, and recommendations for future state
What is required?
- 2+ Microsoft Azure experience including Virtual Machines, Azure SQL Database, Azure Monitor, Azure Virtual Networks with a bachelor’s degree or five years of experience without a degree
- Experience in Python, PowerShell, .NET, or a comparable language with software engineering mindset
- Experience with immutable and scalable infrastructure (infrastructure as code concepts)
- Demonstrated understanding of networking systems, various identity, and authorization systems
- Problem solving and collaboration skills
What is preferred?
- An advanced understanding of cloud technologies
- Experience building or operating CI/CD pipelines or other deployment automation solutions
- Experience with Microsoft Azure including Azure Active Directory, Azure Networking, Azure Load Balancer, and ARM Templates
- Experience with advanced monitoring solutions such as metrics platforms, logging, distributed tracing, and the like.
*EOE including disability/ veteran*