Posted November 28, 2020

Staff Site Reliability Engineer

Episerver

USA Remote Full Time

This job is being posted to Silicon Florist because it is potentially open to remote candidates. Feel free to contact me if you’d like to learn about...

Expand

This job is being posted to Silicon Florist because it is potentially open to remote candidates. Feel free to contact me if you’d like to learn about Episerver before applying. I am a PDX-based remote Episerver employee. [email protected]

Staff Site Reliability Engineer

We start the new chapter of Episerver as we proudly join forces with Optimizely to create a new wave of digital leaders through transforming digital experience creation and optimization. Episerver is consistently ranked as a market leader in digital experience creation, supporting the digital journeys of 9,000+ global brands, while Optimizely is the world’s leader in experience optimization. Combined, these two powerhouses create the most advanced digital experience platform in the industry. The combination of creation and optimization will enable companies across all segments and industries to take advantage of what content, commerce, personalization and experimentation can bring to their business and to their customers.

The scale of our product has created tremendous potential for growth with Episerver + Optimizely – growth of teams, growth of influence, and growth of personal careers. If you are looking to work on the next generation of digital technologies in a fast-paced, hyper-growth environment, apply! We’re just getting started...

Site Reliability Engineers at Episerver are focused on making Episerver the most reliable, performant, and trustworthy Digital Experience Optimization platform ever! Our engineering teams have built data pipelines that process 10 billion events daily and applications that support powerful experimentation and collaboration workflows at scale. Our platforms are built on AWS and GCP. We use technologies such as Cloudflare Workers, Akamai, PostgreSQL, and Honeycomb. We build and manage our systems using tools such as Travis CI, Jenkins, Docker, and Terraform This is a unique opportunity to lead the engineering organization in areas of state-of-the-art observability, service-oriented architectural excellence, and forward-looking planning and execution of large technical projects.

As a Staff Site Reliability Engineer you will:

Assist with defining a roadmap for all engineering teams to utilize fully automated, self-service, highly scalable, cost-efficient, observable, auditable and reliable infrastructure services as standard practice
Work on the execution of this roadmap across the engineering organization, collaborating with SREs and senior engineers across engineering while also performing hands-on work on the most critical challenges
Provide expert technical guidance and ongoing engineering design review to teams planning and implementing large migrations, service-oriented architecture, broad architectural shifts, and capacity growth
Build a metrics-driven operational culture standardizing our practices for SLO definition and review as well as for logging, monitoring, alerting, and on-call practices
Make iterative improvements to blameless incident management processes, root cause analyses, outage prevention, and service recovery strategies across the engineering organization
Partner closely with security, quality, and product teams to achieve high priority security, privacy, compliance, reliability and business-continuity objectives on our overall roadmap
Propose and drive large improvements to production systems to achieve significant impact to our business and engineering teams
Mentor and coach engineers to be curious and effective at discovering and solving technical challenges
Participate in SRE 24/7/365 on-call rotation

You’ll be successful if:

You have proven experience (7-10years) demonstrating hands-on technical leadership and business impact in combining software engineering skills with systems engineering skills to solve complex automation and reliability challenges
You have deep technical experience with various cloud providers, containerization technologies, automated deployment frameworks, orchestration frameworks, monitoring, logging, alerting, system internals, networking, databases, distributed systems, and service-oriented architecture
You have the skills to implement load, stress, performance and reliability testing standards at scale to improve service, platform and infrastructure resiliency
You promote openness, diversity of opinions and inclusive discussions at all times to evaluate a wide variety of ideas and perspectives in solving challenging problems
You demonstrate clear decision making and good trade-offs in complex situations comprising multiple opinions, needs, teams, technologies, cloud providers, and architectural settings
Multiple Cloud experience (AWS, GCP and Azure)
Monitoring expertise with DataDog, New Relic, Nagios
CDN experience is very desirable
AWS IAM, networking, security, architecture and general expertise a must
You communicate effectively with stakeholders ranging from executives to junior engineers across the breadth and depth of the engineering organization
You exemplify high accountability, integrity, and resilience to maintain focus on both big-picture goals and the milestones to get there
You enable the engineering organization to innovate and deliver with greater speed and safety
Proven experience demonstrating hands-on business impact in combining software engineering skills with systems engineering skills to solve complex automation and reliability challenges
Proficiency in more than one programming language or infrastructure automation tool including any of: Python, Java, Bash, Terraform, Chef, or similar

Monitoring expertise (Any of DataDog, New Relic, Nagios, Honeycomb, or similar)

ELK stack for centralized logging
AWS IAM, networking, security general expertise a must

Ability to proactively look at all systems, tools, processes and architectures with an open mind and make recommendations on scale, reliability, availability and automation is key

This listing expired on Jan 12. Applications are no longer accepted.

Below are some other jobs we think you might be interested in.

Platform Engineer (from Junior to VP)
- HILOS
- Portland, Oregon, USA
Jun 30