Senior Site Reliability Engineer (SRE)

Salary not provided

AWSRubyRailsPythonGoShell
English: Fluent

Minimum year of experience: 2

KOMOJU

Site Reliability Engineer (SRE) Job Description

About the Position

As a Site Reliability Engineer (SRE), you will play a critical role at the intersection of software engineering and infrastructure operations. This position is ideal for engineers who are passionate about automation, systems design, and building scalable, reliable platforms.

You will not be limited to just managing cloud infrastructure—you will take ownership of the platform's overall health, performance, and developer experience. Your work will span:

  • Cloud Infrastructure Management: Architect, implement, and maintain robust and secure infrastructure in a cloud-native environment using Terraform. Ensure high availability, scalability, and resilience of systems.
  • CI/CD and Deployment Automation: Design and improve continuous integration and continuous delivery pipelines that empower development teams to ship software reliably and rapidly.
  • Observability & Monitoring: Implement end-to-end observability tooling (metrics, logging, distributed tracing, and alerting) to provide real-time insight into platform performance and help reduce mean time to detection and resolution.
  • Platform Quality & Reliability: Champion best practices for reliability, scalability, and performance across engineering teams.

You’ll collaborate closely with developers, security engineers, and product stakeholders to ensure systems meet both technical and business goals.


Responsibilities

  • Actively participate in improving and maintaining AWS infrastructure
  • Continuously improve system performance, reliability, and security
  • Design, implement, and maintain observability stack (metrics, logging, tracing, dashboards)
  • Correspond with engineering teams to instrument applications for better observability
  • Improve developer productivity with tooling
  • Secure systems and ensure adherence to compliance requirements
  • Participate in on-call rotation

Requirements

  • 2+ years in SRE roles working with the AWS platform
  • 2+ years of experience in a software development role
  • Hands-on experience with observability tools, preferably Datadog
  • Proficiency in Terraform
  • Proficiency in at least one scripting or programming language (e.g., Ruby/Rails, Python, Go, Shell Script)
  • Experience working with CI/CD tools (e.g., GitHub Actions, Jenkins, Circle CI)

Nice to Have

  • Strong communication skills for collaboration with internal and external stakeholders
  • Knowledge of TCP/IP and other networking protocols
  • Experience with AWS Direct Connect

Benefits

  • Flexible remote work environment, with office space available for in-person collaboration
  • 10 regular vacation days, plus 5 additional summer and 5 winter vacation days
  • Paid birthday holiday
  • Self-learning allowance to support skill development
  • Access to the O’Reilly Learning Platform
  • Language training (Japanese/English)
  • Twice-a-week office lunch