Join Razer on a global mission to revolutionize the way the world games. As a Site Reliability Engineer, you will be part of Razer Gold's growing infrastructure and platform engineering team. We are seeking a skilled and driven individual with hands-on experience in Amazon Web Services (AWS), strong troubleshooting capabilities, and a passion for building scalable, observable, and resilient systems using modern Infrastructure as Code (IaC) and automation tools.

Job Responsibilities

Design, Develop, and Maintain Infrastructure as Code (IaC)

Design, develop, and maintain Infrastructure as Code (IaC) using tools like Terraform or AWS CloudFormation.

Implement and Operate Reliable, Scalable Cloud Infrastructure

Implement and operate reliable, scalable cloud infrastructure primarily on AWS (e.g., EC2, ECS, RDS, S3, Lambda, ElastiCache, SQS, SES, Auto Scaling, Load Balancers).

Lead and Participate in Architecture Reviews

Lead and participate in architecture reviews focusing on reliability, scalability, security, and performance.

Develop and Manage Robust Monitoring, Alerting, and Logging Solutions

Develop and manage robust monitoring, alerting, and logging solutions (e.g., CloudWatch, Prometheus, Grafana, ELK, etc.) to detect and resolve issues proactively.

Perform Incident Management

Perform incident management, postmortems, root cause analysis, and implement continuous improvement strategies.

Collaborate with Software Engineering Teams

Collaborate with software engineering teams to improve CI/CD pipelines, deployment automation, and release management.

Automate Infrastructure Operations

Automate infrastructure operations, reduce manual toil, and improve reliability using scripting (Python, Bash, Node.js, or Ruby).

Maintain and Troubleshoot Environments

Maintain and troubleshoot environments involving web servers, databases, firewalls, DNS, load balancers, and networking.

Ensure Systems Compliance

Ensure systems are compliant with security standards, including patching, hardening, and secure access policies.

Provide On-Call Support

Provide on-call support, participate in incident rotations.

Monitor and Maintain Service-Level Objectives (SLOs)

Monitor and maintain service-level objectives (SLOs), SLAs, and error budgets to ensure reliability targets are met.

Provide Support and Solution Handling

Provide support and solution handling to incident and tickets assigned.

Pre-Requisites

Bachelor’s degree in Computer Science, Software Engineering, Information Technology, or a related field.
Minimum 2 years of experience in SRE, DevOps, Cloud Infrastructure, or Systems Administration roles.
Solid hands-on experience with AWS Cloud services including (but not limited to):
- Compute: EC2, Lambda, ECS, Auto Scaling
- Networking: VPC, Load Balancers, Route 53
- Messaging & Storage: SQS, S3, RDS, ElastiCache, SES
- Monitoring: CloudWatch, X-Ray
Proficient in Infrastructure as Code using Terraform and/or CloudFormation.
Experience with CI/CD tools (e.g., GitLab CI, Jenkins, CodePipeline, ArgoCD).
Strong understanding of Linux and Windows system administration and troubleshooting.
Comfortable with one or more scripting/programming languages such as Python, Node.js, Bash, Ruby, or JSON/YAML for automation.
Strong grasp of network fundamentals, including DNS, HTTP(S), TLS/SSL, firewalls, and TCP/IP.
Experience with containerization and orchestration (Docker, ECS, or Kubernetes is a plus).
Familiar with observability tools and incident management best practices.
Vietnamese citizen based in Ho Chi Minh City.

XML job scraping automation by YubHub

Site Reliability Engineer

Site Reliability Engineer at Razer

Job Description

Similar Jobs

Staff Mixed Reality Engineer

Senior Embedded Software Engineer, Android Platform

Open Application

Software Engineer

Software Engineer

Software Engineer