DevOps & SRE

Key Questions

?

What is the difference between DevOps, SRE (Site Reliability Engineering), and Platform Engineering?

?

Why do we need CI/CD pipelines instead of dragging files to a server?

?

What is Infrastructure as Code (IaC) and why is Terraform/OpenTofu standard?

?

What happens when a server crashes at 3 AM? (Incident Response)

?

What are SLIs, SLOs, and SLAs?

?

How do we know if the system is healthy? (Monitoring vs Observability)

?

Why is manual testing not enough for modern software?

?

What is the difference between Unit, Integration, and End-to-End (E2E) testing?

?

What is TDD (Test Driven Development)?

?

What is a 'Blameless Post-Mortem' and why do we need them?

?

Why is 'It works on my machine' not a valid excuse?

Learning Objectives

Track your progress as you learn

0%

Hard Truths

Reality Check

Developers often build things that are impossible to operate or monitor.

Reality Check

Manual deployments are the root cause of most outages.

Reality Check

Uptime is a feature, not luck.

Reality Check

Alert fatigue is real: if everything is urgent, nothing is urgent.

Reality Check

The most permanent solution is a temporary workaround.

Resources

Courses

DevOps Engineering on AWS

Google Cloud SRE: Measuring and Managing Reliability

Articles & Readings

The Site Reliability Engineering Books (Free to read)

The Twelve-Factor App

DevOps Roadmap

What is Platform Engineering?

The Testing Trophy and Testing Classifications

Google SRE Books

The Phoenix Project

Videos

Related Resources

Learn

Product Strategy

Focus on strategy, metrics, and finding product-market fit.

Learn

Organizational Dynamics

Why good technology dies in bad processes.

Learn

Web Engineering

Befriend the machines. Learn the languages and frameworks before they achieve world domination.

Learn

Testing & QA

You cannot inspect quality into a product at the end. Shift left, fix the data, and automate the pain.