Running the Project

Definitions

Transaction completion rate (TCR)
Calculated by dividing the total number of successful (approved) transactions by the total number of attempted transactions over a given time period.

Goals

Systems must be reliable, observable, and maintainable
Track business metrics, not just technical metrics
Listen to users and act on feedback quickly

Questions to Ask

How could this product fail (i.e. pre-mortem)?

Work backwards from where failure occurs, so that you can then plug those gaps.

How does this scale? What do you have in place to scale vertically (make your existing infrastructure more powerful) and/or horizontally (add more infrastructure quantity/units)?

Either or both options have situations where they're better. No matter what, there must be a plan that's being executed to handle a surge later on.

Are our images and other static assets optimized? Are they webp/avif? Are there 2mb jpg hero images?

These are extremely simple but can absolutely show dedication (or lack thereof) to optimizing performance. Better performance will help improve transaction completion rates and user satisfaction.

How does our UI look across device screen sizes?

UIs have no excuse doing poorly on mobile responsiveness and accessibility.

What model are we using and why can't we use <insert model>?

There are obvious differences across models, some do better than others. Like Claude 3.5 sonnet is really good with code and Gemini 2.0 or gpt 4.5 give me really good creative writing pieces. The business objective to keep up with the best in class and choice when there are advancements. The fear is always stagnating because of a lack of incentives to change.

Alarm Bells

Using xxx, yyy, zzz resources (or X%, Y%, Z%) for cloud/on-prem resources.

From a business perspective, need to understand how resource usage meets objectives. Can users access the offering? Can they transact? Are they satisfied? How can costs be reduced?

We haven't checked AWS optimization tools or asked support for help yet.

Lack of utilization of free AWS tools like Trusted Advisor and enterprise support indicates a lack of cost discipline and workload optimization.

Transaction completion rate is really good! (90-99%).

A 10% failure rate is common, excessively high rates suggest potential measurement errors. Review start and end points of measurements to ensure they accurately reflect customer transaction success.

We run health checks and calculate metrics manually. Someone will notice if there's an outage.

Manual monitoring is inefficient and unresponsive. Real-time, automated monitoring is essential for timely service issue detection and resolution. Stats should be used for ongoing health checks, not just periodic presentations.

We have backup, this is not a critical system anyway.

What is the backup frequency? Are these active or passive copies? Where are they stored? How many copies? How long does it take to restore them? Are these complete snapshots or diffs? You'd always want some variant of the 321 backup rule.

How do we manage vulnerabilities (CVEs) and patches?

It's part and parcel of running systems. Patching is essential.

Work always costs a lot of man days and money, even for things that sound very simple.

Fixed payments let vendors pocket the 'savings' when work is easier than estimated. Buffers get added, vendors profit.

Before we fix anything, we need to confirm: was there a delay in reporting? Were the proper procedures followed?

Covering asses becomes more important than tackling problems.

Our analytics dashboard shows 120k Javascript errors... but don't worry, it's all under the hood.

Frontend errors are not invisible, they represent broken UI states, failed clicks, or frustrated users. Ignoring high error rates is ignoring user friction.

We need the precise date and time when you encountered the issue to help us locate the corresponding logs.

This indicates a lack of proper Observability tools (like correlation IDs or request tracing). You shouldn't need a user to timestamp their bad experience for you to find it.

The root cause is human error, the person needs to be more careful.

Human error is a symptom, not a cause. If a system allows a human to accidentally bring it down, the system design is at fault. Good post-mortems focus on guardrails, not blame.

We needn't validate the external API responses... if we add validation, it might break our responses.

This is 'Hope-Driven Development'. Failing to validate incoming data means you prefer your app to crash randomly in production rather than catching the issue gracefully at the door.

We always review logs manually, update PowerPoints, and track outages in spreadsheets.

Reactive, not proactive. Automated monitoring with business-tied triggers is essential.

We don't have KPIs set up yet, but we'll measure things after launch.

No measurement plan means no accountability for business impact or ROI.

Stakeholder feedback is limited to "Make it look nicer" or "Let's improve the UI" without user research or defined UX goals.

UX is not just visual polish. It's about understanding user needs and solving them effectively.

You need a 5-level approval process for changing words on your app/website.

Bureaucracy kills agility. If you need 5 approvals for a typo, your governance is broken.

The system was up 100% of the time, but TCR was 50%.

The server was running, but the 'Submit' button was broken. Uptime is a vanity metric; TCR is a value metric.

We are secure because we are compliant with policy X.

Compliance is a checkbox; security is a mindset. You can be fully compliant and still be wide open to hackers. Compliance protects the auditor; security protects the user.

Dealbreakers

Red Flag
Users are telling us the app sucks, the app is slow, the app is useless. But we're sure the noise will go away eventually.
Why

We only listen to feedback if it's good, or if our bosses are affected by it.

Red Flag
Team proudly presents top notched figures but hasn't validated measurement methodology or investigated outliers.
Why

Lack of data integrity and actionable insights. At this point we might as well not even have measured, that'd be better than getting wrong conclusions from incorrect data.

Apptitude / Curated by Zixian Chen

© 2024–2026. All Rights Reserved.