Running the Project
Definitions
Goals
Questions to Ask
Work backwards from where failure occurs, so that you can then plug those gaps.
Either or both options have situations where they're better. No matter what, there must be a plan that's being executed to handle a surge later on.
These are extremely simple but can absolutely show dedication (or lack thereof) to optimizing performance. Better performance will help improve transaction completion rates and user satisfaction.
UIs have no excuse doing poorly on mobile responsiveness and accessibility.
There are obvious differences across models, some do better than others. Like Claude 3.5 sonnet is really good with code and Gemini 2.0 or gpt 4.5 give me really good creative writing pieces. The business objective to keep up with the best in class and choice when there are advancements. The fear is always stagnating because of a lack of incentives to change.
Alarm Bells
From a business perspective, need to understand how resource usage meets objectives. Can users access the offering? Can they transact? Are they satisfied? How can costs be reduced?
Lack of utilization of free AWS tools like Trusted Advisor and enterprise support indicates a lack of cost discipline and workload optimization.
A 10% failure rate is common, excessively high rates suggest potential measurement errors. Review start and end points of measurements to ensure they accurately reflect customer transaction success.
Manual monitoring is inefficient and unresponsive. Real-time, automated monitoring is essential for timely service issue detection and resolution. Stats should be used for ongoing health checks, not just periodic presentations.
What is the backup frequency? Are these active or passive copies? Where are they stored? How many copies? How long does it take to restore them? Are these complete snapshots or diffs? You'd always want some variant of the 321 backup rule.
It's part and parcel of running systems. Patching is essential.
Fixed payments let vendors pocket the 'savings' when work is easier than estimated. Buffers get added, vendors profit.
Covering asses becomes more important than tackling problems.
Frontend errors are not invisible, they represent broken UI states, failed clicks, or frustrated users. Ignoring high error rates is ignoring user friction.
This indicates a lack of proper Observability tools (like correlation IDs or request tracing). You shouldn't need a user to timestamp their bad experience for you to find it.
Human error is a symptom, not a cause. If a system allows a human to accidentally bring it down, the system design is at fault. Good post-mortems focus on guardrails, not blame.
This is 'Hope-Driven Development'. Failing to validate incoming data means you prefer your app to crash randomly in production rather than catching the issue gracefully at the door.
Reactive, not proactive. Automated monitoring with business-tied triggers is essential.
No measurement plan means no accountability for business impact or ROI.
UX is not just visual polish. It's about understanding user needs and solving them effectively.
Bureaucracy kills agility. If you need 5 approvals for a typo, your governance is broken.
The server was running, but the 'Submit' button was broken. Uptime is a vanity metric; TCR is a value metric.
Compliance is a checkbox; security is a mindset. You can be fully compliant and still be wide open to hackers. Compliance protects the auditor; security protects the user.
Dealbreakers
We only listen to feedback if it's good, or if our bosses are affected by it.
Lack of data integrity and actionable insights. At this point we might as well not even have measured, that'd be better than getting wrong conclusions from incorrect data.