AWS Well-architected Framework

Content

  • Pillars

  • Design Principles

  • Best Practices

  • Questions

General Design Principles

  • Stop guessing capacity

  • Test systems at production scale

  • Automate to make architectural experimentation easier

  • Allow for evolutionary architectures

  • Drive architectures using data

  • Improve through game days

Lenses

  • Data Analytics

  • Hybrid Networking

  • Machine Learning

  • Serverless Application

  • Streaming Media

  • HPC

  • SaaS

  • IoT

  • Games Industry

  • Financial Services Industry

  • SAP

  • Custom Lenses

Pillar: Operational Excellence

The ability to support development and run workloads effectively, gain insight into operations and continuously improve supporting processes and procedures to deliver business value.

Operational Excellence Design Principles

  • Perform operations as code: this limits human error and drive consistent responses to events.

  • Make frequent, small, reversible changes, so that you can reverse if they fail.

  • Refine operations procedures frequently

  • Anticipate failure, pre-mortem exercises to identify potential sources of failure so you can remove or mitigate them:

    • Test your failure scenarios and validate your understanding of their impact.

    • Test your response procedures to ensure they are effective and that teams are familiar with how to launch them.

    • Set up regular game days to test workload and team responses to simulated events.

  • Learn from all operational failure, share what you learn across teams and through the entire organization.

Operational Excellence Best Practices Areas

  • Organization: understand your organization’s priorities, your organizational structure, and how your organization supports your team members, so that they can support your business outcomes.

    • Priorities: understand your entire workload, their role in it, and shared business goals to set the priorities. Review your priorities regularly and update them when necessary.

      1. Evaluate external and internal customer needs involving key stakeholders, including business, development, and operations teams, to determine where to focus efforts. This will ensure that you have a thorough understanding of the operations support required to achieve your desired business outcomes. Use your established priorities to focus your improvement efforts where they will have the greatest impact. This might mean, for example, developing team skills, improving workload performance, reducing costs, automating runbooks, or enhancing monitoring.

      2. Evaluate governance requirements. Governance is the set of policies, rules, or frameworks that a company uses to achieve its business goals. Conformance is the ability to demonstrate that you have implemented governance requirements.

      3. Evaluate compliance requirements. Apply due diligence if no external compliance frameworks are identified. Generate audits or reports that validate compliance. If you advertise that your product meets specific compliance standards, you must have an internal process for ensuring continuous compliance.

      4. Evaluate threat landscape such as competition, business risk and liabilities, operational risks, and information security threats. Maintain current information in a risk registry.

      5. Evaluate the impact of trade-offs between competing interests or alternative approaches.

      6. Manage benefits and risks. For example it may be beneficial to deploy a workload with unresolved issues to make significant new features available to customers. It may be possible to mitigate associated risks, or it may become unacceptable for a risk to remain, in which case you will take action to address the risk.

    • Organizational Culture: Provide support for your team members so that they can be more effective in taking action and supporting your business outcome.

      1. Executive Sponsorship.

      2. Empower team members to take action.

      3. Escalation.

      4. Timely, clean and actionable communications.

      5. Encourage experimentation.

      6. Promote learning to support new technologies. Cross train to promote knowledge transfer and reduce the risk of significant impact when you lose skilled and experienced team members with institutional knowledge.

      7. Encourage diverse opinions

  • Prepare: Understand your workloads and their expected behaviors. You can then design them to provide insight into their status and build procedures to support them.

    • Design Telemetry: Design your workload so that it provides the information necessary for you to understand its internal state.

      1. Implement application telemetry as the foundation for observability. Telemetry should provide internal metrics as well as reflect the state of the solution as compared to the business goals (like number of logins, transactions and so on).

      2. Implement dependency telemetry configuring the workload to emit information about resources it depends on (like databases, DNS and network connectivity).

      3. Implement transaction traceability to be able to trace the flow of data within the workload.

    • Design for operations: adopt approaches that improve the flow of changes into production and that make refactoring, fast feedback on quality, and bug fixing possible.

      1. Use version control.

      2. Test and validate changes before production. testing should also include infrastructure, configuration, security and operations procedures besides unit and integration.

      3. Use configuration management systems.

      4. Use build and deployment management systems.

      5. Perform automated patch management.

      6. Document and share design standards and best practices across teams

      7. Implement practices to improve code quality and minimize defects, suche as TDD, code reviews, standards adoption.

      8. Use multiple environments, such as dev, staging, production.

      9. Make frequent, small, and reversible changes can reduce the scope and impact of a change.

      10. Fully automate integration and deployment.

    • Mitigate deployment risks

      1. Plan for unsuccesful changes: revert to a known good state, or remediate in the production environment if a change does not have the desired outcome. This preparation reduces recovery time through faster responses.

      2. Test and validate changes.

      3. Use deployment management systems.

      4. Test using limited deployments alongside existing systems to confirm desired outcomes prior to full scale deployment. E.g.: canary or one-box deployments. See Amazon Builder’s Library - Automating safe, hands-off deployments.

      5. Deploy using parallel environments: mplement changes onto parallel environments, and then transition over to the new environment. Maintain the prior environment until there is confirmation of successful deployment.

      6. Deploy frequent, small, reversible changes to reduce the scope of a change.

      7. Fully automate integration and deployment.

      8. Automate testing and rollback.

    • Operational readiness and change managemente: Evaluate the operational readiness of your workload, processes, procedures, and personnel to understand the operational risks related to your workload. Manage the flow of change into your environments. You should use a consistent process, including manual or automated checklists, to know when you are ready to go live with your workload or a change.

      1. Ensure personnel capability on the platform and services that support your workload and that they’re an adeguate number.

      2. Consistently review operational readiness: Use Operational Readiness Reviews (ORRs), to validate that you can operate your workload.

      3. Use runbooks to perform procedures.

      4. Use playbooks to investigate issues.

      5. Make informed decisions to deploy systems and changes and perform pre-mortem excercises.

      6. Facilitate support plans for production workloads: make sure to support any software and services that your production workload relies on. Select an appropriate support level to meet your production service-level needs. Support plans for these dependencies are necessary in case of service disruptions or software issues. Document support plans and how to request support for all service and software vendors. Implement mechanisms that verify that support points of contacts are kept up to date.

  • Operate

    • Understanding workload health: Define, capture, and analyze workload metrics to gain visibility to workload events so that you can take appropriate action. Your team should be able to understand the health of your workload easily. You will want to use metrics based on workload outcomes to gain useful insights. You should use these metrics to implement dashboards with business and technical viewpoints that will help team members make informed decisions.
      Workload health is measured by the achievement of business outcomes or KPIs and the state of workload components and applications.

      1. Identify Key Performance Indicators (KPIs) based on desired business outcomes and customer outcomes, to determine workload success. E.g.: abandoned shopping cart, orders placed, cost, price and allocated workload expense. Adjust workload metrics over time as business needs change.

      2. Define, collect and analyze workload metrics performing regular, proactive reviews of these metrics to identify trends and determine if a response is necessary and validate the achievement of business outcomes.

      3. Establish workload metrics baselines so that you can identify under-performing and over-performing applications and components. This adds to your ability to mitigate issues before they become incidents.

      4. Learn expected patterns of activity for workload so you can respond appropriately if required.

      5. Raise an alert when workload outcomes are at risk or anomalies are present.

      6. Validate the achievement of outcomes and the effectiveness of KPIs and metrics.

    • Understanding operational health: Define, capture, and analyze operations metrics to gain visibility to workload events so that you can take appropriate action.

      1. Identify KPIs

      2. Define operations metrics such as Mean Time To Detect (MTTD), Mean Time To Recovery (MTTR).

      3. Collect and analyze operations metrics.

      4. Establish baselines.

      5. Learn expected patterns of activity for operations.

      6. Alert when operations outcomes are at risk.

      7. Alert when operations anomalies are detected.

      8. Validate the achievement of outcomes and the effectiveness of KPIs and metrics.

    • Responding to events: anticipate operational events such as sales, promotions and so on, but also unwanted ones such as failures.
      Events: things that occur in your workload but may not need intervention.
      Incidents: events that require intervention.
      Problems: recurring events that require intervention and cannot be solved.

      1. You need processes for each to mitigate the impact of these events on your business and make sure that you respond appropriately.

      2. Prioritize operational events based on business impact.

      3. Define escalation paths in your runbooks and playbooks, including what sets off escalation, and procedures for escalation.

      4. Define and test a communication plan for system outages for both clients and stakeholders.

      5. Communicate status through dashboards.

      6. Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent responses.

  • Evolve: Evolution is the continuous cycle of improvement over time. Implement frequent small incremental changes based on the lessons learned from your operations activities and evaluate their success at bringing about improvement.

    • Learn, share, improve

      1. Have a process for continuous improvement: Evaluate your workload against internal and external architecture best practices. Conduct workload reviews at least once per year. Prioritize improvement opportunities into your software development cadence.

      2. Perform post-incident analysis.

      3. Implement feedback loops.

      4. Perform Knowledge management: this helps team members find the information to perform their job.

      5. Identify drivers for improvement.

      6. Validate insights. Review your analysis results and responses with cross-functional teams and business owner.

      7. Regularly perform retrospective analysis of operations metrics with cross-team participants from different areas of the business.

      8. Document and share lessons learned from the operations activities.

      9. Allocate time and resources to make continuous and incremental improvements possible.

Pillar: Security

Pillar: Reliability

Pillar: Performance Efficiency

Pillar: Cost Optimization

Pillar: Sustainability