Architect's job is not easy

I took a course called Reliable Google Cloud Infrastructure: Design and Process. Actually the course's knowledge is not specific to Google Cloud Platform but also applicable to other cloud platforms. The course is about how to do Architect's job well, including Architecture, Design and Process.

 

Introduction

Requirements Gathering Process

Rough --> Structural Design --> Measurable

Microservices

is an architectural style for developing applications. It decomposes a large application into independent constituent parts, where each part has its own area of responsibility. To serve a single user or API request, such kind of application can call many internal microservices to compose its response. The key advantage is the agility of the application in aspects such as development speed, deployment, and monitoring. It also allows multiple teams to work independently and deliver through to production at their own cadence.

Measurement

To identify most important requirements, architects measure their impacts by Key Performance Indicators (KPI) using (Specific, Measurable, Achievable, Relevant, and Time-bound) SMART criteria and considering the most suitable Service Level Objectives (SLO) and Service Level Indicators (SLI), and from these the Service Level Agreements (SLA). Example:
  • SLI: The latency of successful HTTP responses (HTTP-200).
  • SLO: The latency of 99% of the responses must be ≤ 200 ms.
  • SLA: The user is compensated if 99th percentile latency exceeds 300 ms.

    User Story

    Qualitative requirements define systems from the user’s point of view, by Who, What, Why, When, How (how to work? how many users? how much data?). This leads User Story. User Story is formatted as As a ____, I want to ____, so that I can ____. A user story should be evaluated by Independent, Negotiable, Valuable, Estimable, Small, and Testable (INVEST) criteria.

    Microservices

    Monolithic

    applications implement all features into a single code base with a database for all data.

    Microservice

    applications have multiple code bases, and each of them manages its own data.

    PROs

  • Easier to develop and maintain
  • Reduced risk when deploying new versions
  • Services scale independently to optimize use of infrastructure
  • Faster to innovate and add new features
  • Can use different languages and frameworks for different services
  • Choose the runtime appropriate to each service
  • Define strong contracts between the various microservices
  • Allow for independent deployment cycles, including rollback
  • Facilitate concurrent, A/B release testing on subsystems
  • Minimize test automation and quality assurance overhead
  • Improve clarity of logging and monitoring
  • Provide fine-grained cost accounting
  • Increase overall application scalability and reliability through scaling smaller units

    CONs

  • Increased complexity when communicating between services
  • Increased latency across service boundaries
  • Concerns about securing inter-service traffic
  • Multiple deployments
  • Need to ensure that you don’t break clients as versions change
  • Must maintain backward compatibility when clients as the microservice evolves
  • It can be difficult to define clear boundaries between services to support independent development and deployment
  • Increased complexity of infrastructure, with distributed services having more points of failure
  • The increased latency introduced by network services and the need to build in resilience to handle possible failures and delays
  • Due to the networking involved, there is a need to provide security for service-to-service communication, which increases complexity of infrastructure
  • Strong requirement to manage and version service interfaces. With independently deployable services, the need to maintain backward compatibility increases. The key to architecting microservice applications is to recognize service boundaries

    Stateful vs Stateless

    Stateful services manage stored data over time: Harder to scale; Harder to upgrade; Need to back up. Stateless services get their data from the environment or other stateful services: Easy to scale by adding instances; Easy to migrate to new versions; Easy to administer.

    A general solution of a large scale cloud-based system

    The Twelve Factor App

    The 12-factor app is a set of best practices for SaaS. It helps you to Maximize portability; Deploy to the cloud; Enable continuous deployment; Scale easily.

    Codebase:

    One codebase tracked in revision control, many deploys
  • Use a version control system like Git.
  • Each app has one code repo and vice versa.

    Dependencies:

    Explicitly declare and isolate dependencies
  • Use a package manager like Maven, Pip, NPM to install dependencies.
  • Declare dependencies in your code base.

    Config:

    Store config in the environment
  • Don't put secrets, connection strings, endpoints, etc., in source code.
  • Store those as environment variables.

    Backing services:

    Treat backing services as attached resources
  • Databases, caches, queues, and other services are accessed via URLs.
  • Should be easy to swap one implementation for another

    Build, release, run:

    Strictly separate build and run stages
  • Build creates a deployment package from the source code.
  • Release combines the deployment with configuration in the runtime environment.
  • Run executes the application.

    Processes:

    Execute the app as one or more stateless processes
  • Apps run in one or more processes.
  • Each instance of the app gets its data from a separate database service.

    Port binding:

    Export services via port binding
  • Apps are self-contained and expose a port and protocol internally.
  • Apps are not injected into a separate server like Apache.

    Concurrency:

    Scale out via the process model
  • Because apps are self-contained and run in separate process, they scale easily by adding instances.

    Disposability:

    Maximize robustness with fast startup and graceful shutdown
  • App instances should scale quickly when needed.
  • If an instance is not needed, you should be able to turn it off with no side effects.

    Dev/prod parity:

    Keep development, staging, and production as similar as possible
  • Container systems like Docker makes this easier.
  • Leverage infrastructure as code to make environments easy to create.

    Logs:

    Treat logs as event streams
  • Write log messages to standard output and aggregate all logs to a single source.

    Admin processes:

    Run admin/management tasks as one-off processes
  • Admin tasks should be repeatable processes, not one-off manual tasks.
  • Admin tasks shouldn't be a part of the application.

    REST

    REST stands for REpresentational State Transfer. Service endpoints supporting REST are called RESTful

    A good microservice design is loosely coupled.

  • Clients should not need to know too many details of services they use
  • Services communicate via HTTPS using text-based payloads
  • Services should add functionality without breaking existing clients Resource is an abstract notion of information; Representation is a copy of the resource information. Representations can be single items or a collection of items.

    Request Methods

  • GET is used to retrieve data
  • POST is used to create data
  • Generates entity ID and returns it to the client
  • PUT is used to create data or alter existing data
  • Entity ID must be known
  • PUT should be idempotent, which means that whether the request is made once or multiple times, the effects on the data are exactly the same
  • DELETE is used to remove data

    Response Status

  • 200 codes for success
  • 400 codes for client errors
  • 500 codes for server errors

    URI

  • Plural nouns for sets (collections)
  • Singular nouns for individual resources
  • Strive for consistent naming
  • URI is case-insensitive
  • Don’t use verbs to identify a resource
  • Include version information

    OpenAPI

  • Standard interface description format for REST APIs. Language agnostic; Open-source (based on Swagger)
  • Allows tools and humans to understand how to use a service without needing its source code

    DevOps

    Continuous integration pipelines automate building applications. Developers check-in code; Run unit tests; Build deployment package; Deploy.

    Cloud Mindset

  • Rent machines rather than buy.
  • Turn machines off as soon as possible rather than keep machines running for years.
  • Prefer lots of small machines rather than fewer big machines.
  • Machines are monthly expenses rather than Machines are capital expenditures.

    All infrastructure in cloud needs to be disposable

  • Don’t fix broken machines.
  • Don’t install patches.
  • Don’t upgrade machines.
  • If you need to fix a machine, delete it and re-create a new one.
  • To make infrastructure disposable, automate everything with code:
  • Can automate using scripts.
  • Can use declarative tools to define infrastructure.

    Infrastructure as code (IaC)

    allows for the quick provisioning and removing of infrastructures.
  • Build an infrastructure when needed.
  • Destroy the infrastructure when not in use.
  • Create identical infrastructures for dev, test, and prod.
  • Can be part of a CI/CD pipeline.
  • Templates are the building blocks for disaster recovery procedures.
  • Manage resource dependencies and complexity

    Performance Metrics

  • Availability: The percent of time a system is running and able to process requests
  • Achieved with fault tolerance.
  • Create backup systems.
  • Use health checks.
  • Use white box metrics to count real traffic success and failure.
  • Durability: The odds of losing data because of a hardware or system failure
  • Achieved by replicating data in multiple zones.
  • Do regular backups.
  • Practice restoring from backups.
  • Scalability: The ability of a system to continue to work as user load and data grow
  • Monitor usage.
  • Use capacity autoscaling to add and remove servers in response to changes in load

    Avoid single points of failure

  • Define your unit of deployment
  • N+2L: Plan to have one unit out for upgrade or testing and survive another failing
  • Make sure that each unit can handle the extra load
  • Don’t make any single unit too large
  • Try to make units interchangeable stateless clones

    Beware of correlated failures:

    Correlated failures occur when related items fail at the same time.
  • If a single machine fails, all requests served by machine fail.
  • If a top-of-rack switch fails, entire rack fails.
  • If a zone or region is lost, all the resources in it fail.
  • Servers on the same software run into the same issue.
  • If a global configuration system fails, and multiple systems depend on it, they potentially fail too.
  • The group of related items that could fail together is a failure domain.
  • To avoid correlated failures: Decouple servers and use microservices distributed among multiple failure domains.
  • Divide business logic into services based on failure domains.
  • Deploy to multiple zones and/or regions.
  • Split responsibility into components and spread over multiple processes.
  • Design independent, loosely coupled but collaborating services.

    Beware of cascading failures:

    Cascading failures occur when one system fails, causing others to be overloaded, such as a message queue becoming overloaded because of a failing backend.
  • To avoid cascading failures:
  • Use health checks in Compute Engine or readiness and liveliness probes in
  • Kubernetes to detect and then repair unhealthy instances.
  • Ensure that new server instances start fast and ideally don't rely on other backends/systems to start up.

    Query of death overload

  • Problem: Business logic error shows up as overconsumption of resources, and the service overloads.
  • Solution: Monitor query performance. Ensure that notification of these issues gets back to the developers.

    Positive feedback cycle overload failure

  • Problem: You try to make the system more reliable by adding retries, and instead you create the potential for an overload.
  • Solution: Prevent overload by carefully considering overload conditions whenever you are trying to improve reliability with feedback mechanisms to invoke retries.
  • Use truncated exponential backoff pattern to avoid positive feedback overload at the client
  • Use the circuit breaker pattern to protect the service from too many retries Use lazy deletion to reliably recover when users delete data by mistake 
  • When disaster planning,

    brainstorm scenarios that might cause data loss and/or service failure
  • What could happen that would cause a failure?
  • What is the Recovery Point Objective (amount of data that would be acceptable to lose)?
  • What is the Recovery Time Objective (amount of time it can take to be backup and running)? Based on your disaster scenarios, formulate a plan to recover
  • Devise a backup strategy based on risk and recovery point and time objectives.
  • Communicate the procedure for recovering from failures.
  • Test and validate the procedure for recovering from failures regularly.
  • Ideally, recovery becomes a streamlined process, part of daily operations.

    Security

  • Principle of least privilege
  • Separation of duties
  • No one person is in charge of designing, implementing, and reporting on sensitive systems
  • Prevention of conflict of interest
  • The detection of control failures, for example, security breaches, information theft In a microservice architecture, be careful not to break clients when services are updated
  • Rolling updates allow you to deploy new versions with no downtime
  • Use a blue/green deployment when you don’t want multiple versions of a service running simultaneously
  • Canary releases can be used prior to a rolling update to reduce the risk
  • Comments

    Popular posts from this blog

    [Event] JManc (28th June 2024)

    Tales from Earthsea

    Real View Will