I took a course called Reliable Google Cloud Infrastructure: Design and Process. Actually the course's knowledge is not specific to Google Cloud Platform but also applicable to other cloud platforms. The course is about how to do Architect's job well, including Architecture, Design and Process.

Introduction

Requirements Gathering Process

Rough --> Structural Design --> Measurable

Microservices

is an architectural style for developing applications. It decomposes a large application into independent constituent parts, where each part has its own area of responsibility. To serve a single user or API request, such kind of application can call many internal microservices to compose its response. The key advantage is the agility of the application in aspects such as development speed, deployment, and monitoring. It also allows multiple teams to work independently and deliver through to production at their own cadence.

Measurement

To identify most important requirements, architects measure their impacts by Key Performance Indicators (KPI) using (Specific, Measurable, Achievable, Relevant, and Time-bound) SMART criteria and considering the most suitable Service Level Objectives (SLO) and Service Level Indicators (SLI), and from these the Service Level Agreements (SLA). Example:

SLI: The latency of successful HTTP responses (HTTP-200).

SLO: The latency of 99% of the responses must be ≤ 200 ms.

SLA: The user is compensated if 99th percentile latency exceeds 300 ms.

User Story

Qualitative requirements define systems from the user’s point of view, by Who, What, Why, When, How (how to work? how many users? how much data?). This leads User Story. User Story is formatted as As a ____, I want to ____, so that I can ____. A user story should be evaluated by Independent, Negotiable, Valuable, Estimable, Small, and Testable (INVEST) criteria.

Microservices

Monolithic

applications implement all features into a single code base with a database for all data.

Microservice

applications have multiple code bases, and each of them manages its own data.

PROs

Easier to develop and maintain

Reduced risk when deploying new versions

Services scale independently to optimize use of infrastructure

Faster to innovate and add new features

Can use different languages and frameworks for different services

Choose the runtime appropriate to each service

Define strong contracts between the various microservices

Allow for independent deployment cycles, including rollback

Facilitate concurrent, A/B release testing on subsystems

Minimize test automation and quality assurance overhead

Improve clarity of logging and monitoring

Provide fine-grained cost accounting

Increase overall application scalability and reliability through scaling smaller units

CONs

Increased complexity when communicating between services

Increased latency across service boundaries

Concerns about securing inter-service traffic

Multiple deployments

Need to ensure that you don’t break clients as versions change

Must maintain backward compatibility when clients as the microservice evolves

It can be difficult to define clear boundaries between services to support independent development and deployment

Increased complexity of infrastructure, with distributed services having more points of failure

The increased latency introduced by network services and the need to build in resilience to handle possible failures and delays

Due to the networking involved, there is a need to provide security for service-to-service communication, which increases complexity of infrastructure

Strong requirement to manage and version service interfaces. With independently deployable services, the need to maintain backward compatibility increases. The key to architecting microservice applications is to recognize service boundaries

Stateful vs Stateless

Stateful services manage stored data over time: Harder to scale; Harder to upgrade; Need to back up. Stateless services get their data from the environment or other stateful services: Easy to scale by adding instances; Easy to migrate to new versions; Easy to administer.

A general solution of a large scale cloud-based system

The Twelve Factor App

The 12-factor app is a set of best practices for SaaS. It helps you to Maximize portability; Deploy to the cloud; Enable continuous deployment; Scale easily.

Codebase:

One codebase tracked in revision control, many deploys

Use a version control system like Git.

Each app has one code repo and vice versa.

Dependencies:

Explicitly declare and isolate dependencies

Use a package manager like Maven, Pip, NPM to install dependencies.

Declare dependencies in your code base.

Config:

Store config in the environment

Don't put secrets, connection strings, endpoints, etc., in source code.

Store those as environment variables.

Backing services:

Treat backing services as attached resources

Databases, caches, queues, and other services are accessed via URLs.

Should be easy to swap one implementation for another

Build, release, run:

Strictly separate build and run stages

Build creates a deployment package from the source code.

Release combines the deployment with configuration in the runtime environment.

Run executes the application.

Processes:

Execute the app as one or more stateless processes

Apps run in one or more processes.

Each instance of the app gets its data from a separate database service.

Port binding:

Export services via port binding

Apps are self-contained and expose a port and protocol internally.

Apps are not injected into a separate server like Apache.

Concurrency:

Scale out via the process model

Because apps are self-contained and run in separate process, they scale easily by adding instances.

Disposability:

Maximize robustness with fast startup and graceful shutdown

App instances should scale quickly when needed.

If an instance is not needed, you should be able to turn it off with no side effects.

Dev/prod parity:

Keep development, staging, and production as similar as possible

Container systems like Docker makes this easier.

Leverage infrastructure as code to make environments easy to create.

Logs:

Treat logs as event streams

Write log messages to standard output and aggregate all logs to a single source.

Admin processes:

Run admin/management tasks as one-off processes

Admin tasks should be repeatable processes, not one-off manual tasks.

Admin tasks shouldn't be a part of the application.

REST

REST stands for REpresentational State Transfer. Service endpoints supporting REST are called RESTful

A good microservice design is loosely coupled.

Clients should not need to know too many details of services they use

Services communicate via HTTPS using text-based payloads

Services should add functionality without breaking existing clients Resource is an abstract notion of information; Representation is a copy of the resource information. Representations can be single items or a collection of items.

Request Methods

GET is used to retrieve data

POST is used to create data

Generates entity ID and returns it to the client

PUT is used to create data or alter existing data

Entity ID must be known

PUT should be idempotent, which means that whether the request is made once or multiple times, the effects on the data are exactly the same

DELETE is used to remove data

Response Status

200 codes for success

400 codes for client errors

500 codes for server errors

URI

Plural nouns for sets (collections)

Singular nouns for individual resources

Strive for consistent naming

URI is case-insensitive

Don’t use verbs to identify a resource

Include version information

OpenAPI

Standard interface description format for REST APIs. Language agnostic; Open-source (based on Swagger)

Allows tools and humans to understand how to use a service without needing its source code

DevOps

Continuous integration pipelines automate building applications. Developers check-in code; Run unit tests; Build deployment package; Deploy.

Cloud Mindset

Rent machines rather than buy.

Turn machines off as soon as possible rather than keep machines running for years.

Prefer lots of small machines rather than fewer big machines.

Machines are monthly expenses rather than Machines are capital expenditures.

All infrastructure in cloud needs to be disposable

Don’t fix broken machines.

Don’t install patches.

Don’t upgrade machines.

If you need to fix a machine, delete it and re-create a new one.

To make infrastructure disposable, automate everything with code:

Can automate using scripts.

Can use declarative tools to define infrastructure.

Infrastructure as code (IaC)

allows for the quick provisioning and removing of infrastructures.

Build an infrastructure when needed.

Destroy the infrastructure when not in use.

Create identical infrastructures for dev, test, and prod.

Can be part of a CI/CD pipeline.

Templates are the building blocks for disaster recovery procedures.

Manage resource dependencies and complexity

Performance Metrics

Availability: The percent of time a system is running and able to process requests

Achieved with fault tolerance.

Create backup systems.

Use health checks.

Use white box metrics to count real traffic success and failure.

Durability: The odds of losing data because of a hardware or system failure

Achieved by replicating data in multiple zones.

Do regular backups.

Practice restoring from backups.

Scalability: The ability of a system to continue to work as user load and data grow

Monitor usage.

Use capacity autoscaling to add and remove servers in response to changes in load

Avoid single points of failure

Define your unit of deployment

N+2L: Plan to have one unit out for upgrade or testing and survive another failing

Make sure that each unit can handle the extra load

Don’t make any single unit too large

Try to make units interchangeable stateless clones

Beware of correlated failures:

Correlated failures occur when related items fail at the same time.

If a single machine fails, all requests served by machine fail.

If a top-of-rack switch fails, entire rack fails.

If a zone or region is lost, all the resources in it fail.

Servers on the same software run into the same issue.

If a global configuration system fails, and multiple systems depend on it, they potentially fail too.

The group of related items that could fail together is a failure domain.

To avoid correlated failures: Decouple servers and use microservices distributed among multiple failure domains.

Divide business logic into services based on failure domains.

Deploy to multiple zones and/or regions.

Split responsibility into components and spread over multiple processes.

Design independent, loosely coupled but collaborating services.

Beware of cascading failures:

Cascading failures occur when one system fails, causing others to be overloaded, such as a message queue becoming overloaded because of a failing backend.

To avoid cascading failures:

Use health checks in Compute Engine or readiness and liveliness probes in

Kubernetes to detect and then repair unhealthy instances.

Ensure that new server instances start fast and ideally don't rely on other backends/systems to start up.

Query of death overload

Problem: Business logic error shows up as overconsumption of resources, and the service overloads.

Solution: Monitor query performance. Ensure that notification of these issues gets back to the developers.

Positive feedback cycle overload failure

Problem: You try to make the system more reliable by adding retries, and instead you create the potential for an overload.

Solution: Prevent overload by carefully considering overload conditions whenever you are trying to improve reliability with feedback mechanisms to invoke retries.

Use truncated exponential backoff pattern to avoid positive feedback overload at the client

Use the circuit breaker pattern to protect the service from too many retries Use lazy deletion to reliably recover when users delete data by mistake

When disaster planning,

brainstorm scenarios that might cause data loss and/or service failure

What could happen that would cause a failure?

What is the Recovery Point Objective (amount of data that would be acceptable to lose)?

What is the Recovery Time Objective (amount of time it can take to be backup and running)? Based on your disaster scenarios, formulate a plan to recover

Devise a backup strategy based on risk and recovery point and time objectives.

Communicate the procedure for recovering from failures.

Test and validate the procedure for recovering from failures regularly.

Ideally, recovery becomes a streamlined process, part of daily operations.

Security

Principle of least privilege

Separation of duties

No one person is in charge of designing, implementing, and reporting on sensitive systems

Prevention of conflict of interest

The detection of control failures, for example, security breaches, information theft In a microservice architecture, be careful not to break clients when services are updated

Rolling updates allow you to deploy new versions with no downtime

Use a blue/green deployment when you don’t want multiple versions of a service running simultaneously

Canary releases can be used prior to a rolling update to reduce the risk

Architect's job is not easy

Introduction

Requirements Gathering Process

Microservices

Measurement

User Story

Microservices

Monolithic

Microservice

PROs

CONs

Stateful vs Stateless

A general solution of a large scale cloud-based system

The Twelve Factor App

Codebase:

Dependencies:

Config:

Backing services:

Build, release, run:

Processes:

Port binding:

Concurrency:

Disposability:

Dev/prod parity:

Logs:

Admin processes:

REST

A good microservice design is loosely coupled.

Request Methods

Response Status

URI

OpenAPI

DevOps

Cloud Mindset

All infrastructure in cloud needs to be disposable

Infrastructure as code (IaC)

Performance Metrics

Avoid single points of failure

Beware of correlated failures:

Beware of cascading failures:

Query of death overload

Positive feedback cycle overload failure

When disaster planning,

Security

Comments

Popular posts from this blog

Bookcrossing

Love of the Phantom

Women's Day