Architect's job is not easy
I took a course called Reliable Google Cloud Infrastructure: Design and Process. Actually the course's knowledge is not specific to Google Cloud Platform but also applicable to other cloud platforms. The course is about how to do Architect's job well, including Architecture, Design and Process.
SLI: The latency of successful HTTP responses (HTTP-200).
SLO: The latency of 99% of the responses must be ≤ 200 ms.
SLA: The user is compensated if 99th percentile latency exceeds 300 ms.
Easier to develop and maintain
Reduced risk when deploying new versions
Services scale independently to optimize use of infrastructure
Faster to innovate and add new features
Can use different languages and frameworks for different services
Choose the runtime appropriate to each service
Define strong contracts between the various microservices
Allow for independent deployment cycles, including rollback
Facilitate concurrent, A/B release testing on subsystems
Minimize test automation and quality assurance overhead
Improve clarity of logging and monitoring
Provide fine-grained cost accounting
Increase overall application scalability and reliability through scaling smaller units
Increased complexity when communicating between services
Increased latency across service boundaries
Concerns about securing inter-service traffic
Multiple deployments
Need to ensure that you don’t break clients as versions change
Must maintain backward compatibility when clients as the microservice evolves
It can be difficult to define clear boundaries between services to support independent development and deployment
Increased complexity of infrastructure, with distributed services having more points of failure
The increased latency introduced by network services and the need to build in resilience to handle possible failures and delays
Due to the networking involved, there is a need to provide security for service-to-service communication, which increases complexity of infrastructure
Strong requirement to manage and version service interfaces. With independently deployable services, the need to maintain backward compatibility increases.
The key to architecting microservice applications is to recognize service boundaries
Use a version control system like Git.
Each app has one code repo and vice versa.
Use a package manager like Maven, Pip, NPM to install dependencies.
Declare dependencies in your code base.
Don't put secrets, connection strings, endpoints, etc., in source code.
Store those as environment variables.
Databases, caches, queues, and other services are accessed via URLs.
Should be easy to swap one implementation for another
Build creates a deployment package from the source code.
Release combines the deployment with configuration in the runtime environment.
Run executes the application.
Apps run in one or more processes.
Each instance of the app gets its data from a separate database service.
Apps are self-contained and expose a port and protocol internally.
Apps are not injected into a separate server like Apache.
Because apps are self-contained and run in separate process, they scale easily by adding instances.
App instances should scale quickly when needed.
If an instance is not needed, you should be able to turn it off with no side effects.
Container systems like Docker makes this easier.
Leverage infrastructure as code to make environments easy to create.
Write log messages to standard output and aggregate all logs to a single source.
Admin tasks should be repeatable processes, not one-off manual tasks.
Admin tasks shouldn't be a part of the application.
Clients should not need to know too many details of services they use
Services communicate via HTTPS using text-based payloads
Services should add functionality without breaking existing clients
Resource is an abstract notion of information; Representation is a copy of the resource information. Representations can be single items or a collection of items.
GET is used to retrieve data
POST is used to create data
Generates entity ID and returns it to the client
PUT is used to create data or alter existing data
Entity ID must be known
PUT should be idempotent, which means that whether the request is made once or multiple times, the effects on the data are exactly the same
DELETE is used to remove data
200 codes for success
400 codes for client errors
500 codes for server errors
Plural nouns for sets (collections)
Singular nouns for individual resources
Strive for consistent naming
URI is case-insensitive
Don’t use verbs to identify a resource
Include version information
Standard interface description format for REST APIs. Language agnostic; Open-source (based on Swagger)
Allows tools and humans to understand how to use a service without needing its source code
Rent machines rather than buy.
Turn machines off as soon as possible rather than keep machines running for years.
Prefer lots of small machines rather than fewer big machines.
Machines are monthly expenses rather than Machines are capital expenditures.
Don’t fix broken machines.
Don’t install patches.
Don’t upgrade machines.
If you need to fix a machine, delete it and re-create a new one.
To make infrastructure disposable, automate everything with code:
Can automate using scripts.
Can use declarative tools to define infrastructure.
Build an infrastructure when needed.
Destroy the infrastructure when not in use.
Create identical infrastructures for dev, test, and prod.
Can be part of a CI/CD pipeline.
Templates are the building blocks for disaster recovery procedures.
Manage resource dependencies and complexity
Availability: The percent of time a system is running and able to process requests
Achieved with fault tolerance.
Create backup systems.
Use health checks.
Use white box metrics to count real traffic success and failure.
Durability: The odds of losing data because of a hardware or system failure
Achieved by replicating data in multiple zones.
Do regular backups.
Practice restoring from backups.
Scalability: The ability of a system to continue to work as user load and data grow
Monitor usage.
Use capacity autoscaling to add and remove servers in response to changes in load
Define your unit of deployment
N+2L: Plan to have one unit out for upgrade or testing and survive another failing
Make sure that each unit can handle the extra load
Don’t make any single unit too large
Try to make units interchangeable stateless clones
If a single machine fails, all requests served by machine fail.
If a top-of-rack switch fails, entire rack fails.
If a zone or region is lost, all the resources in it fail.
Servers on the same software run into the same issue.
If a global configuration system fails, and multiple systems depend on it, they potentially fail too.
The group of related items that could fail together is a failure domain.
To avoid correlated failures: Decouple servers and use microservices distributed among multiple failure domains.
Divide business logic into services based on failure domains.
Deploy to multiple zones and/or regions.
Split responsibility into components and spread over multiple processes.
Design independent, loosely coupled but collaborating services.
To avoid cascading failures:
Use health checks in Compute Engine or readiness and liveliness probes in
Kubernetes to detect and then repair unhealthy instances.
Ensure that new server instances start fast and ideally don't rely on other backends/systems to start up.
Problem: Business logic error shows up as overconsumption of resources, and the service overloads.
Solution: Monitor query performance. Ensure that notification of these issues gets back to the developers.
Problem: You try to make the system more reliable by adding retries, and instead you create the potential for an overload.
Solution: Prevent overload by carefully considering overload conditions whenever you are trying to improve reliability with feedback mechanisms to invoke retries.
Use truncated exponential backoff pattern to avoid positive feedback overload at the client
Use the circuit breaker pattern to protect the service from too many retries
Use lazy deletion to reliably recover when users delete data by mistake
What could happen that would cause a failure?
What is the Recovery Point Objective (amount of data that would be acceptable to lose)?
What is the Recovery Time Objective (amount of time it can take to be backup and running)?
Based on your disaster scenarios, formulate a plan to recover
Devise a backup strategy based on risk and recovery point and time objectives.
Communicate the procedure for recovering from failures.
Test and validate the procedure for recovering from failures regularly.
Ideally, recovery becomes a streamlined process, part of daily operations.
Principle of least privilege
Separation of duties
No one person is in charge of designing, implementing, and reporting on sensitive systems
Prevention of conflict of interest
The detection of control failures, for example, security breaches, information theft
In a microservice architecture, be careful not to break clients when services are updated
Rolling updates allow you to deploy new versions with no downtime
Use a blue/green deployment when you don’t want multiple versions of a service running simultaneously
Canary releases can be used prior to a rolling update to reduce the risk
Comments