Compiled on: 2025-06-30 — printable version
A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable
A distributed system is a collection of independent computers that appears to its users as a single coherent system
We define a distributed system as one in which hardware or software components located at networked computers communicate and coordinate their actions only by passing messages.
A distributed system is a set of computers that are linked together by a network
These definitions cover the essential characteristics of distributed systems:
Any or some of the following reasons:
Use case collection: negotiate expectations with customer(s) or stakeholders
Requirements analysis: produce a list of requirements the final product should satisfy
Design: produce a blueprint of the software
Implementation: write the code that reifies the design into software
Verification: verify that the software meets the requirements
Release: make one particular version of the software available to the customers
Deployment: install and activate the software for the software
Documentation: produce manuals and guides for the software
Maintenance: fix bugs, improve the software, adapt it to new requirements
(only additional steps are listed)
The set of hardware, software, and networking facilities that allow the many pieces of a distributed system to communicate and inter-operate over a network
$\approx$ the set of infrastructural components composing the backbone of the distributed system
Consider your fancy social network of choice (e.g., Instagram, TikTok, Twitter, etc.)
An infrastructural component consists of a software unit playing a precise role in the distributed system. The role depends on the purpose of the component in the system, and/or how it interacts.
software unit $\approx$ process (in the OS sense)
clients, servers, brokers, load balancers, caches, databases, queues, masters, workers, proxies, etc.
Node $\equiv$ an infrastructural component for which the role is not relevant
Peer $\equiv$ an infrastructural component for which the role is not specified
Server $\equiv$ a component with a well-known name/address responding to requests coming from clients
Client $\equiv$ a component that sends requests to servers, waiting for responses
Proxy $\equiv$ a server acting as a gateway towards another server
A proxy performing caching is called a cache server (or just cache)
Broker $\equiv$ a server mediating the communication between producers and consumers of data (a.k.a. messages)
Producer $\equiv$ the component sending messages to producer(s) (via the broker)
Consumer $\equiv$ the component receiving messages from consumer(s) (via the broker)
The same component can be simultaneously a producer and a consumer
MOM $\approx$ a sort of broker having multiple channels for messages
Topic $\equiv$ a label for messages, allowing
Yes, most MOM technologies use queues to implement channels
Master $\equiv$ a server coordinating the work of multiple workers
Worker $\equiv$ a server executing the work assigned by the master
Common use cases:
An interaction pattern describes how different components (nodes, processes, etc.) communicate and coordinate their actions to achieve a common goal. These patterns define the flow of messages, responsibilities of participants, and the timing and sequencing of communications.
e.g. request-response, publish-subscribe, auction, _etc.
The most common and basic pattern for communication between two components
hide footbox participant Client participant Server
Client -> Server: Request Server -> Client: Response
A simple pattern to spread information among multiple recipients
hide footbox actor User participant Publisher participant Subscriber1 participant Subscriber2
== Subscription Phase == Subscriber1 -> Publisher: subscribe activate Publisher Publisher –> Subscriber1: confirmation deactivate Publisher
Subscriber2 -> Publisher: subscribe activate Publisher Publisher –> Subscriber2: confirmation deactivate Publisher
== Notification Phase == User –> Publisher: Event activate Publisher Publisher -> Subscriber1: notify Message Publisher -> Subscriber2: notify Message deactivate Publisher
Notice that the publisher here is acting as a broker
hide footbox actor User participant Publisher participant Broker participant Subscriber1 participant Subscriber2 participant Subscriber3
== Subscription Phase == Subscriber1 -> Broker: subscribe TopicA activate Broker Broker –> Subscriber1: confirmation deactivate Broker
Subscriber2 -> Broker: subscribe TopicA activate Broker Broker –> Subscriber2: confirmation deactivate Broker
Subscriber3 -> Broker: subscribe TopicB activate Broker Broker –> Subscriber3: confirmation deactivate Broker
== Notification Phase == User –> Publisher: Event activate Publisher Publisher -> Broker: publish Message\non TopicA deactivate Publisher activate Broker Broker -> Subscriber1: notify Message Broker -> Subscriber2: notify Message deactivate Broker
A simple protocol for auctions and negotiations
hide footbox participant Initiator participant Contractor1 participant Contractor2
== Call for Proposals == Initiator -> Contractor1: Call for Proposal (CFP) activate Initiator Initiator -> Contractor2: Call for Proposal (CFP)
== Proposal Submission == Contractor1 -> Initiator: Submit Proposal Contractor2 -> Initiator: Submit Proposal
== Proposal Evaluation == Initiator -> Initiator: Evaluate Proposals, choosing the best one
== Award Contract == Initiator -> Contractor1: Award Contract activate Contractor1 Contractor1 -> Initiator: Accept Contract
== Contract Execution == Contractor1 -> Initiator: Return Result deactivate Contractor1 deactivate Initiator
2 roles: initiator and contractor
5 sorts of messages: CFP, proposal, award, accept, result
4 phases of interaction:
not shown in the picture:
A software architecture is an abstraction of the run-time elements of a software system during some phase of its operation. […] It is defined by a configuration of architectural elements constrained in their relationships in order to achieve a desired set of architectural properties.
Architectural style $\approx$ patterns of architectures which are known to work in practice
Which infrastructural components?
Which interaction patterns?
Constraints:
Example: (Web Services)
cf. https://en.wikipedia.org/wiki/Hexagonal_architecture_(software)
Separation of Concerns: each layer handles a specific responsibility, making the system easier to understand and maintain.
Modularity: layers are independent, allowing easier updates, testing, and replacement without affecting the entire system.
Reusability: common functionality in layers can be reused across different systems or projects.
Scalability: the system can scale by modifying or optimizing individual layers (e.g., scaling the database or network layer independently).
Maintainability: bugs or issues can be isolated to specific layers, making troubleshooting simpler.
Abstraction: layers provide clear abstractions, allowing higher layers to interact with the system without needing to know the internal details of lower layers.
Interoperability: a well-defined interface between layers promotes compatibility and enables the use of different technologies in each layer.
Performance Overhead: multiple layers can introduce latency, especially if there are excessive data transformations or processing between layers.
Complexity in Design: designing and managing multiple layers can increase overall system complexity, particularly in very large systems.
Rigid Structure: strict layer separation can limit flexibility, making it harder to implement cross-cutting concerns (e.g., logging, security) efficiently.
Duplication of Functionality: if layers are not carefully defined, functionality can be duplicated across layers, leading to inefficiencies.
Potential for Over-Engineering: in smaller or simpler systems, using a layered architecture might introduce unnecessary complexity when simpler architectures would suffice.
Difficulty in Layer Communication: strict adherence to layer boundaries might make certain interactions cumbersome, requiring unnecessary intermediate steps.
Which infrastructural components?
Which interaction patterns?
Constraints:
Examples:
Encapsulation: objects encapsulate data and behaviour, making the system more modular and easier to maintain.
Reusability: objects can be reused across different parts of the system or in different applications, reducing duplication of code.
Modularity: components are independent and self-contained, allowing easier upgrades, testing, and maintenance.
Clear Interface Definition: objects interact through well-defined interfaces, making it easier to manage dependencies and interactions between components.
Flexibility: objects can be distributed across different nodes, enabling flexibility in deployment and scalability in a distributed system.
Language Agnosticism (e.g., CORBA): some object-based systems (like CORBA) support multi-language environments, allowing developers to use the best language for each component while maintaining system integration.
Dynamic Behaviour: objects can be created, modified, or destroyed dynamically, allowing more flexible and adaptive systems.
Performance Overhead: the communication between distributed objects, especially over a network, can introduce significant latency and performance overhead.
Complexity in Management: managing the lifecycle and communication of distributed objects can add complexity, particularly in handling object references, synchronization, and failure recovery.
Difficulty in Debugging: debugging distributed objects is more complex due to the separation between client and server objects, especially when communication spans across different networks or systems.
Scalability Limitations: while object-based systems are modular, they may not scale efficiently in very large systems due to the overhead of managing object communication and state.
Security Concerns: distributed objects expose interfaces that can be exploited if not secured properly, making security management more challenging.
Tight Coupling through Interfaces: objects often rely on specific interfaces, which can lead to tight coupling between components, reducing flexibility in modifying or replacing objects.
State Management: maintaining the state of distributed objects can be complex, especially in cases of network failures or when objects need to be synchronized across multiple nodes.
Which infrastructural components?
Which interaction patterns?
Constraints:
Remarks:
Scalability: event-driven systems can scale more easily as components are decoupled and can handle many events concurrently.
Real-time Processing: events are processed as they happen, enabling real-time responses and immediate actions.
Loose Coupling: components are loosely coupled, improving flexibility and allowing independent service updates without affecting the whole system.
Resilience: fault tolerance is improved, as event producers and consumers are decoupled, reducing the risk of single points of failure.
Asynchronous Communication: systems can handle tasks asynchronously, improving responsiveness and system performance.
Complexity: event-driven systems can be more complex to design, implement, and maintain, particularly with large-scale event flows.
Debugging Challenges: tracking the flow of events and identifying issues can be difficult, especially in distributed systems with asynchronous operations.
Event Ordering: ensuring the correct order of event processing can be challenging, especially in distributed systems with multiple consumers.
Latency: while events are often processed in real-time, network latency or communication issues can delay event delivery or processing.
Data Consistency: ensuring consistency across different services or systems can be harder in event-driven systems due to asynchronous processing.
Single point of failure: if the event hub is not properly managed (e.g., no fault-tolerant mechanisms), system may become faulty or unavailable.
Which infrastructural components?
Which interaction patterns?
Constraints:
Examples:
Decoupled Communication: clients do not need to interact directly, allowing for more flexible and asynchronous communication.
Simplified Coordination: by using a shared space, different processes can coordinate without knowing about each other, simplifying synchronization in distributed systems.
Scalability: components can be added or removed from the system without affecting others, making the architecture inherently scalable.
Fault Tolerance: since communication happens through a shared space, if one component fails, others can still operate, improving resilience.
Loose Coupling: different parts of the system are loosely coupled through the shared data space, enabling more flexibility in how the system components interact.
Asynchronous Processing: processes can post data to the space and move on without waiting for others to consume it, improving system responsiveness and throughput.0
Performance Overhead: accessing a shared space, particularly in a distributed environment, can introduce latency and performance bottlenecks.
Data Consistency Challenges: ensuring consistency between the data posted in the shared space and its consumers can be difficult, especially with concurrent access.
Concurrency Issues: managing multiple processes accessing or modifying the same shared data can lead to race conditions and other concurrency issues.
Limited Visibility: because of the decoupled nature of the system, it can be harder to track or debug the flow of data and interactions between components.
Complexity in Data Management: house-keeping the shared data space (e.g., cleaning up outdated data, managing space, etc.) adds additional complexity to the system.
Single point of failure: if the shared space is not properly managed (e.g., no fault-tolerant mechanisms), data may be lost if the space fails or system may become completely unavailable.
These features impact the infrastructure, interaction patterns, or architecture of DS:
Redundancy: data, services, or hardware are replicated across multiple nodes or servers to ensure availability in case of failure.
Failover: when a primary node or service fails, a backup (secondary node) takes over.
Checkpoints and Rollback Recovery: periodically saving the state of a process so that in case of failure, the system can roll back to the last known good state.
Consensus: achieving agreement between distributed nodes, ensuring consistent decision-making, even in case of failures.
Hearth-beats: periodic signals sent between nodes to ensure they are alive and responsive.
Timeouts and Retries: setting timeouts for operations and retrying them in case of failure or unresponsiveness.
Authorization and Authentication: ensuring that only authorized users or systems can access data or services.
Data Partitioning: splitting data across multiple nodes to improve performance and scalability, or respect administrative constraints.
Active-passive failover: the backup node become active when the primary fails
Active-active failover: all replicas are active, and traffic is rerouted in case of failure
Authentication: letting the nodes of a DS recognise and distinguish themselves from one another
Authorization: a server granting (or forbidding) access to resources depending on the identity / role of the authenticated client
Compiled on: 2025-06-30 — printable version