Overview on Distributed Systems Engineering


Software Engineering

(for Intelligent Distributed Systems)

A.Y. 2024/2025

Giovanni Ciatto


Compiled on: 2025-06-30 — printable version

back

What is a Distributed System?

Leslie Lamport (1987)

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable

Interpretation

  • hidden interdependencies between machines
  • how a failure in one part of the system can have unforeseen consequences elsewhere

“Distributed Systems”, Tannenbaum and Van Steen (the book here)

A distributed system is a collection of independent computers that appears to its users as a single coherent system

Interpretation

  • illusion of a unified system…
  • … despite the fact that it is made up of multiple independent machines
  • the key idea is that users may NOT be aware that the system is distributed

“Distributed Systems — Concepts and Design”, Coulouris et. al.

We define a distributed system as one in which hardware or software components located at networked computers communicate and coordinate their actions only by passing messages.

Interpretation

  • reliance on message passing across a network for communication

“Concepts, Techniques, and Models of Computer Programming”, Van Roy and Haridi

A distributed system is a set of computers that are linked together by a network

Interpretation

  • very abstract definition focusing on how distribution should be interpreted

What is a Distributed System?

Wrap up

These definitions cover the essential characteristics of distributed systems:

  • independent components
  • communication via messages
  • challenge of presenting a unified system to the user…
  • … despite failures or complexities within the system

Why would you make your system distributed?

Any or some of the following reasons:

  • Scalability: Handling large-scale systems efficiently
  • Fault Tolerance and Availability: Ensuring reliability despite failures
  • Low Latency and Geographical Distribution: Providing better user experience across the globe
  • Resource Sharing: Efficiently using computing power and storage
  • Handling Big Data: Processing data locally rather than moving it
  • Parallelism: Speeding up tasks through concurrent execution
  • Cost Efficiency: Reducing infrastructure costs through resource pooling
  • Collaboration: Enabling real-time updates and interactions across distances
  • The functionality to be provided by the system simply implies distribution
  • Scalability: Google handles billions of searches every day
  • Fault Tolerance and Availability: A single server cannot handle such a massive amount of data
  • Low Latency and Geographical Distribution: By distributing the search process across thousands of servers worldwide, Google can deliver search results quickly and reliably
  • Parallelism: Processing multiple search queries concurrently

Example: Social Networks (e.g., Instagram) or Messaging Apps (e.g., WhatsApp)

  • Scalability: Millions of users are active at the same time
  • Fault Tolerance and Availability: A single server cannot handle such a massive amount of data
  • Low Latency : By distributing the service across multiple servers worldwide, the app can deliver messages quickly and reliably
  • Geographical Distribution: Users can communicate across the globe
  • Handling Big Data: Store users’ data within that user’s region (as per GDPR)
  • Parallelism: Processing multiple messages concurrently
  • Collaboration: Enabling real-time updates and interactions across distances
  • The functionality is inherently distributed

Example: Online Shopping (e.g., Amazon)

  • Scalability: Millions of users are active at the same time
  • Fault Tolerance and Availability: A single server cannot handle such a massive amount of data
  • Low Latency: By distributing the service across multiple servers worldwide, the app can deliver messages quickly and reliably
  • Geographical Distribution: Comply to local regulations and regional preferences, e.g., language, currency, etc.
  • Parallelism: Processing multiple orders concurrently
  • The functionality to be provided by the system simply implies distribution

Example: Online Gaming (e.g., LoL, Fortnite)

  • Scalability: Millions of users are active at the same time
  • Fault Tolerance and Availability: A single server cannot handle such a massive amount of data
  • Low Latency: By distributing the service across multiple servers worldwide, the app can deliver messages quickly and reliably
  • Geographical Distribution: Users can play with others across the globe
  • Parallelism: Processing multiple games concurrently
  • Collaboration: Enabling real-time updates and interactions across distances
  • The functionality to be provided by the system simply implies distribution

Example: Google Docs (et. similia)

  • Scalability: allows for different business models, e.g., subscription-based
  • Fault Tolerance and Availability: Allows for automatic, implicit updates / file saves / backups
  • Geographical Distribution: Simplified sharing and collaboration across the globe
  • Parallelism and Collaboration: Multiple users can edit the same document concurrently

Example: Federated Learning (e.g. Google’s Gboard)

  • Scalability: Improve learning for all users (while keeping their data private)
  • Geo-Distribution: Improve the functionality based on the data from different regions
  • Handling Big Data: Process data locally rather than moving it
  • Parallelism: Learning occurs on the user’s device, anywhere, at any time
  • Cost Efficiency: A single device would not be sufficient to process all the data
  • Resource Sharing: Efficiently using computing power and storage, while preserving privacy

How does the SE workflow change for distributed systems?

Recap: Software Engineering Workflow

  1. Use case collection: negotiate expectations with customer(s) or stakeholders

  2. Requirements analysis: produce a list of requirements the final product should satisfy

    • with acceptance criteria for each requirement
  3. Design: produce a blueprint of the software

    • modelling: what entities from the real world are represented in the software?
      • how do they behave? how do they interact?
  4. Implementation: write the code that reifies the design into software

  5. Verification: verify that the software meets the requirements

    • automated testing: write more code to check whether the aforementioned implementation code works
    • acceptance testing: test the system with real data and real users
  6. Release: make one particular version of the software available to the customers

  7. Deployment: install and activate the software for the software

  8. Documentation: produce manuals and guides for the software

  9. Maintenance: fix bugs, improve the software, adapt it to new requirements

Software Engineering Workflow for Distributed Systems

(only additional steps are listed)

  1. Use case collection:
    • where are the users?
    • when and how frequently do they interact with the system?
    • how do they interact with the system? which devices are they using?
    • does the system need to store user’s data? which? where?
    • most likely, there will be multiple roles

Software Engineering Workflow for Distributed Systems

  1. Requirements analysis:
    • most likely, answers to the questions above imply technical constrains on the system
      • e.g. where to store data?
        • in case of multiple data centers, how to keep them consistent?
    • will the system need to scale? how to handle faults? how to recover from failures?
    • acceptance criteria will for all such additional requirements/constraints

Software Engineering Workflow for Distributed Systems

  1. Design:
    • are there infrastructural components that need to be introduced? how many?
      • e.g. clients, servers, load balancers, caches, databases, message brokers, queues, workers, proxies, firewalls, CDNs, etc.
    • how do components distribute over the network? where?
      • e.g. what are the IP addresses of the servers / brokers / databases / etc.?
    • how do domain entities map to infrastructural components?
      • e.g. state of a video game on central server, while inputs/representations on clients
      • e.g. where to store messages in an IM app? for how long?
    • how do components communicate? what? which interaction patterns do they enact?
    • do components need to store data? what data? where? when? how many copies?
      • in case of network partition, should the system be available or consistent?
    • how do components find each other?
      • how to name components?
      • e.g. service discovery, load balancing, etc.
    • how do components recognize each other?
      • e.g. authentication, authorization, etc.
    • what should happen when a component fails? is it really a failure?
      • e.g. retries, back-off, graceful degradation, etc.

Software Engineering Workflow for Distributed Systems

  1. Implementation:
    • which network protocols to use?
      • e.g. UDP, TCP, HTTP, WebSockets, gRPC, XMPP, AMQP, MQTT, etc.
    • how should in-transit data be represented?
      • e.g. JSON, XML, YAML, Protocol Buffers, etc.
    • how should persistent data be stored?
      • e.g. relations, documents, key-value, graph, etc.
    • how should databases be queried?
      • e.g. SQL, NoSQL, etc.
    • how should components be authenticated?
      • e.g. OAuth, JWT, etc.
    • how should components be authorized?
      • e.g. RBAC, ABAC, etc.

Software Engineering Workflow for Distributed Systems

  1. Verification:
    • how to unit-test distributed components?
    • testing the integration among components is paramount
    • how to end-to-end-test the system?
      • e.g. production vs. test environment
    • deployment automation is commonly used to test the system in production-like environment

Software Engineering Workflow for Distributed Systems

  1. Release:
    • components may (and should) have their own release cycles and versions
    • components should be resilient to the coexistence of multiple versions
    • rolling updates are preferred over big bang updates

Software Engineering Workflow for Distributed Systems

  1. Deployment:
    • where to deploy components?
      • e.g. cloud, on-premises, hybrid?
    • how to deploy components?
      • e.g. containers, VMs, bare metal?
    • how to scale components?
      • e.g. horizontally, vertically, auto-scaling?
    • how to monitor components?
      • e.g. logs, metrics, traces?
    • how to secure components?
      • e.g. firewalls, encryption, certificates, etc.
    • all such aspects should be automated
      • tools and companies exist just for this!

Software Engineering Workflow for Distributed Systems

  1. Documentation:
    • protocols and data formats should be well-documented
      • allowing for third-parties to create compatible components
    • e.g. Web API specification is public for most Web services

Software Engineering Workflow for Distributed Systems

  1. Maintenance:
    • continuous monitoring of performance and availability
      • tools and companies exist just for this!
    • issues tracking is non-trivial (may require ad-hoc sub-systems)
    • sunsetting old versions is crucial $\rightarrow$ coexistence of multiple versions
      • End-of-Life should be scheduled, not abrupt

Nomenclature

Infrastructure

The set of hardware, software, and networking facilities that allow the many pieces of a distributed system to communicate and inter-operate over a network

$\approx$ the set of infrastructural components composing the backbone of the distributed system

  • Distributed systems may rely on similar infrastructures, regardless of the different functionalities they provide
  • The infrastructure is transparent to the end-users, yet essential for the system to work


Example

Consider your fancy social network of choice (e.g., Instagram, TikTok, Twitter, etc.)

  • it looks like a dashboard on you phone / computer, but…
    • … where are the data stored?
    • … where are the messages after they have been sent but before they reach your phone?
    • … where does information processing happen?
  • different sorts of computations are performed on different infrastructural components, depending on their role in the system

Infrastructural Components

An infrastructural component consists of a software unit playing a precise role in the distributed system. The role depends on the purpose of the component in the system, and/or how it interacts.

software unit $\approx$ process (in the OS sense)

Examples

clients, servers, brokers, load balancers, caches, databases, queues, masters, workers, proxies, etc.

Synonyms

  • Node $\equiv$ an infrastructural component for which the role is not relevant

  • Peer $\equiv$ an infrastructural component for which the role is not specified

    • because all components play the same role(s)

Clients and servers

Clients and server concept

  • Server $\equiv$ a component with a well-known name/address responding to requests coming from clients

    • it’s the components that listens (waits) for remote requests
    • commonly it exposes an interface ($\approx$ the requests it can answer to) to clients
  • Client $\equiv$ a component that sends requests to servers, waiting for responses

    • it’s the components that initiates the interaction
    • it may expose an interface to users on how to interact with the server

Proxy


  • Proxy $\equiv$ a server acting as a gateway towards another server

    • it intercepts requests from clients and forwards them to the actual server
    • it intercepts responses from the actual server and forwards them to the client
    • it may cache responses to reduce the load on the server
  • A proxy performing caching is called a cache server (or just cache)

Load Balancer

Load Balancer concept
  • Load Balancer $\equiv$ a proxy distributing incoming requests among multiple servers
    • according to some distribution policy:
      • round-robin,
      • least connections,
      • least response time,
      • etc.

Broker

Broker concept

  • Broker $\equiv$ a server mediating the communication between producers and consumers of data (a.k.a. messages)

    • it receives messages from producers and forwards them to (1 or more) consumers
    • common assumption: consumers declare their interest in receiving messages
  • Producer $\equiv$ the component sending messages to producer(s) (via the broker)

  • Consumer $\equiv$ the component receiving messages from consumer(s) (via the broker)

  • The same component can be simultaneously a producer and a consumer

[Broker with] Queues

Queue concept

  • Queue $\approx$ a data structure where messages are stored in a FIFO manner
    • FIFO $\rightarrow$ messages are consumed in the order they were produced
    • storage $\rightarrow$ messages are not lost if consumers are unavailable

Message oriented middleware (MOM)

MOM concept

  • MOM $\approx$ a sort of broker having multiple channels for messages

    • messages involving the same topic are sent to the same channel
    • consumers subscribe to channels
  • Topic $\equiv$ a label for messages, allowing

    • producers to control which consumers receive the message
    • consumers to filter messages they are interested in
    • the broker to route messages to the correct consumers
  • Yes, most MOM technologies use queues to implement channels

Database

Database concept

  • Database $\equiv$ a server specialized in storing and retrieving data
    • in three-tier architectures, the database is the third tier
      • acting as a server for the server (which acts as a client w.r.t the database)

Master—Worker (a.k.a. Master—Slave, or Leader—Follower)

Master—Worker concept

  • Master $\equiv$ a server coordinating the work of multiple workers

    • it distributes the work among workers
    • it collects the results from workers
  • Worker $\equiv$ a server executing the work assigned by the master

  • Common use cases:

    • [master—worker] parallel computation
    • [master—slave] replication of data

Interaction Patterns

An interaction pattern describes how different components (nodes, processes, etc.) communicate and coordinate their actions to achieve a common goal. These patterns define the flow of messages, responsibilities of participants, and the timing and sequencing of communications.

e.g. request-response, publish-subscribe, auction, _etc.

Key points

  • Participants: we assume that there are $N \geq 2$ participants
  • Roles: participants play well-defined roles, most commonly:
    • initiator: the participant initiating the interaction
    • responder: the participant waiting for some other participant to initiate the interaction
  • Messages: information exchanged among participants, most commonly containing at least:
    • payload: the actual content of the message
    • metadata: information about the message, e.g. source, destination, timestamp, conversation id, etc.

How to represent Interaction Patterns?

  1. Sequence diagrams: visual representation of the flow of messages between participants
    • time is vertical, participants are horizontal
    • arrows represent messages sent from one participant to another
    • lifelines represent the lifetime of a participant

Example of a Sequence Diagram

Hints

How to represent Interaction Patterns?

  1. Message flow graphs: visual representation of the flow of messages
    • each node represents a type of message (e.g. request, response, notification, etc.)
    • each directed edge represents an admissible reply to a message
    • the graph may contain cycles, if the pattern allows for repeated interactions, or resets
    • nodes may be represented in different colours/shapes depending on which role sends/receives the message

Hints

  • in OO programming languages, each message type can be represented as a class
  • you may use class diagrams (e.g. in PlantUML) to represent them
  • you may also depict the graph otherwise, via graph-drawing tools (e.g. Graphviz or yEd)

How to represent Interaction Patterns?

  1. State diagrams (a.k.a. state machines): visual representation of the internal state transitions of a participant
    • each state represents a condition of the participant (e.g. before/after a message is sent/received)
    • each transition represents an event that changes the state of the participant
      • most commonly, a message reception or sending
    • initial and final states are special states representing the start and end of the interaction
    • most commonly, there’s one state diagram per participant

Hints

  • use PlantUML to automatically generate state diagrams from textual descriptions

Which representation to use?

  • The three representations are complementary
  • So you should use them together to fully describe an interaction pattern

Further representations may be welcome

  1. FIPA’s AUML (Agent UML) for agent-based systems
  2. BPMN (Business Process Model and Notation) for business processes
  3. UML’s Activity Diagrams

Common Interaction Protocols (pt. 1)

Request—Response

The most common and basic pattern for communication between two components

hide footbox participant Client participant Server

Client -> Server: Request Server -> Client: Response

  • 2 roles: client and server
  • 2 sorts of messages: request and response
  • each request is followed by one response
  • client is the initiator, server is the responder
  • client sends the request, server sends the response
  • client waits for the response, server waits for the request
  • often used to realise:
    • remote procedure calls (RPC)
    • remote method invocations (RMI)
    • (web) services

Common Interaction Protocols (pt. 2)

Publish—Subscribe

A simple pattern to spread information among multiple recipients

hide footbox actor User participant Publisher participant Subscriber1 participant Subscriber2

== Subscription Phase == Subscriber1 -> Publisher: subscribe activate Publisher Publisher –> Subscriber1: confirmation deactivate Publisher

Subscriber2 -> Publisher: subscribe activate Publisher Publisher –> Subscriber2: confirmation deactivate Publisher

== Notification Phase == User –> Publisher: Event activate Publisher Publisher -> Subscriber1: notify Message Publisher -> Subscriber2: notify Message deactivate Publisher

  • 2 roles: publisher and subscriber
  • 2 sorts of messages: subscribe and notify
  • 2 phases of interaction: subscription and notification
    1. Subscription phase: subscribers declare their interest in receiving message
      • subscribers are initiators here
    2. Notification phase: publishers send messages to subscribers
      • messages are broadcasted or multicasted depending on the implementation
  • messages of type notify carry messages, commonly representing events
  • messages of type subscribe may carry topics, commonly representing the topic of interest (for the subscriber)
  • notice that subscription is, essentially, a request-response pattern

Common Interaction Protocols (pt. 2)

Publish—Subscribe with Broker

Notice that the publisher here is acting as a broker

  • one may re-design the pattern with explicit broker
  • this is commonly useful to decouple the publisher from the subscribers
  • most commonly, brokers store messages until they are consumed

hide footbox actor User participant Publisher participant Broker participant Subscriber1 participant Subscriber2 participant Subscriber3

== Subscription Phase == Subscriber1 -> Broker: subscribe TopicA activate Broker Broker –> Subscriber1: confirmation deactivate Broker

Subscriber2 -> Broker: subscribe TopicA activate Broker Broker –> Subscriber2: confirmation deactivate Broker

Subscriber3 -> Broker: subscribe TopicB activate Broker Broker –> Subscriber3: confirmation deactivate Broker

== Notification Phase == User –> Publisher: Event activate Publisher Publisher -> Broker: publish Message\non TopicA deactivate Publisher activate Broker Broker -> Subscriber1: notify Message Broker -> Subscriber2: notify Message deactivate Broker

Common Interaction Protocols (pt. 2)

Unicast vs. Broadcast vs. Multicast

Depiction of the three types of communication
  • Unicast: one-to-one communication
  • Broadcast: one-to-all communication
  • Multicast: one-to-many communication
    • implies a selection criterion for the many

Common Interaction Protocols (pt. 3)

ContractNet Protocol

A simple protocol for auctions and negotiations

hide footbox participant Initiator participant Contractor1 participant Contractor2

== Call for Proposals == Initiator -> Contractor1: Call for Proposal (CFP) activate Initiator Initiator -> Contractor2: Call for Proposal (CFP)

== Proposal Submission == Contractor1 -> Initiator: Submit Proposal Contractor2 -> Initiator: Submit Proposal

== Proposal Evaluation == Initiator -> Initiator: Evaluate Proposals, choosing the best one

== Award Contract == Initiator -> Contractor1: Award Contract activate Contractor1 Contractor1 -> Initiator: Accept Contract

== Contract Execution == Contractor1 -> Initiator: Return Result deactivate Contractor1 deactivate Initiator

  • 2 roles: initiator and contractor

  • 5 sorts of messages: CFP, proposal, award, accept, result

  • 4 phases of interaction:

    1. Call for Proposals: initiator broad- or multi-casts a request a CFP
      • commonly defining a deadline + task request
    2. Proposal Submission: contractors submit proposals to initiator
      • commonly containing estimated cost
    3. Proposal Evaluation: initiator evaluates proposals and chooses the best one
    4. Award Contract: initiator awards the contract to the chosen contractor
      • the chosen contractor accepts the contract
    5. Contract Execution: contractor executes the contract and returns the result
  • not shown in the picture:

    1. lack of proposals (i.e. no contractor)
    2. no proposal chosen
    3. contractor may reject the contract

Common Interaction Protocols (pt. 4)

Foundation for Intelligent Physical Agents (FIPA)

  • FIPA is a standardization body for agent-based systems
  • Very rough analogy: agent $\approx$ distributed component in a distributed system
  • FIPA has standardized many interaction protocols for agents
    • e.g. contract net, request-response, subscription, auction, _etc.
  • See http://www.fipa.org/repository/ips.php3

Architecture and Architectural Styles

Recap: Software Architecture vs. Architectural Style

  • Roy Fielding (2000):

A software architecture is an abstraction of the run-time elements of a software system during some phase of its operation. […] It is defined by a configuration of architectural elements constrained in their relationships in order to achieve a desired set of architectural properties.

  • Very roughly:

Architectural style $\approx$ patterns of architectures which are known to work in practice

Main architectural styles for distributed systems

  • layered architectures
  • object-based architectures
  • event-based architectures
  • shared data-space architectures

Architectural Styles in Practice (pt. 1)

Layered Architecture (concept)

Layered Architecture
  • Which infrastructural components?

    • Each layer is a server (or a proxy) for the layer(s) above it
    • Each layer is a client for the layer(s) below it
  • Which interaction patterns?

    • Request—Response: (a.k.a. remote procedure call, RPC)
      • upper layers issue requests to lower layers
      • lower layers issue responses to upper layers
    • sometimes, Publish—Subscribe:
      • upper layers subscribe to lower layers
      • lower layers notify upper layers
  • Constraints:

    • no cycles among layers
      • i.e. lower layers_ should not contact upper layers
  • Example: (Web Services)

    • Skyscanner (looks for flights / hotels)
      • relying on the Web services of airlines / hotels
      • relying on the Booking portal
        • all of which have their own layered architecture

Architectural Styles in Practice (pt. 1)

Layered Architecture (particular cases)

Three-Tier Architecture

Classic three-tier application architecture
  • Presentation Tier: responsible for presenting information to the user, and accepting user input
  • Application Tier: responsible for processing user requests, and executing business logic, updating system state
  • Data Tier: responsible for storing and retrieving data, and managing data access, persisting system state

Architectural Styles in Practice (pt. 1)

Layered Architecture (particular cases)

Hexagoal Architecture

Hexagonal Architecture

cf. https://en.wikipedia.org/wiki/Hexagonal_architecture_(software)

Architectural Styles in Practice (pt. 1)

Layered Architecture (analysis)

Pros

  1. Separation of Concerns: each layer handles a specific responsibility, making the system easier to understand and maintain.

  2. Modularity: layers are independent, allowing easier updates, testing, and replacement without affecting the entire system.

  3. Reusability: common functionality in layers can be reused across different systems or projects.

  4. Scalability: the system can scale by modifying or optimizing individual layers (e.g., scaling the database or network layer independently).

  5. Maintainability: bugs or issues can be isolated to specific layers, making troubleshooting simpler.

  6. Abstraction: layers provide clear abstractions, allowing higher layers to interact with the system without needing to know the internal details of lower layers.

  7. Interoperability: a well-defined interface between layers promotes compatibility and enables the use of different technologies in each layer.

Architectural Styles in Practice (pt. 1)

Layered Architecture (analysis)

Cons

  1. Performance Overhead: multiple layers can introduce latency, especially if there are excessive data transformations or processing between layers.

  2. Complexity in Design: designing and managing multiple layers can increase overall system complexity, particularly in very large systems.

  3. Rigid Structure: strict layer separation can limit flexibility, making it harder to implement cross-cutting concerns (e.g., logging, security) efficiently.

  4. Duplication of Functionality: if layers are not carefully defined, functionality can be duplicated across layers, leading to inefficiencies.

  5. Potential for Over-Engineering: in smaller or simpler systems, using a layered architecture might introduce unnecessary complexity when simpler architectures would suffice.

  6. Difficulty in Layer Communication: strict adherence to layer boundaries might make certain interactions cumbersome, requiring unnecessary intermediate steps.

Architectural Styles in Practice (pt. 1)

Layered Architecture (analysis)

Personal Opinion of the Teacher

  • Layered architectures are simple and easy to understand
  • It works well in most cases, you can consider it the default choice
  • Prefer:
    1. two-tier (3-tier with no real DBMS) architectures for quick and dirty systems
    2. three-tier architectures for if flexibility is not a primary concern
    3. hexagonal architectures for systems that may need to scale in complexity

Architectural Styles in Practice (pt. 2)

Object-Based Architecture (concept)

Object-Based Architecture

Architectural Styles in Practice (pt. 2)

Object-Based Architecture (analysis)

Pros

  1. Encapsulation: objects encapsulate data and behaviour, making the system more modular and easier to maintain.

  2. Reusability: objects can be reused across different parts of the system or in different applications, reducing duplication of code.

  3. Modularity: components are independent and self-contained, allowing easier upgrades, testing, and maintenance.

  4. Clear Interface Definition: objects interact through well-defined interfaces, making it easier to manage dependencies and interactions between components.

  5. Flexibility: objects can be distributed across different nodes, enabling flexibility in deployment and scalability in a distributed system.

  6. Language Agnosticism (e.g., CORBA): some object-based systems (like CORBA) support multi-language environments, allowing developers to use the best language for each component while maintaining system integration.

  7. Dynamic Behaviour: objects can be created, modified, or destroyed dynamically, allowing more flexible and adaptive systems.

Architectural Styles in Practice (pt. 2)

Object-Based Architecture (analysis)

Cons

  1. Performance Overhead: the communication between distributed objects, especially over a network, can introduce significant latency and performance overhead.

  2. Complexity in Management: managing the lifecycle and communication of distributed objects can add complexity, particularly in handling object references, synchronization, and failure recovery.

  3. Difficulty in Debugging: debugging distributed objects is more complex due to the separation between client and server objects, especially when communication spans across different networks or systems.

  4. Scalability Limitations: while object-based systems are modular, they may not scale efficiently in very large systems due to the overhead of managing object communication and state.

  5. Security Concerns: distributed objects expose interfaces that can be exploited if not secured properly, making security management more challenging.

  6. Tight Coupling through Interfaces: objects often rely on specific interfaces, which can lead to tight coupling between components, reducing flexibility in modifying or replacing objects.

  7. State Management: maintaining the state of distributed objects can be complex, especially in cases of network failures or when objects need to be synchronized across multiple nodes.

Architectural Styles in Practice (pt. 2)

Object-Based Architecture (analysis)

Personal Opinion of the Teacher

  • Object-based architectures were a thing in the 90s and early 2000s
  • They never really took off, and are now mostly legacy
  • OOP is good for developing individual components, where objects share the same address space
  • Inter-object interactions are so fine-grained and intertwined in OOP that adding network in between makes it unmanageable
  • Just don’t use this when designing a new system

Architectural Styles in Practice (pt. 3)

Event-Based Architecture (concept)

Event-Based Architecture
  • Which infrastructural components?

    • event bus $\approx$ a broker for events OR a set of brokers with a routing mechanism
    • servers are producers and consumers of events
    • clients commonly just interact with servers, as in layered architectures
  • Which interaction patterns?

    • Request—Response: client $\leftrightarrow$ server
    • Publish-Subscribe: server $\leftrightarrow$ event bus
  • Constraints:

    • servers do not really know each other…
    • … they just know which event to react to or to produce
  • Remarks:

    • much used in the industry to create modular systems
      • where modules are realised in different moments/technologies

Architectural Styles in Practice (pt. 3)

Event Based Architecture (analysis)

Pros

  1. Scalability: event-driven systems can scale more easily as components are decoupled and can handle many events concurrently.

  2. Real-time Processing: events are processed as they happen, enabling real-time responses and immediate actions.

  3. Loose Coupling: components are loosely coupled, improving flexibility and allowing independent service updates without affecting the whole system.

  4. Resilience: fault tolerance is improved, as event producers and consumers are decoupled, reducing the risk of single points of failure.

  5. Asynchronous Communication: systems can handle tasks asynchronously, improving responsiveness and system performance.

Architectural Styles in Practice (pt. 3)

Event Based Architecture (analysis)

Cons

  1. Complexity: event-driven systems can be more complex to design, implement, and maintain, particularly with large-scale event flows.

  2. Debugging Challenges: tracking the flow of events and identifying issues can be difficult, especially in distributed systems with asynchronous operations.

  3. Event Ordering: ensuring the correct order of event processing can be challenging, especially in distributed systems with multiple consumers.

  4. Latency: while events are often processed in real-time, network latency or communication issues can delay event delivery or processing.

  5. Data Consistency: ensuring consistency across different services or systems can be harder in event-driven systems due to asynchronous processing.

  6. Single point of failure: if the event hub is not properly managed (e.g., no fault-tolerant mechanisms), system may become faulty or unavailable.

Architectural Styles in Practice (pt. 3)

Event Based Architecture (analysis)

Personal Opinion of the Teacher

  • Event-based architectures are very popular in the industry
  • They are also pretty complex to design and implement: not at all entry level
  • Future courses in this master programme will cover this in more detail
  • Understanding of how layered architectures work is a good starting point for event-based architectures
    • also because the two styles are often combined

Architectural Styles in Practice (pt. 4)

Shared Dataspace Architecture (concept)

Shared Dataspace Architecture
  • Which infrastructural components?

    • clients and databases
  • Which interaction patterns?

    • Request—Response: client reads/writes data from/to database
    • Publish-Subscribe:
      • support for asynchronous notifications of data changes
      • support for streaming big query results
  • Constraints:

    • clients only perform CRUD operations on the database
  • Examples:

Architectural Styles in Practice (pt. 4)

Shared Dataspace Architecture (analysis)

Pros

  1. Decoupled Communication: clients do not need to interact directly, allowing for more flexible and asynchronous communication.

  2. Simplified Coordination: by using a shared space, different processes can coordinate without knowing about each other, simplifying synchronization in distributed systems.

  3. Scalability: components can be added or removed from the system without affecting others, making the architecture inherently scalable.

  4. Fault Tolerance: since communication happens through a shared space, if one component fails, others can still operate, improving resilience.

  5. Loose Coupling: different parts of the system are loosely coupled through the shared data space, enabling more flexibility in how the system components interact.

  6. Asynchronous Processing: processes can post data to the space and move on without waiting for others to consume it, improving system responsiveness and throughput.0

Architectural Styles in Practice (pt. 4)

Shared Dataspace Architecture (analysis)

Cons

  1. Performance Overhead: accessing a shared space, particularly in a distributed environment, can introduce latency and performance bottlenecks.

  2. Data Consistency Challenges: ensuring consistency between the data posted in the shared space and its consumers can be difficult, especially with concurrent access.

  3. Concurrency Issues: managing multiple processes accessing or modifying the same shared data can lead to race conditions and other concurrency issues.

  4. Limited Visibility: because of the decoupled nature of the system, it can be harder to track or debug the flow of data and interactions between components.

  5. Complexity in Data Management: house-keeping the shared data space (e.g., cleaning up outdated data, managing space, etc.) adds additional complexity to the system.

  6. Single point of failure: if the shared space is not properly managed (e.g., no fault-tolerant mechanisms), data may be lost if the space fails or system may become completely unavailable.

Architectural Styles in Practice (pt. 4)

Event Based Architecture (analysis)

Personal Opinion of the Teacher

  • Shared dataspace architectures only make sense in few niches
  • e.g. in distributed data processing, where computational burdens are shared among many nodes
  • e.g. when the distributed systems’ state is shared among most/all nodes, and concurrent updates are necessary
    • such as in distributed databases or distributed file systems, or in non-real-time multiplayer video games

Features Impacting DS Design

These features impact the infrastructure, interaction patterns, or architecture of DS:

  1. Redundancy: data, services, or hardware are replicated across multiple nodes or servers to ensure availability in case of failure.

  2. Failover: when a primary node or service fails, a backup (secondary node) takes over.

  3. Checkpoints and Rollback Recovery: periodically saving the state of a process so that in case of failure, the system can roll back to the last known good state.

  4. Consensus: achieving agreement between distributed nodes, ensuring consistent decision-making, even in case of failures.

  5. Hearth-beats: periodic signals sent between nodes to ensure they are alive and responsive.

  6. Timeouts and Retries: setting timeouts for operations and retrying them in case of failure or unresponsiveness.

  7. Authorization and Authentication: ensuring that only authorized users or systems can access data or services.

  8. Data Partitioning: splitting data across multiple nodes to improve performance and scalability, or respect administrative constraints.

About Redundancy

What

  • Data: making copies of the data across multiple nodes
  • Services: deploying multiple instances of the same service across different nodes
  • Hardware: using multiple computers, storage devices, or network components

Why

  • Fault Tolerance: ensures that the system remains operational even if some components fail
    • also avoiding data loss
  • Availability: distributing the load across multiple nodes, reducing the load on each node
  • Scalability: easier to scale the system up/down when the load increases/decreases

How

  • Replication: copying data or services across multiple nodes
  • [Data] Sharding : splitting data into smaller parts and distributing them across multiple nodes
    • when all nodes have all data, it’s called replication
    • … when each node has only a part of the data, it’s called sharding

Implications

  • Data replication $\Rightarrow$ consistency issues $\Rightarrow$ consensus algorithms OR master—slave replication
  • Service replication $\Rightarrow$ load balancing issues $\Rightarrow$ more complex infrastructure
  • Service replication $\Rightarrow$ state management issues $\Rightarrow$ state-less server + stateful database

About Failover

What

  • Active-passive failover: the backup node become active when the primary fails

  • Active-active failover: all replicas are active, and traffic is rerouted in case of failure

Why

  • Fault Tolerance and Availability: ensures that the system remains operational even the component fail

How

  • Load-balancer or proxy to reroute traffic to the backup node(s)

Implications

  • Same as in service replication
  • Need for a mechanism to detect failures, e.g. hear-beats (see next slides)

About Checkpoints and Rollback Recovery

What

  • Checkpoints: saving the state of a process at regular intervals
    • this is similar to backups, but it’s most commonly automatic
  • Rollback Recovery: in case of failure, the system can roll back to the last known good state
    • again, stress on the automatic part

Why

  • Fault Tolerance: it may be hard to prevent some bad situation from happening, so this is a way to recover from it when it does

How

  1. by designing your system to be able snapshot its state & restore snapshots upon failures automatically
  2. by designing your system to track variations (rather than states) and compute the states from the variations

Implications

  • Requires ad-hoc modelling, architectural, and infrastructural decisions

About Consensus

What

  • A protocol to ensure that some nodes (most commonly servers or databases) in a DS agree on a common decision
    • decision could be “which operation to perform on data” or “who to elect as a leader”, etc
    • even in case of failures
      • e.g. crashes: one node stops responding
      • e.g. byzantine failures: one node sends wrong information (either by mistake or deliberately)

Why

  • Consistency: ensures that nodes in the system have the same view of the data
  • Fault Tolerance: ensures that the system can continue to operate even if some nodes fail
  • Data Redundancy: see previous slides

How

  • Byzantine Fault Tolerant (BFT) protocols, e.g. PBFT
  • Crash Fault Tolerant protocols, e.g. Paxos, Raft

Implications

  • Complexity: design is more complex, e.g. clients may need to know which replica(s) to contact
  • Latency: in the eyes of clients, due to consensus going on between request and response

About Heart-beats, Timeout, Retries

What

  • Heart-beats are periodic signals sent between nodes to ensure they are alive and responsive
    • if a node stops sending heart-beats within the period, it’s considered dead
    • signal $\approx$ (almost) empty message: only its reception matter
  • Timeout: amount of time necessary to locally mark a remote operation as failed (e.g. node as unreachable)
  • Retry: commonly a single failure not enough (e.g. short timeout, bad luck) so better to retry a few times
    • max retries: maximum amount of retries before considering an actual failure
    • delay: amount of time to wait before retrying again (may be variable, cf. exponential back-off)

Why

  • Fault Tolerance: basic mechanism to detect failures, to be able to react to them ASAP

How

  1. Nodes keep a connection open between them, and send data periodically
    • alternatively, they send each other messages with no connection
    • some application may have this feature built-in, others may need ad-hoc design
  2. Timeout + retrial threshold + retrial delay to mark nodes as unreachable
  3. Decide what to do when depending on which and how many nodes are unreachable
    • e.g. prioritize consistency? prioritize availability?

Implications

  • More complex (and robust) design and implementation
  • Note: this is useful also when reliable network protocols (e.g. TCP) are in place

About Authorization and Authentication

What

  1. Authentication: letting the nodes of a DS recognise and distinguish themselves from one another

    • this includes users’ clients (a.k.a. identifying the legitimate users of the system)
  2. Authorization: a server granting (or forbidding) access to resources depending on the identity / role of the authenticated client

Why

  • Access control: to control who can (or cannot) do what
  • Monitoring: to record who is doing what
  • Prerequisite for many cyber-security-related aspects

How

  1. Allow the system to register legitimate users and/or roles, endowing them with credentials
  2. One or more server is in charge of generating session tokens upon client request
  3. Authenticated nodes include session tokens in any subsequent interaction
  4. Nodes know how to verify session tokens are valid and genuine
  5. Nodes enforce access control depending on the content of the token

Implications

  • Requires one authentication server, possibly backed by a database
  • Requires cryptography to handle session tokens
  • Requires designing and implementing some access control mechanism, e.g. ACLs, RBAC, etc.

About Data Partitioning

What

  • Clones of the same functionality are deployed, each one covering a partition of the geographic region where the system operates
    • e.g. amazon**.uk** vs. amazon**.it**, google**.fr** vs. google*.com**, etc.
  • Clones are not consistent among each other, because data is partitioned on a geographical basis
    • mechanisms are in place to direct users towards closest partition

Why

  • Fault-tolerance: one fault in one region does not propagate to others region
  • Availability and Load-balancing: divide-and-conquer approach to serve users
  • It is sometimes a mandatory accomplishment due to regulations (cf. GDPR, art. 44)

How

  • The same infrastructure is deployed in different places
  • Mechanisms may be present to connect the clones of the system (hence hiding partitioning to users)

Implications

  • Deployment procedures should be reproducible and parametric, possibly automated
  • Clones should be designed to be aware of the existence of other clones

Check your understanding (pt. 1)

  • What is a distributed system? What are its essential characteristics?
  • In the context of distributed systems, what is scalability? Is that desirable? Why?
  • In the context of distributed systems, what is fault tolerance? Is that desirable? Why?
  • In the context of distributed systems, what is availability? Is that desirable? Why?
  • In the context of distributed systems, what is consistency? Is that desirable? Why?
  • In the context of distributed systems, what is latency? Is that desirable? Why?
  • In the context of distributed systems, what is geographical distribution? Is that desirable? Why?
  • In the context of distributed systems, what is resource sharing? Is that desirable? Why?
  • How do distributed systems help to handle big data?
  • In the context of distributed systems, what is parallelism? Is that desirable? Why?
  • In what way do distributed systems contribute to cost efficiency?
  • How do distributed systems foster collaboration among users?
  • Enumerate and describe the reasons why one may want to design and implement a software system as distributed

Check your understanding (pt. 2)

  • Describe some real-world distributed system of your choice, focussing on
    1. whether and how it deals with relevant features such as scalability, fault tolerance, etc., and
    2. why it needs distribution in the first place
  • In the context of software engineering,
    • how to understand whether a new system to be realised needs distribution or not?
    • what are relevant aspects to be taken into account in the “design” phase of a distributed system?
    • what are relevant aspects to be taken into account in the “implementation” phase of a distributed system?
    • what are relevant aspects to be taken into account in the “verification” phase of a distributed system?
    • what are relevant aspects to be taken into account in the “release” phase of a distributed system?
    • what are relevant aspects to be taken into account in the “deployment” phase of a distributed system?
    • what are relevant aspects to be taken into account in the “maintenance” phase of a distributed system?
  • In the context of distributed systems,
    • what is the infrastructure?
    • what are infrastructural components?
    • what does it mean for a node to act as a client? What are clients necessary in a system?
    • what does it mean for a node to act as a server? Why are servers often necessary in a system?
    • what does it mean for a node to act as a proxy? Why would that be beneficial in a system?
    • what does it mean for a node to act as a cache? Why would that be beneficial in a system?
    • what does it mean for a node to act as a load balancer? What balancing strategies may it employ? Why would that be beneficial in a system?
    • what does it mean for a node to act as a broker? What could a broker be useful for in a system?
    • how could a broker exploit a queue?
    • what is a message oriented middleware?
    • what does it mean for a node to act as database?
    • what does it mean for a node to act as master?
    • what does it mean for a node to act as worker?

Check your understanding (pt. 3)

  • In the context of distributed systems,
    • what is an interaction pattern (a.k.a. protocol)? How to represent it?
    • describe the request–response interaction pattern (a.k.a. protocol)
    • describe the publish–subscribe interaction pattern (a.k.a. protocol)
    • what is the difference among unicast, multicast, and broadcast?
    • describe the contract-net protocol
    • what is a software architecture? And what is an architectural style?
    • describe the “layered architecture” architectural style
    • describe the “three-tier architecture” architectural style
    • describe the “hexagonal architecture” architectural style
    • describe the “event-based architecture” architectural style
    • describe the “shared dataspace architecture” architectural style
    • what is redundancy? Why is it useful? How to achieve it and what implications does each solution come with?
    • what is a failover mechanism? Why is it useful? How to achieve it and what implications does each solution come with?
    • what is check-pointing? How is it related to rollbacking? Why are they useful? How to achieve them and what implications does each - solution come with?
    • what is consensus? Why is it useful? How to achieve it and what implications does each solution come with?
    • what is a heart-beat mechanism? Why is it useful? How to achieve it and what implications does each solution come with?
    • what is authentication? How is it related to authorization? Why are they useful? How to achieve them and what implications does each - solution come with?
    • what is data partitioning? Why is it useful? How to achieve it and what implications does each solution come with?

Lecture is Over


Compiled on: 2025-06-30 — printable version

back to ToC