NTT Data – Platea Banking

Use Case

Platea Banking is a Digital Banking Solution developed by NTTD to be used by our clients as an accelerator in their Modernization Journeys, especially when creating a new banking Platform.

Platea is positioned as a modernization vehicle that provides a ready to go digital infrastructure and architecture over both internal and external components as core. It also includes pre-built functional modules and integrations with a third-party ecosystem with Mambu[1] at his heart as a Core Engine.

Platea accelerates the modernization journey by dramatically reducing T2M while proving a solid future-proof technology, avoiding vendor lock-in and setting Clients in Driver’s seat with autonomy and full control of their future.

It is well known why mainframes were a success back in the day:

  • Reliable
  • Strong
  • Computing power
  • Transactionality

But nowadays in the modern digitalized world clients are facing the challenge of defining and implementing production digital solutions in record time in order to be competitive.

Therefore, apart from the mainframe capabilities, businesses require additional features these days:

  • Elasticity and Pay as you go.
  • Real time interaction.
  • Value delivery.
  • Information management.

Also, as the system becomes more complex due to the increasing number of components that compose the system, additional requirements arise from the reliability requirement and the system operation itself:

  • Business segmentation.
  • System observability both as a whole and per information flow.
  • System resilience.

From this new digital paradigm, the idea of creating a digital solutions platform capable to support exponential businesses arose.

Requirements

  • Mambu functions synchronization support.
  • Online transactions support (open accounting, withdrawal, deposit, balance view, etc.) including error handling support.
  • Batch transactions snapshotting, so that the transaction database can be re-built.
  • Optimal mass transactions performance (TPS, Response Time, low latency, etc.).
  • Multiple layers structure, with support for at least three layers: Global (CORE), Regional (per country), and Customers’ Group Specific. Application logic must be based on modules structured by the layers.
  • Business transactions are implemented using Micro Services Architecture (MSA).
  • Support for ACID-grade transactions, even when calling third parties, on some functionalities.
  • Transaction management structure and rules are defined to support services called by other service groups. As an example, a money transfer among multiple accounts in a single transaction; Error handling must consider all the accounts involved.
  • Information exchange is based on Event Driven Architecture (EDA) and Eventual Consistency when possible (no other requirement recommends otherwise). Data consistency and sequencing are assured.
  • Data needs to be accessed in synchronous and asynchronous modes. ACID-Grade consistent and eventually consistent data access must be provided, where each component may be set to work with consistent or eventually consistent models. The eventual data consistency must be ensured within 60 seconds globally.
  • The data integrity must be assured; the transactions are immutable and can be used to re-build the transactions database.
  • The solution infrastructure must be 99.99% available, excluding availability of 3rd party integrated solutions.
  • Online responses must take up to 3 seconds for 90% of the functionality.
  • Performance target is 200 transactions per second (TPS) on a 3 million accounts basis.
  • Online time must be 24H 365D with a max of 4 hours offline a month 99.5% online.
  • Releases must be carried out without stopping the service.
  • Backup and disaster recovery methods must be in place.
  • Monitoring must include trace information, resource monitoring, audit logging, and performance monitoring. It must be able to produce live alerting. It will also be used as a postmortem evaluation tool.
  • Banking security regulations must be translated to cloud security measures.
  • Two factor authentication must be enforced.
  • All deployments must be automated.

Main Approach

Due to the need of creating a robust and scalable platform, which solves a global banking business cases, we aimed to get a Reactive Architecture.

Reactive Architecture in Platea

A Reactive Architecture is defined by four main features which are perfectly defined in the Reactive Manifesto[2]:

  • Responsiveness
  • Resilience
  • Elastic
  • Message Driven

This kind of architecture seems to fit perfectly with Platea requirements because it allows high availability and dynamic scaling based on the workload while guaranteeing a response on each part of the system in a decoupled way, optimizing the dedicated resources. In a business such as banking, in which the clients place their trust this is a key factor.

However, since one of the requirements is to support external cores, and many SaaS have standard http APIs (typically REST APIs) which are not message driven, and since the exposed interface will mostly need to be via REST APIs to enable interaction with many other systems which may interact with Platea, the approach cannot be a pure Reactive Architecture: it must also support synchronous interactions.

Responsiveness

The system responds in a timely manner if possible. Responsiveness is the cornerstone of usability and utility. Responsiveness means problems to be detected quickly and dealt with effectively. Responsive systems focus on providing quick and consistent response times, establishing reliable upper bounds so they deliver a consistent quality of service. This consistent behaviour in turn simplifies error handling, building end user confidence, and encouraging further interaction.

Resilience

The system stays responsive on failures. Resilience is achieved by replication, containment, isolation and delegation. Failures are contained within each component, isolating components from each other and thereby ensuring that parts of the system can fail and recover without compromising the system as a whole. Recovery of each component is delegated to another (external) component and high availability is ensured by replication where necessary. The client of a component is not burdened with handling its failures.

Elasticity

The system stays responsive under varying workloads. Reactive Systems can react to changes in the input rate by increasing or decreasing the resources allocated to process these inputs. This implies designs that have no contention points or bottlenecks, resulting in the ability to shard or replicate components and distribute inputs among them. Reactive Systems support predictive, as well as reactive, scaling algorithms by providing relevant live performance measures. They achieve elasticity in a cost-effective way on commodity hardware and software platforms.

Note about elasticity in contrast to scalability: the purpose of elasticity is to match the resources allocated with actual number of resources needed at any given point in time while scalability handles the changing needs of an application within the confines of the infrastructure, via statically adding or removing resources to meet applications demands if needed.

Message Driven

Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling, isolation and location transparency. This boundary also provides the means to delegate failures as messages. Employing explicit message-passing enables load management, elasticity, and flow control by shaping and monitoring the message queues in the system and applying back-pressure when necessary. Location transparent messaging as a means of communication makes it possible for the management of failure to work with the same constructs and semantics across a cluster or within a single host. Non-blocking communication allows recipients to only consume resources while active, leading to less system overhead.

Asynchronous message passing vs synchronous interactions

Since it is required to support external cores, and most systems in the internet expose synchronous API services, the architecture must support synchronous interactions. This model behaves differently than message driven, so we don’t have the same isolation as the other. To fit with the requirements, the asynchronous model is preferred. For components supporting synchronous interactions, it will be required to generate messages for some cases even when called synchronously so it behaves like asynchronous components.

Event Driven vs Message Driven

Given the nature of the Platea platform, the responsiveness and latency of the system is crucial. Also, the need to support synchronous interactions and standardize both for system management, added an extra issue. Due to these facts we shifted from a Message-Driven system to an Event-Driven one for most of the system.

Please note that in contrast to a Message-Driven system, in which a message is an item of data that is sent to a specific destination, an event is a signal emitted by a component upon reaching a given state. In a message-driven system addressable recipients await the arrival of messages and react to them, otherwise lying dormant. In an event-driven system notification listeners are attached to the sources of events such that they are invoked when the event is emitted. This means that an event-driven system focuses on addressable event sources while a message-driven system concentrates on addressable recipients (a message can contain an encoded event as its payload).

The Message Driven model was reserved for actions that must be accomplished atomically (distributed transactions). For other scenarios (including synchronous interactions), the Event Driven model is followed.

Domain Driven Design

DDD or Domain Driven Design pattern is set as the basis for the decomposition of functionalities in the system and, thus, the services definition. More than a design pattern, it is an approach to software development for complex needs by connecting the implementation to an evolving model.

Event Driven

Event-driven architecture is a software architecture paradigm promoting the production, detection, consumption, and reaction to events. Basically, each service publishes an event whenever it updates its data and meanwhile other services subscribe to events. When an event is received, a service updates its data.

When a service performs an operation inside a distributed transaction, the events are generated once the transaction is committed (the update is committed when the transaction is committed, even if the data itself is updated before that). In cases when the transaction is cancelled no event is generated, despite there are at least two changes in the data (the change and the rollback).

Among the benefits is the possibility of developing loosely coupled microservices, that improve elasticity, being able to scale independently.

Finally, the use of this pattern fits with the Event Sourcing pattern described below.

Event Sourcing

The fundamental idea in Event Sourcing is that every change in the state of an application entity must be captured in an event object, and that these event objects are themselves stored in the sequence they were created for the lifetime of the application state itself.

Saving an event in a single operation is an atomic operation. Every command executed in the system will produce one or more events if their underneath actions have been executed successfully. This means that the Event Store is the source of truth for the whole architecture of the system, being able to reconstruct an entity current state by replaying the events. This is a key factor to cover audit needs.

With the objective of improving the observability and auditing of the system, the reading events as well as the errors produced will also be recorded, so that the informational system can ingest it and have the information of everything that happened in the system at each precise moment. This provides a 100% reliable audit log of the changes made to a business entity, as well as all the events that happened in the system.

Some entities can have a large number of events. In order to optimize loading, an application can periodically save a snapshot of an entity’s current state. To reconstruct the current state, the application finds the most recent snapshot and the events that have occurred since that snapshot. As a result, there are fewer events to replay.

Event sourcing has several benefits:

  • It solves one of the key problems in implementing an event-driven architecture and makes it possible to reliably publish events whenever state changes.
  • It mostly avoids the object relational impedance mismatch problem, because it persists events rather than domain objects.
  • It provides a 100% reliable audit log of the changes made to a business entity.
  • It makes it possible to implement temporal queries that determine the state of an entity at any point in time.
  • Event sourcing-based business logic consists of loosely coupled business entities that exchange events. This makes it easier to develop and maintain these entities.

It also has a major drawback: since the business logic consists of loosely coupled business that work independently one from the other, and both the generator of the event is not aware of the receivers of the event and each consumer of an event is not aware of the generator nor other consumers, it makes very hard for any of the entities involved in the business logic to be aware of the global state of the logic. So to get the benefits but avoid the drawbacks event sourcing is forbidden as a distributed transaction communication; however it will be required to generate events once the transaction is finished successfully (committed).

Event source pattern was implemented storing generated events in different domain layers using the defined data base in technological stack.

Figure – Storage of events during the treatment of a request

 

In the other hand, the event store is difficult to query since it requires typical queries to reconstruct the state of the business entities. That is likely to be complex and inefficient. As a result, the application must use Command Query Responsibility Segregation (CQRS) to implement queries, which is proposed as global solution.

The generated events will also serve to communicate the Command side and Query side microservices developed with the CQRS pattern described below.

CQRS

CQRS stands for Command and Query Responsibility Segregation. The core concept is that different models for updating and reading information can be used, which will be referred from now on as the Command and Query part respectively.

Figure – CQRS Segregation for a domain

Typically the implementation is separated into two sides: one for Commands and one for Queries, allowing a better scalability of the solution as they can be dimensioned separately. The commands side is oriented to transactional operations and the query is oriented to pure read-only services.

When this pattern is applied, general responsibilities are also separated between the two parts that compose it.

Also, the model of the command and the query may be different (and will be in many cases): the model of the command must be oriented to the transactional needs, while the model of the query is oriented only to the queries it may receive.

In our case, we are also using a different microservice for query-side event sourcing projection than the query microservice, having even better scaling performance:

Figure – Typical Platea microservices in CQRS cases

Command side responsibilities

All the commands in the model must cover:

  • The validation of the actions that are triggered by the users, including business restrictions
  • Data versioning maintenance, handling transactions and consistency
  • The atomicity of the exposed operations (from the point of view of the caller)
  • The division of the requests into tasks, and its coordination
  • The storage in the Event Journal of all the results of the execution of the commands

And, of course, it must deal and implement all the business logic of its domain.

Query side responsibilities

The Query side creates a projection of all views necessary for the different clients of the system, like the traditional materialized views. This side will be responsible for:

  • Keeping only the latest version of the data, according to how it is consumed.
  • Data must be updated whenever there is an update on the command side.
  • Offer fast and reliable access to the data (High Availability reads).
  • Offer consistent access to the data (High Consistency reads).

Apart from that, the Query Side must perform the necessary aggregations of the data so that it gets enriched before serving it to the clients, increasing its value.

Communication between Command/Query

Communication between commands and queries is carried out by an event broker, which allows decoupling from each other and improves scalability.

The use of this type of communication system also provides responsiveness to the whole solution even if the system goes down, because the messages persist in the broker’s queue until they are consumed.

Despite adding a bit of latency, asynchronous communication provides responsiveness and resilience to the system.

Transactionality

The Transaction Manager is the architecture component in charge of managing distributed transactions among the Platea banking Microservices architecture.

Benefits

CQRS

  • Scale independently
  • Ability to optimize queries independently from the rest of the platform
  • 100% reliable source of trust
  • Point in time recovery
  • Decouple business logic

Event Sourcing

  • Enables a 100% reliable audit log on entity modifications
  • Allows entity state searches at any specify date back in time
  • Business logic is based on loosely coupled entities interchanging events

Transaction Manager

  • Codeless for developers
  • Ensures ACID (Atomicity, Consistency, Isolation, Durability).
  • Automatic rollbacks, including third party products through configuration.
  • Automated event sourcing.

Platea Banking Architecture

Whole view

Figure – High level whole view of Platea Architecture

 

 

Detailed View

Infrastructure

The infrastructure system is based on an AWS EKS cluster as main infrastructure component along with some satellite systems used to complement its functionality. Some of these components are load balancers, Nat Gateways, Route53 DNS resolvers and the typical VPC networking needed bits and pieces, like ACLs, Security Groups, etc.

The whole VPC networking is a simple public/private subnets architecture. Only the load balancers and the Nat Gateways are placed in the public subnets while all the other components of the infrastructure live in private subnet space.

For relational data storage, there is a multi-node private fully managed Aurora PostgreSQL cluster.

An Elasticsearch cluster has been provisioned for timeseries and object data storage. This is a managed standalone multi-node cluster living in the private subnets of the VPC.

For secrets and sensitive parameters, a highly available HashiCorp Vault cluster is provisioned also in the private subnets.

Most of the functionality lives in the EKS cluster as K8s objects, such as Deployments, Pods, Services, Stateful sets, etc. The Orchestration, security, traceability and observability of such cluster and objects is done using the Istio service mesh.

An AWS managed Kafka cluster lives on the private subnets and it is used as message broker and transformation service along with the Logstash and Filebeat services running on EKS. There are also some AWS Lambda functions involved in the events, messages and logs forwarding and manipulation.

Kong is also part of the infrastructure, deployed from the marketplace. It is used as API manager and sits between load balancers forwarding traffic to specific services in EKS.

Figure – Platea infrastructure

Components communication

The allowed communications between components models are different between cases.

Exposed services

Services exposed to the outside are exposed through REST APIs. Components exposing services to the outside must support synchronous json-based REST calls.

Exposed Query services

Components exposing query services should not only expose them with a json-based REST interface. They should only call its data repository; they should not call any other components unless requirements make it impossible to avoid it.

Calls to external components

Calls to external components will depend on the technologies such external components support. It is expected to be done through http calls (mainly REST calls).

Calls inside transactions in command side

There are two supported models for communication in command side:

  • Http REST calls, for synchronous request-response communication.
  • Kafka messaging, for both synchronous request-response communication over an asynchronous channel and pure fire&forget messaging on a Message-Driven model.

Each command service may expose its synchronous services through any of these models, or through both. Asynchronous services should be exposed through the kafka interface, but it is allowed to expose them through a HTTP REST call, despite the execution would not be parallel and decoupled in this case.

Any transaction may mix any type of calls between the services it needs to be completed (any microservice may call another one through any communication model no matter how it was called).

Due to the nature of the transactions (specifically Atomicity), a transaction which includes fire&forget calls cannot be committed until all of these processing has ended; but any intermediate service is not aware of that.

General model

Communication between components should be Event-Driven. That includes not only business events (such as changes due to transactions) but also other types, like observability, alerts,

Any component may generate events and any component may listen to any events it can see. Events may be Domain-scoped if required, so no component outside its domain can see it.

DevOps

A project with this scope requires a high level of deployment automation. With so many microservices configured to work together it is vital to use the appropriate cloud orchestration tools and methodologies.

At a high level, the following steps are performed in order to deploy a microservice from its source code:

  • Run unit tests included in the source code.
  • Scan the source code quality using SonarQube.
  • Package the code sources.
  • Deploy the packages to the S3 based artifacts repository.
  • Build automated Docker Images using the packages.
  • Push Docker Images to the AWS Elastic Container Registry.
  • Deploy the Images on Kubernetes based using Helm.

The whole CICD system is based on AWS CodeComit, CodePipelines, CodeBuild, and FluxCD. All are AWS native components but FluxCD, which is just a resource installed on Kubernetes (EKS). This allows for controlled and agile resource deployments on EKS. The resources are orchestrated by Helm, which only needs to be updated on Architectural changes. The applications lifecycle depends uniquely on git actions that the developers themselves perform.

The only action taken by the DevOps team is the promotion between environments, which is also an automated action based on Helm templates and the CICD tools.

Figure – Development and deployment flow

The whole system is deployed in several non-prod and one production environment allowing to fully test the platform before pushing changes to production. Also, due to the nature of Docker and Kubernetes, rollbacks are easy to perform if needed. The relational data is regularly “point in time” backed up and it is also possible to be rebuilt from any point in time thanks to the immutable events stored.

This fully automated system is resistant to human error and easy to manage and further develop. All the complexities of the microservices architecture are eased by the CICD and orchestration systems used. Such systems are also monitored as part of the infrastructure so there is no need to worry about them.

Observability

All components will be monitored using the following open-source stack, which is considered one of the most mature and standardized solutions in cloud environments.

Below is a brief description of each application:

  • Jaeger: A distributed tracing system used for monitoring and troubleshooting microservices-based distributed systems.
  • Prometheus: A systems and services monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is met.
  • Grafana: An open-source visualization and analytics software. It allows querying, visualizing and exploring metrics no matter where they are stored. Also generates alerts on certain metrics’ values.
  • ELK: The acronym for three open-source projects: Elasticsearch, Logstash, and Kibana.
    • Elasticsearch: A search and analytics engine.
    • Logstash: A server-side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a “stash” like Elasticsearch.
    • Kibana: Lets users visualize charts and graphs which are fed with Elasticsearch data.

Prometheus scrapes metrics, both from applications and infrastructure. Although the latter require specific Prometheus exporters to obtain the information from Cloud Watch. Thus, each EKS cluster will have its own Prometheus server for scrapping, as well as a Grafana instance for the visualization of the metrics. In addition, these metrics will also be stored in an external system, preferably in a managed timeseries database, using the Prometheus remote writer. This external system will have a longer retention policy, so that it can be used in functional monitoring.

Jaeger provides open tracing capabilities. Thus, each microservice must include the Jaeger library with which to trace the requests and send the information to a Jaeger collector placed in a specific Kubernetes observability cluster. This collector will store data in Kafka, in such a way that it can also be exploited by the informational system, and finally Jaeger will store it in Elasticsearch. In order to obtain end-to-end traceability, information will be visualized through a centralised Jaeger UI.

Filebeat is the agent in charge of collecting the application and audit logs from EKS while infrastructure logs will be collected from CloudWatch using an AWS Lambda function which sends all the logs to Kafka. This allows storing raw audit data in a secure system such as S3 using the informational system. Then, Logstash will format and enrich that information and store it in Elasticsearch. Elasticsearch stores both, the application and the obfuscated audit logs; the latter is stored to comply with the ISO 27001 required by many Banking Security regulations. The application logs will have a TTL of 7-14 days but the obfuscated audit logs can be stored for years, so it is recommended to apply a hot-cold storage lifecycle. In order to obtain end-to-end traceability, application logs are visualized through a centralised Kibana. Obfuscated audit logs must be consumed using the functional monitoring system.

Figure – Observability flow

This way we can observe the whole application as well as the infrastructure.

Applications
  • Logging: A Filebeat agent is used for logs collection and Logstash is used for aggregation and processing. These logs are previously stored in a Kafka to prevent overloading Logstash and to be consumed by the informational system. Finally, data is stored on Elasticsearch and visualised by users through Kibana.
  • Monitoring: Prometheus is used for metrics collection and storage. Microservices must expose metrics with the Micrometer library, JVM metrics are essential to identify malfunctions or bottlenecks. Grafana is used for visualization. Data is also stored in an external data storage system, so that it can be used for functional monitoring.
  • Tracing: Jaeger is used for tracing collection and processing. Traces are also stored in a Kafka topic for the same reasons that logs are. Finally, data is stored on Elasticsearch and visualisation is done via Jaeger UI.
Infrastructure
  • Logging: An AWS Lambda function collects the logs from CloudWatch and send them to Logstash for aggregation and processing. These logs are also stored in Kafka so that they can be consumed by the informational system. Eventually the data will be stored on a Elasticsearch and visualised by users using Kibana.
  • Monitoring: Prometheus is used for metrics collection and storage. An Elasticsearch | Kafka | RDS | Cloud Watch exporter is required to expose database metrics. Visualisation is done using Grafana. Here as well, data is also stored in an external data storage system, for further use in functional monitoring.

Learned Lessons

To create a baseline for a banking application is a massive task that should not (and there is no need to) be done from scratch. Taking advantage of a mature, well tested by existing customers, CQRS banking platform able to pass the regulators audits is a major advantage.

To deploy the solution in AWS using a fully automated orchestration system is a recipe for success. AWS is constantly maintaining and improving its services, regions and datacentres. The grow need of a banking application is no challenge for the AWS infrastructure capability. Resilient, secure, and certified; the Cloud is the home of the Platea banking solution.

Last but not least, the NTTData team is constantly improving the solution and is able and available to provide support, installation, management and improvements of the Platea Open Banking solution.

[1] Mambu is a Cloud banking SaaS: https://www.mambu.com/

[2] https://www.reactivemanifesto.org/

NTT DATA  cloud adoption journey

contact

In order to learn about the scope of our services, or be given answers to any doubts you may have, please get in touch with us.
Top