Ensuring Resilience in the Fintech Eco-system by Introducing Chaos EngineeringRavi Bhushan Konduru, Anil Bahirat
Listen on the go!
Over the past few decades, consumers increasingly accepted digital tools as they helped save time, manage funds securely, and track and control their finances efficiently, leading to the growth of Fintech companies. According to some reports, the Fintech Industry is projected to become $1.5 Trillion by 2030.
NA has the largest share of the market in Fintech across the globe. US Fintech as a service market size is expected to grow at a compound annual growth rate (CAGR) of 17.5% from 2023 to 2030. One of the primary drivers is the increasing demand for seamless, user-friendly digital financial services among consumers and businesses, and secondly, the rise of open banking and APIs.
The Fintech industry’s rapid growth is fueled by changing consumer preferences, technological advancements (APIs, mobile tech, cloud computing, data analytics, AI), a thriving start-up culture, supportive regulations, and a strong push for globalization.
This growth comes with inherent challenges.
The Fintech industry faces several inherent challenges that demand careful attention. System reliability is paramount, as downtime or failure can result in significant financial losses and erode customer trust. Insider threats from employees within the company pose a substantial risk, accounting for 60% of security breaches. Additionally, the complexity and interconnected nature of fintech systems can be a hurdle, particularly when integrating modern high-tech apps with the legacy systems of established financial institutions.
Moreover, cybersecurity threats are prevalent, including ransomware attacks, phishing attempts, and data breaches, with over 1,000 security breaches noted in a single year. Implementing failover mechanisms and robust security features like two-factor authentication and encryption are crucial safeguards. Regulatory compliance is another challenge, as the industry must adhere to strict data security, privacy, and reliability standards.
Maintaining customer trust is vital, as any frequent outages or performance issues can drive customers away from competitors. Providing a seamless and secure user experience is imperative for retaining a loyal customer base. It’s important for companies in the fintech sector to stay vigilant and consult with cybersecurity specialists to ensure they meet both regulatory standards and customer expectations.
Resilience and Reliability of the system against Application failure, Infrastructure & Network
Cyber Security is a major priority for global businesses, and organizations are investing heavily in infrastructure and dedicated teams. Organizations should encourage their developers to purposely break the system by implementing Chaos engineering practices and identifying its weaknesses. Most hackers seek financial gains and steal data from enterprise or government systems. Chaos Engineering can help predict cyberattacks and stop hackers from entering the system. In 2017, Ponemon Cost of a Data Breach Study breaks down the root causes of data breaches into three areas: Malicious, System, and Human errors.
Chaos Engineering is a testing practice that helps organizations proactively identify and mitigate potential system issues by intentionally introducing controlled chaos. In the context of fintech, where financial systems’ robustness, reliability, and security are paramount, Chaos Engineering can play a crucial role by introducing controlled chaos to find and mitigate potential weaknesses in the system.
Chaos engineering aims to build software that can withstand turbulence and unexpected conditions across application behavior, infrastructure, or networks.
In today’s world, Ecosystems are becoming very complex in the digital age. The service outage will be very costly and will impact multi-folds. The traditional ways and means of testing are not enough to guarantee service availability with next-gen systems. Hence, there is a need for an innovative approach to verify and validate availability in an automated manner. Chaos engineering addresses the resiliency of key components in any organization, like people, culture, processes, applications, platforms, and infrastructure.
Solution Delivery Approach
Fintech companies must adopt Chaos Engineering principles and Practices and some cultural changes to address these inherent challenges.
Phase I: Enhancing Organizational Knowledge
In this initial phase, the objective is to elevate organizational knowledge by optimizing processes and maximizing the collective wisdom and information available within the organization. This phase encompasses a range of strategies, practices, and technologies geared toward rendering knowledge more accessible, valuable, and actionable for all organization members. Based on the previous knowledge, build the hypotheses. Below are some key considerations:
Standardizing Reliability Metrics and Prioritization
- Identify core capabilities by performing postmortems from past incidents
- Establish uniform definitions and conventions across the entire organization.
- Standardize metrics to ensure consistency.
- Define Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for the organization and individual teams.
- Foster a common language and understanding of SLOs and SLIs throughout the organization.
Implementing a Robust Incident Handling Mechanism
- Introduce an incident command control structure for efficient incident management.
- Enhance coordination, collaboration, and communication between different teams.
- Promote cross-training among teams, emphasizing continuous learning.
- Adopt a Developer on Call (DoC) approach when rolling out new features into production.
Establishing Services/Component Ownership
- Clearly define ownership of services and components through a Developer Portal.
- Strive to address 80% of issues through the Developer Portal.
- Set up war rooms and establish effective communication channels for incident resolution.
- Define clear accountability for both global and team based SLIs and SLOs.
By implementing these measures, teams will be well-prepared and knowledgeable about whom to contact in case of an outage or system failure. This concludes Phase I, which is dedicated to enhancing organizational knowledge. However, it is essential to recognize that achieving system resilience hinges on the subsequent Phase II.
Phase II: Enhancing Overall System Reliability
This phase focuses on improving the overall system reliability by centralizing knowledge on known limitations and capabilities, leveraging insights from past postmortems, and utilizing the right chaos engineering tools to accomplish the desired outcome. The following steps outline our approach:
The Delivery Approach
- Gather and consolidate information regarding known limitations and system capabilities and build hypotheses about the system’s behaviors.
Learn from Previous experience by performing postmortems
- Extract valuable insights from previous postmortem reports, analyze production defects, and read the root cause analysis reports for each production issue to better understand system weaknesses and strengths.
Categorize Postmortem Outcomes
Segregate the postmortem outcomes and actionable items into three distinct buckets based on their characteristics:
- Scenarios with Self-Recovery Potential:
- Identify scenarios where systems can autonomously recover. This represents the ideal state of system resilience.
- Continuously Verify and Validate: Regularly confirm that these scenarios indeed allow for self-recovery.
- Scenarios Requiring Verification:
- Pinpoint scenarios where self-recovery is uncertain or unconfirmed.
- Conduct Chaos Experiments:
- Plan and execute Chaos Experiments to rigorously test these scenarios.
- Document Baseline and Steady-State Metrics: Establish baseline performance metrics for comparison.
- Execute Chaos Experiments: Follow Chaos Principles and Practices to simulate real-world failures.
- Publish Advisory Bulletins: Share experiment outcomes and findings with teams.
- Enable System Adjustments: Allow teams to make necessary fixes and optimizations.
- Iterative Validation: Repeat these steps until the scenario transitions into the “Scenarios with Self-Recovery Potential” bucket.
- Scenarios Requiring Manual Intervention:
- Identify scenarios where manual intervention is necessary for recovery.
- Immediate Improvement: Prioritize making immediate enhancements.
- Conduct Chaos Experiments:
- Apply Chaos Principles and Practices to these scenarios as well.
- Publish Advisory Bulletins: Communicate experiment results and insights.
- Enable System Adjustments: Allow teams to adjust the systems.
- Iterative Validation: Continue the process until these scenarios shift into the “Scenarios with Self-Recovery Potential” bucket.
By diligently categorizing and addressing scenarios based on their recovery potential, we aim to continuously enhance our systems’ reliability and resilience.
To implement chaos engineering in a fintech ecosystem, start with controlled experiments in non-production environments to minimize risks. Ensure compliance with fintech-specific regulations and security considerations.
Cigniti Technologies, a global digital assurance and engineering leader, offers various services, including Chaos Engineering. They have 150+ experienced engineers skilled in designing and executing chaos experiments using various tools like Chaos Monkey, Gremlin, and Chaos Toolkit.
Over five years, Cigniti has demonstrated expertise in delivering engagements for Banking and Financial Services clients, identifying weak points, and ensuring system recoverability. These experiments validate the system’s ability to handle adverse conditions and ensure service continuity with 3rd party systems.
Need help? Contact our Fintech and Chaos Engineering experts to learn more about ensuring resilience in fintech eco-system through chaos engineering.