Stress Testing Provide Chain Networks at Scale on Databricks


Introduction

Within the current commerce warfare, governments have weaponized commerce via cycles of retaliatory tariffs, quotas, and export bans. The shockwaves have rippled throughout provide chain networks and compelled firms to reroute sourcing, reshore manufacturing, and stockpile essential inputs—measures that reach lead occasions and erode once-lean, just-in-time operations. Every detour carries a price: rising enter costs, elevated logistics bills, and extra stock tying up working capital. Because of this, revenue margins shrink, cash-flow volatility will increase, and balance-sheet dangers intensify.

Was the commerce warfare a singular occasion that caught international provide chains off guard? Maybe in its specifics, however the magnitude of disruption was hardly unprecedented. Over the span of just some years, the COVID-19 pandemic, the 2021 Suez Canal blockage, and the continuing Russo-Ukrainian warfare every delivered main shocks, occurring roughly a yr aside. These occasions, troublesome to foresee, have prompted substantial disruption to international provide chains. 

What could be completed to arrange for such disruptive occasions? As a substitute of reacting in panic to last-minute modifications, can firms make knowledgeable selections and take proactive steps earlier than a disaster unfolds? A well-cited paper by MIT professor David Simchi-Levi affords a compelling, data-driven strategy to this problem. On the core of his methodology is the creation of a digital twin—a graph-based mannequin the place nodes characterize websites and amenities within the provide chain, and edges characterize the circulation of supplies between them. A variety of disruption situations is then utilized to the community, and its responses are measured. By way of this course of, firms can assess potential impacts, uncover hidden vulnerabilities, and determine redundant investments.

This course of, generally known as stress testing, has been broadly adopted throughout industries. Ford Motor Firm, for instance, utilized this strategy throughout its operations and provide community, which incorporates over 4,400 direct provider websites, a whole lot of hundreds of lower-tier suppliers, greater than 50 Ford-owned amenities, 130,000 distinctive components, and over $80 billion in annual exterior procurement. Their evaluation revealed that roughly 61% of provider websites, if disrupted, would don’t have any affect on income—whereas about 2% would have a major affect. These insights essentially reshaped their strategy to provide chain threat administration.

The rest of this weblog publish offers a high-level overview of the way to implement such an answer and carry out a complete evaluation on Databricks. The supporting notebooks are open-sourced and obtainable right here.

Stress Testing Provide Chain Networks on Databricks

Think about a state of affairs the place we’re working for a world retailer or a client items firm and tasked with enhancing provide chain resiliency. This particularly means guaranteeing that our provide chain community can meet buyer demand throughout future disruptive occasions to the fullest extent attainable. To realize this, we should determine susceptible websites and amenities throughout the community that might trigger disproportionate injury in the event that they fail and reassess our investments to mitigate the related dangers. Figuring out high-risk areas additionally helps us acknowledge low-risk ones. If we uncover areas the place we’re overinvesting, we will both reallocate these sources to stability threat publicity or scale back pointless prices.

Step one towards attaining our objective is to assemble a digital twin of our provide chain community. On this mannequin, provider websites, manufacturing amenities, warehouses, and distribution facilities could be represented as nodes in a graph, whereas the perimeters between them seize the circulation of supplies all through the community. Creating this mannequin requires operational knowledge akin to stock ranges, manufacturing capacities, payments of supplies, and product demand. Through the use of these knowledge as inputs to a linear optimization program—designed to optimize a key metric akin to revenue or value—we will decide the optimum configuration of the community for that given goal. This allows us to determine how a lot materials needs to be sourced from every sub-supplier, the place it needs to be transported, and the way it ought to transfer via to manufacturing websites to optimize the chosen metric—a provide chain optimization strategy broadly adopted by many organizations. Stress testing goes a step additional—introducing the ideas of time-to-recover (TTR) and time-to-survive (TTS).

Visualization of the digital twin of a multi-tier provide chain community.

Time-to-recover (TTR)

TTR is without doubt one of the key inputs to the community. It signifies how lengthy a node—or a bunch of nodes—takes to get better to its regular state after a disruption. For instance, if considered one of your provider’s manufacturing websites experiences a fireplace and turns into non-operational, TTR represents the time required for that web site to renew supplying at its earlier capability. TTR is often obtained straight from suppliers or via inside assessments.

With TTR in hand, we start simulating disruptive situations. Underneath the hood, this includes eradicating or limiting the capability of a node—or a set of nodes—affected by the disruption and permitting the community to re-optimize its configuration to maximise revenue or reduce value throughout all merchandise below the given constraints. We then assess the monetary lack of working below this new configuration and calculate the cumulative affect over the length of the TTR. This offers us the estimated affect of the precise disruption. We repeat this course of for hundreds of situations in parallel utilizing Databricks’ distributed computing capabilities.

Beneath is an instance of an evaluation carried out on a multi-tier community producing 200 completed items, with supplies sourced from 500 tier-one suppliers and 1000 tier-two suppliers. Operational knowledge have been randomly generated inside affordable constraints. For the disruptive situations, every provider node was eliminated individually from the graph and assigned a random TTR. The scatter plot beneath shows whole spend on provider websites for threat mitigation on the vertical axis and misplaced revenue on the horizontal axis. This visualization permits us to rapidly determine areas the place threat mitigation funding is undersized relative to the potential injury of a node failure (purple field), in addition to areas the place funding is outsized in comparison with the chance (inexperienced field). Each areas current alternatives to revisit and optimize our funding technique—both to reinforce community resiliency or to scale back pointless prices.

Analysis of risk mitigation spend vs. potential profit loss, indicating areas of over- & under-investment 
Evaluation of threat mitigation spend vs. potential revenue loss, indicating areas of over- & under-investment 

Time-to-survive (TTS)

TTS affords one other perspective on the chance related to node failure. Not like TTR, TTS is just not an enter however an output—a choice variable. When a disruption happens and impacts a node or a bunch of nodes, TTS signifies how lengthy the reconfigured community can proceed fulfilling buyer demand with none loss. The chance turns into extra pronounced when TTR is considerably longer than TTS. 

Beneath is one other evaluation performed on the identical community. The histogram reveals the distribution of variations between TTR and TTS for every node. Nodes with a damaging TTR − TTS are usually not a priority—assuming the supplied TTR values are correct. Nevertheless, nodes with a optimistic TTR − TTS could incur monetary loss, particularly these with a big hole. To boost community resiliency, we will reassess and doubtlessly scale back TTR by renegotiating phrases with suppliers, enhance TTS by constructing stock buffers, or diversify the sourcing technique.

Analysis of nodes focused on time to recover (TTR) relative to time until disruption incurs downstream losses (TTS) 
Evaluation of nodes centered on time to get better (TTR) relative to time till disruption incurs downstream losses (TTS) 

By combining TTR and TTS evaluation, we will acquire a deeper understanding of provide chain community resiliency. This train could be performed strategically on a yearly or quarterly foundation to tell sourcing selections, or extra tactically on a weekly or every day foundation to watch fluctuating threat ranges throughout the community—serving to to make sure easy and responsive provide chain operations.

On a light-weight four-node cluster, the TTR and TTS analyses accomplished in 5 and 40 minutes respectively on the community described above (1,700 nodes)—all for below $10 in cloud spend. This highlights the answer’s spectacular pace and cost-effectiveness. Nevertheless, as provide chain complexity and enterprise necessities develop—with elevated variability, interdependencies, and edge circumstances—the answer could require higher computational energy and extra simulations to take care of confidence within the outcomes.

Why Databricks

Each data-driven answer depends on the standard and completeness of the enter dataset—and stress testing is not any exception. Corporations want high-quality operational knowledge from their suppliers and sub-suppliers, together with info on payments of supplies, stock, manufacturing capacities, demand, TTR, and extra. Gathering and curating this knowledge is just not trivial. Furthermore, constructing a clear and versatile stress-testing framework that displays the distinctive points of your enterprise requires entry to a variety of open-source and third-party instruments—and the flexibility to pick out the best mixture. Specifically, this consists of LP solvers and modeling frameworks. Lastly, the effectiveness of stress testing hinges on the breadth of the disruption situations thought-about. Working such a complete set of simulations calls for entry to extremely scalable computing sources.

Databricks is the perfect platform for constructing such a answer. Whereas there are a lot of causes, crucial embody:

  1. Delta Sharing: Entry to up-to-date operational knowledge is crucial for growing a resilient provide chain answer. Delta Sharing is a robust functionality that allows seamless knowledge trade between firms and their suppliers—even when one celebration is just not utilizing the Databricks platform. As soon as the info is out there in Databricks, enterprise analysts, knowledge engineers, knowledge scientists, statisticians, and managers can all collaborate on the answer inside a unified, knowledge clever platform.
  2. Open Requirements: Databricks integrates seamlessly with a broad vary of open-source and third-party applied sciences, enabling groups to leverage acquainted instruments and libraries with minimal friction. Customers have the flexibleness to outline and mannequin their very own enterprise issues, tailoring options to particular operational wants. Open-source instruments present full transparency into their internals—essential for auditability, validation, and steady enchancment—whereas proprietary instruments could supply efficiency benefits. On Databricks, you may have the liberty to decide on the instruments that finest fit your wants.
  3. Scalability: Fixing optimization issues on networks with hundreds of nodes is computationally intensive. Stress testing requires operating simulations throughout tens of hundreds of disruption situations—whether or not for strategic (yearly/quarterly) or tactical (weekly/every day) planning—which calls for a extremely scalable platform. Databricks excels on this space, providing horizontal scaling to effectively deal with advanced workloads, powered by robust integration with distributed computing frameworks akin to Ray and Spark.

Abstract

World provide chains typically lack visibility into community vulnerabilities and battle to foretell which provider websites or amenities would trigger essentially the most injury throughout disruptions—resulting in reactive disaster administration. On this article, we introduced an strategy to construct a digital twin of the availability chain community by leveraging operational knowledge and operating stress testing simulations that consider Time-to-Recuperate (TTR) and Time-to-Survive (TTS) metrics throughout hundreds of disruption situations on Databricks’ scalable platform. This methodology allows firms to optimize threat mitigation investments by figuring out high-impact, susceptible nodes—much like Ford’s discovery that solely a small fraction of provider websites considerably have an effect on income—whereas avoiding overinvestment in low-risk areas. The result’s preserved revenue margins and lowered provide chain prices.

Databricks is ideally fitted to this strategy, because of its scalable structure, Delta Sharing for real-time knowledge trade, and seamless integration with open-source and third-party instruments for clear, versatile, environment friendly and cost-effective provide chain modeling. Obtain the notebooks to discover how stress testing of provide chain networks at scale could be carried out on Databricks.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles