Liquid Cooling Migration Playbook for ML Clusters

A step-by-step playbook for migrating on-prem GPU clusters to direct-to-chip or RDHx cooling without throttling, leaks, or outages.

Liquid cooling is no longer a niche optimization for hyperscalers; it is becoming a practical requirement for on-prem ML infrastructure planning as GPU densities climb and air cooling hits its physical ceiling. For teams operating training and inference fleets, the question is not whether cooling will matter, but how to migrate without causing accelerator throttling, service interruptions, or a costly rework of rack layouts, monitoring, and procedures. This guide is written as a step-by-step migration playbook for engineering, platform, and infrastructure operations teams that need to move from air-cooled racks to direct-to-chip or rear-door heat exchanger architectures in a controlled, testable way. If you are balancing capacity growth with safety, you will also want to pair this with cost modeling discipline and budget visibility so the cooling project is tied to measurable performance outcomes.

Modern GPU fleets are constrained by more than compute. As densities rise, cooling, power, firmware behavior, and observability become one operational system, and the migration must be treated as a hardware program rather than a facilities side task. That mindset shift matters because teams often underestimate the amount of orchestration required across procurement, vendor qualification, change windows, and validation gates. Done well, the result is more headroom, steadier clocks, lower fan noise, and fewer emergency interventions when training runs saturate the cluster. Done poorly, the first sign of trouble is usually a spike in temperature, a silent firmware mismatch, or an outage that was entirely preventable.

1) Why Liquid Cooling Becomes a Migration, Not a Swap

Air cooling breaks down first at rack density, then at operational tolerance

Traditional data-center air cooling is workable for mixed CPU workloads, but high-density GPU nodes push a different profile of heat rejection. The challenge is not just raw thermal load; it is that GPUs can sustain sharp, localized heat spikes that make inlet air and room-level cooling insufficient even if average temperatures look acceptable. In practice, that means a rack can be “fine” on paper while individual boards are already flirting with thermal throttling. This is why teams moving to liquid cooling must think in terms of thermal envelopes, not just rack watts.

The underlying trend is visible across the market: AI infrastructure is now being designed around immediate power availability and dense thermal systems rather than incremental room-based improvements. That shift is echoed in industry reporting on next-wave AI infrastructure, which highlights the need for liquid cooling and ready-now power to unlock modern accelerators. For ML teams, that translates into a hard operational truth: if the cooling system cannot absorb and remove heat at the pace the hardware generates it, your model throughput becomes a facilities problem. Teams that ignore this relationship often discover it later through unstable training jobs, unexpected GPU downclocking, or rising error rates under sustained load.

Direct-to-chip and RDHx solve different bottlenecks

Direct-to-chip cooling moves heat away from the hottest components, typically the CPUs and GPUs, using cold plates and a circulating coolant loop. It is effective when the server design and vendor support are aligned, and it offers the cleanest path for very high-density GPU systems. Rear-door heat exchanger solutions, by contrast, sit at the back of the rack and remove heat from exhaust air before it recirculates into the room. RDHx can be easier to adopt in some legacy environments because it preserves more of the existing server topology, but it still requires careful airflow and plumbing coordination. Most teams should evaluate both based on rack density, serviceability, and whether the migration needs to preserve existing chassis investments.

Plan the migration as a capacity and reliability program

Cooling upgrades are often justified as an efficiency project, but the bigger win is operational resilience. Better cooling can preserve GPU boost behavior, reduce fan wear, and create room for more aggressive capacity planning. That is especially important in high-utilization on-prem ML environments where utilization already pushes hardware close to its limits. If your fleet supports training jobs that run for days, even a modest thermal improvement can materially change run times and cost per experiment. In other words, the migration is a way to buy performance stability, not just thermal comfort.

2) Build the Business Case and Scope the Target Architecture

Start with workload characterization, not vendor brochures

Before comparing direct-to-chip or RDHx products, profile your workloads. You need to know the sustained versus burst power draw of each server class, the proportion of training versus inference jobs, and the acceptable thermal headroom during peak periods. This is also where you decide whether the upgrade is driven by current throttling, near-term expansion, or a hardware refresh cycle. A strong baseline starts with observed data from your cluster scheduler, BMC telemetry, GPU metrics, and facility power logs.

Teams that skip this step usually buy for the average case and fail under the worst case. The right question is not “What is the rated TDP?” but “What happens when all topologies are active, the room is warm, and jobs run for six hours at 95% utilization?” That is the difference between an architecture that looks acceptable and one that survives production load. If you need a model for translating infrastructure inputs into operational decisions, borrow the discipline used in serverless cost modeling for data workloads: forecast the shape of the workload before selecting the platform.

Define success metrics up front

Your migration should have measurable goals. Common targets include reducing GPU thermal throttling events to near zero, keeping inlet temperatures within a strict band, increasing sustained boost clocks, or enabling higher rack density without increasing incident rates. You should also define acceptance criteria for leak detection, pump redundancy, maintenance access, and failover behavior. Without explicit targets, a liquid cooling rollout becomes difficult to defend and even harder to debug.

A practical example: a 20-rack ML cluster may target a 10-15% increase in sustained compute throughput at the same energy budget, while also cutting emergency fan interventions and eliminating “warm aisle” hotspots. That kind of objective turns the project into something finance and operations can support together. It also makes vendor comparison easier because you are no longer purchasing “cooling”; you are purchasing verified thermal capacity.

Choose the migration pattern that fits your operating model

There are three common patterns. The first is a full greenfield aisle or room built for liquid cooling from day one, which is ideal but rarely available to teams with existing hardware commitments. The second is a phased retrofit, where a subset of racks is upgraded and instrumented before broader rollout. The third is a mixed fleet approach, where only the highest-density or most throttle-prone nodes move first. For most on-prem ML teams, phased retrofit is the safest default because it preserves learning opportunities and reduces blast radius.

Cooling approach	Best fit	Operational complexity	Retrofit friendliness	Typical risk profile
Direct-to-chip	Very high-density GPU nodes	High	Medium	Leaks, plumbing, maintenance coordination
Rear-door heat exchanger	Legacy rack environments, moderate density growth	Medium	High	Airflow mismatch, rack access constraints
Air cooling with containment	Short-term bridge strategy	Low to medium	Very high	Limited headroom, throttling at scale
Hybrid liquid + air	Transition period	Medium to high	High	Complex operations split across systems	Highest load density with full new build
Direct-to-chip + RDHx	Incremental migration with redundancy	High	Medium	Integration and controls complexity

3) Procurement Checklist: Buy for Operations, Not Just Specs

Validate mechanical compatibility and serviceability

Procurement should start with server compatibility matrices. Confirm which GPU server SKUs support cold plates, manifold connections, and coolant flow requirements, and verify whether your rack and row layouts can physically accommodate hoses, rear-door hardware, and service clearances. Ask vendors for drawings, not just marketing specs, and require installation diagrams that show bend radius, port placement, rack depth, and maintenance access. These details matter because a cooling system that cannot be serviced safely is not production-ready.

Also verify the maintenance model. Some systems are easier to isolate at the rack level, while others require broader loop shutdowns that complicate repairs. You should know how quickly a single node can be removed, what happens if a door exchanger fails, and whether replacement parts are stocked locally. If you are evaluating multiple vendors, document assumptions in a standardized scorecard alongside your broader vendor comparison process, because procurement mistakes in infrastructure are expensive and slow to unwind.

Insist on evidence for thermal performance

Do not accept “supports up to X kW per rack” without an accompanying test method. Ask for thermal validation data that includes inlet temperatures, coolant delta-T, ambient conditions, load duration, and test topology. You want to know how the system behaves under sustained load, not just under a short certification burst. This is where the discipline of designing under accelerator constraints becomes useful: the best systems are built around the limit case, not the average case.

Also ask for instrumentation support. Your vendor should expose flow rate, supply and return temperature, pressure, leak state, and pump status through APIs or standardized interfaces. If data only lives in a proprietary console, your team will struggle to integrate it with observability tooling. Cooling monitoring should be first-class telemetry, not an afterthought.

Make service-level and warranty terms explicit

Liquid cooling adds new failure domains, so the contract must reflect them. Confirm who owns failure response for pumps, valves, manifolds, sensors, and water quality management. Make sure warranty language covers leak-related hardware exposure and whether downtime from cooling faults affects the support clock. This matters because the wrong support model can turn a small incident into a multi-day outage.

Pro tip: Treat cooling procurement like a reliability purchase. If a vendor cannot explain how their system degrades safely when sensors fail, they are selling a component, not an operational platform.

4) Site Readiness: Power, Plumbing, and Environmental Controls

Upgrade the room before the rack arrives

One of the biggest migration mistakes is ordering hardware before the site is ready. Direct-to-chip and RDHx systems often require new plumbing paths, drip management, floor load reviews, drip trays, make-up water planning, and revised emergency procedures. In some facilities, you may also need to rework aisle containment, pressure relationships, or hot/cold air mixing paths even though the primary heat transfer is now liquid-based. The room should be ready to support the new operating model before the first rack is energized.

For teams with strict power envelopes, capacity planning should include not only the IT load but also pumps, controls, monitoring gear, and any supporting mechanical equipment. That is where planning frameworks similar to datacenter capacity forecasts help tie near-term deployment to long-term expansion. A good site plan leaves room for growth without forcing emergency re-cabling or unplanned mechanical work later.

Define coolant quality and environmental controls

Coolant chemistry is an operational control, not a footnote. Your team should define acceptable water quality, corrosion inhibitors, filtration requirements, and maintenance cadence for sample testing. If the vendor provides a prescribed coolant formulation, document how it will be procured, stored, and replenished. The facilities team, the hardware team, and the ops team need a shared checklist because contamination or incompatible fluids can damage expensive equipment.

Temperature and humidity controls still matter. Even with liquid at the chips, ambient conditions affect connectors, cabling, and non-liquid components. If the room experiences wide swings, you increase the risk of condensation, service errors, and unstable peripheral behavior. The safest deployments keep ambient variation tight and monitored.

Prepare electrical and fire-safety procedures

Cooling changes do not eliminate electrical risk. They can actually raise the importance of safety planning because denser racks concentrate more value into smaller footprints. Review shutdown procedures, emergency power interaction, sensor alarms, and how the system behaves during utility transitions. Make sure your safety plans include coolant leak response, spill kits, and escalation paths that are drilled before rollout.

For teams already managing high-consequence systems, this is similar to building controls for regulated data or other sensitive environments. The principle is the same: make the system observable, make failure modes explicit, and ensure operator actions are rehearsed. Liquid cooling is safest when it is boringly predictable.

5) Leak Detection, Containment, and Failure Response

Design for detection before protection

Leak detection should be layered. Use onboard sensors, rack-level detectors, drip trays, and zone-level alerts so a single failure does not become a blind spot. A reliable deployment does not assume leaks never happen; it assumes the system will detect them early enough to isolate impact. That means your detection path must be validated during staging, not just during vendor acceptance.

Define thresholds carefully. A sensor that is too sensitive may cause nuisance alarms, while one that is too coarse may miss slow leaks. Tune alerts to the operating profile of the rack, including maintenance windows, coolant refill activity, and expected temperature variation. The objective is to distinguish routine service from true incidents.

Establish isolation and shutdown playbooks

When a leak alarm triggers, operators need an exact response sequence. Which valves close first, which racks remain online, and which systems must be drained or powered down? Your playbook should identify the owner for each step and the maximum allowable response time. The point is not to make a perfect procedure on paper; it is to make a procedure that an on-call engineer can execute under pressure without improvising.

It is also wise to simulate false positives and partial failures during staging. Practicing shutdowns reveals how the system really behaves when valves close, pumps stop, or controls fail over. Those tests often uncover hidden assumptions, such as manual steps that take too long or sensor dashboards that do not clearly show which rack is affected. For a related lens on operational risk, see how teams manage operations planning under changing labor conditions: the system needs process clarity as much as technical redundancy.

Build incident response around blast-radius reduction

The best response architecture assumes the smallest possible impact zone. Rack-level isolation beats room-level shutdown, and quick detection beats post-incident cleanup. You should know whether a failure in one loop can propagate pressure or temperature issues into neighboring racks, and if so, what mechanical barriers exist to stop it. The ideal result is a contained maintenance event, not a cluster-wide outage.

Pro tip: Test leak alarms with the same seriousness as a fire drill. If no one knows who can authorize shutdown, your recovery time is already too long.

6) Firmware, BIOS, and Thermal Validation

Standardize hardware state before the migration window

Cooling projects often fail because teams focus on plumbing and ignore firmware drift. Before moving a node into the liquid-cooled environment, standardize BIOS, BMC, GPU driver versions, PSU firmware, and any board-level thermal policies. Small differences in fan curves, power caps, or telemetry behavior can make validation noisy and complicate root-cause analysis. A clean firmware baseline makes your thermal results trustworthy.

Document the exact build state for each server class and require a rollback plan. If a firmware update improves telemetry but destabilizes driver behavior, you need to be able to revert quickly. That discipline resembles patch-cycle management in software: controlled changes, clear gates, and rapid rollback. In infrastructure, the cost of skipping this step is much higher because the failures are physical as well as logical.

Run thermal validation under realistic workload shape

Thermal validation should include both synthetic and real workloads. Synthetic stress tests help establish maximum load response, but real ML jobs reveal how the cluster behaves over long-duration training, checkpointing, job preemption, and varying batch sizes. Measure temperature stability, fan behavior, clocks, package power, error rates, and coolant delta-T. You should also test warm-start conditions after downtime, because systems often behave differently when they begin from a higher ambient baseline.

When possible, validate one rack, then one row segment, then the full deployment. This staged approach helps isolate whether a problem comes from the cooling system, the server firmware, or the workload mix. Thermal validation is not just about passing a test; it is about building confidence that the hardware can operate at designed performance without hidden degradation. That confidence is what lets you scale safely.

Watch for throttle signatures, not just alarms

Thermal throttling is often detectable before it becomes an outage. Look for rising fan duty cycles, clock oscillation, increased inference latency, or training throughput flattening under unchanged load. Those are signs the cluster is losing thermal headroom even if no hard fault has occurred. The most valuable part of cooling monitoring is spotting this drift early enough to fix it before the production user notices.

For teams running advanced model pipelines, performance volatility may come from subtle interactions between cooling, power limits, and application behavior. The lesson mirrors what engineers see in other constrained environments: design choices look different once resource ceilings become real. If you need a useful conceptual model, review simulation discipline and apply the same habit to thermal systems—test before you trust.

7) Monitoring Instrumentation: Build a Cooling Control Plane

Instrument the full heat path

Your monitoring stack should cover the entire thermal journey, from chip to coolant loop to room environment. At minimum, capture GPU temperature, CPU temperature, inlet and outlet coolant temperature, flow rate, pressure, leak state, pump status, rack exhaust conditions, and room humidity. Add power telemetry so you can correlate temperature with load, because isolated temperature graphs do not explain why cooling behavior changed. The goal is a control plane that shows cause and effect, not a pile of unrelated charts.

This is where many teams underinvest. They monitor compute because it is familiar and ignore mechanical telemetry because it feels foreign. Yet once the cluster moves to liquid cooling, the mechanical layer is as important as your scheduler. If you cannot see it, you cannot operate it.

Set alert thresholds around actionable states

A useful alert tells an operator what to do next. For example: warning when coolant flow drops below a minimum threshold, critical when rack inlet exceeds the validated range for more than a defined interval, and paging when leak detection triggers in an active rack. Avoid generic temperature alerts that simply scream “hot”; instead, tie alerts to operational outcomes like workload migration, rack isolation, or maintenance intervention. Precision reduces noise and improves trust.

Also build dashboards for postmortems. When something goes wrong, you need a timeline that includes workload spikes, coolant changes, firmware updates, and room conditions. The faster you can correlate those layers, the faster you can avoid repeated incidents. Strong observability turns the cooling system into a managed asset rather than a mystery box.

Use trend data for capacity planning

Once the system is live, telemetry becomes your best forecasting tool. Track average and peak thermal load by rack, by workload type, and by time of day. Use those trends to decide when to add capacity, when to rebalance workloads, and when to schedule maintenance. This is especially useful in on-prem ML environments where GPU demand can spike with research cycles, product launches, or model retraining windows.

Think of telemetry as an early warning system for both performance and cost. If coolant delta-T begins widening, or if a subset of racks is trending hotter than expected, that may indicate load imbalance, underperforming components, or maintenance drift. Cooling monitoring should drive operational decisions, not just post-incident analysis.

8) Rollout Staging: Migrate Without Triggering a Cluster-Wide Incident

Use a pilot rack and define exit criteria

Start with a pilot rack or a small cluster segment that can fail without taking the business down. The pilot should include the full stack: hardware, coolant, firmware, monitoring, and operator procedures. Define exit criteria before the migration begins, such as sustained operation under representative workloads, clean leak detection tests, stable temperatures across a defined time window, and no unplanned throttling. If the pilot fails, you should know exactly which layer to inspect before expanding.

The pilot phase should also test your operational muscle. Can the on-call engineer interpret the alarms? Can facilities respond inside the required window? Can your platform team shift workloads if one rack must be serviced? These are not academic questions; they are the real indicators of rollout readiness. You are not just testing the hardware; you are testing the team.

Ramp gradually and keep a rollback path

Do not convert the whole fleet at once. Expand from pilot to a handful of racks, then to a full row, and only then to larger operational zones. Between stages, review telemetry, maintenance logs, and operator feedback. If any sign of instability appears, pause and correct it before continuing. The ability to stop a rollout is not a failure; it is a sign of mature infrastructure ops.

A rollback path should exist at each stage. That may mean keeping a subset of air-cooled capacity available, maintaining spare parts, or reserving schedule flexibility so workloads can be shifted while a rack is investigated. This is similar to how teams think about workflow automation: automation helps, but only if you can recover cleanly when the system surprises you.

Coordinate workloads to protect user-facing SLAs

Training jobs are easier to pause or reschedule than production inference services, so the rollout sequence should reflect business risk. Begin with lower-priority compute, then move critical workloads after the system has proven stable under real traffic. If you support both internal researchers and external customers, create workload classes that can be shifted between racks based on thermal headroom. This gives you a practical lever for avoiding throttling during busy periods.

It also helps to define “cooling safe mode” behaviors for your scheduler. If a rack begins trending hot, can jobs be migrated automatically? Can capacity be reserved for emergency rerouting? These controls make the cooling system part of your overall reliability strategy rather than a separate mechanical project.

9) Operating the New Fleet: Daily, Weekly, and Quarterly Cadence

Daily checks should focus on exceptions, not everything

After rollout, operators need a short routine that focuses on deviations: temperature anomalies, flow changes, active leaks, alarms, and hardware not returning to baseline after maintenance. The daily check should be fast enough to run consistently, and it should surface only actionable signals. Too much noise causes alert fatigue, which is dangerous in a cooling environment because the cost of ignoring a real problem is high.

Make the dashboard readable for mixed audiences. Infra engineers need detailed telemetry; managers need risk summaries; technicians need maintenance context. The system should answer the same question at multiple depths: what is normal, what is changing, and what needs intervention now?

Weekly reviews should tie thermal behavior to workload patterns

Every week, compare thermal telemetry with scheduler data. Are certain jobs, frameworks, or node types consistently hotter? Are there windows where ambient conditions drift and cause reduced performance? This review helps you tune the cluster for both thermal stability and throughput. It also reveals whether some teams are overusing “safe” nodes while others remain underutilized.

Use weekly reviews to refine placement policies and to catch maintenance drift early. A system that was stable during launch can degrade quietly as filters age, coolant properties change, or firmware updates alter fan behavior. Weekly analysis keeps the migration from becoming a one-time success followed by slow regression.

Quarterly reviews should revisit capacity and refresh planning

Cooling capacity is not static. As GPUs become denser and workloads become more ambitious, the thermal assumptions that were correct at launch may no longer hold. Quarterly reviews should ask whether the rack mix still fits the cooling envelope, whether you need more instrumentation, and whether the next hardware refresh should shift more aggressively toward liquid-ready platforms. This is the point where infrastructure ops and roadmap planning meet.

For some teams, the answer will be to expand direct-to-chip coverage. For others, RDHx remains the best bridge while they phase out older systems. Either way, the review should turn operational telemetry into strategic planning. That closes the loop between hardware migration and future capacity planning.

10) Common Failure Modes and How to Avoid Them

Overlooking firmware and telemetry differences

One of the most common failures is assuming all nodes will behave the same once liquid cooled. In reality, firmware versions, sensor calibration, and board-level power policies can produce very different results even within the same server family. Standardization is your best defense. So is requiring a clear inventory of every component that can influence thermal behavior.

Underestimating maintenance complexity

Liquid cooling adds new service steps, and those steps often take longer than expected the first few times. If your procedures are not rehearsed, a routine maintenance event can become a high-stress incident. Build runbooks, practice them, and time them. Then revise them until they are boring.

Skipping the operational handoff

A project is not complete when the rack powers on. It is complete when the on-call team can support it, the monitoring is trustworthy, and the workload owners know how to use the new capacity safely. That final handoff is where many programs fail because the build team and the run team are not equally prepared. The best migration programs treat training as part of the deliverable.

Pro tip: If your first production incident reveals an unknown valve, an unlabeled sensor, or a missing rollback step, the migration was not finished — it was only installed.

11) Practical Checklist for Your First 90 Days

Days 0-30: assess and design

Inventory hardware, document rack layouts, gather thermal baselines, and confirm vendor compatibility. Complete site readiness work, finalize coolant requirements, and define success metrics. Build the monitoring spec before buying equipment, because the telemetry design determines whether you can prove the migration worked.

Days 31-60: stage and validate

Install the pilot rack, run leak tests, validate shutdown behavior, and test firmware baselines. Execute thermal validation with real workloads and document every anomaly. Use this phase to train operators and capture corrective actions before rollout expands.

Days 61-90: expand and stabilize

Ramp to additional racks only after the pilot meets exit criteria. Keep rollback capacity available, review telemetry daily, and make small adjustments to thresholds and runbooks. By the end of the first 90 days, your team should be able to explain not just that liquid cooling works, but exactly why it is safe, measurable, and sustainable in your environment.

Conclusion: Treat Liquid Cooling as a Reliability Upgrade

For on-prem ML teams, liquid cooling is not simply a way to fit more GPUs into a rack. It is a shift in how infrastructure is designed, procured, validated, and operated. The teams that succeed will treat the migration as a multi-disciplinary program spanning hardware migration, thermal validation, monitoring instrumentation, and rollout staging. They will also recognize that the path to stable performance begins long before coolant flows: it starts with a precise baseline, strong vendor scrutiny, and a willingness to stage slowly rather than rush into a cluster-wide change.

If you are deciding between direct-to-chip and rear-door heat exchanger approaches, the right answer is the one that matches your density, your service model, and your operational maturity. Both can work, but neither will save you from poor planning. For broader context on how AI infrastructure is evolving, revisit the next wave of AI infrastructure and compare that outlook with your own capacity roadmap. In the end, the best liquid cooling migration is the one your team can support confidently on a bad day, not just celebrate on launch day.

Datacenter Capacity Forecasts and What They Mean for Your CDN and Page Speed Strategy - Useful for translating capacity signals into rollout timing.
Designing Agentic AI Under Accelerator Constraints: Tradeoffs for Architectures and Ops - A strong lens for operating under hard hardware limits.
Serverless Cost Modeling for Data Workloads: When to Use BigQuery vs Managed VMs - Helpful for structuring workload economics and tradeoffs.
Preparing for Rapid iOS Patch Cycles: CI/CD and Beta Strategies for 26.x Era - A useful model for staged validation and rollback discipline.
Build a Deal Scanner for Dev Tools: Ranking Integrations by GitHub Velocity - A practical framework for vendor comparison and scoring.

FAQ

What is the main difference between direct-to-chip and RDHx?

Direct-to-chip removes heat at the server component level using liquid near the processors, while RDHx removes heat from exhaust air at the back of the rack. Direct-to-chip generally supports higher densities, while RDHx is often easier to retrofit into existing environments.

How do I know if my ML cluster needs liquid cooling?

If your GPU servers are hitting thermal limits, fan noise is extreme, ambient room cooling is insufficient, or you plan to increase rack density significantly, liquid cooling is worth evaluating. Persistent throttling under sustained training load is a strong signal that air cooling has reached its practical limit.

What should be tested before putting liquid-cooled racks into production?

Validate firmware versions, leak detection, temperature stability, flow rates, shutdown behavior, and real workload performance. The most important test is whether the system can run your actual ML jobs for long periods without throttling or alarms.

Do I need new monitoring tools for liquid cooling?

Usually yes, or at least a new instrumentation layer in your existing observability stack. You need visibility into coolant flow, supply/return temperatures, pressure, leak status, and their relationship to GPU and CPU telemetry.

What is the safest rollout strategy?

Start with a pilot rack, define exit criteria, test rollback procedures, and expand in small stages. Keep extra capacity available so workloads can move if a rack needs maintenance or if thermal behavior is unstable.

How do I avoid downtime during migration?

Stage the rollout, validate each rack before expanding, and maintain a fallback path. Use workload scheduling to move lower-priority jobs first, and only transition critical services after the system has proven stable under real production load.