ilgyu-yi

HVAC System Optimization

2025-06-24

Optimizing Data Center HVAC Systems with Reinforcement Learning

cover-hvac

Background

electric-usage-hvac

As data centers continue to scale, their cooling systems consume a significant amount of energy. In many facilities, the HVAC (Heating, Ventilation, and Air Conditioning) system accounts for a large portion of total electricity usage. To ensure optimal performance, server temperatures must stay within a narrow range (e.g., 22.5°C), while simultaneously minimizing Power Usage Effectiveness (PUE).

This project explores the use of Deep Reinforcement Learning (DRL) to automate and optimize HVAC system control, achieving both energy efficiency and thermal stability.


Problem Setting

obs-act-hvac


Proof of Concept (PoC)

poc-hvac

To validate the feasibility of RL control in real-world HVAC systems, we used a simulator-based approach.


Core Techniques

1. Probabilistic Dynamics Modeling (MDN)

mdn-hvac

To better capture the stochastic nature of the environment, especially under partial observability, I trained a Mixture Density Network (Christopher M. Bishop, 1994) as the dynamics model. This allowed our agent to handle multi-modal transition distributions more accurately.

This idea was inspired by the use of MDNs in World Models, where multi-modal transitions were crucial to modeling generative trajectories.


2. Model-Based Model-Free combined RL with Dyna Style Learning

dyna-hvac

actual-dyna-hvac

Due to the low system responsiveness, data efficiency was critical. We adopted a Dyna-like hybrid approach, combining:

This approach resembles the SimPLe algorithm in spirit, where synthetic samples are leveraged to improve sample efficiency.


3. Safety-Aware Training

Because temperature violations can damage servers, we enforced strict hard constraints during training:


4. Overcoming the Cliff-Walking Problem

cliff-walking-hvac

In traditional SARSA-style algorithms (unlike Q-learning-style algorithms), penalties from constraint violations propagate backward to earlier states, creating unstable learning dynamics. To counter this:

The Python code below implements a custom GAE mechanism that selectively blocks advantage propagation from failure penalties:

def selective_penalty_gae(rewards, values, dones, failed, gamma=0.99, lam=0.95):
  """
  Custom GAE with:
  - Normal reward propagation for success
  - No propagation of failure penalty (only applied at failure step)
  """
  T = len(rewards)
  advantages = np.zeros(T)
  lastgaelam = 0
 
  for t in reversed(range(T)):
    if dones[t]:
      delta = rewards[t] - values[t]
      if failed[t]:
        lastgaelam = 0.  # Failure: zero out GAE propagation
      else:
        lastgaelam = delta
      advantages[t] = delta
    else:
      # Normal step: full GAE propagation
      delta = rewards[t] + gamma * values[t + 1] - values[t]
      lastgaelam = delta + gamma * lam * lastgaelam
      advantages[t] = lastgaelam
 
  return advantages

This significantly stabilized training by isolating failure signals to their causative actions only.


Results

result-hvac


Retrospective

While the agent showed promising results in simulation:

Although modifying the value function to suppress failure penalty propagation was effective in this project, more principled alternatives could also have been considered:

These approaches are often more complex to implement or tune, but they offer better theoretical guarantees, clearer credit assignment, and better generalization, especially in safety-critical domains like data center control.

In future iterations, replacing heuristic value cutoffs with explicit constraint-aware optimization is expected to yield more robust and interpretable outcomes.


Related Resources


Closing Thoughts

This project demonstrated that reinforcement learning, when combined with model-based simulation, probabilistic modeling, and failure-isolated training techniques, can offer a practical pathway toward energy-efficient HVAC operation in data centers.
However, safe deployment in real-world environments will require broader integration of constraint-aware optimization and robust policy evaluation frameworks.

← To Profile