(Re)-Building A Better Metric – Part II

In Part I, we talked about the criteria we wanted to satisfy to ensure that a metric was good, and briefly assessed the results of our beta test of the new version of TMI. The conclusion I came to after that testing was that, in short, it needed more work.

I don’t know that it’s entirely true to say that I went “back to the drawing board,” so much as I went back to my slew of equations and mulled over what I could tweak in them to fix the problems. To recap, the formula I was using was:

$$\large {\rm Beta\_TMI} = c_1 \ln \left [ 1 + \frac{c_2}{N} \sum_{i=1}^N e^{F(MA_i-1)} \right ],$$

with $F=10$, $c_1=500$ and $c_2=e^{10}$.

One of the problems I was running into was one of conflicting constraints. If you look back at the last blog post, you’ll see that constraint #6 was that the numbers had to stay reasonable. Mentally, I had converted this constraint to be “should have a fixed range of a few thousand,” possibly up to 10 or 20 thousand at a maximum. So I was rigidly trying to keep the score down around a few thousand.

But the obvious solution to the stat weight problem was to increase $c_1$, which increases the slope of the graph. That makes a small change in spike size a more significant change in TMI, and gives you larger stat weights. Multiply $c_1$ by ten, and your stat weights all get multiplied by 10. Seems simple enough.

Except that in the beta test, I got data with TMIs ranging from a few hundred to over 12 thousand. So if I multiply by ten, I’m looking at TMIs ranging from around a thousand to over 120 thousand, which is a much larger range. And a factor of ten still wouldn’t have fixed everything thanks to the “knee” in the graph, because if your TMI was on the really low end you could still get garbage stat weights.

It felt like the two constraints were at odds with one another. And both at odds with a third, somewhat self-imposed constraint, which is that I wanted to keep the zero-bounding effect that the “1+” in the brackets produced. Because without that, the score could go negative, which is odd. After all, what does it mean when your arbitrary FICO-like metric goes negative? Which just led back to more fussing over the fact that I was still pretty light on “meaning” in this metric to begin with.

It was a conversation with a colleague that led me to the solution. While discussing the stat weight issues, and how I could tweak the equation to fix them, he mentioned that he would rather have a metric with large numbers that had an obvious meaning than a nicely-constrained metric that didn’t. We were talking in terms of percentages of health, and it was only at that point that the answer hit me. Within a day of that conversation, I made all of the changes I needed to give TMI a meaning.

Asking The Right Question

As is often the case, the answer had been staring me in the face the entire time. I’ve been looking at this graph (in various different incarnations, with various different constants) for the last few months:

Simulated TMI data using the Beta_TMI formula. Red is the uniform damage case, blue is the single-spike case, and green is pseudo-random combat data.

Simulated TMI data using the Beta_TMI formula. Red is the uniform damage case, blue is the single-spike case, and green is pseudo-random combat data.

What that conversation led me to realize was that I was asking the wrong question. I was trying to figure out what combination of constants I needed to keep the numbers “reasonable.” But my definition of “reasonable” was vague and arbitrary. So it’s no surprise that what I was getting out was also… vague and arbitrary.

What I should have been doing was trying come up with a score that does a better job of communicating to the user how big those spikes were. Because that, by definition, would be “reasonable” no matter what size the numbers were.

In other words, the question I should have been asking was “how can I tweak this equation so that the number it spits out has a simple and intuitive relationship to the spike size, expressed in a scale that the user can not only easily understand, but easily remember?”

And the answer, which was clear after that conversation, was to use percent health.

To illustrate, let’s flip that graph around it’s diagonal, such that instead of plotting TMI vs. $MA_{\rm max}$, we were plotting $MA_{\rm max}$ vs. TMI.

Simulated TMI data using the Beta_TMI formula. Red is the uniform damage case, blue is the single-spike case, and green is pseudo-random combat data.

The same data, just plotted in reverse.

At a given TMI value, the $MA_{\rm max}$ values we get from the random combat simulation always fall below the blue single-spike line. In other words, at a TMI of X, you can confidently say that the maximum spike you will take is of size Y. It could be smaller, of course – you could take a few spikes that are a little smaller than Y and get the same score. But you can be absolutely sure it isn’t above Y.

So we just need to find a way to make the relationship between X and Y obvious, such that someone can look at a TMI of e.g. 20k and immediately know how large of a damage spike that is, as a percentage of their health.

We could use a one-to-one relationship, such that a TMI of 100 meant you were taking spikes that were 100% of your health. That would correspond to a slope of 100, or a $c_1$ of 10. But that would give us even smaller stat weights, which is a problem. We could literally end up with a plot in Simulationcraft where every single one of your stat weights was 0.00.

It would be nice to keep using factors of ten. Bumping it up to a slope of 1000 doesn’t work. That’s a $c_1$ of 100, which is still smaller than what we used in Beta_TMI. A slope of 10000, or a $c_1$ of 1000, is only a factor of two improvement over Beta_TMI, so our stat weights will still be sloppy.

But a slope of 100k… that might just work. A TMI of 100k would mean that your maximum spikes were around 100% of your health. If your TMI went up to 120k, you’d immediately know that the spikes are now about 120% of your health. Easy. Intuitive. Now we’re getting somewhere. The stat weights would also be 20x as large as they were for Beta_TMI, ensuring that we would get good unnormalized weights even with two decimal places of precision.

So, assuming we’re happy with that, it locks down our $c_1$ at $10^4$, so that every percentage of health corresponds to 1k TMI. Now we just have to look at the formula and figure out what else, if anything, needs to be changed.

Narrowing the Field

The very first thing I did after coming to this realization is toss out the “1+” in the formula. While I liked zero-bounding when we were treating this metric like a FICO score, it suddenly has no relevance if the metric has a distinct and clear meaning. Removing it allows for negative TMI values, but those negative values actually mean something now! If you end up with a TMI of -10k, it means that you were out-healing your damage intake by so much that the largest “spike” you ever took was smaller than your incoming healing in that time window. It also tells you exactly how much smaller: 10% of your health. While it’s not a situation we’ll run into that often, I suspect, it actually has meaning. There’s no sense obscuring that information with zero-bounding.

Which just leaves the question of what to do with $c_2$. Let’s look at the equation after removing the “+1″:

$$\large {\rm TMI} = c_1 \ln \left [ \frac{c_2}{N} \sum_{i=1}^N e^{F(MA_i-1)} \right ] $$

If we make the single-spike approximation, i.e. that we can replace the sum with a single $e^{F(MA_{\rm max}-1)}$, we get:

$$\large \begin{align} {\rm TMI_{SS}} &= c_1 (\ln c_2 – \ln N) + c_1 F (MA_{\rm max} – 1) \\&~\\ &= c_1 F MA_{\rm max} + c_1 ( \ln c_2 – \ln N – F ) \end{align}$$

just as before. Now that we’ve removed the “1+” from the formula, the single-spike approximation isn’t limited to large spikes anymore, so this is valid for any value of $\large MA_{\rm max}.$

Remember that in our single-spike approximation, $c_2$ controlled the y-intercept of the plot. And now that this y-intercept isn’t being artificially modified by zero-bounding, it actually has some meaning. It’s the value of $MA_{\rm max}$ at which our TMI is zero.

And given our convention that X*1000 TMI is a spike that’s X% of our health, a TMI of zero should mean that we take spikes that are 0% of our health. In other words, this should happen at $MA_{\rm max}=0$. So we want our y-intercept to be zero, or

$$\large c_1 ( \ln c_2 – \ln N – F ) = 0 .$$

Since $c_1$ can’t be zero, there’s only one way to accomplish this: $c_2 = N e^F.$ I was already using $e^F$ for $c_2$ in Beta_TMI, so this wasn’t totally unexpected. In fact, I figured out quite a while ago that the choice of $e^F$ for $c_2$ was equivalent to simplifying the term inside the sum:

$$\large \frac{e^F}{N}\sum_{i=1}^N e^{F(MA_i-1)} = \frac{1}{N}\sum_{i=1}^N e^{F\cdot MA_i}.$$

Defining $c_2=Ne^F$ would also eliminate the $1/N$ factor in front of the sum. However, there’s a problem here: I don’t want to eliminate it. That $1/N$ is serving an important purpose: normalizing the metric for fight length. For example, let’s consider two simulations, one being three minutes long and the other five minutes long. We’ll assume the boss is identical in both cases, so the magnitude and frequency of spikes are identical. In theory, the metric should give you nearly identical results for both, because the amount of danger is identical. A fight that’s twice as long should have roughly twice as many large spikes, but they’re spread over twice as much time.

But a longer fight will have more terms in the sum for a particular bin size, and a shorter fight will have fewer terms. So the sum will be approximately twice as large for the longer fight. The $1/N$ cancels that effect because $N$ would also be twice as large. If we get rid of that $1/N$, then the longer fight will seem significantly more dangerous than the shorter one. In other words, it would cause the metric to vary significantly with fight length, which isn’t good.

So I decided to define $c_2$ slightly differently. Rather than $Ne^F$, I chose to use $N_0e^F$, where $N_0$ is a default fight length. This means that we’re normalizing the fight length to $N_0$ rather than eliminating the dependence entirely, which should mean much smaller fluctuations in the metric across a large range of fight lengths. Since the default fight length in SimC is 450 seconds, that seemed like an obvious choice for $N_0$.

To illustrate that graphically, I fired up Visual Studio and coded the new metric into Simulationcraft, with and without the normalization. I then ran a character through for fight lengths ranging from 100s to 600s. Here are the results:

Comparison of normalized ($N_0/N$) vs unnormalized versions of the TMI metric.

Comparison of normalized ($N_0/N$) and unnormalized versions of the TMI metric. Vertical axis is in thousands.

The difference is pretty clear. The version where $c_2=Ne^F$ varies from a little under 65k TMI to around 86k TMI. The normalized version where $c_2 = N_0e^F=450e^F$ varies much less, from about 80k to a little over 83k, and most of that variation happening for fights that are shorter than four minutes long (i.e. not that common). This version is stable enough that it should work well for combat log analysis sites, where we’d expect a wide variety of encounter lengths.

There was one final change I felt I should make, and it’s not to the formula per se, it’s to the definition of $MA$. If you recall from the last post, we defined it as follows:

$$\large MA_i = \frac{T_0}{T}\sum_{j=1}^{T / dt} D_{i+j-1} / H.$$

This definition normalizes for two things: player health (by dividing by $H$), and window size (by multiplying by $T_0$). The latter is the part I wanted to change.

The reason we originally multiplied by $T_0/T$ was to allow the user to specify a shorter time window $T$ over which to calculate spikes, for example in cases where you were getting a large heal every 5 second, but were fighting a boss who could kill you in 3 or 4 seconds in-between those heals. This normalization meant that it calculated the moving average over $T$-second intervals, but always scaled the total damage up to what it would be if that damage intake rate were sustained for $T_0$ seconds. Doing this kept the metric from varying significantly with window size, as we discussed last year.

But that particular normalization doesn’t make sense anymore now that the metric is representing a real quantity. If my TMI is a direct reflection of spike size, then I’d expect it to go up or down fairly significantly as I change the window size. If I take X damage in a 6-second time window, but only X/2 damage in a 3-second time window, then I want my TMI to drop by a factor of 2 when I drop the window size from 6 seconds to 3 seconds as well.

In other words, I want TMI to accurately reflect what percentage of my health I lose in the window I’m considering. If I want to analyze a 3-second window, then I want to know what percentage of my health the boss can take off in that 3 seconds, not how much he would take off if he had 6 seconds.

So we’re entirely eliminating the time-window normalization in the definition of $MA_i$. That seems to match people’s intuition for how the time-window control should work anyway (this topic has come up before, including in the comments of the Crowdsourcing TMI post), so it’s a win on multiple fronts.

Bringing it all Together

Now, we have all the pieces we need to construct a formal definition for TMI v2.0. I’ll update the TMI Standard Reference Document with the rigorous details, but since we’ve already discussed many of them, I’m only going to summarize it here. Assume we start with an array $D$ containing the damage we take in every time bin of size $dt$, and the player has health $H$.

The moving average array is now defined as

$$\large MA_i = \frac{1}{H}\sum_{j=1}^{T / dt} D_{i+j-1}.$$

In other words, it’s the array in which each element is the $T$-second moving sum of damage taken, normalized to player health $H$.

We then take this array and use it to calculate TMI as follows:

$$\large {\rm TMI} = 10^4 \ln \left [ \frac{N_0}{N}\sum_{i=1}^N e^{10 MA_i} \right ] ,$$

where $N$ is the length of the $MA$ array, or equivalently the fight length divided by $dt$, and $N_0=450/dt$ is the “default” array size corresponding to a fight length of 450 seconds.

But Does It Work?

To illustrate how this works, let’s look at some examples using Simulationcraft. I coded the new formula into my local copy and ran some tests. Here are two reports, both against the T16H25 boss, using my own character and the T16H Protection Warrior profile:

T16H Protection Warrior

The very first thing I looked at was the stat weights:

Stat weights generated with Theck using TMI 2.0

Stat weights generated with Theck using TMI 2.0

Much, much better. This was with 25k iterations, but even 10k iterations gave us reasonable (if noisy) stat weights. The error bars here are all pretty reasonable, and it wouldn’t be hard to increase the precision by bumping it up to 50k iterations if we wanted to. The warrior profile’s stat weights are similarly high-precision.

We could also look at the TMI distribution:

TMI distribution for Theck using TMI 2.0

TMI distribution for Theck using TMI 2.0

Again, much nicer looking than before. We’re still getting a bit of skew here, but that mostly has to do with being slightly overgeared for the boss definition. The warrior profile exhibits even stronger skew, but tests run with characters of lower gear levels (and thus higher average TMI values) show very little skew.

I also wanted to see exactly how well the TMI value reflected maximum spike size, and what (if any) difference there was. So you may have noticed that I’ve enhanced the tanking section of the SimC report a little bit by adding some new columns:

Updated tanking section of the SimC report, including information about spike size.

Updated tanking section of the SimC report, including information about spike size.

In short, SimC now also records the “Maximum Spike Damage,” or MSD, for each iteration and calculates the maximum, minimum, and mean MSD value. It reports this information in units of “percentage of player health” right alongside the DTPS and TMI information that you’re used to getting. Lest the multiple “max” modifiers be confusing: the MSD for one iteration is the biggest spike you take that iteration, and the “MSD Max” is the largest spike you take out of all iterations.

You may be wondering, at this point, if this isn’t all superfluous. If I can code SimC to report the biggest spike, why wouldn’t we want to use that directly? What does TMI add that we can’t get from MSD?

The answer is continuity. MSD uses a max() function to isolate the absolute biggest spike in each iteration. Which is fine, but often misleading. For example, let’s consider two different tanks, one of which takes a single spike that’s 90% of their health, and another that takes one 90% spike and three or four 89% spikes. Assume nothing else in the encounter is remotely threatening them. Their MSD values will be identical, because it ignores all but the largest spike. But it’s clear that the second tank is in more danger, because he’s taking a large spike more frequently, and the TMI value will accurately reflect that.

That continuity also translates into generating better and more reliable stat weights. A stat that reduces the frequency of 90% spikes without eliminating them would be given a garbage stat weight if we tried to scale over MSD, because MSD doesn’t retain any information about frequency. However, we know that stats like hit and expertise are strong partly because they reduce spike frequency. TMI reflects that accurately while MSD simply can’t.

MSD is still useful though, in that having both TMI and MSD gives us additional information about our spike patterns. It also gives us a convenient way to compare the two to see how TMI works.

First, take a look at the TMI Max and MSD Max values. You’ll notice they mimic each other pretty well: MSD Max is 150.3%, TMI Max is 151.7k. This makes sense for the extreme case because that’s when all the planets align to create your worst-case scenario, which is rare. It won’t happen multiple times per fight, so it’s a situation where you have one giant spike that dominates the score, much like our single-spike approximation. And in that approximation, TMI is roughly equal to the largest spike size, just like it should be.

Comparing the mean TMI value (just “TMI” on the table) to the MSD mean shows a little bit of a gap: MSD Mean is 69.5%, TMI mean is 82.8k. The TMI is about 13k above where you’d expect it to be based on the single-spike model. That’s because of spike frequency. You wouldn’t normally expect to take one giant spike in an encounter and nothing else; the more common case is to take several spikes of similar magnitude over that 450 seconds. If we’re taking 3-4 of those spikes, then that’s going to raise the TMI value a little bit compared to the situation where we only take one. That’s exactly what’s happening here.

Mathematically, if we take $n$ spikes, we expect the TMI to be $\ln(n)$ times as large as the single-spike case. In this simulation, the TMI is about 1.2 times larger, meaning that $n\approx 3.3.$ In other words, on average we’re taking about 3.3 spikes every 450 seconds, each of which is about 69.5% of our health. That’s pretty useful information – in fact, I may add it to the table in the future if people would like SimC to calculate it for them.

You can see that the gap grows considerably for the minimum TMI and MSD values. The MSD Min is only about 31% while the minimum TMI is ~66k. Again, this comes down to frequency. Large spikes tend to be infrequent due to statistics, as they require a failure to avoid any one of multiple attacks. But as we eliminate those (either by gearing, or in this case, by lucky RNG on one iteration) we’re left with smaller, more frequent spikes. In the extreme limit, you could imagine a scenario where you alternated between taking a full hit and avoiding every second attack, in which case you’d have loads of really tiny spikes. So what we’re seeing at this end of the distribution is that we’re taking about $n=8.4$ small spikes in the low-TMI iterations.

This behavior also has a more subtle, but rather important meaning. TMI is really good at prioritizing large spikes and giving you stat weights that preferentially eliminate them. Once you eliminate those spikes, it automatically shifts to prioritizing the next-biggest spikes, and so on. If you smooth your damage intake sufficiently that you’re taking a lot of moderately-sized spikes, it naturally tries to reduce the frequency of those spikes. In other words, if you’ve successfully eliminated the danger of isolated spikes, it automatically starts optimizing you for DTPS. So it seamlessly fuses spike mitigation and DTPS into a metric that shifts the goalposts based on your biggest concern, as determined by the combat data.

A lot of those ideas can be seen graphically, as well. Here’s a plot showing data generated with my own character pitted against the T16H25 boss. We’re plotting MSD (which I was originally calling “Max Moving Average”) against the reported TMI score. To generate this plot, I used a variety of window sizes. At each window size, I recorded the minimum, mean, and maximum TMI and MSD values. The dotted line is the expected relationship, i.e. 100k TMI = 100% max health.

MSD vs. TMI for Theck

MSD vs. TMI for Theck against the T16H25 boss.

Generally speaking, as we increase or decrease the window size, the MSD and TMI should similarly increase or decrease. That’s certainly happening for the maximum MSD and TMI values, which should be expected. And in that limit, we see that TMI and MSD mostly agree and lie close to the dotted line.

However, the mean values show a much smaller spread, and the minimum values show almost no spread. It turns out that this is the fault of EF’s crazy scaling. A paladin in this level of gear is basically self-sufficient against the T16H25 boss, so changing the window size doesn’t have a large effect unless we consider the most extreme cases. If we’re out-healing the boss, then a longer window won’t cause a noticeable increase in damage intake or spike size. At the very low end, where the minimum TMI & MSD values show up, we’re basically plotting window-edge effects.

The results look a lot cleaner if we consider a player that’s undergeared for the boss (and of a class that doesn’t have a strong self-healing mechanic, like a warrior):

MSD vs. TMI for the T16H Protection Warrior profile.

MSD vs. TMI for a sample warrior against the T16H25 boss.

This is one of the warriors who submitted multiple data sets for the beta test. He’s got an average ilvl of 517, which is well below what would be needed to comfortably survive the 25H boss. As a result, his TMI values are fairly high, with even the smallest values being over 200k. As you can see, though, all of the values cluster nicely around the equivalence line, meaning that the TMI value is a very good representation of his expected spike size. Also note that the colors are more evenly distributed on this plot. That’s because the window size adjustment is working properly here. The lowest values are from simulations with a window size of 2 seconds, while the largest ones are using a window size of 10 seconds. And the data is pretty linear: double the window size, and you double the MSD and TMI.

Report Card

So this final version of the metric seems to be hitting all the right notes. Let’s get our checklist out and grade it on each of the criteria we set out to satisfy.

  1. Accurately representing danger: Pass. There’s really no difference between this version and the beta version in this category. If anything, this may be a bit better since it no longer has the “knee” obfuscating danger for smaller spikes.

  2. Work seamlessly: Pass. Apart from coding the metric into SimC, it took no additional tweaks to get it to work properly with the default plotting and analysis tools.

  3. Generate useful stat weights: Pass. The stat weights are being generated properly and to sufficient precision to identify differences between the stats, without having to normalize. It will generate useful stat weights even in low-damage regimes thanks to the removal of the “knee,” and it automatically adapts to generate DTPS-like results when you’ve done all you can for smoothing. Massive improvement in this category.

  4. Useful statistics: Pass. Again, not much difference between this version and Beta_TMI, at least in this category.

  5. Easily interpreted: Pass. This is the most important improvement. If I get a TMI score of 80k, I immediately know that I’m in danger of taking spikes that are up to 80% of my health. I don’t need to do any mental math to figure it out, just replace a “k” with a “%” and I’m there. No need to look back to a blog post or remember a funny conversion factor. As long as I know what TMI is, I know what it means.

  6. Numbers should be reasonable: Pass. While the numbers aren’t technically small, I think it’s fair to say that they’re reasonable. After Mists, everyone is comfortable working in thousands (“I do 400k DPS and have 500k health”), so I don’t think the nomenclature will be confusing. The biggest issue with the original TMI was that it varied wildly by orders of magnitude due to small changes, which can’t happen in this new form. Going from 75k to 125k has a clear and obvious meaning, and won’t throw anyone for a loop, unlike going from 75k to 18.3M (an equivalent change in Old_TMI).

I’ll admit that I may be a little biased when it comes to grading my own metric, but I don’t think you can argue that I’m being unfairly kind in any of these categories. I set up clear expectations for what I wanted in each category, and made sure the metric met them. If it hadn’t, you probably wouldn’t be reading about it, because I’d have tossed it like Beta_TMI and continued working on it until I found a version that did.

But keep in mind that this doesn’t mean the metric is flawless. It just means that we haven’t discovered what (if any) its flaws are yet. As the logging sites get on-board with the new metric and implement it, we’ll be able to look for differences between real-world performance and Simulationcraft results and identify the causes. And if we do find problems, we’ll adjust it as necessary to fix them.

Looking Forward

It shouldn’t be much of a surprise that I’m very happy with TMI 2.0. It finally has a solid meaning, and will be far simpler to explain to players discovering it for the first time. It’s a vast improvement over the original version of the metric in so many ways that it’s hard to even compare the two.

And by giving the metric a clear meaning, we’ve opened up a number of new possible applications. For example, let’s say you sim your character and get a TMI of 85k. You and your healers now know they need to be prepared for you to take a spike that’s around 85% of your health at any given moment. Which leads directly into the question, “how much healing do I need to ensure survival?”

If your healer is a druid, you might consider how many Rejuvenation ticks you can rely on in a 6-second window and how much healing that will be. If it’s 20% of your health, then you (and your healer!) immediately have an estimate of how much on-demand healer throughput you’ll need to keep you safe. Or if you have multiple HoTs, and they sum up to about 50% of your health in that time window, your healers know that as long as they keep you HoT-ted up, they can spend their GCDs elsewhere and just spot-heal you when you hit 50% health.

In other words, TMI may be a tanking metric, but it’s got the potential to have a meaning for (and be useful to) your healers as well.

Extend this idea even further: TMI was originally defined as only including self-healing effects, not external heals. The new definition can be much looser, because it still has a meaning if you include external heals. Adding a healer to your simulation may reduce your TMI, but the end result is still meaningful because it tells you how large a spike you took with a healer focusing on you.

Likewise, a combat logging site might report your regular TMI and an “ETMI” or Effective TMI, which includes outside healing. And that ETMI would tell you something slightly different – what was the biggest spike you took and survived (or not!) on that pull. If your ETMI is less than 50k you’re never really in much danger. If your ETMI is pushing 90k or 100k (and you didn’t die), it means you’re getting awfully close to dying at least a few times in that encounter, which may warrant some investigation. You could then analyze your own logs and your healers’ logs to figure out why that’s happening and determine ways to improve it.

I’m really excited to see where this goes over the next few months. For now, though, I’m going to focus on getting the foundations in place. I’ve already coded the new metric into Simulationcraft, so as of the next release (547-3) all TMI calculations will use the new formula.

I also plan on working with both WarcraftLogs and AskMrRobot, both of whom have expressed an interest in implementing TMI, to get it up and running on their logging sites. And I’ll be updating the standard reference document shortly with a rigorous definition of the standard to facilitate that.

Posted in Simcraft, Simulation, Tanking, Theck's Pounding Headaches, Theorycrafting | Tagged , , , , , , , , , , , , | 32 Comments

(Re)-Building A Better Metric – Part I

A few weeks ago, I posted a request for data to test out a new implementation of TMI. This follow-up post took longer than expected, for a number of reasons. A busy semester, wedding planning, and the Diablo 3 expansion were all contributing factors.

However, the most important factor is that the testing uncovered a few weaknesses that I felt were significant enough to warrant fixing. So I went back to the math and worked on revising it, in the hopes of hitting on something that was better. And I’m happy to say that I think that I’ve succeeded in that endeavor, to the point that I feel TMI 2.0 will become an incredibly useful tool for tanks to evaluate their performance.

But before I get to the new (and likely final) implementation, I think it’s worth talking about the data. After all, many of you were generous enough to take the time to run simulations for me and submit it, so I think I owe you a better explanation of what that data accomplished than “Theck changed stuff.”

To do that with sufficient rigor, though, I need to start from the beginning. If you recall, about nine months ago I laid out a series of posts entitled “The Making of a Metric,” which explained the thought process involved in designing TMI. Without re-hashing all of those posts, we were trying to quantize our qualitative analysis of damage histograms in table form. And most of the analysis and discussion in those posts centered around the numerical aspects of the metric. For a few examples, we discussed:

  • How we thought that a spike that’s 10% larger should be worth $h$ times as much in the metric (the “cost function” for those that are familiar with control theory or similar fields)
  • The problem of edge effects that were caused by attempting to apply a finite cut-off or minimum spike size
  • What normalization conditions should be applied to keep the metric stable across a variety of situations

and so on.

However, none of that discussion addressed what would eventually be a crucial (and in the case of our beta test results, deciding) factor: what makes a good metric? I was intently focused on the mathematics of the problem at the time, and more or less assumed that if the math worked well then the metric would be a good one.

Suffice to say, this assumption was pretty wrong.

What Does Make a Good Metric?

So when I sat down late last year to start thinking about how I would revise the metric, I approached from a very different direction. I made a list of constraints that I felt a good metric would satisfy, which I could then apply to anything I came up with to see if it “passed.” This is that list:

  1. First and foremost, the metric should accurately represent the threat of damage spikes. That actually encompasses several mini-constraints, most of which are numerical ones.
    • For example it should take into account spike magnitude and spike frequency, because it’s more dangerous to take three or four spikes of size X than it is to take one spike of size X.
    • It should filter the data somehow, such that the biggest spikes are worth considerably more than smaller ones are.
    • However, it also can’t filter so strongly that it ignores ten spikes that were 120% of your health just because you took one spike of 121%.
    • The combination of those three points means that it has to filter continuously (i.e. smoothly), so we can’t use max() or min() functions.

    In short, this is basically the numerical constraints that I applied to build the original version of TMI. Ideally, I would like it to continue generating the same quality of results, but tweak the numbers to change the presentation.

  2. It should work seamlessly in programs like Simulationcraft and on sites like World of Logs, Warcraft Logs, and AMR’s new combat log analysis tool. Working in Simcraft is obvious. That was one major reason I joined the SimC dev team. But wanting it to be useful on logging sites is a broader constraint – it means that it needs to work in a very wide range of situations, including every boss fight that Blizzard throws at us. If it’s only useful on Patchwerk under simulated conditions, it’s probably not general enough to mean anything.

    This also means that it should work with SimC’s default settings. I want to have to do as little messing around with SimC’s internals as possible.  This will come up again, so I want to mention it explicitly here.

  3. It should generate useful stat weights when used in Simulationcraft. One of the primary goals of the original metric was to be able to quantify how useful different stats were. If the metric produces garbage stat weights, it’s a garbage metric.

  4. Similarly, it should produce useful statistics. Another major drawback of the old version was that the TMI distributions were highly skewed thanks to the exponential nature of the metric. That meant that the distribution in no way represented a normal distribution, which made certain statistical measures mostly useless. A new version should (hopefully) fix that.

  5. It should be easily interpreted. Ideally, someone should be able to look at the number it produces and immediately be able to infer a meaning. Good, bad, or otherwise, you shouldn’t need to go to a blog post to look up what it means to have a TMI of 50k.

    I was never very happy with this part of the original metric. The meaning wasn’t entirely clear, because it was an arbitrary number. You’d have to read (and remember) the blog post to know that a factor of 3 corresponded to taking spikes that were 10% of your health larger (i.e. 80% of your health to 90% of your health should triple your TMI).

  6. Ideally, the numbers should be reasonable. This was arguably the biggest failing of the original version of TMI, and something that Wrathblood and I have argued about a lot. While it’s nice mathematically that a bigger spike creates an exponentially worse value, the majority of players do not think in orders of magnitude.

    I may have no problem understanding a TMI going up from 50 thousand to over 1 million as a moderate change, because I’ve been trained to work with quantities that vary like that in a scientific context. But the average user hasn’t been trained that way, and thus saw that as an enormous difference. Much larger than going from 2.5k to 50k, even though it is an equivalent change in spike size.

    The size of the change was part of the original goal, of course – to emphasize the fact that it was significantly worse to take a larger spike. But that’s not how the average user interpreted it. Instead, their initial reaction was to assume that the metric was broken. Because surely they hadn’t suddenly gotten 20 times worse just by bumping the boss up from normal to heroic. Right? Well, that’s exactly what the metric was saying, and should have been saying, when their spike size went up by ~28% of their health. But the message wasn’t getting across.

    In retrospect, I think I know why, and it was tied to item #5. The meaning of the metric wasn’t entirely clear. At least to someone who hadn’t gotten down and dirty with the math behind the metric. So instead, they assumed the metric was in error, or faulty, or something else.

Those were the five major constraints I set out to abide by in my revisions. Pretty much anything else I could come up with was covered by one or more of those, either explicitly or implicitly.

Now, with this rubric, we can take a look at the results of the beta test and see how the original revision of the metric performed. But first, I want to talk briefly about the formula I chose to use for those that are interested. Fair warning, the next section is fairly mathy – if you don’t care about those details, you may want to skip to the “Beta Test Results” section.

Beta Test Formula

Let’s first assume we have an array $D$ containing the damage taken in each time bin of width $dt$. I’m going to leave $dt$ general, but if it helps you visualize it just pretend that $dt=1$, so this array is just the damage you take in every one-second period of an encounter. We construct a $T$-second moving average array of that data just as we did in the original definition of the metric:

$$\large MA_i = \frac{T_0}{T}\sum_{j=1}^{T / dt} D_{i+j-1} / H$$

The new array $MA$ created by that definition is essentially just the moving average of the total damage you take in each $T_0$-second period, normalized to your current health $H$. By default I’ve been using $T_0=6$ as the standard window size. Again, nothing about this part changed, it’s still the same array of damage taken in each six-second period for the entire encounter.

If you recall, the old formula took this array and performed the following operation:

$$\large {\rm Old\_TMI} = \frac{C}{N} \sum_{i=1}^N e^{10\ln{3} ( MA_i – 1 ) } = \frac{C}{N}\sum_{i=1}^N 3^{10(MA_i-1)}$$

Where $C$ was some mess of normalization and scaling constants, and $N$ was the length of the $MA$ array.

This formed the basis of the metric – the bigger the spike was, the larger $MA$ would be, and the larger $3^{10(MA_i-1)}$ would be. Due to the exponential nature of this function, large spikes would be worth a lot more than small ones, and one really large spike would be worth considerably more than lots of very little ones.

The formula that I programmed into Simulationcraft for the beta test was this:

$$ \large {\rm Beta\_TMI} = c_1 \ln \left [ 1 + \frac{c_2}{N} \sum_{i=1}^N e^{F(MA_i-1)} \right ] $$

where the constants ended up being $F=10$, $c_1=500$ and $c_2=e^{10}$. Let’s discuss exactly how this differs from ${\rm Old\_TMI}$

It should be clear that what we have is roughly

$$ \large {\rm Beta\_TMI} \approx c_1 \ln \left [ 1 + \chi {\rm Old\_TMI} \right ]$$

where $\chi$ is some scaling constant. That statement is only approximate, however, because ${\rm Old\_TMI}$ used a slightly different exponential factor in the sum. In the old version, we summed a bunch of terms that looked like this:

$$\large e^{10\ln 3 (MA_i-1)} = 3^{10(MA_i-1)},$$

while in the new one we’re raising $e$ to the $F(MA_i – 1)$ power:

$$\large e^{F(MA_i-1)}.$$

In other words, the constant $F$ is our “filtering power,” just as $10\ln 3 $ was our filtering power in ${\rm Old\_TMI}$. The filtering power is a little bit arbitrary, and after playing with the numbers I felt that there wasn’t enough of a difference to warrant complicating the formula. By choosing $F=10$, a change of 0.1 (10% of your health) in $MA_i$ increases the value of the exponential by a factor of $e\approx 2.718.$ For comparison, in ${\rm Old\_TMI}$ increasing the spike size by 10% increased the value of the exponential by a factor of 3. So we’re not filtering out weaker attacks quite as strongly as before, but again, the difference isn’t that significant. The main advantage to doing this is simplifying the formula, that’s about it.

So with that caveat, what we’re doing with the new formula is taking a natural logarithm of something close to ${\rm Old\_TMI}$. For those that aren’t aware, a logarithm is an operation that extracts the exponent from a number in a specific way. Taking the log of “base $b$” of the number $b^a$ gives you $a$, or

$$\large \log_b \left ( b^a \right ) = a$$

There are a few logarithms that show up frequently in math. For example, when working in powers of ten, you might use the logarithm “base-10,” or $\log_{10}$, also known as the “common logarithm.”  If what you’re doing uses powers of $e$, then the “natural logarithm” or “base-$e$” log ($\log_{e}$) might be more appropriate. Binary logarithms (“base-2″ or $\log_2$) are also common, showing up in many areas of computer science and numerical analysis.

In this case, we’re using the natural logarithm $\log_e$, which can be written $\log$ or $\ln$ depending on which textbook or website you’re reading. I’m using $\ln$ because it’s unambiguous; some books will use $\log$ to represent the common log and others will use it to represent the natural log, but nobody uses $\ln$ to represent anything but the natural log.

To figure out how this new formula behaves, let’s consider a few special cases. First, let’s consider the limit where the sum in the equation comes out to be zero, or at least very small compared to $1/c_2$. This might happen if you were generating so much healing that your maximum spike never got close to threatening your life. In other words, if your ${\rm Old\_TMI}$ was really really small. In that situation, the second term is essentially zero, and we have

$$\large {\rm Beta\_TMI} \approx c_1 \ln \left [ 1 + 0 \right ] = 0,$$

because $\ln 1 = 0$. In other words, adding one to the result of the sum before taking the log zero-bounds the metric, so that we’ll never get a negative value. This was a feature of the old formula just due to its definition, and something I sort of liked, so I wanted to keep it. It has a side effect of introducing a “knee” in the formula, the meaning of which will be clearer in a few minutes when we look at a graph.

But before we do so, I want to consider two other cases. First, let’s assume we have an encounter where we take only a single huge spike, and no damage the rest of the time. We’ll approximate this by saying that all but one element of the $MA$ array is a large negative number (indicating a large excess of healing), and that there’s one big positive element representing our huge spike. In that case, we can approximate our sum of exponentials as follows:

$$\large \sum_{i=1}^N e^{F(MA_i-1)} \approx e^{F(MA_{\rm max}-1)}.$$

Let’s also make one more assumption, which is that this spike is large enough that $c_2 e^{F(MA_{\rm max}-1)}/N \gg 1$, so that we can neglect the first term in the argument of the logarithm. If we use these assumptions in the equation for ${\rm Beta\_TMI}$ and call this the “Single-Spike” scenario, we have the following result:

$$\large {\rm Beta\_TMI_{SS}} \approx c_1\ln\left [ \frac{c_2}{N} e^{F(MA_{\rm max}-1)} \right ] = c_1\left ( \ln c_2 – \ln N \right ) + c_1 F \left ( MA_{\rm max} – 1 \right ), $$

where I’ve made use of two properties of logarithms, namely that $\log(ab)=\log(a)+\log(b)$ and that $\log(a/c) = \log(a)-\log(c)$. We can put this in a slightly more convenient form by grouping terms:

$$\large {\rm Beta\_TMI_{SS}} \approx c_1 F MA_{\rm max} + c_1 \left ( \ln c_2 – \ln N – F \right ) $$

This form vaguely resembles $y=mx+b,$ a formula you may be familiar with. And putting it in that form makes the effects of the constants $c_1$ and $c_2$ a little clearer.

We’re generally interested in how the metric scales with $MA_{\rm max}$, which is a direct measurement of maximum spike size. It’s clear from this form that ${\rm Beta\_TMI_{SS}}$ is linear in $MA_{\rm max}$, with a slope equal to $c_1 F$. So for a given filtering strength $F$, the constant $c_1$ determines how many “points” of ${\rm Beta\_TMI}$ you gain by taking a larger spike. Since $F=10$, $c_1$ is the number of points that corresponds to a spike that’s 10% of your health larger.

So if your biggest spike goes up from 130% of your health to 140% of your health, your ${\rm Beta\_TMI}$ goes up by $c_1$. Note that this isn’t a factor of $c_1$, it’s an additive amount. If you go from 130% to 150%, you’d go up by $2c_1$ rather than $c_1^2$.

This was the point of taking the logarithm of the old version of TMI. It takes a metric that scales exponentially and turns it into one that’s linear in the variable of interest, $MA_{\rm max}$. If done right, this should keep the numbers “reasonable,” insofar as you shouldn’t get a TMI that suddenly jumps by 2 or 3 orders of magnitude by tweaking one thing. The downside is that it masks the actual danger – your score doesn’t go up by a factor of X to indicate that something is X times as dangerous.

Once you have $F$ and $c_1$, the remaining constant $c_2$ controls your y-intercept, and is essentially a way to add a constant amount to the entire curve. It doesn’t affect the slope of the result, it just raises or lowers all TMI values by $\approx c_1 \ln c_2$.

The other case I want to consider before going forward is one in which you’re taking uniform damage. In other words, every element of $MA$ is the same, and equal to $MA_{\rm max}$. In that case, the sum becomes

$$\large \sum_{i=1}^N e^{F(MA_i-1)} = \sum_{i=1}^N e^{F(MA_{\rm max}-1)} = Ne^{F(MA_{\rm max}-1)}.$$

In this case, the $N$’s cancel and we have

$$\large {\rm Beta\_TMI_{UF}} = c_1 \ln \left [ 1 + c_2 e^{F(MA_{\rm max}-1)} \right ]$$

If we make the same assumption that the second term in brackets is much larger than one, this is approximately

$$\large {\rm Beta\_TMI_{UF}}\approx c_1\ln c_2 + c_1 \left [ F (MA_{\rm max}-1)\right ],$$

or in $y=mx+b$ form:

$$\large {\rm Beta\_TMI_{UF}} \approx c_1 F MA_{\rm max} + c_1 (\ln c_2 – F ).$$

The difference between the uniform case and the single-spike case is just a constant offset of $c_1 \ln N$. So we get all the same behavior as the single-spike case, just with a slightly higher number. The uniform and single-spike cases are the extremes, so we expect real combat data to fall somewhere in-between them.

On a graph, this would look something like the following:

Simulated TMI data using the Beta_TMI formula. Red is the uniform damage case, blue is the single-spike case, and green is pseudo-random combat data.

Simulated TMI data using the Beta_TMI formula. Red is the uniform damage case, blue is the single-spike case, and green is pseudo-random combat data.

This is a plot of ${\rm Beta\_TMI}$ against $MA_{\rm max}$ for some simulated data that shows how the new metric behaves as you crank up the maximum spike the player takes. The red curve is what we get in the uniform case, where every element of $MA$ is identical. The blue curve is the single-spike case, where we only have one large element in $MA$. The green dots are fake combat data, in which each attack can be randomly avoided or blocked to introduce variance.

The first thing to note is that when $MA_{\rm max}$ is very large, the blue and red curves are both linear, as advertised. Likewise, the green dots always fall between those two curves, though they tend to cluster near the single-spike line. In real combat, you’re going to avoid or block a fair number of attacks, and the randomness of those processes eliminates the majority of cases where you take four full hits in a 6-second window.

You can also see the “knee” in the graph I was talking about earlier. At an $MA_{\rm max}$ of around 0.6, the blue curve starts, well, curving. It’s no longer linear, because we’ve transitioned into a regime where the “1+” is no longer negligible, and we can’t ignore it. The red curve has a similar knee, but it occurs closer to zero (as intended, based on the choice of $c_2$). As you get closer to the knee, the metric shallows out, meaning that changes in spike size have less of an effect on the result. This makes some intuitive sense, in that it’s not as useful to reduce spikes that are already below the danger threshold.

The constants $c_1$ and $c_2$ were chosen mostly by tweaking this graph. I wanted the values to be “reasonable,” so I was aiming for values between around 1000 and 10000. The basic idea was that if you were taking 100% of your health in damage, your TMI value would fall between about 2000 and 2500, and then scale up (or down) from there in increments of 500 for every 10% of health increase in maximum spike size.

So that’s the beta version of the metric. Now let’s look at the results of the beta test, and see why I decided to go back to the drawing board instead of rubber-stamping ${\rm Beta\_TMI}$.

Beta Test Results

The spreadsheet containing the data is now public, and you can access it at this link, though I’ve embedded it below:

The data we have doesn’t include the moving average arrays used to generate the TMI value, so we can’t make a plot like the one I have above. We can generate a lot of other graphs, though, and trust me, I did. I plotted more or less everything that I thought could give me relevant information about its performance. I could show you histograms and scatter plots that break down the submissions by average ilvl, TMI, Boss, class, stat weight. But while I had to sift through all of those graphs, I’m not sure it’s a productive use of time to dissect each of them here.

Instead, let’s look at a few of the more significant ones. First, let’s look at Beta_TMI vs ilvl for all classes against the T16N10 boss:

Beta_TMI vs. ilvl for all classes, T16N10 boss.

Beta_TMI vs. ilvl for all classes, T16N10 boss.

The T16N10 boss had the highest response rate from all classes. The general trend here is obvious – as ilvl goes up, Beta_TMI goes down indicating that you’re more survivable against this boss. Working as intended. The range of values isn’t all that surprising given that not all of the Simcraft modules are as well-refined as the paladin and warrior ones. But at least on this plot the metric appears to be working fine.

If we want to see how a single class looks against different bosses, we can. For example, for warriors:

Beta_TMI vs. ilvl for warriors, all bosses.

Beta_TMI vs. ilvl for warriors, all bosses.

Again, the trends are pretty clear. Improving gear reduces TMI, as it should. Some of these data points come from players that tested against several different bosses in the same gear set, and those also give the expected result – crank up the boss, and the TMI goes up.

Another neat advantage is that the statistics finally work well. In other words, if you ran a simulation for 25k iterations before, you’d get an ${\rm Old\_TMI}$ distribution plot that looked like this:

TMI distribution using the old definition of TMI.

TMI distribution using the old definition of TMI.

And this was an ideal case. It was far more common to have a really huge maximum spike, such that the entire distribution was basically one bin at the extreme end of the plot. It also meant that the metrics Simulationcraft reported (like “TMI Error”) were basically meaningless. However, with the Beta_TMI definition, that same plot looks like this:

TMI distribution generated using the Beta_TMI definition.

TMI distribution generated using the Beta_TMI definition.

This looks a whole lot more like a normal distribution, and as a result works much more seamlessly with the standard error metrics we’re used to using and reporting.

So on the surface, this all appears to be working well. Unfortunately, when we look at stat weights, we run into some trouble. Because a lot of them looked something like this:

Example stat weights generated using Beta_TMI

Example stat weights generated using Beta_TMI

The problem here should be obvious, in that this doesn’t tell us a whole lot about how these stats are performing. Rounding to two decimal places means we lose a lot of granularity.

Now, to be fair, they aren’t all this bad. This tended to happen more frequently with players that were overgeared for the boss they were simming. In other words, on players that were nearing the “knee” in the graph. But enough of the stat weights turned out like this for me to consider it a legitimate problem.

Note that Simcraft only rounds the stat weights for the plot and tables. Internally, it keeps much higher precision. As a result, the normalized stat weights looked fairly good. But by default, it plots the un-normalized ones.

I could fix this by forcing SimC to plot normalized stat weights if it’s scaling over TMI, but this comes into conflict with goal #2. Ideally, I’d like it to work well with the defaults, so that I don’t have to add little bits of code all over SimC just to get it to work at all.

And more to the point, this is a more pervasive issue. If health is really doubling in Warlords, and the healing model really is changing, we may start caring about smaller spikes than before. It isn’t good for the metric to be muting the stat weights in those regions.

In fact, now seems like as good a time as any to go through our checklist and grade this version of the metric. So let’s do that.

  1. Accurately representing danger: Pass. At least insofar as more dangerous spikes give a higher TMI, it’s doing its job. We could debate whether the linear scaling is truly representative (and Meloree and I have), but the fact of the matter is that we tried that with version 1, and it led to confusion rather than clarity. So linear it is.

  2. Work seamlessly: Eh…. There’s nothing in SimC that prevents it from working, and it’s vastly improved in this category compared to the first version of the metric because the default statistical analysis tools work on it. But the stat weights really need to be fixed one way or another, which either means tweaking SimC to treat my metric as a special snowflake, or changing the metric. Not super happy about that, so it’s on the border between passing and failing. If I were assigning letter grades, it would be a C-. The original metric would flat-out fail this category.

  3. Generate useful stat weights: Eh…. Again, it’s generating numeric stat weights that work, but only after you normalize them. I’m not sure if the fault really lies in this category, but at the same time if the metric generated larger numbers to begin with, we wouldn’t have this problem.

  4. Useful statistics: Pass. This is one category where the new version is universally better.

  5. Easily interpreted: Fail. If someone looks at a TMI score of 4500, can they quickly figure out that it means they’re taking spikes that are around 135% to 150% of their health? Not unless they go back and look up the blog post, or have memorized that 100% of health is around 2000 TMI, and each 10% of health is about 500 TMI.

    In fact, I’d go as far as to say that this is very little improvement over the original in terms of ease of understanding. The linearity is nice, and the numbers are “reasonable,” but the link between the value and the meaning is still pretty arbitrary and vague.

  6. Numbers should be reasonable: Pass. At the very least, taking the logarithm makes the numbers easier to follow.

All in all, that scorecard isn’t very inspiring. This may be an improvement over the original in several areas, but it’s still not what I’d call a great metric. Generating nice stat weights is important, and it’s not doing a great job of that, but that could be fixed with a few workarounds. But failing at #5 is the real kicker. We rationalized that away in version 1 by treating this like a FICO score, an arbitrary number that reflects your survivability. But the more time I spent trying to refine the metric, the more certain I became that this was a fairly significant flaw.

To make TMI more useful, it needs to be more understandable. Period. And it was only after a discussion with a friend about the stat weight problem that the solution to the “understandability” problem became clear.

In Part II, I’ll discuss that solution and lay out the formal definition of the new metric, as well as some analysis of how it works and why

Posted in Simulation, Tanking, Theck's Pounding Headaches, Theorycrafting | Tagged , , , , , , , , , , , , | 7 Comments

Crowdsourcing TMI

As I’ve mentioned a few times on Twitter already, I’ve been working on refining the formula used to calculate the Theck-Meloree Index. The current version certainly works, or at least, gives me the numerical effects that I initially wanted. But in the 6+ months since we defined the metric, we’ve learned a lot more about the quirks involved with having a raw exponential metric. Several of which are more rooted in psychology than mathematics!

(If you’re keeping score, that’s Wrathblood – 1, Theck – 0)

In any event, after a bit of playing with the possibilities I’ve finally decided how I want to modify the formula. It’s all but finished, in fact. The only thing left to do is fine-tune some constants, which I think I’ve done sufficiently well already. But the only good way to test that is to generate a lot of data and see if it’s working the way I want it to.

Which normally would be fine, but there are a few issues with doing that myself.

  • It takes a long time to generate the amount of data I’m looking for. Think several hundred simulations, each with 25k iterations, and calculating 10 scale factors.
  • I want to test it on a variety of gear sets. Again, it takes a lot of time to put together gear sets, and I don’t really want to troll the armory looking for random players to import.
  • I want to test it on all five tanking classes (or at least, the ones that SimC supports). Again, short of trolling the armory, it would be a formidable task to find an appropriate number of players to get a proper sample. And would take a long time.
  • I want to test it against multiple different TMI bosses… so multiply all of those time investments by a factor of four or five.

I’m busy enough as it is with all sorts of other projects (*cough* and Diablo III), not to mention my job, that it’s not feasible for me to generate all of this data myself. Unless you want to wait for the new TMI definition until December. Of 2017.

I could just release the new metric into the wild, of course. I’m pretty sure it’s functioning properly, after all. But I’d much rather be able to do some rigorous testing of it in case there are weird problems that I didn’t anticipate.

This is where you come in. Instead of running several hundred simulations myself, I’m asking each of you to run a simulation or two for me. Basically, you could consider this the public beta test of TMI 2.0.

How To Contribute Data

I’ve coded the new TMI definition into Simulationcraft, and it’s available as an option in version 547-2. By default, it will calculate stat weights using the old formula. However, you can enable the new formula with the argument new_tmi=1.

You can do this by adding that line to the Simulate tab as shown in the screenshot below:

On the Simulate tab, add "new_tmi=1" after your character definition to enable the new formula.

Add “new_tmi=1″ after your character definition to enable the new formula.

The results page will then report TMI as calculated using the new formula.

If every reader of the blog runs their own character through the sim, I will have a veritable sea of data to swim through (as in, many thousands of simulations). I’m not that optimistic about a 100% reader-to-data-submission conversion rate, so if you can run your character several times with different options (i.e. against different TMI bosses), that’s even better.

Here are the basic guidelines that I’m looking for in submissions:

  • 25000 iterations
  • Standard Patchwerk fight (these should all be SimC defaults)
    • Length: 450
    • Vary Length: 20%
    • Style: Patchwerk
    • Level: Raid Boss
    • Target Race: humanoid
    • Num Enemies: 1
    • Challenge Mode: Disabled
  • Standard Player settings (again, defaults)
    • World Lag: Low
    • Player Skill: Elite
  • Scale Factors (make sure you choose to scale over “tmi”)
    • Strength or Agility (depending on your class)
    • Stamina
    • Expertise
    • Hit
    • Crit
    • Haste
    • Mastery
    • Armor
    • Dodge
    • Parry

All of these options can be found on either the Options->Globals tab or the Options->Scaling tab. First, a quick look at the Globals tab:

SimC's General Options tab.

SimC’s Options -> Global tab.

You can see that I have all of the settings at default here. The only two I want you to play with are the TMI Standard Boss and the TMI Window.

For the TMI Boss, pick one (or more) ilvl-appropriate bosses. For example, if you’re in heroic T16 gear, then you shouldn’t bother simming against the T15 bosses at all, and probably not against T16N10. Stick to T16H bosses or the 17Q boss. Please do not use “custom” – that will pit you against Fluffy Pillow, who is not so fluffy anymore now that he learned how to perform melee and spell nukes.

For TMI Window, the standard is six seconds. Feel free to leave it at that if you’re getting reasonable results. If you get really weird-looking stat weights or your TMI is below, say, 1000, consider dropping this a little, maybe to four seconds. Please submit the wonky stat weight data anyway, because that’s also useful to me, but then submit the (hopefully) normal-looking data you get using the lower TMI window.

On to the Scaling tab:

SimC's Options -> Scaling tab.

SimC’s Options -> Scaling tab.

As you can see, I’ve checked all of the stats I’m interested in. If you’re a druid or monk tank, please check the Agility box too (in that case you can skip Strength if you want to). Above all though, make sure you’ve chosen to scale over “tmi.” I can’t stress this enough, because if you scale over DPS you’ll get scale factors that are useless to me, and it will just mean I have to spend time filtering the data to eliminate those useless data points.

Once you’ve completed the simulation, you can enter the data in the form below. Please also attach the html results (which you can get by using the “Save” button at the bottom right of the results pane in SimC) using the “Upload” button at the bottom of the form. I’m requesting the html so that I can sanity-check the data and figure out what’s happening with outliers, so you can’t submit data without first attaching that file.

There’s no limit to how many times you can submit data, so you can run several different characters through the simulation if you want to. In fact, that’s encouraged, because data from undergeared alts is just as valuable to me (if not more) as data from overgeared mains. And of course, you can run a character against several different TMI bosses and submit each result separately.

Just don’t keep re-submitting a single simulation result multiple times, because each submission after the first would be useless for obvious reasons.

If the embedded form below isn’t working for you for some reason, you can also access it directly via this link: http://goo.gl/SY36xu. Note that you’ll have to reload the page (or open the link in another new tab) in order to use the submission form again.


Thanks in advance for your help! Depending on how quickly the data comes rolling in, I may be able to have this all wrapped up as early as next week.

As soon as that’s done, I’ll be making a much longer post detailing what changes I’ve made, why I’ve made them, and how the new formula works, including simulated data that I used to develop the metric and actual data from this exercise.

Posted in Theck's Pounding Headaches, Theorycrafting | Tagged , , , , , , , , , | 28 Comments

Vengeance, With A Vengeance

The developers have been hinting that a major info dump is coming soon™, and that probably includes some more detail about how Vengeance will work in Warlords of Draenor. If you’re a long-time reader of the blog, you probably know that we’ve been pretty hard on Vengeance several times in the past. But with a new expansion, there’s new hope for an implementation that actually works well.

The Good

One of the things we do know about the new version of Vengeance is that it won’t affect our DPS output. They’re finally severing the connection between damage intake and damage output. After years of complaining about the pitfalls and frustrations of that mechanic, I’m considering this a moral victory.

More importantly, it means that for the first time in a long time I’m really enthusiastic about Vengeance. If you recall, most of the objections that Meloree and I have made about Vengeance over the years have centered around the damage output component. We’ve pointed out the backwards logic of encouraging tanks to take more damage to increase their DPS, the huge discrepancy between damage output in solo play and raids, the feeling of uselessness while off-tanking, the frustration of having little to no control over your DPS output (and thus no way to properly evaluate it), and the way it encourages cheesy tricks like one-tanking and /sit-tanking to game the mechanic. All of that is going away, hopefully for good.

This change is probably the one thing I’m looking forward to most in Warlords. Partly out of a feeling of vindication, but mostly just because of functionality. I can’t wait to put out respectable damage in solo and small group content without having to switch to Retribution.

Lessons To Be Learned

Blizzard frequently talks about how they iterate on mechanics, applying the lessons they learn from previous incarnations to improve new versions. I think that severing the DPS connection is obviously one of those cases. But I don’t think that’s all that the devs can stand to learn from the 5.x implementation of Vengeance.

To illustrate that thought, I want to show you an excerpt from one my my recent heroic Thok logs. In particular, I want to consider a period in the last phase of the encounter immediately after I taunt Thok.

Here’s the attack power graph for that section of the fight:

Attack Power plot for a portion of the 25H Thok encounter.

Attack Power plot for a portion of the 25H Thok encounter.

I start at 300k after taunting and rise to over 600k at the peak. This is pretty normal for the last few bosses of the tier, though of course earlier bosses don’t hit as hard. But remember, I have around 40k-50k attack power out of combat. That means that as much as 90% of my DPS is coming from Vengeance, rather than my gear, and thus not directly under my control.

But the part that’s really eye-opening is the healing graphs. Let’s switch to the healing view and filter the log for Eternal Flame. As a point of nomenclature, I use “Word of Glory” to refer to the base heal and “Eternal Flame” to refer to the heal-over-time (HoT) portion to keep them straight. The log, of course, uses the same name for both. But nonetheless, let’s look at the plot of healing done per second:

HPS output of Eternal Flame during a portion of the 25H Thok encounter.

HPS output of Eternal Flame during a portion of the 25H Thok encounter.

This plot, which includes overhealing, suggests that I’m producing about 150k-250k HPS just with Eternal Flame.  And those two spikes are the Word of Glory heals, which are obviously really huge. Let’s see exactly how huge:

Event view for Eternal Flame for this portion of the 25H Thok encounter.

Event view for Eternal Flame for this portion of the 25H Thok encounter.

The base Word of Glory heals are 1M (at ~400k Veng) and 1.5M (over 600k Veng) when I refresh Eternal Flame. The Eternal Flame ticks generated by those casts are ~260k and ~400k, respectively, occurring every ~1.8 seconds.

Note that I have a little over 1 million hit points. That means the base WoG heal is basically a Lay on Hands, limited only by Bastion of Glory ramp-up time. It also means that the HoT, which is providing 140k-220k HPS all by itself, is capable of healing me to full every four to six seconds at high Vengeance.

And that’s just Eternal Flame. If you filter the log for Seal of Insight you’ll see that it is also healing for 100k-140k per tick, producing another 100k-200k HPS. Combined, these two effects heal for ~250k to 420k every second. That means I’m essentially healing to full every 2-4 seconds just from these two passive sources.

You might note that I’m including overhealing here, but if you’ve read any of my survivability posts over the last year or so you should already realize that it’s a mistake to immediately discount overhealing. Because that overhealing isn’t overhealing when you’re in a dangerous situation, like during a damage spike. The fact that it overheals when you’re safe is irrelevant if it saves your ass when your ass actually needs saving.

It also has implications beyond just spikes. With enough avoidance and mitigation, I’m producing enough healing to keep myself alive without healers against this boss. This is an effect we’ve seen in Simulationcraft and discussed before. But it’s also happening in-game, on fights like Thok and Siegecrafter Blackfuse. On more than one attempt, I’ve been able to tank each of these bosses for well over a minute after everyone else had died, just with my own self-healing. Other paladins I’ve talked to have been able to do the same (one even boasts a 3-minute solo on Thok until he hit enrage).

Playing The Blame Game

Now, as far as I can tell, other classes aren’t capable of this degree of self-sufficiency. So it’s not clear that this problem is all Vengeance’s fault. But it’s definitely one of several contributing factors. And it underlies one of the lessons we can learn from 5.x Vengeance: I think it is far too generous.

See, Vengeance increases with the boss’s raw damage throughout an expansion, and even within a tier. So on early bosses you might only have 200k Vengeance, while later bosses will give you upwards of 500k. And of course, those later bosses do more damage than the earlier bosses do – which is why they give more Vengeance in the first place.

But as the expansion goes on, your mitigation and avoidance keep increasing. So while the 500k Vengeance boss gives you twice as much Vengeance as a 250k boss, your gear upgrades between the time you first encounter those bosses mean that you have more mitigation and avoidance. So you don’t actually take twice as much damage from that later boss, because you’re avoiding and mitigating more of it.

In addition, our self-healing grows rather generously with attack power thanks to Eternal Flame, Bastion of Glory, and Seal of Insight. So those later bosses are giving us (say) twice as much attack power, and thus roughly twice as much healing througput, without dishing out twice as much damage taken.

To give a more quantitative bent to that thought, consider the ratio of self-healing done to damage taken:

$$ R = \frac{{\rm SH}}{\text{DT}} \propto \frac{{\rm AP}}{(1-A)(1-M)\text{RD}} \propto \frac{k{\rm RD}}{(1-A)(1-M)\text{RD}}$$

Self-healing ${\rm SH}$ is proportional to attack power, which is proportional to some constant $k$ times the boss’s raw damage $\text{RD}$. Our damage taken is also proportional to the boss’s raw damage, but with additional factors $(1-A)$ and $(1-M)$ to account for our avoidance $A$ and average mitigation $M$ (I’m lumping armor, Shield of the Righteous, and blocking all together here).

Note that if this ratio is below one, then we take more damage than we can heal up. But if it goes above one, we’re healing for more damage than we take. In other words, a ratio of $R=1$ is the self-sufficiency limit, above which we can take care of ourselves (at least up until the boss is capable of one-shotting us).

It should be pretty clear what happens over the course of an expansion. As the expansion goes on, $A$ and $M$ increase,  $(1-A)$ and $(1-M)$ decrease, and the ratio gets larger. At the beginning of an expansion, we may be able to heal for 30% to 50% of our damage taken at best. But by the end of the expansion, when we’re pushing ~75% Shield of the Righteous uptime, ~35% avoidance, ~35% block, and 60% mitigation from armor, we’re able to push this ratio significantly above one. Which is why we can solo-tank Thok or Siegecrafter until their stacking debuff effects let them one-shot us.

Again, to illustrate that thought, let’s look at my damage taken for this same period:

Damage taken for the same period of the 25H Thok encounter.

Damage taken for the same period of the 25H Thok encounter.

If you total that up, you get about 12.6 million damage during this 54-second period, or about 233k damage taken per second. Now look at the healing table:

Self-healing from all sources for the same portion of the 25H Thok encounter.

Self-healing from all sources for the same portion of the 25H Thok encounter.

Even if we only consider Seal of Insight and the HoT portion of Eternal Flame, that’s 17.1 million healing. So from passive sources alone, our ratio is $R=1.36$. In other words, I’m passively healing for 36% more damage than we’re actually taking. And that’s ignoring the set bonus and my two Word of Glory casts, which would bring the total up to 21.1 million healing and a ratio of $R=1.67$.

The point in all of this is that our self-healing scales far too well with attack power, and thus with Vengeance.  As we get more “tanky” with more gear, we actually get more Vengeance than we need to compensate for our damage intake. As a tank, I think this is a problem because I don’t believe that tanks should ever be self-sufficient. The bulk of our healing should come from external sources to keep the tank–healer leg of the tank-healer-DPS interaction trinity alive. It’s one thing to have a lot of control over your survivability (which we do, thanks to active mitigation). It’s another thing entirely to be able to be your own healer when other classes can’t.

I don’t think the developers are ignorant of this fact, either. To compensate, the developers have reduced the conversion percent $(k)$ several times over the course of the expansion to attempt to account for this effect, but it simply hasn’t been effective enough. Or at least, not for us. I think they’ve probably kept all of the other tanks in line with these reductions, but somehow we slipped through the cracks (more on that in a bit).

There are some ways to ensure this doesn’t happen, or at least to prevent the need to change $k$ several times per expansion. The issue here is that the ratio of $k/(1-A)(1-M)$ grows as $A$ and $M$ grow. So the logical solution is to let $k$ vary the same way. The simplest way to do that is to use actual damage taken to determine Vengeance rather than raw damage. That introduces a factor of $(1-A)(1-M)$ in the numerator, which automatically corrects for variations.

Note that I argued strongly against doing this in the past, which may seem inconsistent. But the earlier versions of Vengeance gave us extra damage output for taking more damage. We’ve definitely seen antics this expansion that legitimized that concern. But if that’s no longer possible then we don’t have to worry about tanks feeling encouraged to stand in the fire to produce more DPS. As long as the damage-taken-to-Vengeance conversion is sane (i.e. even remotely balanced), we’ll get less self-healing back than the extra damage we take, so there wouldn’t be an advantage to taking more damage.

But while simple, this solution has its problems. For one thing, it would be awful for avoidance tanks, because it would make Vengeance really spiky. It would penalize you for avoiding attacks, which is bad if avoided attacks are something we should ostensibly be happy about. And while it may not matter as much once dodge and parry ratings don’t show up on gear, it’s still an odd quirk we’d like to avoid. Worse yet, it punishes you for using your active mitigation, which we definitely want to avoid.

An alternative that causes fewer issues is to keep the current Vengeance implementation, but use “estimated post-mitigation damage” rather than raw damage. And what I mean by that is that we define $k$ to be $k=(1-A)(1-M){\rm RD}$. In other words, every attack you receive gives you Vengeance whether you avoid it or not, just like it does now, but the amount of vengeance is artificially reduced based on your character sheet avoidance and mitigation.

This is tricky, insofar as it still has the negative interaction with avoidance, but it’s a weaker and more smoothed-out effect. To make it work, they would probably also have to exclude active mitigation sources from $M$, which means it would be primarily armor, spec-based mitigation, and possibly blocking. Excluding active mitigation means there would still be some creep in the ratio over the course of the expansion, but a judicious choice of $k$ would ensure that it keeps the ratio at sensible levels.

Maybe the simplest version is to just keep Vengeance as it is (minus the damage component, obviously), but slash $k$ significantly enough to keep the ratio low even in the highest-Vengeance cases. This also weakens Vengeance a lot, but that may not be a bad thing. Before Vengeance existed, there was a real sense of fear tanking a harder-hitting boss because your defenses didn’t immediately scale up to meet it. A weaker version of Vengeance would bring some of that feeling back. The downside, of course, is that WoG might lose ground compared to SotR in that system, making one or the other the better choice on a boss-to-boss basis.

But rather than dwell to long on ways to “fix” Vengeance, especially in the absence of information about how it will be calculated in WoD, I want to take this discussion in a different direction. For a moment, let’s look at the big picture. What if we’re the only class having these odd scaling issues. In fact, this isn’t much of a “what if” because I think this is actually the case. So if the problem is us, then maybe the solution isn’t to tweak Vengeance, but to tweak us. But how?

Back to Basics

From the logs we’ve analyzed above, it’s clear that a large portion of the problem is the massive effect that Vengeance has on Seal of Insight and Eternal Flame. Sure, I think it’s overpowered to be able to fire off a 1-million-point WoG every 20 seconds – Lay on Hands has a 5+ minute cooldown for a reason, after all – but the bulk of our self-sufficiency comes from these two passive healing sources. So the question becomes “Which abilities should 6.0 Vengeance affect, and how?” To answer that, first let’s consider the purpose of Vengeance for a moment . Celestalon put it fairly succinctly during the twittergeddon following Friday’s blog post:

So that things like Shield Block and Shield Barrier can stay competitive with each other.

In other words, it exists to keep point-based active mitigation (Shield Barrier, Word of Glory) competitive with percent-based mitigation (Shield Block, Shield of the Righteous).

To illustrate why that’s an important goal, imagine that Vengeance didn’t exist in Mists of Pandaria. Let’s ignore Bastion of Glory for a moment and say WoG heals for about 30% of your health at a certain gear level. If you’re raiding in a 25-man, you’d almost never cast it, because Shield of the Righteous will mitigate that much damage or more from a single swing, let alone two. But in a 10-man, where the bosses don’t hit as hard, you could almost ignore Shield of the Righteous and chain-WoG yourself.

That disparity in gameplay isn’t ideal. It would be better if your class worked the same way regardless of setting. It’s more immersive if the question you ask yourself when choosing a finisher is “do I need a heal right now” rather than “are there more than X players in my raid.” And this concern isn’t going away in Warlords – in fact, it’s getting more ubiquitous since normal and heroic modes will be flexible.

Another way to phrase the purpose of Vengeance is that it’s there to make sure that active mitigation abilities have resource parity. If Shield of the Righteous and Word of Glory both cost 3 Holy Power, then they have to perform similarly. Not identically, of course – for good, solid, interesting gameplay there should be situations where you’d choose one or the other. But we can’t have one of them be so dominant that you can take the other one off of your bars either.

Bastion of Glory accomplishes this to some degree, because it introduces an interaction between the two, and subsequently a time factor. You can chain-cast Word of Glory, but it will be weak. It gets a lot stronger (and thus more efficient per Holy Power) if you cast a few SotRs first. This interaction inherently makes that choice interesting, and limits the usefulness of strong WoGs to one every 20 seconds or so without artificially adding a cooldown to the spell. It’s a really great design, all told.

But it doesn’t solve everything, because it doesn’t let Word of Glory scale with boss damage, and to compete with Shield of the Righteous for resources, it has to.

Hope Springs Eternal

However, Seal of Insight doesn’t compete for our Holy Power. It’s automatic – we don’t even cast it. There is never a situation where we choose between another Shield of the Righteous and Seal of Insight.

Eternal Flame is a slightly different beast. In concept, it doesn’t compete for Holy Power either because we get the same Word of Glory heal with or without the talent. The added bonus of Eternal Flame is the heal over time, which could be construed as an extra bonus. This is really only true if you have the T16 4-piece bonus though, which disconnects Eternal Flame maintenance from the Holy Power opportunity cost.

In practice, Eternal Flame’s HoT is so strong that without the set bonus, we’re really choosing between spending Holy Power on Shield of the Righteous and spending it on the Eternal Flame heal over time. It turns it into a choice between a Shield of the Righteous that shaves ~300k off of each boss attack for 3 seconds and an Eternal Flame that heals us for 300k every two seconds for 30 seconds. The latter is just far more efficient, and the ability to overlap them so powerful that it isn’t even much of a choice.  In some sense, Eternal Flame becomes our version of Inquisition, with the caveat that we’d rather refresh it at high Vengeance. The gigantic Word of Glory heal becomes a bit of an afterthought, and I think that’s a bit of a problem.

And the weird self-sufficiency effects in 5.4 are all “collateral damage” from Eternal Flame and Seal of Insight due to the direct Vengeance-to-AP conversion. Eternal Flame in particular gets a huge boost from Vengeance thanks to it’s collection of multiplicative modifiers, which makes it tough to keep other talents (Sacred Shield) competitive with EF over a large range of AP values.

The Last Bastion

Bastion of Glory is part of the problem too. Eternal Flame benefits from Bastion thanks to buffs back in 5.2 when Eternal Flame was far behind Sacred Shield in survivability. And while I advocated for those buffs at the time, in retrospect it was the wrong call. It definitely made Eternal Flame stronger for Protection (though at the time, still not strong enough to be competitive), but did it in the worst possible way. The difference between a 5-BoG EF and a 0-BoG EF is huge, roughly a factor of 3 or 4 depending on mastery levels. And since that factor ends up applying to our huge Vengeance accumulations, the multiplicative nature makes Eternal Flame ludicrously powerful if we can refresh it with 5 stacks and at high Vengeance.

It also introduces a number of annoying gameplay intricacies. For example, is it worth replacing a 5-BoG EF with a 3-BoG one? Your gut would say no, but in many cases (like after a taunt) it is. If you gained a lot of Vengeance, the 3-BoG EF would be significantly stronger. Likewise, sometimes it’s not worth replacing a 3-BoG EF with a 5-BoG EF if you’ve lost Vengeance or if the 3-BoG EF was cast under Bloodlust and/or Avenging Wrath. It’s complicated enough that nobody can do that math in their head on the fly, especially given the lack of any sort of Vengeance display in the base UI. So to take advantage of those nuances, a player needs equally-complex WeakAuras to simplify the problem down to a go/no-go decision they can use to make split-second decisions.

That last bit is what really pushes me into “this is a bad mechanic” territory. It’s not transparent. The UI doesn’t provide clear information about it. It’s not easy for an advanced player to understand, let alone a beginner. It adds a type of depth and complexity to tanking, for sure, but it does so by making the timing of the EF refresh very sensitive to three or four different factors that the player doesn’t have an easy way to monitor outside of add-ons.

And in the process, it removes depth and complexity of timing the Word of Glory heal based on your health or expected damage. So I’m not sure it’s really adding that much depth overall, it’s just shifting it from being aware of the boss, your health, and combat to being aware of three arbitrary indicator dials (Vengeance, Haste, and BoG stacks).

If anything, I’d actually call that a loss. Because it actively dissuades players from using Word of Glory the way it was meant to be used – to react to damage spikes. Now, if you have to use your emergency heal, but you don’t have five Bastion stacks, you’re sacrificing even more long-term survivability to use your emergency heal. You’re actually penalized for using your emergency heal as an emergency heal!

In retrospect, I’m sorry I suggested that fix (though of course it’s not clear my suggestion had anything to do with it being implemented). Because I think it would have been more aptly solved with a simpler one: the “100% more healing when self-cast” solution, or equivalently, just increasing the size of the AP coefficient for protection. While bland, it produces all of the desired effects. It can be tuned such that the spell remains competitive with Sacred Shield, but without the huge swings in power with Bastion of Glory stacks and without subverting the design of Word of Glory.

A Limited Time Only Flame

So how do the developers “fix” all of these problems?

First of all, I think that Vengeance should only affect Word of Glory. It can be balanced such that spending HP on Word of Glory should heal for more than is mitigated from a single attack by SotR.  It should probably be tweaked such that a 5-BoG WoG heals for a little more than SotR would mitigate off of two attacks, but a 0-BoG WoG is close to the single-attack SotR value so that it’s useful no matter how many BoG stacks you have. That’s just a matter of fitting numbers, and keeps WoG an interesting choice over a large variety of content levels. And for clarity, that choice is “Do I need a heal right now to survive the next boss attack, or would I rather put up SotR to increase smoothness over the next two attacks?”

With none of our other healing abilities (Eternal Flame, Seal of Insight, and Sacred Shield) receiving a benefit from Vengeance, the balance of those abilities could be tuned much more finely. The drawback is that they wouldn’t adapt to boss damage, but if the attack power coefficients are chosen appropriately they should remain useful over several tiers of content. Since the spells won’t vary with Vengeance those AP coefficients can be made large enough to keep the skills significant without risking them being overpowered against certain bosses.

And finally, I think the Bastion of Glory interaction with Eternal Flame should be dropped. It has all sorts of unfortunate side effects, and it will be easier to balance Eternal Flame and Sacred Shield when one of them isn’t capable of fluctuating in strength so significantly. Paring both skills down to a single AP coefficient each means they can control the two effects well enough to make them truly competitive, because it will be a simple question of “do you want to absorb X every 6 seconds or heal for Y every 3 seconds,” where X and Y depend only on spellpower and can be independently tuned.

That’s really what I want to see this week, to be honest. While I’m excited to find out about the mechanics of the new version of Vengeance, the details are less important to me than these bigger issues with Eternal Flame, Sacred Shield, and Seal of Insight. I’d really like to see these passive effects toned down to be reasonable, and the only way that will happen is if they aren’t astronomically different in magnitude when they’re buffed by Vengeance, or in Eternal Flame’s case, Bastion of Glory.

Posted in Tanking, Theck's Pounding Headaches, Theorycrafting | Tagged , , , , , , , , , , , , , , , , , , | 28 Comments

A Comedy of Error – Part II

As I said in Part I, I observed some strange error behavior in the 5.4.2 Rotation Analysis post. Now that we’ve had a thorough (and lengthy) review of the statistics of error analysis, It’s time we looked more carefully at the problem that started this whole mess.

Mo’ Iterations, Mo’ Problems

Once again, here was my comment about error from the Rotation Analysis blog post:

The “DPS Error” that Simulationcraft reports is really the half-width of the 95% confidence interval (CI). In other words, it is 1.96 times the standard error of the mean. To put that another way, we feel that there’s a 95% chance that the actual mean DPS of a particular simulation is within +/- DPS_Error of the mean reported by that simulation. There are some caveats to this statement, insofar as it makes some reasonably good but not air-tight assumptions about the data, but it’s pretty good.

I’m actually doing a little statistical analysis on SimC results right now to investigate some deviations from this prediction, but that’s enough material for another blog post, so I won’t go into more detail yet. What it means for us, though, is that in practice I’ve found that when you run the sim for a large number of iterations (i.e. 50k or more) the reported confidence interval tends to be a little narrower than the observed confidence interval you get by calculating it from the data.

So for example, at 250k iterations we regularly get a DPS Error of approximately 40. In theory that means we feel pretty confident that the DPS we found is within +/-40 of the true value. In practice, it might be closer to +/- 100 or so.

So let’s talk about these “deviations.” What caught my attention at first was that, even though the DPS Error reported by SimC was $\pm$ 40 DPS, I could sim the same rotation several times and get values that differed by much more than that, often in the hundreds of DPS. After looking into it more carefully, I’d say that the “$\pm$ 100 or so” I quoted in the last blog post was probably a bit of an under-estimate; $\pm$ 200 to 300 DPS might be a closer estimate to the actual variations I was seeing.

And while this is less than a 0.1% relative error given that we’re talking about DPS means near 400k, it’s still a little disconcerting. First, on a theoretical level, I believe in statistics, so it’s unsettling when they appear not to be behaving properly.  Second, it struck me as very odd that going from 50k iterations to 250k iterations didn’t seem to have a meaningful impact on the error fluctuations. As an experimentalist, I’m familiar with the process of determining how much error I can accept and how much integration time (in this case, iterations) it will take to achieve that level of confidence. So when these sims failed to meet the spec that I set, I took notice.

But a handful of assorted simulations that violate spec isn’t enough information to base a hypothesis on. I knew it wasn’t demonstrating the desired behavior. But to figure out what was wrong, I needed to first figure out exactly what behavior the system was exhibiting. And to do that, I needed more data.

Confidence Boost

In the quoted passage above, I said that what Simulationcraft reports as “DPS Error” is really $1.96 {\rm SE}_{\mu}$, which is the half-width of the 95% confidence interval (CI). The full 95% CI is $\mu_{\rm sample} \pm 1.96 {\rm SE}_{\mu}$, so it’s appropriate to say that when you look at a SimC report, the “DPS” value it reports is accurate to about $\pm$ “DPS Error.” This is a pretty natural way of reporting error, as we’ve seen in Part I.

Thinking back to our dice experiment in Part I, we said that if we repeated the experiment 100 times, we’d expect that about 95 of them would fall within the range $\mu_{\rm sample}\pm 2{\rm SE}_{\mu}$ (I’m rounding 1.96 to 2 here for simplicity). That was the meaning we ascribed to the 95% confidence interval. So one way to test the system is to do exactly that: run the simulation 100 times and take a look at the distribution of sample means.

And just to be abundantly clear about what that means, let’s assume we’re interested in the simulation error when we run it for 100k iterations. We can do that once to get a sample mean and 95% CI. We can then do it 99 more times, running the sim for 100k iterations each time, which gives us 100 sample means from the 100 independent simulations.

Our best guess at the population mean $\mu$ is the mean of those 100 sample means $\mu_{\rm sample}$ (I feel like I need an Xzibit image here…). And we could then empirically determine a value $\delta$ such that 95 of those means fit in the range $(\mu-\delta,\mu+\delta)$. If we did that, then $2\delta$ is our empirical estimate of the 95% CI. We could compare that to twice the value SimC reports as “DPS Error” to check for consistency.

There’s a number of ways to make that empirical estimate, but two of them are relatively easy in MATLAB. The first is to use the prctile() function, which we can use to find the DPS values that are the 2.5th and 97.5th percentiles of the data set. The difference of those two values is the empirical estimate of the 95% CI.

The second method is more involved, and uses Principle Component Analysis, or PCA. It also goes by a number of other names: eigenvalue decomposition, empirical component analysis, singular value decomposition, and several more. It’s related to finding principal axes in mechanics, if you’re familiar with mechanical engineering concepts. It attempts to find the confidence region (or “confidence ellipsoid”) of the data set, which is the generalization of a confidence interval into higher dimensions. When you apply it to a one-dimensional data set, though, you get the usual confidence interval.

In any event, it’s a powerful linear algebra technique that would require another whole blog post to explain, so if you’re really interested in the guts of it I suggest you read the Wikipedia article. For those that care, I’m using a function from this thread of MATLAB Central, which uses the finv() and princomp() methods from the statistics toolbox. (fun coincidence: I worked in the same building as the author of this code as a postdoc, though in a different department). The only change I’ve made is a minor correction; I’m fairly certain that the line

ab = diag(sqrt(k*lat));

should be

ab = diag(k*sqrt(lat));

so I’ve made that correction. Without the correction, the 95% CI’s the code produces are approximately half the size they should be (because $k\approx 4$) as tested with a normally-distributed data set that I generated for the purpose of testing the code. With that correction, this prediction agrees very well with the percentile-based estimate (as it should!).

So, armed with two techniques to empirically estimate the 95% confidence interval, I set to the task of doing that for various simulation lengths. In other words, run 100 simulations with 50 iterations each, then do it again for 100 iterations, and again for 250 iterations, and then for 500, 1000, 2500, 5000, 10000, 25000, 50000, 100000. I did all of this with the T16H protection paladin profile and “default” settings in SimC.

That takes a while – the whole set of runs takes 5-8 hours depending on how many threads I use. But at the end, we get a graph that looks something like this:

Error analysis

Error analysis of Simulationcraft results. The blue line is the confidence interval reported by Simulationcraft. Green and red lines are the estimated confidence intervals obtained through PCA and percentile methods, respectively.

It’s a little harder to tell what’s going on in the top plot because it’s a semilog, but the bottom loglog plot shows the problem very clearly. At 1000 iterations (103) the three error estimates agree very well. However around 5000 iterations we see the observed error exceeding the reported error, and as we increase the number of iterations further the gap just gets larger. By 100000 iterations (105), we’re reporting a confidence interval of almost 100 DPS, but observing a confidence interval of nearly 500 DPS.

This is a problem – it means that we’ve effectively hit an “error floor” in SimC, because no matter how many iterations we throw at the problem, the error doesn’t seem to improve. And that’s pretty weird. But why?

Results Hazy, Ask Again Later

The “why” took a little more thinking. I’ve had several discussions over the past month with other SimC devs and a few academics about what might cause this sort of thing. As it turns out, everyone I spoke to had the same first guess that I did. If you remember back to Part I, we said that our error estimates were based on the Central Limit Theorem. Maybe we were violating the CLT somehow, and as a result our actual errors were larger than we expected?

If you recall, the constraints on the CLT were that each iteration needed to be independent and identically distributed. In other words, none of the iterations should depend on any of the previous iterations, and the probability distribution we’re sampling shouldn’t change from iteration to iteration. Of the two, dependence seemed like the more likely culprit.

I should note that while this was the first thought I had, the second thought I had was “but how?” Most people I talked with were similarly stumped at first. The thing that stuck out to us as the most likely culprit also seemed… somewhat unlikely. And that was the “vary_combat_length” option in SimC.

See, the default setting in SimC is to vary the combat length from iteration to iteration to smooth out the impact of cooldowns and other fixed-time-interval effects. To illustrate that concept, let’s say you had a spell with a 1-minute cooldown that gave you a 30-second buff that significantly increased your DPS (say, Avenging Wrath on steroids). If you ran the sim for exactly 1 minute and 30 seconds, you’d get two casts of that spell (once at the pull, once at the 1-minute mark) and you’d have 66.67% uptime on that buff. But if you ran the sim for exactly 2 minutes, you’d have the same two casts but only 50% uptime. Your DPS would look really good in the first sim, and significantly lower in the second sim.

So to try and reduce that problem and give a more holistic view of your DPS that accounts for fluctuations in fight length, SimC varies the fight length by up to 20% from the default of 450 seconds. That way you get a spread of cooldown uptimes that more accurately represents and average encounter.

The reason that we thought this was an unlikely candidate was that it wasn’t clear how this violated either of the CLT constraints. See, SimC doesn’t just run arbitrarily for 450 seconds by default. It does that for the first iteration, during which it tallies up the amount of damage you do, and then for subsequent iterations it gives the boss that much health and lets you go to town on it, varying the health accordingly to get longer or shorter runs.

So varying the combat time doesn’t change the relative amount of time you spend in execute range, for example. That’s important, because if you spent e.g. half of the fight in execute range, and you do more DPS in execute range, then you’ve changed the probability distribution being sampled, so we’d be violating the “identically distributed” constraint.

However, the variation in combat length isn’t random either – it follows a predetermined sequence, where it alternates between extremes. As a rough example, it might start with a run that’s 20% shorter than the average, which we’ll call “-20%.”  It’ would follow that with a run that’s 20% longer than average, or “+20%.” And then one that’s -19%, followed by another at +19%, followed by -18%, and so on. Note that these aren’t relative to the previous iteration – they’re all relative to the target length of 450 seconds. So in theory, these shouldn’t be violating the independence clause on that account. But they are somewhat deterministic because of the patterning.

So, it felt unlikely that this was the problem. But we really weren’t sure. So we tested it by repeating the experiment with the command-line argument “vary_combat_length=0” to disable the combat length variation code. And five to eight hours later, the result was this:


Error analysis of Simulationcraft results with “vary_combat_length” disabled. The blue line is the confidence interval reported by Simulationcraft. Green and red lines are the estimated confidence intervals obtained through PCA and percentile methods, respectively.

Well, that didn’t help. So at the very least, the combat length variation code isn’t the only problem. We can’t rule it out completely based on this data, because it’s possible (if unlikely) that it is one of two or more contributing factors. But it certainly looks like the culprit lies elsewhere.

Death and Decay

The next candidate we came up with was a quirk of how the boss health calculation works. I glossed over this above by saying that we determine the boss’s health based on the damage done in the first iteration. But that’s not really the whole story.

There’s no guarantee that the first iteration is a representative sample of your DPS. Maybe in that first iteration you had an unusually low number of crits or Grand Crusader procs, so your DPS was below average. In that case, the health we assign the boss for iteration #2 will be a little low, and you might blow through it in 425 seconds rather than 450 seconds. If we kept using that boss health value, we may find that after a large number of iterations the mean combat length is only 430 seconds rather than our target of 450 seconds.

So Simulationcraft incorporates that information by performing a moving average on boss health as we go. If iteration #2 was significantly shorter, it will add a little health to the boss for the next one. It basically makes an educated guess at how much more health it would take to bring the average back up to 450 seconds. It does that for each iteration, though with some amount of decay built-in to keep things from oscillating out of control. The technique is very good at homing in on an average of 450 seconds of combat after many iterations. This is called the “enemy health estimation model,” and it’s what SimC uses by default.

Unfortunately, it also means that each iteration is slightly dependent on the previous ones. If iterations one through 50 were a little short, then iteration 51 gets a little longer. Again, it’s not clear that this is a strong enough effect to matter, but we just weren’t sure, and it’s a pretty obvious place to check if you’re worried that dependence between iterations is a problem.

There are two ways we can reduce the impact of health recalculation in SimC. The first is to use a time-based model with the command-line option “fixed_time=1″, which tells the sim to run for exactly 450 seconds, period. It will still perform the boss health recalculation from iteration to iteration, but since we’re stopping the sim based on time, that won’t cause excessively long or short runs. This option also respects the user’s choice of the vary_combat_length option, and adjusts the time accordingly unless it’s disabled.

The second way is to use the Fixed Enemy Health model by setting “override.target_health=X”. This forces the boss to have exactly X health every iteration, and the sim ends when the boss runs out of health.  So it automatically disables combat length variation and the health recalculation effect. This is the pinnacle of having independent trials, because it removes any possible dependence on previous runs.

So I ran three more configurations: One with fixed_time=1 and vary_combat_length left at the default of 0.2, one with fixed_time=1 and vary_combat_length=0, and one with target_health=171000000 (roughly appropriate for a 450-second run at ~400k sustained DPS).

Did I mention that each of these takes 5-8 hours?

Days later, here’s what I got out of the experiments:

Error analysis of Simulationcraft results with “fixed_time=1″. The blue line is the confidence interval reported by Simulationcraft. Green and red lines are the estimated confidence intervals obtained through PCA and percentile methods, respectively.

Error analysis of Simulationcraft results with "vary_combat_length" disabled. The blue line is the confidence interval reported by Simulationcraft. Green and red lines are the estimated confidence intervals obtained through PCA and percentile methods, respectively.

Error analysis of Simulationcraft results with “fixed_time=1″ and “vary_combat_length” disabled. The blue line is the confidence interval reported by Simulationcraft. Green and red lines are the estimated confidence intervals obtained through PCA and percentile methods, respectively.

    Error analysis of Simulationcraft results with "override.target_health=171000000". The blue line is the confidence interval reported by Simulationcraft. Green and red lines are the estimated confidence intervals obtained through PCA and percentile methods, respectively.

Error analysis of Simulationcraft results with “override.target_health=171000000″. The blue line is the confidence interval reported by Simulationcraft. Green and red lines are the estimated confidence intervals obtained through PCA and percentile methods, respectively.

Now we’re getting somewhere. It seems from this data that the fixed_time setting didn’t change anything, but fixing the target health did. The fixed health simulation gives us results in excellent agreement with the theoretical results. So we really are looking for a violation of the Central Limit Theorem, at least somewhere.

But where? Was it in the health recalculation? Or the combat length variation? Or something else entirely that I overlooked?

Class Warfare

Around this time, one of the other SimC devs asked me if I had tested this with other specs or classes. The thought being that maybe it was an issue specific to paladins. And of course, I hadn’t yet, because each experiment takes five to eight hours to run, and I was in the middle of the last of the three runs above. But it was definitely on my to-do list to run a few other specs as a control group.

So I queued up a few more of these experiments for other specs. For example, using the  T16H retribution paladin profile:


Error analysis of Simulationcraft results. The blue line is the confidence interval reported by Simulationcraft. Green and red lines are the estimated confidence intervals obtained through PCA and percentile methods, respectively.

Note that this is with default settings – the same settings that cause the error anomaly with protection. I ran a few more experiments to test enhancement shamans and protection warriors, with similar results. All of the other classes seemed to be obeying the CLT, even with combat length variation and health recalculation active. And even retribution seemed to be working properly under those conditions. It’s as if the problem was specific to protection paladins!

Which really meant that I thought the problem was something I did in the paladin module – i.e. it was my fault. So of course, I immediately went to digging through the paladin module looking for anything that would link one iteration to the next. Maybe I wasn’t re-initializing everything properly, so the state at the end of one iteration somehow was influencing the next? But after a few hours of combing through the code, I came up empty-handed. Nothing seemed to be persisting between iterations.

So I started debugging, literally running a few simulations and checking the state of the paladin at different break points in the process of the simulation. Combing through all of the relevant properties of the paladin object in Visual Studio, searching in vain for something – anything – that wasn’t being reset properly. And while I didn’t find anything, it did cause me to stumble over the answer in the dark almost by accident.

Fix Me Up, Before You Go Go

What I stumbled across was the fixed_time flag. I was running the T16H protection paladin profile through the simulation with completely default settings, and at one of my breakpoints I happened to notice that the fixed_time flag was active. Needless to say, this was… odd. It shouldn’t be on in a default simulation. Unable to figure out why it was on, I consulted the other devs, and was pointed to an old piece of code that had been hiding in the shadows:

if ( p -> primary_role() != ROLE_HEAL && p -> primary_role() != ROLE_TANK && ! p -> is_pet() ) zero_dds = false

If you’re not fluent in C++, that’s checking to see that the actor’s role is not “healer” or “tank”, and also that the actor is not a pet. And if the actor is none of those things, it sets a flag to false. Later on, that flag is used to forcibly enable “fixed_time=1″ if the flag is true. So in other words, the sim automatically shifts into fixed-time mode if you’re simming a healer or a tank!

Now, at the time it was written, this code makes sense. Keep in mind that Simulationcraft started out primarily as a DPS spec simulator. While it has the guts to support healers and tanks, it wasn’t until fairly recently that either of those roles were really supported well.  Arguably, healers still aren’t, for a variety of reasons, and a lot of the reason that it’s been improving for tanks is because I got involved and started implementing stuff that we wanted to see.

That’s not meant as a shot at the existing SimC devs either, by the way. These folks work incredibly hard to improve and maintain the project, but it’s a hobby for all of us, and there’s more than enough work to be done keeping it running properly for DPS specs. Getting solid support for, say, tanking pretty much requires a dev who has the interest and time to spend implementing tanking stuff, not to mention other devs who are willing to maintain the tanking part of each class module. And that didn’t really happen until I got involved and gave tanks a reason to care about the results (being able to calculate TMI, and correcting a bunch of minor errors in combat, mitigation, and Vengeance calculations).

It’s also why I suspect healers won’t be well-supported until a serious healing theorycrafter decides to say, “here’s what we need the sim to do in order to be useful to us,” and then wade in and make those changes.

But back to the point, if you’re simming a healer, you’re not putting out any DPS. It makes very little sense to base the simulation time on boss health in that scenario, so you’d clearly want to default to a fixed-time model for a healer. That line of code was basically just a catch-all to say, “only use the boss health estimation model for DPS classes/specs.” The fact that it enabled it for tanks was mostly an afterthought, because nobody was using Simulationcraft to simulate tanking at that point.

In any event, this was a giant clue that the problem had to do with the fixed_time option, so we dug into that in more detail. What I learned, mostly from discussion with the other devs, is that fixed-time mode did a bunch of things it really shouldn’t. The root of the problem was that it was still basing the boss’s health percentage on the health recalculation algorithm in this mode. That poses two major problems:

  • The boss still “died” when it reached 0% health, which meant that you could end the simulation earlier than your target time if you happened to be lucky on that iteration (i.e. had above-average DPS).
  • If you had an iteration of below-average DPS, the simulation would hard-stop at the target time. So if you were supposed to run for 450 seconds, and the boss wasn’t dead yet, tough – the simulation just ends.

That seems perfectly logical, but it causes some major CLT violations. Hard-stopping the simulation at 450 seconds is essentially throwing a Heaviside function (or “step function”) into the mix. It’s saying, “we don’t care what happens after this point, and we’re going to ignore it.” But natural variations in DPS output should cause some iterations to be shorter than 450 seconds and other iterations to be longer. The hard-stop only applies to the longer runs, which means we’re artificially affecting some of our iterations but not others.

To see why this is a problem, consider the following two scenarios:

  • An iteration where you had exactly average DPS, and the boss dies exactly at 450 seconds. You enter execute range ~370 seconds into the fight, so you spend about 80 seconds in execute range. Note that this is a little less than 20% of the time, because I’m assuming your DPS goes up in execute range.
  • An iteration where you had bad luck and produced below-average DPS. You don’t enter execute range until ~400 seconds into the fight as a result, so you only get 50 seconds of execute range. The simulation forcibly ends combat at 450 seconds with the boss still having 5%-10% health remaining.

The second scenario should produce even lower DPS than expected, because not only did you have bad luck during the initial part of the iteration, but you were robbed of 30 seconds (or more) of higher-DPS execute time. Statistically, that means is that we’re changing the underlying probability distribution, because the relative time spent in execute range is changing significantly from iteration to iteration.

And that violates one of our CLT conditions – each iteration needs to be identically distributed if we want to be able to use the CLT. If we spend 10% of our time in execute range on one iteration, but 20% on another iteration, and 15% on a third iteration, that condition isn’t being adhered to anymore, and we can’t expect our error to conform to the predictions of the CLT.

Ti-i-i-ime Is On My Side

The correction, which was made in this commit, was to fix the way we calculate health percentage. Instead of using boss health in fixed_time mode, we now ignore boss health entirely and use time to estimate boss health. For example, if you’re running a 450-second simulation and you’re 270 seconds into it, the health_percentage() function just returns the percentage of time left in the simulation: $100\%\times(1-270/450)=40\%$. This fixes both of the problems above: we’re no longer chopping off low-DPS runs and skewing our distribution, and the boss can’t die early on high-DPS runs because the sim calls health_percentage() to determine if the boss is dead yet. And if we rebuild the simulation after that commit and run the T16H protection paladin profile through it, we get this:


Error analysis of Simulationcraft results with default (forced fixed_time=1) settings after fixing the behavior of health_percentage(). The blue line is the confidence interval reported by Simulationcraft. Green and red lines are the estimated confidence intervals obtained through PCA and percentile methods, respectively.

Excellent. We’re now getting proper agreement with the CLT estimate even out as far as 105 iterations. And we can expect that trend to continue as iteration numbers increase because we’re not violating any of the CLT conditions anymore.

In a later commit, the line of code quoted above was changed to remove the tank role check as well. In other words, we’re no longer running in fixed-time mode all the time, which is fine because we produce stable enough DPS that the health recalculation algorithm should work properly. While that doesn’t have a significant impact on the results, it’s nice to know that we use the same defaults as most other specs (excepting healers, of course).

If you were attentive, you may have noticed that I tested protection warriors and found that they weren’t exhibiting the same error behavior. Now that you know what the problem was, you may ask, “why not?” After all, they’re tanks, so they were also being forced into a fixed-time mode when being simmed. So what gives?

If you guessed “they don’t have an execute range,” pat yourself on the back. Oh sure, warriors have Execute – the entire “execute range” term is named for it, after all. But if you take a quick look at the T16H protection warrior profile, you’ll notice that it isn’t being used. Which makes sense, because a tank that’s actually in danger would rather use that rage on Shield Barrier for more survivability. Since the T16H protection warrior profile doesn’t change the player’s behavior during execute range, it’s irrelevant how much time they spend there, because their DPS doesn’t change when the boss drops below 20%. So the types of variations that caused error bloat for the protection paladin profile simply don’t exist in the protection warrior profile.


If you’re thinking to yourself, “Man I really don’t want to read 4600 words, could you get to the point already,” this section is for you. In short, here’s what happened:

  • The simulation was forcing tanks into a “fixed-time” mode, where the sim runs for X seconds and stops if it reaches that time regardless of boss health.
  • As a result, the relative amount of time spent in execute range could change significantly from iteration to iteration based on your DPS, changing the underlying probability distribution.
  • Changing the underlying probability distribution violates the Central Limit Theorem, and makes Simulationcraft’s reported error estimate inaccurate, far lower than the actual error.
  • We fixed it by (a) changing the way we calculate the boss’ health percentage in fixed-time mode, and (b) not forcing tanks into fixed-time mode in the first place.

For anyone who didn’t skip to the end, I hope this was an enjoyable read, and less technically grueling than Part I was. It was fun (if time-consuming) to write, if only because I get to mix in concepts that I use frequently in a professional (experimental physics) context, like error analysis and experimental design, with theorycrafting and simulation.

I think many people don’t realize how intertwined the two are in practice. I’m sure a lot of theorycrafters, especially newer ones or ones without a strong science background, ignore error entirely. It’s a lot easier to just look at things like mean DPS or HPS, damage per resource spent, or similar metrics. But especially when it comes to simulation, it’s important to know how good your estimates are and whether you can trust them.

Part of my goal in this pair of posts was to provide a good example of how one goes about doing that, and why. Together, they’re a good introduction to how to properly perform error analysis on results and what to look for when you find results that don’t meet your expectations. Hopefully, at least a few theorycrafters come out of reading these posts feeling like they’ve added a new tool to their skill set.

And more generally, that non-theorycrafters leave with a sense of what it means to talk about statistical (i.e. random) error. I’ll consider it a success if a few people walk away from this set of posts saying, “You know, I never understood how this works before, but now I get it.”

Posted in Tanking, Theck's Pounding Headaches, Theorycrafting | Tagged , , , , , , , , , | 15 Comments

A Comedy of Error – Part I

In the 5.4.2 Rotation Analysis post, I mentioned that I was looking into some odd behavior in the SimC error statistics:

I’m actually doing a little statistical analysis on SimC results right now to investigate some deviations from this prediction, but that’s enough material for another blog post, so I won’t go into more detail yet. What it means for us, though, is that in practice I’ve found that when you run the sim for a large number of iterations (i.e. 50k or more) the reported confidence interval tends to be a little narrower than the observed confidence interval you get by calculating it from the data.So for example, at 250k iterations we regularly get a DPS Error of approximately 40. In theory that means we feel pretty confident that the DPS we found is within +/-40 of the true value. In practice, it might be closer to +/- 100 or so.

Over the past two weeks, I’ve been running a bunch of experiments to try to track down and correct the source of this effect. The good news is that with the help of two other SimC devs, we’ve fixed it, and future rotation analysis posts will be much more accurate as a result.

But before we discuss the solution, we have to identify the problem. And to do that, we need a little bit of statistics. I find that most people’s understanding of statistical error is, humorously enough, rather erroneous. So in the interest of improving the level of discourse, let’s take a few minute and talk about exactly what it means to measure or report “error.”

Disclaimer: While I’m 99.9% sure everything in this post is accurate, keep in mind that I am not a statistician. I just play one on the internet to do math about video games (and in real life to analyze experimental results). If I’ve made an error or misspoken, please point it out in the comments!

Lies, Damn Lies, and Statistics

Let’s start out with a thought experiment. If we’re given a pair of standard 6-sided dice, what’s the probability of rolling a seven?

There’s a number of ways to solve this problem, but the simplest is probably to do some basic math. Each die has 6 sides, so there are 6 x 6 = 36 possible combinations. Out of those combinations, how many give us a sum of seven? Well, there are three ways to do that with the numbers one through six: 1+6, 2+5, and 3+4. However, we have two dice, so either one could contribute the “1″ in 1+6. If we decide on a convention of reporting the rolls in the format (die #1)+(die #2), then we could also have 4+3, 5+2, and 6+1. So that’s six total ways to roll a seven with a pair of dice, out of thirty-six possible combinations; our probability of rolling a seven is 6/36=1/6=0.1667, or 16.67%.

We could ask this same question for any other possible outcome, like 2, 5, 9, or 11. If we did that for every possible outcome (anything from 2 to 12), and then plotted the results, it would look like this:

The probability distribution that describes the results of rolling two six-sided dies.

The probability distribution that describes the results of rolling two six-sided dies.

This gives a visual interpretation of the numbers. It’s clear from the plot that an 8 is less likely than a 7 (as it turns out, there are only five ways to roll an 8) and that rolling a 9 is even less likely (four ways) and that rolling a 2 or 12 is the least likely (one way each). What we have here is the probability distribution of the experiment. It tells us that on any given roll of the dice there’s a ~2.78% chance of rolling a 2 or 12, a 5.56% chance of rolling a 3 or 11, and so on.

Now let’s talk about two terms you’ve probably heard before: mean and standard deviation. These terms show up a lot in the discussion of error, so making sure we have a clear definition of them is a good foundation on which to build the discussion. The mean and the standard deviation describe a probability distribution, but provide slightly different information about that distribution.

The mean tells us about the center of the distribution. You’re probably more familiar with it by another name: the average.  Though both of those names are a bit ambiguous. “Average” can refer to several different metrics, though it’s most commonly used to refer to the arithmetic mean. “Mean” is used slightly differently in different areas of math, but when we’re talking about statistics it’s used synonymously with the term “expected value.” The Greek letter $\mu$ is commonly used to represent the mean. If you want the mathy details, it’s calculated this way:

$$ \mu = \sum_k x_k P(x_k)$$

where $x_k$ is the outcome (i.e. “5″) and $P(x_k)$ is the probability of that outcome (i.e. “11.11%” or 0.1111). For our purposes, though, it’s enough to know that the mean tries to measure the middle of a distribution. If the data is perfectly symmetric (like ours is), it tells you what value is in the center. In the case of our dice, the mean is seven, which is what we’d expect the average to be if we made many rolls.

The standard deviation (usually represented by $\sigma$), on the other hand, describes the spread or width of the distribution. Its definition is a little more complicated than the mean:

$$ \sigma = \sqrt{\sum_k P(x_k) (x_k-\mu)^2} $$

But again, for our purposes it’s enough to know that it’s a measurement of how wide the distribution is, or how much it deviates from the mean. A distribution with a larger $\sigma$ is wider than a distribution with a smaller $\sigma$, which means that any given roll could be farther away from the mean. For our distribution, the standard deviation is 2.45.

The thing I want you to note is that neither of these terms tell us anything about error. We aren’t surprised if we roll the dice and get a 10 or 12 instead of a 7. We don’t return them to the manufacturer as defective. The mean and standard deviation tell us a little bit about the range of results we can get when we roll two dice. To talk about error, we need to start looking at actual results of dice rolls, not just the theoretical probability distribution for two dice.

Things Start Getting Dicey

Okay, so let’s pretend we have two dice, and we roll them 100 times. We keep track of the result each time, and plot them on a histogram like so:

Histogram representing the outcome of 100 rolls of two six-sided dies.

The outcome of 100 rolls of two six-sided dies.

Now, this doesn’t look quite the same as our expected distribution. For one thing, it’s definitely not symmetric – there were more high rolls than low rolls. We could express that by calculating the sample mean $\mu_{\rm sample}$, which is the mean of a particular set of data (a “sample”). By calling this the sample mean, we can keep straight whether we’re talking about the mean of the sample or about the mean of entire probability distribution (often called “population mean”). The sample mean of this data set is 7.40, as shown in the upper right hand corner of the plot, which is higher than our expected value of 7.00 by a fair amount.

We can also calculate a sample standard deviation $\sigma_{\rm sample}$ for the data, which again is just the standard deviation of our data set. The sample standard deviation for this run is 2.52, which is a bit higher than the expected 2.45 because the distribution is “broader.” Note that the maximum extent isn’t any wider – we don’t have any rolls above 12 or below 2 – but because the distribution is a little “flatter” than usual, with more results than expected in some of the extremes and fewer in the middle, the sample standard deviation goes up a little.

But note that, by themselves, neither $\mu_{\rm sample}$ nor $\sigma_{\rm sample}$ tell us about the error! They’re still just describing the probability distribution that the data in the sample represents. At best, we might be able to compare our results to the theoretical $\mu$ and $\sigma$ we found for the ideal case to identify how our results differ. But it’s not at all clear that this tells us anything about error. Why?

Because maybe these dice aren’t ideal. Maybe they differ in some way from our model. For example, maybe you’ve heard the term “weighted dice” before? What if one of them is heavier on one side? That might cause it to roll e.g. 6 more often than 1, and give us a slightly different distribution. You could call that an “error” in the manufacturing of the dice, perhaps, but that’s not what we generally mean when we talk about statistical error.

So perhaps it’s time we seriously considered what “error” means. After all, it’s hard to identify an “error” if we haven’t clearly defined what “error” is. Let’s say that we perform an experiment – we make our 100 die rolls and keep track of the results, and generate a figure like the one above. And in addition, let’s say we’re primarily interested in the mean of this distribution; we want to know what the average result of rolling these particular two dice will be. We know that if they were ideal dice, it should be seven. But when we ran our experiment, we got a mean of 7.40.

What we really want to know is the answer to the question, “how accurate is that result of 7.40?” Do we trust it so much that we’re sure these dice are non-standard in some way? Or was it just a fluke accident. Remember, there’s absolutely no reason we couldn’t roll 100 twelves in a row, because each dice roll is independent of the last, and it’s a random process. It’s just really unlikely. So how do we know this value we came up with isn’t just bad luck?

So let’s say the “error” in the sample mean is a measure of accuracy. In other words, we want to be able to say that we’re pretty confident that the “true” value of the population mean $\mu$ happens to fall within the interval $\mu_{\rm sample}-E < \mu < \mu_{\rm sample} + E$, where $E$ is our measure of error. We could call that range our confidence interval, because we feel pretty confident that the actual mean $\mu$ of the distribution for our dice happens to be in that interval. We’ll talk about exactly how confident we are a little bit later.

It should be clear now why comparing our distribution to the “ideal” distribution doesn’t tell us anything about how reliable our results are. We might know that the sample mean differs from the ideal, but we don’t know why. It could be that our dice are defective, but it could also just be a random fluctuation. But since nothing we’ve discussed so far tells us how accurate our measured sample mean is, we don’t know for sure. To get that, we need to figure out how to represent $E$, the number that sets the bounds on our confidence interval.

It’s a common misconception that $E$ should just be the sample standard deviation $\sigma_{\rm sample}$. You may have seen results presented like $\mu \pm \sigma$, or $7.40 \pm 2.52$, to suggest an interval of confidence. That is, generally speaking, not correct. Or at least, very misleading. Because that’s not what the standard deviation means.

What we really want here is something called the standard error, though it’s also commonly called the standard error of the mean.  It’s also sometimes (mistakenly or carelessly) called the “standard deviation of the mean,” but we’ll clarify the difference in a second. I like the term “standard error of the mean,” because it makes it clear that this is a measurement of accuracy of the sample mean. As you might guess, it’s closely related to the sample standard deviation, but not quite the same. It’s calculated by dividing the sample standard deviation by the number of individual “trials,” or dice rolls, $N$:

$${\rm SE_{\mu}} = \frac{\sigma_{\rm sample}}{\sqrt{N}}.$$

This, at long last, is a good measurement of error. It’s worth noting that the standard deviation of the mean is defined similarly, but uses the true standard deviation of the distribution:

$${\rm SD_{\mu}} = \frac{\sigma}{\sqrt{N}}.$$

The reason the two are often used interchangeably is that we generally don’t know what the actual distribution looks like, nor do we know the expected values of $\mu$ and $\sigma$. Sometimes we do, of course; if we have a theory describing the process we’re measuring, then we can often calculate the theoretical values of $\mu$ and $\sigma$. But we don’t always know if our experiment matches the theory as well as we’d like – for example, if one of the dice is weighted and rolls more sixes than ones.

And sometimes, we don’t have a well-described theory at all, we just have a pile of data. This is the case for most Simulationcraft data runs, because we don’t have an easy analytical function that accurately describes your DPS due to any number of factors: procs, avoidance, movement, and so on. In that sort of situation, we can never truly know $\sigma$, so the lines between ${\rm SE}_{\mu}$ and ${\rm SD}_{\mu}$ blur a little bit, and we tend to get sloppy with terminology.

Double Standards

Now, we’ve thrown around a lot of terms that have “standard deviation” in them. It’s no wonder the layperson is easily confused by statistics. So it’s worth spending a moment to make the differences between these terms abundantly clear. Let’s reiterate quickly why we use standard error to describe the accuracy of the sample mean rather than just using $\sigma$ or $\sigma_{\rm sample}$.

We have a theoretical probability distribution describing the result of rolling two 6-sided dice. Here’s what each of the terms we’ve discussed so far tells us:

  • The mean (or “population mean”) $\mu$ tells us the average value of a single roll.
  • The standard deviation $\sigma$ tells us about the fluctuations of any single dice roll. In other words, if we make a single roll, $\sigma$ tells us how much variation we can expect from the mean. When we make a single roll, we’re not surprised if the result is $\sigma$ or $2\sigma$ away from the mean (ex: a roll of 9 or 11). The more $\sigma$s a roll is away from the mean, the less likely it is, and the more surprised we are. Our distribution here is finite, in that we can never roll less than two or more than 12, but in the general case a probability distribution could have non-zero probabilities farther out in the wings, such that talking about $4\sigma$ or $5\sigma$ is relevant.
  • The sample mean $\mu_{\rm sample}$ tells us the average value of a particular sample of rolls. In other words, we roll the dice 100 times and calculate the sample mean. This is an estimate of the population mean.
  • The sample standard deviation $\sigma_{\rm sample}$ tells us about the fluctuations of our particular sample of rolls. If we roll the dice 100 times, we can calculate the sample standard deviation by looking at the spread of the results. Again, this is an estimate of the population’s standard deviation, and it tells us how much variation we should expect from a single dice roll.
  • The standard deviation of the mean $SD_{\mu}$ tells us about the fluctuations of the mean of an arbitrary sample. In other words, if we proposed an experiment where we rolled the dice 100 times, we would go into that experiment expecting to get a sample mean that’s pretty close to (but not exactly) $\mu$. $SD_{\mu}$ tells us how close we’d expect to be. For example, under normal conditions we’d expect to get a result for $\mu_{\rm sample}$ that is between $\mu-2{\rm SD}_{\mu}$ and $\mu+2{\rm SD}_{\mu}$ about 95% of the time, and between $\mu-2.5{\rm SD}_{\mu}$ and $\mu+2.5{\rm SD}_{\mu}$ about 99% of the time.
  • The standard error of the mean $SE_{\mu}$ tells us about the fluctuations of the mean of our particular sample of rolls. Once we actually make those 100 rolls, and calculate the sample mean and sample standard deviation, we can state that we’re 95% confident that the “true” population mean $\mu$ is between $\mu_{\rm sample}-2{\rm SE}_{\mu}$ and $\mu_{\rm sample}+2{\rm SE}_{\mu}$, and 99% confident that it’s between $\mu_{\rm sample}-2.5{\rm SE}_{\mu}$ and $\mu_{\rm sample}+2.5{\rm SE}_{\mu}$

You can see why this gets confusing. But the key is that the standard deviation and sample standard deviation are telling you about single rolls. If you roll the dice once, you expect to get a value between $\mu+2\sigma$ and $\mu-2\sigma$ about 95% of the time.

Whereas the standard deviation of the mean and standard error tell us about groups of rolls. If we make 100 rolls the sample mean should be a much better estimate of the population mean than if we made only a handful of rolls. And if we make 1000 rolls, we should get a better estimate than if we only made 100 rolls.

So we use the standard deviation of the mean to answer the question, “if we made 100 rolls, how close do we expect $\mu_{\rm sample}$ (our sample mean) to be to $\mu$ (the population mean)?” And we use the standard error to answer the related (but different!) question, “now that I’ve made 100 rolls, how accurately do I think my calculated $\mu_{\rm sample}$ (sample mean) approximates $\mu$ (the population mean)?”

You might wonder what voodoo tricks I played to get these “95%” and “99%” values. These come from analysis of the normal distribution, which is a probability distribution that comes up frequently in statistics. If your probability distribution is normal, then about 68% of the data will fall within one standard deviation in either direction. Put another way, the region from $\mu-\sigma$ to $\mu+\sigma$ contains 68% of the data. Likewise, the region from $\mu-2\sigma$ to $\mu+2\sigma$ contains about 95% of the data, and over 99.7% of the data will fall between $\mu-3\sigma$ to $\mu+3\sigma$.

Our probability distribution isn’t a normal distribution. First of all, it’s truncated on either side, while the normal distribution goes on infinitely in either direction (we’ll never be able to roll a one or 13 or 152 with our two dice). Second, it’s a little too discrete to be a good normal distribution – there isn’t quite enough granularity between 2 and 12 to flesh the distribution out sufficiently. It’s really more of a triangle than a nice Gaussian, though it’s not an awful approximation given the constraints. Luckily, none of that matters! As it turns out, the reason our distribution looks vaguely normal is closely related to the reason that we use the normal distribution to determine confidence intervals.

Limit Break

The Central Limit Theorem is the piece that completes our little puzzle. Quoth the Wikipedia,

the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed.

That’s a bit technical, so let’s break that down and make it a bit clearer with an example. We start with a dice roll (a “random variable”) that has some probability distribution that doesn’t change from roll to roll (“a well-defined expected value and well-defined variance”) and each roll doesn’t depend on any of the previous ones (“independent”). Now we roll those dice 10 times and calculate the sample mean. And then roll another 10 times and calculate the sample mean. And then do it again. And again, and again, and… you get the idea (“a sufficiently large number of iterates”). If we do that, and plot the probability distribution of those sample means, we’ll get a normal distribution centered on the population mean $\mu$.

The beautiful part of this is that it doesn’t matter what the probability distribution you started with looks like. It could be our triangular dice roll distribution or a “top-hat” (uniform) distribution or some other weird shape. Because we’re not interested in that; we’re interested in the sample means of a bunch of different samples of that distribution. And those are normally distributed about the mean, as long as the CLT applies. Which means that when we find a sample mean, we can use the normal distribution to estimate the error, regardless of what probability distribution that the individual rolls obey.

Now, there are two major caveats here that cause the CLT to break down if they aren’t obeyed:

  • The random variables (rolls) need to be independent. In other words, the CLT will not necessarily be true if the result of the next roll depends on any of the previous rolls. Usually this is the case (and it is in our example), but not always. There are two wow-related examples I can think of off the top of my head.

    Quest items that drop from mobs aren’t truly random, at least post-BC (and possibly post-Vanilla). Most quest mobs have a progressively increasing chance to drop quest items, such that the more of them you kill, the higher the chance of an item dropping. This prevents the dreaded “OMG I’ve killed 8000 motherf@$#ing boars and they haven’t dropped a single tusk” effect (yes, that’s the technical term for it).

    Similarly, bonus rolls have a system where every failed bonus roll will cause a slight increase in the chance of success with your next bonus roll against that boss. So this would be another example where the CLT won’t apply, because the rolls aren’t truly independent.

  • The random variables need to be identically distributed. In other words, the probability distribution can’t be changing in-between rolls. If we swapped one of our 6-sided dice out for an 8-sided or 10-sided die, all of the sudden our probability distribution would change and there would be no guarantee that the CLT would apply.

    You might ask if you could cite either of the two examples of dependence here as examples of non-identical distributions. After all, in each case the probability distribution is changing between rolls. However, that change is due to dependence on previous effects – in a sense, the definition of dependence is “changing the probability distribution between rolls based on prior outcomes.” So dependence is a more specific subset of this category.

If either of those things occur, then we can’t be sure that the CLT is valid for our situation. Luckily, none of that applies to our dice-rolling example, so we can properly apply the CLT to estimate the error in our set of 100 rolls.

Keep Rollin’ Rollin’ Rollin’ Rollin’

So now that we’ve talked a lot about deep probability theory, let’s actually do that. The standard error of our 100-roll sample is,

$$ {\rm SE}_{\mu} = \sigma_{\rm sample}/\sqrt{N} = 2.52/\sqrt{100} = 0.252 $$

To get our 95% confidence interval (CI), we’d want to look at values between $\mu_{\rm sample}-2{\rm SE}_{\mu}$ and $\mu_{\rm sample}+2{\rm SE}_{\mu}$, or $7.40 \pm 0.504$. And sure enough, the actual value of the population mean (7.00) falls within that confidence interval. Though note that it didn’t have to – there was still a 5% chance it wouldn’t!

We could improve the estimate by increasing the number of dice rolls. For example, what if we rolled 1000 dice instead? That might look something like this:

The result of 1000 rolls of two six-sided dice.

The outcome of 1000 rolls of two six-sided dice.

We see that our new sample mean is $\mu_{\rm sample}=6.95$ and our sample standard deviation is $\sigma_{\rm sample}=2.41$. But now $N=1000$, so our standard error is much smaller:

$$ {\rm SE}_{\mu} = \sigma_{\rm sample}/\sqrt{N} = 2.41/\sqrt{1000} = 0.0762$$

As before, we’re 95% confident that our sample mean is within $\pm 2{\rm SE} = 0.1524$ of the population mean in one direction or the other, and sure enough it is.

Of course, we could keep going. Here’s what 10000 rolls looks like:

The outcome of 10000 rolls of two six-sided dice.

The outcome of 10000 rolls of two six-sided dice.

And if we calculate our standard error for this distribution, we get:

$$ {\rm SE}_{\mu} = \sigma_{\rm sample}/\sqrt{N} = 2.43/\sqrt{10000} = 0.0243$$

So now we’re pretty sure that the value of 7.01 is correct to within $\pm 0.0486$, again with 95% confidence. Like before, there’s no guarantee that it will be – there’s still that 5% chance it falls outside that range. But we can solve that by increasing our confidence interval (say, looking at $\pm 3{\rm SE}_{\mu}$) or by repeating the experiment a few times and thinking about the results. If we repeat it 100 times, we’d expect about 95 of them to cluster within $\pm 2{\rm SE}_{\mu}$ of 7.00.

You may have noticed that while the confidence interval is shrinking, it’s not doing so as fast as it did going from 100 to 1000. That’s because we’re dividing by the square root of $N$, which means that to improve the standard error by a factor of $a$, we need to run $a^2$ times as many simulations. So if we want to increase our accuracy by a whole decimal place (a factor of 10), we need to make 100 times as many rolls. This is important stuff to know if you’re designing an experiment, because you don’t want your graduate thesis to rely on making five trillion dice rolls. Trust me.

You probably also noticed that the more rolls we make, the more the sample probability distribution resembles the ideal “triangular” case we arrived at theoretically. That’s to be expected – the more rolls we make, the better the sample approximates the real distribution. This is related to another law (the amusingly-named law of large numbers) that’s important for the CLT, but I don’t have time to go into that here. But it was worth mentioning just because “law of large numbers” is probably the best name for a mathematical law ever.

Finally, I mentioned that our “triangular” distribution for two dice looks vaguely normal, and that this relates to the CLT somehow. Here’s how. Each die is essentially its own random variable with a “flat” or “uniform” probability distribution (you have an equal chance to roll any number on the die). So when we take two of them and calculate the sum, we’re really performing two experiments and finding two sample means (with a sample size of 1 roll each). The sum of those two sample means, which is just twice the average of the sample means, is our result. This is exactly how we phrased our description of the CLT!

The reason we get a triangle rather than a nice Gaussian is that two dice is not “a sufficiently large number of iterates.” There is, unfortunately, no clean closed-form expression for this probability distribution for arbitrary numbers of $s$-sided dice (something called the binomial distribution works when $s$=2, i.e. for coin flips). But if we rolled 5 dice or 10 dice instead of two, and added all of those up, we’d start to get a distribution that looked very much like a normal distribution. And in fact, if you read either of the articles linked in this paragraph, you’ll see that they both become well-approximated by a normal distribution as you increase the number of experiments (die rolls).

World of Stat-craft?

Now that you’ve read through 4000 words on probability theory, you may ask where the damn World of Warcraft content is. The short answer: next blog post. But as a teaser, let’s consider a graph that shows up in your Simulationcraft output:

A DPS distribution generated by Simulationcraft.

When you simulate a character in SimC, you run some number of iterations. Each iteration gives you an average DPS result, which is essentially one result of a random variable. In other words, each iteration is comparable to a single roll of the dice in our example experiment. If we run a simulation for 1000 iterations, that gives us 1000 different data points, from which we can calculate a sample mean (367.7k in this case), a sample standard deviation, and a standard error value.

And all of the same statistics apply here. This plot gives us the “DPS distribution function,” which is equivalent to the triangular distribution in our experiment. The DPS distribution looks Gaussian/normal, but be aware that there’s no reason it has to be.  It generally will look close to normal just because each iteration is the results of a large number of “RNG rolls,” many of which are independent. But some of those RNG rolls are are not independent (for example, they may be contingent on the previous die roll succeeding and granting you a specific proc, like Grand Crusader). With certain character setups you can definitely generate DPS distributions that deviate significantly from a normal distribution (skewed heavily to one side, for example).

But again, because of the Central Limit Theorem, we don’t care that much what this DPS distribution function looks like. As long as each iteration is independent, we can use the normal distribution to estimate the accuracy of the sample mean. So we can calculate the standard error and report that as a way of telling the user how confident they should be in the average DPS value of 367.7k DPS.

At the very beginning of this post, I said I was looking into a strange deviation from the expected error. What I was finding that my observed errors were larger than what Simulationcraft was reporting. Next time, we’ll look a little more closely into how Simulationcraft reports error, and discuss the specifics of that effect – why it was happening, and how we fixed it.

Posted in Simcraft, Simulation, Tanking, Theck's Pounding Headaches, Theorycrafting | Tagged , , , , , , | 6 Comments

5.4.2 Rotation Analysis

In December, I talked about the code I’ve written to automate the testing of Simcraft profiles. In that post, I tackled the two easiest simulations to write: glyphs and talents. In both of those cases, we’re just editing a single line of the .simc file, so it was a fairly simple job of tweaking that line and repeating. Of course, there was the entire superstructure of code surrounding that idea, which is what took far longer than the (relatively) simple logic required to swap out talents and glyphs.

Today I present the results of the other end of the spectrum – one of the most difficult sims to write. Because today we’re going to look at rotations.

If you haven’t read the previous post, I recommend you go back and do so now.  Or at least re-read the “Automating Simcraft” portion of it. I’ll refresh your memory about certain points, but I’m going to assume that you’re familiar with the basics of how this code operates. In short, if you don’t remember that we piece together a .simc file from discrete components (i.e. a player, a gear set, a rotation, a set of glyphs, a set of talents, etc.), then you should probably go re-read that section.

Note that I’ve taken to calling each of these components “blocks” in the rest of this post. That’s what I tend to call them in my head, and it’s faster than typing “component” over and over. Plus, I think it gives a nice visual – sort of like building the .simc file out of a bunch of different distinct Lego pieces.

Rotations Schmotations

You might ask what makes the rotation sim significantly harder than, say, a glyph sim. The short (and woefully incomplete) answer is that it involves changing more than one line of the .simc file we feed to the executable.

I say “woefully incomplete” because that statement encompasses a lot more than just swapping out a single component.  For example, in the glyph simulation, we kept the same player block, gear block, rotation block, and so on, and just swapped out the glyph block. We did that by pre-generating a glyph block for all of the different glyph combinations we were interested in and cycling through them.

On its face, it seems like that same logic couldn’t apply to the rotation simulation. We could just generate 100 different rotation blocks that describe the different rotations we’re interested in, and then swap them in and out one by one to get the results. Right?

Wrong. Oh, so wrong…

That might work fine for a really simple rotation simulation where we only consider combinations of basic abilities. For example, we limit ourselves to Crusader Strike, Judgment, Avenger’s Shield, Holy Wrath, Hammer of Wrath, and Consecration. That would be enough to figure out the basic gist of the rotation, for sure.

But it should be obvious that this list is missing a few important abilities. What if we want to include Sacred Shield, or one of our level 90 talents? All of those have to go into the rotation somewhere. And the sim won’t use them unless we’ve talented them. So, first of all, that means we need to swap the talent block out at the same time as the rotation block. And not just that, but we need some way to know which talent block to use when – it’s no good if we use a talent block with Light’s Hammer when we’re testing Execution Sentence rotations. That seems like an obvious and trivial problem to solve, but it’s still an extra moving part we need to consider in a sim that’s already going to be pretty complicated.

Because it’s not just talents we need to worry about, either. Let’s say we want to look at execute-range rotations in particular. We might want to know if Holy Wrath changes priority when Final Wrath is glyphed. But to do that, we need to enable that glyph, or else use it by default. But there may be cases where we don’t want it on, either. So we need to be able to swap glyphs too.

Further, we need to be able to specify conditionals in the action priority list (APL). So that, for example, we can compare




Now, of course, that’s not really a problem in theory, because we could just write each block by hand and take care of all of that. But we might have hundreds of rotations, and the risk of making a small, unnoticed but relevant error in one of them is pretty high when you’re talking about writing that many by hand. Also, if you really expected me to write hundreds of rotation files by hand, you’re kidding yourself.

We’ll still need a good shorthand for it for identifying rotations on tables anyway, and if you’re going to write a robust shorthand, then you may as well automatically generate the rotation blocks from that shorthand. That gives us the consistency we want (because there will never be an error in “HW” in one file that doesn’t exist everywhere else) and makes tables easy to read. But it adds another complication: now we need to write a translator that goes between shorthand and full SimC file, complete with all of the options and conditionals we might want to use.

You can already see why this snowballed into one of the more complicated sims to write. And it’s not even necessarily the hardest – the AoE one may be more annoying still depending on what exactly we want to calculate!

The Nitty Gritty Details

So, in short, this is how the simulation works. I’ve divided the rotations we care about up into groups (which, in a sad turn of events, I’ve called “blocks” in the code…. oops? I’ll be consistent about calling them “groups” here though).  Each group has a defined set of talents and glyphs, because for the most part those vary on a group level. So there’s a “Basic” group, an “Execute” group that focuses on Hammer of Wrath and Final Wrath, a “Defensive” group that’s primarily for testing Sacred Shield, and a “Level 90″ group that tests all the level 90 talents.

In addition, I have the ability to enable custom talents per rotation. So for example, within the Level 90 group, it will automatically check each rotation to see which level 90 talent it uses and tweak the talent block to enable that talent. It also does this for the Sacred Shield rotations in the Defensive group. I signify this by adding “+custom” to the end of the talent block, which is the flag the code looks for to decide whether it needs to perform this check.

In theory I could do the same thing with glyphs, I suppose, but I found that I didn’t really need to. It wouldn’t be difficult to modify the code to do that in the future if we decide it’s necessary.

The rest of the difficulty was coming up with the abbreviation scheme for abilities and their conditionals. Thinking ahead, I wanted this to be extendable to other classes, so I set it up such that each class can have its own definitions. For a paladin, CS will always mean Crusader Strike, but if we’re simming another class it could translate to something different.

The abilities were fairly easy, since I’ve been using a standard notation for them in the old MATLAB code for years. They are:

Ability Shorthands
Shorthand Ability
CS Crusader Strike
CSw Crusader Strike followed by a /wait (see below)
HotR Hammer of the Righteous
J Judgment
AS Avenger's Shield
HW Holy Wrath
HoW Hammer of Wrath
Cons Consecration
SS Sacred Shield
ES Execution Sentence
LH Light's Hammer
HPr Holy Prism
SotR Shield of the Righteous
EF Eternal Flame
WoG Word of Glory

In the earlier code, we used a bracketing technique for options, which was very powerful, but led to really long rotation names.This time around, I’m trying to keep the names fairly compact for display purposes, so I went with a slightly different method. Each option has a shorthand and gets appended to the ability shorthand with a plus sign (‘+’). The options I have enabled at this point are:

Conditional Shorthands
Shorthand Conditional
W# add a /wait after the ability if the cooldown is less than or equal to # seconds
GC# buff.grand_crusader.up or buff.grand_crusader.remains<#
DP buff.divine_purpose.react
DPHP# (buff.divine_purpose.react|holy_power>=#)
ex target.health.pct<=20
FW glyph.final_wrath.enabled&target.health.pct<=20
HP# holy_power>=#
nt !ticking
nF target.debuff.flying.down
SW talent.sanctified_wrath.enabled&buff.avenging_wrath.react
T# active_enemies>=#
R# buff.(ability_string).remains<#

So for example, AS+GC would translate into


Not all of these are in use in the data I’ll present today, but they’re all coded and potentially usable. I expect that we’ll add a bunch of action priority lists to the simulation after we’ve analyzed the results in this post. For example, it might be interesting to see if “Cons+nt” has any effect, but it wasn’t high on my list of priorities when I was putting this together so I didn’t include it.

There’s one special case I want to mention. The “wait” conditional works something like this: CS+W0.35 translates to:


As you might expect from the default APL for protection, this almost always nets an increase in holy power generation because it prevents us from doing silly things like CS-X-X-X-CS. That can otherwise happen in situations where one or more of the X’s were spells, so the GCD ends a little before CS becomes available. As a result, we’ll almost always want to follow CS with a wait. Since that comes up a lot, and I didn’t want to type CS+W0.35 all the time in the interest of keeping the rotation abbreviations short and readable, I’ve defined the shorthand “CSw” to implicitly mean “CS+W0.35″

As a final note, I want to mention that this simulation is limited to GCD-based abilities. In other words, I’m using the same precombat actions and the same finishers in each rotation. I’m basically bolting the rotations below together with the precombat actions and the following default finisher definitions:


This ensures that the changes we see are purely due to any change in holy power generation or dead time in the rotations themselves. And in any event, since our active mitigation is decoupled from the GCD, it’s not really part of our “rotation” in a strict sense. It’s stuff we use when necessary and available based on the resources, not based on whether they’re more or less important than e.g. CS.  We’ll analyze the finisher options specifically in a later sim in much the same way we do here for the rotation. Luckily, that sim will be a lot easier to write!

As usual, all of the code can be found in the matlabadin repository. This sim uses a lot of files, but the master one that controls it all is:

All of the results can be found in the /io/ directory, along with the results of the glyph and talent simulations. The sims are labeled appropriately with “>” replaced by “_”.
(Ex: rotation_paladin_protection_CSw_J_AS_HW_HoW_Cons.html)


We’ll go through each of the rotation groups one at a time, briefly discussing what makes them unique and why we’ve made the choices we have.They all use the default T16N profile gear set (which includes 4T16) and are pitted against the T16N25 TMI calibration boss. The default talents include Unbreakable Spirit, Eternal Flame, and Divine Purpose unless otherwise specified. Everything else should be provided in the details below.

I’ll note that for all of these simulations, I’ve set the number of iterations to 250k. Yes, that’s a lot, but it’s necessary to get the degree of accuracy we want.

The “DPS Error” that Simulationcraft reports is really the half-width of the 95% confidence interval (CI). In other words, it is 1.96 times the standard error of the mean. To put that another way, we feel that there’s a 95% chance that the actual mean DPS of a particular simulation is within +/- DPS_Error of the mean reported by that simulation. There are some caveats to this statement, insofar as it makes some reasonably good but not air-tight assumptions about the data, but it’s pretty good.

I’m actually doing a little statistical analysis on SimC results right now to investigate some deviations from this prediction, but that’s enough material for another blog post, so I won’t go into more detail yet. What it means for us, though, is that in practice I’ve found that when you run the sim for a large number of iterations (i.e. 50k or more) the reported confidence interval tends to be a little narrower than the observed confidence interval you get by calculating it from the data.

So for example, at 250k iterations we regularly get a DPS Error of approximately 40. In theory that means we feel pretty confident that the DPS we found is within +/-40 of the true value. In practice, it might be closer to +/- 100 or so.

Why does that matter for us? Well, we want to know if one rotation is better than another in a statistically significant sense. Based on the theoretical estimate, this means that as long as they’re farther apart than 80 DPS, we can trust that the higher-DPS rotation is better. In practice, I think we should expand that bound a bit, at least to 100 DPS, and probably to 200 DPS if we’re going to be generous and assume that there could be other sources of systematic error that we don’t know about. I’ve seen the same rotation sim up to 300 DPS differently from two separate runs, so I’m inclined to be a little more generous in my error estimate than SimC is.

And keep in mind that we’re looking at a mean value of almost 400k DPS in these sims. 400 DPS is a change of 0.1%, which is miniscule, and not likely to swing an encounter one way or another. Even if our sims are accurate to that level, that’s right around  the point where you prioritize mental bandwidth over DPS gain and choose the rotation that’s simpler to execute. So I’d probably be hesitant to ascribe any real significance to differences that are smaller than 1000 DPS, which is still less than a 1% change.

Basic Rotation Group

This group of rotations is focused on determining the order of operations for our basic abilities, excluding talents and execute range. From this, we determine our “ideal” base rotation, which we then go about tweaking in the other groups.

In this set, we use just two glyphs: Focused Shield and Word of Glory. We could have included Divine Protection, but we want to be able to compare the survivability results to those obtained in later groups which use all three glyph slots on DPS glyphs. Plus, there’s really not a lot to learn from glyphing Divine Protection here. It’s our only feasible survivability glyph and it’s so highly situational that there’s no guarantee we’re using it for a given boss.

In addition to the table, the sim spits out the maximum DPS Error measurement of the group (each rotation is fairly similar in that regard, so it didn’t make sense to include it on the table) and the talents and glyphs used:

Max DPS Error: 41
Talents: 312232
Glyphs: focused_shield/word_of_glory

Basic Rotations
Rotation DPS HPS DTPS TMI Var SotR Wait
CS>J>AS>Cons>HW 373603 160013 160353 6212 2062 71.0% 14.5%
CS>J>AS>HW>Cons 379608 159521 159854 4287 971 71.4% 13.2%
CS+W0.3>J>AS>HW>Cons 373814 157738 158054 533 117 73.2% 12.7%
CSw>J>AS>Cons>HW 368204 157862 158182 460 47 73.1% 13.8%
CSw>J>AS>HW>Cons 373591 157666 157983 427 55 73.2% 12.7%
CSw>J>HW>AS>Cons 372798 157616 157932 410 77 73.3% 12.5%
CSw>HW>J>AS>Cons 359765 161552 161890 626 61 69.6% 15.6%
HW>CSw>J>AS>Cons 363466 161565 161905 806 122 69.6% 14.9%
CSw>AS>J>HW>Cons 373952 158483 158804 451 38 72.5% 12.8%
J>CSw>AS>HW>Cons 368396 162575 162942 90576 83031 68.5% 17.4%
J>AS>CSw>HW>Cons 372886 163157 163529 61525 35811 67.9% 17.0%
AS>J>CSw>HW>Cons 372490 163965 164342 174459 57861 67.2% 17.4%
AS>CSw>J>HW>Cons 378485 159759 160092 1971 365 71.2% 13.0%
HotR+W0.35>J>AS>HW>Cons 371877 159558 159894 6714 5015 71.4% 13.2%
AS+GC>CSw>J>AS>HW>Cons 374633 157958 158289 727 145 72.9% 12.7%
CSw>AS+GC>J>AS>HW>Cons 373734 157767 158086 409 52 73.1% 12.7%
CSw>AS+GC>J>HW>AS>Cons 373243 157700 158021 391 43 73.2% 12.5%
CSw>AS+GC>J>HW>Cons>AS 372838 158084 158405 429 79 72.8% 12.3%

Note that you can sort the table by a particular column by simply clicking on that column’s header. The “Var” column simply reports the measurement of “TMI Error,” which is really more of an uncertainty or variance measure due to the nature of the TMI distribution. Basically, treat that column as the +/- on the measured TMI value. The “Wait” column tells us how much time the sim spends waiting while the GCD is available, either because there’s nothing to cast or because we’re hitting the /wait action.

Before sorting, it’s clear that waiting for CS’s cooldown to come up is a significant survivability gain. The more subtle thing to notice is that it’s actually a slight DPS loss, mostly because CS hits like a limp noodle. There are a number of reasons for that, but the primary one is that CS’s damage increases far more slowly with attack power than the rest of our abilities do. So the higher Vengeance gets, the worse CS is compared to just about everything else we could cast.

A lot of the features here are expected. Dropping CSw below anything else in priority gives you a large survivability loss. It’s worth noting that the “CSw>AS+GC>J>*” rotations near the bottom produce some very low TMI results, but I’m still a bit skeptical of these. The SotR uptime isn’t any higher than the default (CSw>J>AS>HW>Cons), nor are the TMI values lower in a statistically significant sense.

If we sort by DPS, we see that the top rotation is actually the one where we don’t wait for CS’s cooldown, again because CS is such a weak ability at this point. But after that one, we have a bunch of rotations that emphasize AS in various ways. This can be summarized with a pretty simple rule of thumb: “if you don’t care about survivability and need max DPS right now, prioritize AS.”

There are a bunch of rotations where I push Holy Wrath up ahead of CS/J/AS. These aren’t interesting from a survivability point of view, because they uniformly increase our TMI. They also seem to uniformly reduce DPS compared to the standard CSw>J>AS>HW>Cons. We’ll have to revisit these in the execute range group where we have Final Wrath glyphed, which is where we might expect a high HW prioritization to bear fruit.

The HotR rotation I threw in has the same wait as CSw, so it’s directly comparable to a CSw rotation. This is really only relevant in cases where you want to know how much single-target damage you’re sacrificing to cleave to adds now that Weakened Blows is applied by both abilities. Nonetheless, we see it’s about a 1700 DPS loss to use HotR instead of CS. Not really a big deal in the grand scheme of things, we’re talking about less than a 1% difference. CS and HotR both hit so weakly it’s almost irrelevant which you use.

I also want to call attention to the TMI and Var columns again quickly. If you sort by either of these, you’ll see that as TMI goes up, so does the variance. This is one significant drawback of the current TMI formula – because it’s an exponential metric, the variance tends to be rather large when TMI is large. Increasing the number of iterations doesn’t end up helping it much, because it’s just not anything resembling a Gaussian distribution.

The two take-home messages I want to get across here are:

  • Unless two TMI values differ by more than the sum of their Var columns, it’s not 100% clear that they’re different in a statistically significant sense. So TMIs of 400 and 500 are roughly identical if their Vars are 100 or more, but you could safely say that a TMI of 400 is better than e.g. a TMI of 1000. We’re looking for order-of-magnitude effects in TMI, because that’s how the metric was constructed.
  • This will be fixed in TMI v2.0, which I’m working on currently. More on that soon, maybe next week if I have time to write.

Next, let’s look at the execute rotations.

Execute Rotation Group

In this case, we want to find out how we vary the basic CSw>J>AS>HW>Cons rotation in execute range. That means we need to know where to slot in Hammer of Wrath and what (if anything) to do about Holy Wrath when Final Wrath is glyphed.

Since we can already look at the table above to figure out what happens when Final Wrath isn’t glyphed, this group includes it by default along with Focused Shield and Word of Glory.

Max DPS Error: 41
Talents: 312232
Glyphs: focused_shield/word_of_glory/final_wrath

Execute Rotations
Rotation DPS HPS DTPS TMI Var SotR Wait
CSw>J>AS>HW>Cons>HoW 383714 157727 158045 379 30 73.2% 11.2%
CSw>J>AS>HW>HoW>Cons 384536 157678 157999 436 111 73.2% 11.0%
CSw>J>AS>HoW>HW>Cons 383834 157566 157879 431 121 73.3% 10.9%
CSw>J>HoW>AS>HW>Cons 383380 157868 158196 529 135 73.0% 10.9%
CSw>HoW>J>AS>HW>Cons 383612 157968 158297 519 123 72.9% 10.9%
HoW>CSw>J>AS>HW>Cons 383963 158370 158738 2348 761 72.5% 11.0%
CSw>J>HW+FW>AS>HW>HoW>Cons 384673 157751 158072 397 41 73.2% 11.0%
CSw>J>AS+GC>HW+FW>AS>HW>HoW>Cons 384381 157701 158020 416 59 73.2% 11.0%
CSw>HW+FW>J>AS>HW>HoW>Cons 384846 158089 158426 458 65 72.8% 11.0%
HW+FW>CSw>J>AS>HW>HoW>Cons 385184 158096 158435 632 229 72.8% 11.0%

We can clearly see that Hammer of Wrath should slot in ahead of Consecration but behind Holy Wrath. TMI values vary somewhat depending on how far ahead of other abilities you put it, but note that HoW>J and J>HoW don’t differ much because both are 6-second cooldowns, so they don’t generally clash all that often.

However, if we push HoW ahead of CSw we get a significant TMI increase without realizing any sort of DPS gain compared to slotting it behind Holy Wrath. This is a little different than the results we got with the old MATLAB sims in 5.2, which suggested HoW was a DPS increase at the top of the priority queue. My guess is that the change is due to two factors: switching our L45 talent from SS to EF and losing Grand Crusader procs.

In 5.2, we had fewer empty GCDs because we’d be refreshing Sacred Shield every 30 seconds and using up more Grand Crusdaer procs, which ended up leaving less room for Hammer of Wrath and other fillers. Now, we have a larger number of empty GCDs to work with, so using Hammer of Wrath doesn’t necessarily push another filler back multiple cycles. And since we have those extra GCDs more regularly, it’s not worth pushing it ahead of the basic CS-J cycle; it’s just more efficient to slot it back in wherever it fits without delaying heavy-hitters like AS and Final-Wrath-Glyphed Holy Wrath (can we just call it “Final Wrath” in execute range?).

Speaking of Final Wrath, it looks like that does hit hard enough to be a DPS increase at the front of the queue, for relatively little cost in TMI. The CSw>J>AS+GC>HW+FW>AS>HW>HoW>Cons rotation is particularly interesting in that it gives you a small (~300) DPS boost without sacrificing any holy power generation. But at the same time that difference is right at (or below) our error threshold, so it’s not clear that’s a realizable gain. By the time we’re looking at 0.1% DPS increases, we’re splitting more hairs than we probably should.

So the conclusion here seems to be that the filler order ought to be HW>HoW>Cons, and during execute range you can prioritize “Final Wrath” as high as you want for a DPS gain, realizing that you’re sacrificing a little survivability if you use it instead of a holy power generator.

Next up: Defensive rotations.

Defensive Rotation Group

While I called this the “Defensive” category, it should really just be called the “Sacred Shield” category, since that’s the only defensive spell in here. And with EF being so strong in Siege of Orgrimmar, it’s also mostly irrelevant. But I’m including it for completeness, and to highlight how strong EF really works out to be.

One oversight here is that this group doesn’t take advantage of the T16 4-piece. The default finisher block has lines for SotR usage and Eternal Flame maintenance, but there’s nothing in there for Word of Glory. As a result, we expect to see a drop in SotR uptime corresponding to losing the 4-piece bonus, as well as an increase in TMI. In the future, I’ll be adding a line like this:


which should appropriately use WoG whenever we have 3 stacks of the 4T16 buff to fish for extra Divine Purpose procs. For now, just keep in mind that the results in this group aren’t strictly comparable to the ones with EF for survivability purposes. However, they should still be accurate for comparing the SS rotations against one another if you’re hell-bent on running Sacred Shield.

Max DPS Error: 41
Talents: 313232
Glyphs: focused_shield/word_of_glory/final_wrath

Sacred Shield Rotations
Rotation DPS HPS DTPS TMI Var SotR Wait
CSw>J>AS>HW>HoW>Cons>SS 367667 111879 120831 230953 15250 68.8% 3.3%
CSw>J>AS>HW>HoW>SS+R1>Cons 370778 113784 122153 151427 5737 69.9% 8.9%
CSw>J>AS>HW>HoW>SS+R1>Cons>SS 367419 111758 118767 116705 3947 68.8% 3.3%
CSw>J>AS>HW>SS+R1>HoW>Cons>SS 367205 111699 118253 99323 3247 68.8% 3.3%
CSw>J>AS>SS+R1>HW>HoW>Cons 368861 113136 119949 100877 3655 70.0% 9.1%
CSw>J>AS>SS+R1>HW>HoW>Cons>SS 366914 111684 118101 98289 3051 68.8% 3.3%
CSw>J>AS+GC>SS+R1>AS>HW>HoW>Cons>SS 366800 111667 118048 98425 4289 68.8% 3.3%
CSw>J>AS>SS+R2>HW>HoW>Cons>SS 367064 111661 118041 95695 2790 68.8% 3.3%
CSw>J>AS>SS+R3>HW>HoW>Cons>SS 366863 111692 118097 98717 2810 68.8% 3.3%
CSw>J>AS>SS+R4>HW>HoW>Cons>SS 366868 111676 118065 96511 2942 68.8% 3.3%
CSw>J>AS>SS+R5>HW>HoW>Cons>SS 366886 111665 118051 96597 3131 68.8% 3.3%
CSw>J>AS+GC>SS+R1>AS>HW>HoW>Cons 368770 112856 119307 87275 2789 70.0% 9.1%
CSw>J>SS+R1>AS>HW>HoW>Cons 368547 112834 119190 87898 2947 70.0% 9.1%
CSw>SS+R1>J>AS>HW>HoW>Cons 368234 112800 119149 96260 5332 69.7% 9.2%
SS+R1>CSw>J>AS>HW>HoW>Cons 368466 112689 119040 95784 7286 69.7% 9.1%

First, notice that all of the TMI values on this table are in the 100k range, compared to ~400 when we use Eternal Flame. Some of that is the utter dominance of EF over SS at high AP/Vengeance, some of it is because the Shield of the Righteous uptime is lower by a few percent because we’re no longer leveraging the 4-piece bonus. Note that our SotR uptime is a little higher here than the ~64% range we saw in the 4T16 post; we’re averaging around 69% instead.

You might wonder why that is – after all, in that earlier post we said the 4T16 benefit is about 10% SotR uptime, and we’re not taking advantage of the 4-piece in this group of sims. However, when we talent Sacred Shield we also don’t have to maintain Eternal Flame, which means we can spend that holy power on SotR instead, making up about half of the difference. If we were fishing for extra DP procs with Word of Glory, SotR uptime should actually catch up to what we get with Eternal Flame.

In any event, there’s not a lot to say here. TMI obviously improves as we increase the priority of refreshing SS (“SS+R1″ means “refresh SS if it’s got less than 1 second left”), but there’s no advantage to putting it ahead of CS or J. I added the CSw>J>AS+GC>SS+R1>AS>HW>HoW>Cons option at the last minute on a hunch, as I suspected that would truly be the low-TMI option after looking at the rest of the results, and it paid off. I’m not entirely sure why this performs better than the identical rotation with an exra “>SS” tacked onto the end though. It’s clear that it’s causing some kind of holy power generation loss based on the SotR uptimes, but I don’t really see how.  Something to investigate for later, I guess.

I also want to draw attention to the fact that refreshing it at 2 seconds early seems to be the sweet spot. One second puts it off long enough that sometimes you get short gaps due to the GCD. Three seconds or longer tends to be no more effective than two seconds. I don’t know offhand why the SS+R3 version scored so poorly, but again, it could just be RNG given the Var column is nearly 3000.

That’s enough about Sacred Shield, let’s move on to the level 90 talents.

Talent Rotations Group

This is the fun group, where we make use of our “+custom” talent flag. Basically, we’re just swapping the L90 talent appropriately so that we have the ability the rotation calls for.

There are two things we’re checking for in this sim. First, what’s the “default” place to slot each talent into the rotation, ignoring what section of the encounter we’re in. Then, we want to try and fine-tune that by specifying execute rotations to see if there’s an advantage to increasing the priority during execute. We might care about that because once Hammer of Wrath becomes available, we don’t have that many empty GCDs to work with, so we could inadvertently ignore a L90 talent (or at least delay it for a long time) if we slot it behind Hammer of Wrath.

I’ve decided to split this group up into three tables for ease of filtering/sorting.

Max DPS Error: 41
Talents: 312232+custom
Glyphs: focused_shield/word_of_glory/final_wrath

Execution Sentence Rotations
Rotation DPS HPS DTPS TMI Var SotR Wait
CSw>J>AS>HW>HoW>Cons>ES 404748 157849 158167 591 332 73.1% 9.6%
CSw>J>AS>HW>HoW>ES>Cons 406988 157828 158147 454 110 73.1% 9.7%
CSw>J>AS>HW>ES>HoW>Cons 407136 157867 158187 406 46 73.1% 9.7%
CSw>J>AS>ES>HW>HoW>Cons 406419 157757 158077 517 235 73.2% 9.9%
CSw>J>ES>AS>HW>HoW>Cons 406392 157809 158126 433 48 73.1% 9.9%
CSw>ES>J>AS>HW>HoW>Cons 405022 157857 158175 539 117 73.0% 10.0%
ES>CSw>J>AS>HW>HoW>Cons 405441 158113 158437 665 134 72.7% 10.0%
CSw>J>AS>ES+ex>HW>ES>HoW>Cons 407045 157824 158144 432 91 73.1% 9.7%
CSw>J>AS+GC>HW>AS>ES>HoW>Cons 405890 157756 158075 452 125 73.1% 9.7%
CSw>J>AS+GC>HW+FW>AS>HW>ES>HoW>Cons 406777 157838 158157 385 35 73.1% 9.8%

Light's Hammer Rotations
Rotation DPS HPS DTPS TMI Var SotR Wait
CSw>J>AS>HW>HoW>Cons>LH 393512 157962 158280 308 22 73.1% 9.6%
CSw>J>AS>HW>HoW>LH>Cons 393944 158002 158310 329 51 73.1% 9.8%
CSw>J>AS>HW>LH>HoW>Cons 394156 158013 158320 327 80 73.1% 9.8%
CSw>J>AS>LH>HW>HoW>Cons 393962 157918 158221 316 38 73.2% 10.0%
CSw>J>LH>AS>HW>HoW>Cons 394201 158002 158305 324 59 73.1% 10.0%
CSw>LH>J>AS>HW>HoW>Cons 394363 158040 158344 326 50 73.0% 10.0%
LH>CSw>J>AS>HW>HoW>Cons 394678 158286 158595 421 89 72.8% 10.0%
CSw>J>AS>LH+ex>HW>LH>HoW>Cons 393949 158018 158323 279 19 73.1% 9.8%
CSw>J>AS+GC>HW>AS>LH>HoW>Cons 392669 157944 158250 314 34 73.1% 9.7%
CSw>J>AS+GC>HW+FW>AS>HW>LH>HoW>Cons 393873 158029 158335 299 33 73.0% 9.8%
Holy Prism Rotations
Rotation DPS HPS DTPS TMI Var SotR Wait
CSw>J>AS>HW>HoW>Cons>HPr 396882 158186 158505 339 27 72.9% 7.6%
CSw>J>AS>HW>HoW>HPr>Cons 396093 158345 158655 336 37 72.8% 7.8%
CSw>J>AS>HW>HPr>HoW>Cons 395777 158348 158657 363 63 72.7% 7.9%
CSw>J>AS>HPr>HW>HoW>Cons 394603 158170 158476 300 23 72.9% 8.0%
CSw>J>HPr>AS>HW>HoW>Cons 394422 158173 158479 383 65 72.9% 8.0%
CSw>HPr>J>AS>HW>HoW>Cons 393947 158426 158732 343 35 72.7% 8.1%
HPr>CSw>J>AS>HW>HoW>Cons 395775 159554 159877 521 74 71.6% 8.2%
CSw>J>AS>HPr+ex>HW>HPr>HoW>Cons 395674 158362 158669 369 73 72.8% 7.9%
CSw>J>AS+GC>HW>AS>HPr>HoW>Cons 395192 158255 158564 463 115 72.9% 7.7%
CSw>J>AS+GC>HW+FW>AS>HW>HPr>HoW>Cons 395805 158399 158707 388 57 72.7% 7.9%

First, it’s clear that Execution Sentence is our damage option, with Holy Prism trailing it slightly and Light’s Hammer coming in at a close third place.

Execution Sentence seems to be a toss-up with Hammer of Wrath initially, with them neck and neck at around 407k DPS. But the ES>HoW version is far enough ahead that I’m willing to believe it’s a little better, but again, we’re talking about differences that are right on the boundary of our error level. Still, level 90 talents are more fun than Hammer of Wrath, and when two rotations come this close in DPS that’s as good a criterion as any. This basically boils down to “ES>HoW” during execute range, since outside of execute the two rotations are identical. In other words, the ES rotation should be:


None of the tweaked versions that prioritize things differently in execute range give us a significant improvement over that rotation, so we can rule them out.

Curiously, with LH we would be lead to believe that prioritizing it above AS, J, or even CS is a DPS increase. That doesn’t make a lot of sense though: ES hits harder, yet those rotations didn’t exhibit this same behavior. At this point, I’m inclined to believe that something fishy is going on here. I’d call it an outlier, even though it’s several hundred DPS ahead of some of the other options, but it’s not just one rotation. All three of the rotations with LH near the top show the same effect. I’m not sure why that’s happening yet.

That said, if we ignore those three, perhaps on the grounds that it’s a HPG loss, then the same rotation that maximizes Execution Sentence is the best choice here as well. The CS>J>AS>HW>LH>HoW>Cons rotation is the strongest performer when we’re not putting LH above holy power generators. Though again, the difference between that rotation and LH>HW>HoW or HW>HoW>LH is so small that any of them would be fine.

Holy Prism is an odd duck. It seems to enjoy – no, relish even – hanging out in the last spot. Moving it anywhere higher in the queue is a loss of 800 DPS or more, a large enough gap that we can feel pretty certain it’s statistically significant. Even playing some execute-range tricks with it doesn’t help.

This is actually pretty easy to explain. Consider the following three charts of damage per execute time (DPET) for the rotations CSw>J>AS>HW>HoW>Cons>L90:


DPET for CSw>J>AS>HW>HoW>Cons>ES


DPET for CSw>J>AS>HW>HoW>Cons>LH


DPET for CSw>J>AS>HW>HoW>Cons>HPr


Note that the DPET on Execution Sentence is far higher than any of our other spells, which is why it’s worth prioritizing ahead of HoW and Cons. The only reason it isn’t worth pushing higher in the queue is that we have enough gaps in our rotation that it’s better to use high-damage, low-cooldown spells like Holy Wrath first to minimize empty GCDs.

The DPET on Light’s Hammer is lower than ES, but still above everything else, so most of the same logic applies. Again, with the weird unexplained exceptions that we talked about earlier, which I’m likely chalking up to error (either in the LH results or in the ES results – I’m not really sure which!).

But the DPET on Holy Prism is only on par with Hammer of Wrath and Consecration. This is mostly because it doesn’t scale as well with attack power as the other Level 90 options. Or to state that more precisely, the spell’s attack power coefficient is similar to that of Consecration and Hammer of Wrath (all around 0.7ish, I believe), so in the high-Vengeance regime they all do about the same damage. Light’s Hammer and Execution Sentence have significantly larger attack power coefficients, and thus do a lot more damage in that regime.

Now, it’s worth noting that this doesn’t mean Holy Prism is badly balanced. The cooldown of Holy Prism is only 20 seconds, compared to 60 seconds for ES and LH. In theory, you could get 3 casts of Holy Prism off in the same time that you cast one of either of the other level 90 talents. And those three Holy Prisms would total more damage than a single LH cast, though less than a single ES.

But those three Holy Prisms also cost three GCDs to the single GCD used by LH or ES. And that hurts Holy Prism in the “rotation priority” department, because it means we’re far more likely to be pushing something else back, effectively extending the cooldown of another spell and cutting into the DPS gain.

And if you have three spells that do very similar amounts of damage, one with a sub-6-second cooldown (HoW), one with a sub-9-second cooldown (Cons), and one with a 20-second cooldown (HPr), which one do you use first? Generally speaking, the ones with the shortest cooldown, because you usually lose less DPS by pushing the long-cooldown spell back than you do by pushing the short-cooldown spells back. See Wrath-era Retribution theorycrafting for another example of this, where Crusader Strike was prioritized over harder-hitting spells simply because its cooldown was much shorter.

Another reason for the discrepancy in DPET (and DPS) in our L90 talents is that LH and Holy Prism have some utility that they’re being balanced around. Both spells do a good bit of healing. Light’s Hammer works a lot like a raid cooldown, while Holy Prism does less of it but does it up-front. Holy Prism also has availability going for it, in that you can use it more frequently – something that anyone who’s tried to pick up a group of loose adds will recognize as a life saver.

In any event, that was a slight tangent; the take-home message of this last table is that Holy Prism gets to bring up the rear in our priority list.


This was an incredibly long post, and I didn’t even begin to go over the results in the sort of detail that I could given more time. But I’m pretty sure I hit most of the more important things. Still, it’s worth summarizing what we learned, or at least reinforced.

From this data, we’d ideally want to follow this rotation:


with the caveat that I’m obviously assuming you’re taking Eternal Flame instead of Sacred Shield, that you’re not doing any fancy Holy Wrath prioritization during execute range, that you’re glyphing Focused Shield and Glyph of Word of Glory, and that you’re ignoring whichever two talents you don’t currently have chosen.

We know that not waiting up to about a third of a second for CS to come off of cooldown is a notable DPS gain, as is prioritizing Avenger’s Shield. In both of those cases we suffer a noticeable decrease in survivability, however.

We also know that it’s a small DPS gain to push Holy Wrath higher in the queue during execute range if we’re using the Final Wrath glyph, but that this comes with a small survivability loss as well.

We know that if we’re taking Sacred Shield, we want to slot it in somewhere in the filler section to refresh it when the duration is almost up. It should probably be a gain to tack it on to the end of the queue to fill an empty GCD as well, but the data is inconclusive here, so the jury’s still out on that one.

And of course the Level 90 talent results are already incorporated into the rotation given above.

It’s also worth noting what we didn’t check here and to be clear about the limitations of this data set. We haven’t attempted to try any additional L75 talent options, so all we have is Divine Purpose data. Holy Avenger shouldn’t vary things too much, insofar as most of HA’s effect is simply more off-GCD SotR spammage. But it could cause rotations that try to increase DPS by prioritizing something over a holy power generator to fail miserably because each holy power generator is basically adding 2/3 of a SotR in damage during HA. On the other hand, they’re also less effective outside of HA than they would be with Divine Purpose, so who knows! Note also that we’re tanking the boss full-time here, so the effective uptime of Holy Avenger isn’t being considered.

Likewise, we didn’t test how Sanctified Wrath affects things. We’re already fairly sure that pushes a J+SW up ahead of CSw in priority, but we don’t know if it changes filler priority at all. Those are all on the list of things to add for next time.

We’re also simming the most bland encounter possible: solo-tanking Patchwerk forever. There’s no movement, no sudden or predictable damage bursts from a boss special, no significant variation in damage patterns (i.e. it’s a steady stream of melee+DoT damage, not an oscillating pattern of heavy melee followed by heavy magic followed by heavy melee and so on…). Basically none of the things that make real encounters interesting.

So keep that in mind when interpreting the results. I may say “this data suggests X is better than Y,” but I’m always doing that within the context of this particular set of constraints. It’s reasonable to assume that it generalizes fairly well to other situations, but it won’t always, and it’s almost certainly not going to be iron-clad enough to be correct for every encounter.

As usual, a smart tank should be looking for those inconsistencies and adapting their play to the encounter rather than blindly relying on “but Theck said so!”

Posted in Tanking, Theck's Pounding Headaches, Theorycrafting, Uncategorized | Tagged , , , , , , , , , , , | 22 Comments

5.4.2 WeakAuras Strings

It’s been a little while since 5.4 was released, and I’ve still been tweaking my WeakAuras here and there as I go. I’ve finally made enough tweaks that I thought it was worth sharing with the class.

Again, the updated Paladin auras can all be grabbed at http://www.sacredduty.net/weakauras-strings/ along with auras for all of the other class/spec combinations I use regularly.

Weakened Blows

The first change is actually the removal of an aura that doesn’t have much meaning anymore. Now that Crusader Strike applies Weakened Blows, there’s no reason to be tracking it’s uptime. So I’ve removed that aura entirely and shifted the Eternal Flame & Sacred Shield icons over to fill the empty space.

Priority Row Shuffle

I’ve tweaked the order of the spells on the priority row a bit. I’ve been using Holy Prism more frequently lately – or to be more specific, I’ve been swapping that talent more frequently and using all three choices. Nowadays, my sims are telling me that all three of these choices fall above Consecration in priority. And more importantly, I found that I tended to forget about those spells since I had their icons so far off to the right.

So I’ve re-ordered the last few icons on the priority row. I’ve moved Execution Sentence, Light’s Hammer, and Holy Prism to the left and moved Consecration and Sacred Shield to the right.  Now the order looks like:

CS – J – AS – HW – HoW – (ES/LH/HPr) – Cons – SS

Swapping Consecration and Sacred Shield is a last-minute change that I made after recording the videos (and taking the screenshots) shown later in this post, so in those Consecrate appears to be way off to the right. It looks a little cleaner now with that last-minute change though (which is why I decided to make it!).

Tier 16 4-piece Indicator

The first new indicator is one that tells you the status of your Tier 16 4-piece bonus. When you reach 3 stacks of Bastion of Glory, you get a buff called Bastion of Power that makes your next Word of Glory or Eternal Flame free. It’s a very simple matter to track this buff in WeakAuras so you know when it’s available.

The indicator pulses if you have a 5-stack of Bastion of Glory to remind you that it’s at full strength. As per this comment on last week’s blog post, refreshing the buff immediately at 5 stacks of BoG tends to be an ideal strategy in the steady-state (i.e. at constant Vengeance). In practice you’d want to consider your current Vengeance level, of course.

The two new indicators I've added in 5.4.2, along with the (slightly) adjusted layout now that the Weakened Blows indicator is gone.

The two new indicators I’ve added in 5.4.2, also showing the adjusted layout now that the Weakened Blows indicator is gone.

Eternal Flame Stoplight

To that end, I’ve added a new indicator. In this comment, Zil asked me if I could write an aura that would tell you whether refreshing Eternal Flame would give you a larger or smaller HoT. So I wrote up this “stoplight” indicator to give us that information.

Every time you cast EF, it calculates the strength of that EF and stores that value (much like the text indicators store the effective HP used, BoG level, haste, and AP you had at the time of the cast). It then calculates the value of a new EF given the current conditions and compares that to the stored value.

If the new value is at least 10% larger than the existing one, the indicator turns green. If it’s not, it stays red. Yes, this is the reverse of how I have the text indicators working, but if that bothers you the colors are easily tweaked on the display tab of each aura. I tend to think of green as “good,” in this case the stoplight means “It would be good to cast EF” while the text auras are telling me about the status the existing EF (“Your current EF is good, don’t mess with it”). I suspect most people will only use one or the other anyway.

There’s also a text indicator that shows exactly how much better the new EF will be. It’s literally just showing you (new EF value)/(old EF value) as a percentage, so if it reads 115% (as it does in the image above) it means recasting EF at this point will be a 15% increase in healing throughput. Note that this percentage can get very large in cases where you have a one-HP Eternal Flame active and you’re sitting on 5 stacks of Bastion of Glory.

Note that this indicator takes into account everything, as far as I know. It should accurately reflect changes in mastery, haste, Bastion of Glory stacks, spellpower, holy power, crit, and even Avenging Wrath. The only thing I’ve omitted are constant factors like the 50% increase from self-casting (which you should always have) and the 5% Seal of Insight healing bonus (which you should probably also always have, since I don’t think many players are switching to Seal of Truth/Righteousness).

The video below shows the indicator during development at 4x the final size to make it easier to see how it works:


It didn’t occur to me to write a Sacred Shield stoplight until writing this post, but I’ll probably put one together in the next few weeks. I’ll toss it on pastebin, update the WeakAuras page with a link to it, and probably tweet about it, but probably won’t give it it’s own blog post.
Jan 15 2014 edit: SS stoplight added, bundled with the EF stoplight. Also added crit scaling to the EF stoplight.

Aura Group Re-organization

Finally, I’ve had to re-organize the aura groups a little bit. Adding the code to support the EF Stoplight aura caused one of the other groups to get too large for Ace Serializer, which in turn broke importing. So I had to split them up. They’re now organized a little differently. The three big groups haven’t changed:

Theck – Prot – Priority Row
Theck – Prot – Cooldowns Row
Vengeance Bars

Those aura groups are all independent and work perfectly well all by themselves. You can import any combination of those and they should work seamlessly.

All of the auras that give you specific information about Vengeance, Sacred Shield, and Eternal Flame now have a dependency: the “Vengeance/SS/EF Helper Auras” group. This group contains the code that saves a snapshot of your stats when you cast EF or SS, which is why it’s required for the other aura sets to work. They all perform calculations on that information to determine what to display, so without it, they don’t work.

Vengeance/SS/EF Helper Auras (required for the three sets below)

And finally, the auras that display EF and SS information; again, none of these work without the aura group linked above:

SS/EF Vengeance Bar Overlays
Vengeance/SS/EF text indicators
EF/SS Stoplight Auras

(Technically speaking, the simple Vengeance text indicator doesn’t require the helper auras, so if you’re a non-paladin tank class and want that, you can just grab the “Vengeance/SS/EF text indicators” group and delete everything with EF/SS in the title and that Vengeance indicator should still work).

Final Product

And after all of that, here’s how it looks in practice on a target dummy:

If you want to see the Vengeance Bar indicators at work, check out the old 5.4 video on the WeakAuras Strings page. And as always, that page contains the auras for all of the other classes I play or have played. If you have a question about what addon I’m using to create a certain UI element, check out the UI Construction and Key Bindings post, which should still be mostly accurate. If that doesn’t answer your question, feel free to ask in the comments.

Other Stuff

A handful of quick comments regarding Simulationcraft stuff before I go:

  • There’s a bug with Execution Sentence in Simcraft at the moment (in 542-1 at least, and several earlier builds). For some reason it’s not ramping up the damage of each tick appropriately for protection, even though it works perfectly for retribution. This is a non-trivial issue caused by some piece of code deep in the core, and at this point I couldn’t even give you a completely satisfactory answer for why it happens. But top men (meaning: people more competent than I) are on it. Top men, I say. Hopefully should be fixed for next build.
  • There’s also a small bug in the last couple revisions that causes Eternal Flame to be slightly undervalued. I put some code in there to handle the hotfix applied in September (when EF’s self-healing bonus was nerfed from 100% to 50%) and forgot to take it out once the spell database was updated to include that information. So EF was only getting a 25% bonus from being self-cast rather than a 50% bonus for a while, thus being undervalued by about 17%. Oops. Bad Theck. This will be fixed in version 542-2.
  • I’m almost done with the MATLAB automation code that runs the rotation simulation. This turned out to be a much larger and more annoying project than I expected (and I already expected it to be large and annoying). It’s arguably the most complicated of all of the sims because I had to allow the possibility of using different glyphs and talents for each rotation. Luckily, I should be able to finish it this week and have it ready for a blog post next week. Also luckily, the rest of the automation sims should be far easier to code than this one was.
  • Once those are done, I have the fun job of deciding how to make them translatable to other classes (if at all). I have to see if any of this code runs in Octave or FreeMat (unlikely, I use a lot of fancy structure and cell stuff), and if not, decide whether to translate all of this to another language so that other theorycrafters/players can contribute and use the code. I could also entertain the possibility of integrating some of these features into SimC itself in the long run (ex: a talent simulation would be pretty simple, I think), but that’s something I’ve yet to discuss with the other SimC devs.

That’s it for today.


Posted in Tanking, Theck's Pounding Headaches | Tagged , , , , , , , , , , , | 25 Comments

Itemization Value of 4T16

Last week, Fouton from Icy Veins asked me whether I had tried to determine an “ilvl value” for the tier 16 4-piece set bonus. Stated another way, the problem he was trying to solve was twofold:

  1. Is it worth using lower-ilvl tier pieces instead of non-set pieces just for the 4-set bonus?
  2. If so, how much lower? Is it worth using LFR tier instead of heroic warforged non-set?

Unfortunately, I didn’t have an answer for him. I knew the 4-piece was powerful, of course. There was no question that using tier pieces over warforged loot from the same difficulty level was a survivability gain. But I had never really looked into whether it would make sense to use much lower-ilvl tier instead of warforged gear.

I was unconsciously assuming that if you had access to warforged loot from e.g. heroic, then you also had access to off-set gear from that same difficulty mode, so the most you would care about is a 6-ilvl difference. But especially for guilds progressing through normal, or guilds at the mercy of the personal loot system of LFR/Flex modes, that’s not necessarily a good assumption. Surely there are cases where a player has an LFR tier chest or helm and warforged normal-mode off-set from a different boss, and wants to know what to wear?

So I threw together a few quick profiles in Simulationcraft to test this.


As a control group, we’ll just use the T16 normal-mode protection paladin profile. This uses four pieces of normal-mode T16 (head, shoulders, chest, and gloves) with non-warforged Legplates of Unthinking Strife as the one off-set piece. Note that none of the gear in this profile has valor upgrades applied. The stat breakdown is given below:

T16N Stats
Stat Amount
Strength 19540
Stamina 47990
Expertise Rating 5107
Hit Rating 2607
Crit Rating 1112
Haste Rating 15677
Mastery Rating 7602
Armor 60112
Dodge Rating 180
Parry Rating 1526

The rest of the setup is pretty much what you’d expect. Talents are Eternal Flame, Unbreakable Spirit, Divine Purpose, and Light’s Hammer, glyphs are Focused Shield, Alabaster Shield, and Divine Protection. 

I then worked up four different variant gear sets to compare. The first is a set where we downgrade two of our tier pieces to LFR level. We choose the chest and the shoulders for this, since the tier helm and gloves both have haste on them.  Since both chest and shoulders are expertise/mastery pieces with expertise reforged into haste, we lose a chunk of those secondary stats as well as some strength, stamina, and armor.

Since we don’t really want to deal with the hassle of reforging each gear set to cap expertise, we cheat a little bit by adding a shirt to the gear set that will put us over the cap. While this adds a little ambiguity to our results, it should be a larger boon to the non-set arrangements than the tier sets.

After doing all of that, our second gear set looks like this:

T16N-LFR Stats
Stat Amount Diff
Strength 18786 -754
Stamina 46506 -1484
Expertise Rating 6345 N/A
Hit Rating 2607 0
Crit Rating 1112 0
Haste Rating 15503 -174
Mastery Rating 7065 -537
Armor 59329 -783
Dodge Rating 180 0
Parry Rating 1526 0

For the next set, we replace the chest and shoulders with normal-mode off-set pieces. In each case we’ve gone for maximizing haste, so we’ve chosen Chestplate of Congealed Corrosion and Darkfallen Shoulderplates. In both cases we’ve used the warforged (ilvl 559) version and applied two valor upgrades for a net ilvl of 567. Since we’re using a hacked shirt with 2500 expertise on it, we’ve chosen not to reforge the shoulders and have used a crit->mastery reforge on the chest. This gives us the maximum bang for our buck since none of that extra itemization has to go into expertise.

The stats for that gear set look like this (note that “Diff” is still in reference to T16N):

T16N-WF Stats
Stat Amount Diff
Strength 20045 505
Stamina 48630 640
Expertise Rating 5896 N/A
Hit Rating 2607 0
Crit Rating 1827 715
Haste Rating 18053 2376
Mastery Rating 6821 -781
Armor 60551 439
Dodge Rating 180 0
Parry Rating 1526 0

The next set takes the previous one to the extreme and uses the heroic warforged versions of both chest and shoulders.

T16N-HWF Stats
Stat Amount Diff
Strength 20576 1036
Stamina 49676 1686
Expertise Rating 5896 N/A
Hit Rating 2607 0
Crit Rating 1928 816
Haste Rating 18420 2743
Mastery Rating 7044 -558
Armor 60958 846
Dodge Rating 180 0
Parry Rating 1526 0

In our final two gear sets, we go to the other extreme: what if we force the player to use four or all five LFR tier pieces, including the severely sub-optimal dodge/mastery legs? We’ll be kind and reforge the dodge on those legs to haste, and continue to compensate for expertise and hit caps by using a fake shirt.

T16N-4LFR Stats
Stat Amount Diff
Strength 18032 -1508
Stamina 45021 -2969
Expertise Rating 6182 N/A
Hit Rating 3992 N/A
Crit Rating 1112 0
Haste Rating 14971 -706
Mastery Rating 7065 -537
Armor 58686 -1426
Dodge Rating 180 0
Parry Rating 1353 -173
T16N-5LFR Stats
Stat Amount Diff
Strength 17680 -1860
Stamina 44406 -3584
Expertise Rating 6182 N/A
Hit Rating 3500 N/A
Crit Rating 1112
Haste Rating 13789 -1888
Mastery Rating 7262 -340
Armor 58294 -1818
Dodge Rating 821 641
Parry Rating 1353 -173

We take all six of these gear sets and run them through a 50k-iteration simulation against the T16N25 TMI boss. Anything not explicitly mentioned is identical to the defaults in the T16N profile.


Here’s what we get out the other side:


And summarizing the important bits in table format:

T16N 230.5 73.25% 380k 149834 149540
T16N-LFR 967.0 72.82% 377k 153704 153381
T16N-WF 4125.1 64.32% 386k 160163 159818
T16N-HWF 1705.0 64.90% 390k 157680 157362
T16N-4LFR 1627.3 71.93% 371k 156588 156240
T16N-5LFR 3457.7 70.25% 363k 157307 156937

It should be immediately apparent from the table that the T16N gear set performs the best for survivability. It has the lowest TMI by a large margin and the highest SotR uptime.

Using normal warforged off-set pieces (T16N-WF) may be a gain of 2376 haste, but you actually lose about 9% SotR uptime, which means losing the 4-piece is costing you over 10% SotR uptime all by itself. And of course, smoothness (as measured by TMI) suffers greatly; the TMI is about 20 times higher, which means the spikes are roughly 27% larger on average.

Upgrading those off-set pieces to heroic warforged (T16N-HWF) pieces cuts your losses somewhat, but still gives significantly worse results than the control set. It’s not a large increase in haste or SotR uptime over the normal warforged configuration, but the extra stamina drops the TMI to around 1700, still about 18% larger spikes than T16N.

The T16N-LFR gear set, on the other hand, outperforms both of the off-set configurations. The TMI is only about 4 times worse than T16N, corresponding to a 13% increase in spike size, but the SotR uptime isn’t that much lower. So there’s no question that using 2 pieces of LFR tier (chest and shoulders) to get the 4-piece bonus gives superior survivability to using two well-itemized heroic warforged items in those slots to get extra haste.

If you instead force the use of four or five LFR tier pieces, the situation gets worse. That’s a significant loss of haste and stamina, so the TMI is predictably much higher. 4LFR is roughly equivalent to the T16N-HWF set in TMI, making up for the significant stamina reduction with the higher SotR uptime of the 4-piece bonus. It’s solidly ahead of the T16-WF gear set in both categories.

5LFR is still better than the WF set that uses normal-mode warforged off-set, both in terms of TMI and SotR uptime. 5LFR gives higher SotR uptime than the HWF set, but it trails in TMI thanks to the extra stamina and secondary stats (~5k haste) of the heroic warforged gear. That said, I don’t think this situation will be very common – players that have access to heroic warforged off-set should rarely need to resort to LFR pieces to complete their tier set.

There are two other things I want to point out about this data. Note that the higher-ilvl sets also convey slightly higher DPS, which is something to consider. The difference isn’t large (less than 3%), but on a serious DPS check that might be worthwhile.

Also note that all of these results assume you’re using Eternal Flame and Divine Purpose. If you’re talenting Sacred Shield, then you can still game this effect with free Word of Glory casts to fish for more Divine Purpose procs, but the benefit will be reduced somewhat. And of course, if you’re not using Divine Purpose then the 4-piece bonus won’t help your SotR uptime at all, though it will still make you more survivable by virtue of removing the opportunity cost of having to heal yourself with Word of Glory.


I’m hesitant to assign an equivalent ilvl value to the 4-piece bonus for a few reasons. The first of which is something most people don’t think about: not all ilvls are created equal. The head, chest, and leg slots give you more stats per ilvl than the shoulder and glove slots do, so the exact ilvl value will depend on the particular slots in which you’re making the sacrifice. In addition, it will depend a bit on which off-set gear you have; we’ve only looked at two specific choices (shoulders/chest), so we’d get a different answer if we considered the head, glove, or leg slots.

However, it’s clear that under the right conditions the tier bonus is stronger than trading up 52 ilvls in two slots (the difference between T16N-LFR and T16N-HWF). We also know that it’s roughly equivalent to trading up 52 ilvls in shoulder/chest and gaining 25 ilvls in head/gloves (though in this case, with equivalent tier rather than off-set).

Beyond that we’d have to guess a little, or run more sims where we compare the tier sets to other sets that use only off-set pieces of much higher quality. That introduces the benefit of the 2-piece bonus as well, though that’s probably a relatively small effect. It’s clear that four heroic warforged off-set pieces would beat out four or more LFR tier pieces based on the data we already have. It seems unlikely that a set with four heroic warforged off-set would be able to compete with the T16N set though.

The take-home message here is that the 4-piece can be really, really strong if used properly, and it’s worth resisting the temptation of even significantly-higher-ilvl gear to keep it. In all but the most extreme cases, such as trading multiple LFR tier pieces for multiple heroic warforged off-set pieces, keeping the 4-piece is going to be the better call.

Again, that comes with some caveats: it assumes you’re talenting Eternal Flame and Divine Purpose. If you swap from Divine Purpose to Holy Avenger for an encounter, then the benefit is reduced (though not eliminated – it still makes EF easier to maintain); if you don’t use either DP or EF, then the benefit is smaller still, and depends on how often and effectively you use WoG as an emergency heal.

Posted in Simcraft, Tanking, Theck's Pounding Headaches, Theorycrafting, Uncategorized | Tagged , , , , , , , , , , , , , | 18 Comments

A Letter to Celesty Claus

Every Winter Veil, children of both factions write letters to Greatfather Winter and ask for toys and games. In the meantime, their parents are writing letters and saying prayers to a completely different deity: Celesty Claus, the great celestial dragon that maintains the cosmic (class) balance. Legends say that he flies through the sky on Winter Veil Eve showering the world with nerfs and buffs, and the occasional meteor by accident (one of the inherent downsides of automated shooting star delivery systems).

celesty claus

A rare picture of the elusive Celesty Claus.

Classes that were good that year are happy to wake up on Winter Veil morning to find buffs in their stockings. However, classes that were bad check their stockings with trepidation, because they know they’re only likely to receive nerfs.

The rest of the year is usually spent bickering about who got the best loot from Celesty Claus and why everyone else needs to be nerfed because they’re clearly overpowered in PvP. And asking for ponies.

Of course, he never brings ponies, because he’s heartless. I don’t mean that in a derogatory way, but in a literal, anatomical way. Dude’s made of stars, he’s powered by fusion reactions, he doesn’t have a need for meat and sinew. How many ponies do you know that have survived the heat – not to mention radiation exposure – of a body made of stars? So wishing for a pony is pretty stupid unless you want char-broiled irradiated pony.

This is my letter to Celesty Claus for this year, specifically for protection paladins.

Dear Celesty Claus,

I know you’re a busy man… dragon… spectral titan construct… thing. So I’ll dispense with the milk and cookies and get right to the point. Which is asking you for stuff.

1. Please bring me a version of Holy Wrath that doesn’t have the damage-splitting effect. I get the original goal – a long time ago in a continent far away it was a neat way to give Retribution AoE damage that wasn’t “free” without adding another spell to their arsenal. But this is 2014, Retribution doesn’t even have Holy Wrath anymore. It’s ours now, and really should be designed around our needs.

And right now, we need snap aggro. We’re already strong on up to three targets thanks to Avenger’s Shield.  And our sustained aggro on large groups of mobs is also fine thanks to Consecrate. But the difficulty is picking up aggro on groups of 5+ mobs so that our sustained aggro can do its thing. On large groups, Holy Wrath hits weakly enough that it can’t compete with things like Dizzying Haze and Thunder Clap.

I realize that removing the damage splitting effect is a buff to Holy Wrath, and a buff to our sustained AoE DPS/aggro as well. I’m happy to accept a nerf to Consecration to balance out sustained DPS to make Holy Wrath a more useful spell.

2. Please bring us equitable talent choices on our level 45 tier. Eternal Flame is extremely strong even without our tier 16 four-piece set bonus. It really needs to be nerfed a little more in order for Sacred Shield to be a competitive option.

Likewise, Selfless Healer is in a pitiable state for protection. An instant Flash of Light, while nice, still costs a GCD, doesn’t heal for as much as a full-strength Word of Glory with 5 stacks of Bastion of Glory, and doesn’t come with the fringe benefits of Eternal Flame. Please give it some love so that somewhere, some protection paladin will feel like it’s worth taking.

If Selfless Healer could allow Flash of Light to be cast off of the GCD for protection, that would help a lot. But it also needs to heal for a lot more to make up for the fact that it doesn’t give you the long-term smoothness of Eternal Flame or Sacred Shield. Those two talents prevent spikes before they start by giving you predictable healing or absorption at regular intervals. For Selfless Healer to be able to compete with those two proactive talents, it has to be a very effective reactive choice.

It should really gain the full increase from Bastion of Glory so that the talent remains competitive as we stack mastery. Ideally, a Flash of Light cast with 3 stacks of Selfless Healer and 5 stacks of Bastion of Glory should heal for quite a bit more than a Word of Glory with 5 stacks of Bastion so that it’s your first go-to reactive tool. It’s trying to compete with two strong “over-time” effects, so it should condense the raw healing or absorption of those effects into a single huge shot. If it doesn’t heal for 80% of your health, it’s not really going to be competitive with an Eternal Flame that heals for 60%-70% and gives you several times your health over 30 seconds.

3. Please bring me a version of Consecrate that benefits from haste. As I’m sure you know, we paladins love haste, almost to the point of irrationality. And while Sanctity of Battle helpfully reduces Consecrate’s cooldown as we stack haste, it doesn’t change its tick interval. It ticks at fixed one-second intervals no matter what your haste level is.

The problem that arises here is that when we’re at high levels of haste, we can be in a position where we re-cast Consecrate before the previous one is done ticking. Since we can’t have two Consecrates on the ground, we end up clipping the earlier cast and losing ticks, reducing Consecrate’s damage per cast.

In a single-target situation, it’s fine for Consecrate to be our lower-DPS filler that remains a low priority. However, in AoE situations it is a much higher priority. That reduces the effect of haste on our many-target sustained threat. It’s almost like the spell suffers from diminishing returns with respect to haste.

More importantly, it makes it trickier to use properly for novice paladins. You lose DPS if you recast it early, but you lose more DPS by bumping it lower in priority during AoE situations. The default unit frame even shows a little timer for it, which could be misleading for a novice. I’d just like to see it work more seamlessly with Sanctity of Battle so that it feels less awkward.

4. Please make seals interesting again. I remember losing auras. It was sad to lose something iconic, but at the same time they had devolved into a “set it and forget it” mechanic that didn’t add a lot of fun game play. If it’s something I won’t change for hours and has a minimal effect on my experience, it’s probably not worth keeping.

The problem is that seals feel very much the same way as protection. Seal of Truth has been neutered to the point that the DPS increase is negligible. Seal of Righteousness is similarly weak. And most importantly, Seal of Insight is such a strong survivability component that it is almost never worth giving up for either of the other two. You could remove Seal of Truth and Seal of Righteousness from protection and most tankadins wouldn’t even notice.

But the idea behind the Warlords of Draenor talent “Seal of Faith” is interesting. We would trade a bunch of damage output for healing output. Of course, it doesn’t make a whole lot of sense right now, because we don’t have the supporting tools to make that useful. But if we had a more extensive toolkit of healing spells, I could imagine using that talent to help my raid survive heavy raid damage phases.

I don’t think I’ll ever take that talent, because having Holy Shield back is just too cool, but it’s the thought that counts.

And in this case, I’d love to see all seals work on this basic principle of having a more significant effect on your play. Seal of Insight could be the default “tanking” seal that gives you a big chunk of survivability by increasing armor and healing throughput. Seals of Truth and Righteousness could sacrifice a lot of that self-healing to grant other benefits that are primarily useful while not tanking, much like Seal of Faith sacrifices damage for more (possibly) raid-healing or off-healing capability.

The one fear is that by being able to swap between highly disparate modes could cause tank imbalances. We’ve seen this before, where one tank was able to switch from high damage output to high survivability by toggling stances, and it caused plenty of problems. It’s really something that all tanks need to be able to do in similar capacities to be balanced.

But the alternative is to just redesign or eliminate seals. Seal twisting just isn’t very fun for the same reason most retribution paladins dislike Inquisition.  Spending resources now for a zero-damage GCD feels bad, even if the math says it’s an overall DPS increase. And for protection, the damage increase is rarely, if ever, worth the large survivability sacrifice of dropping Seal of Insight. If seals aren’t getting redesigned, I’d rather just see each spec get one seal: Seal of Insight for protection and holy, Seal of Truth for retribution.

If you really want to go a little radical with the redesign option, give us one “passive” seal and make the others active abilities that operate like cooldowns. Seal of Righteousness could replace the active seal, granting its usual effect for 15-20 seconds on a one-minute cooldown, and then automatically swap back to the “default” seal after the effect has ended. That would give us the ability to actually use Seal of Righteousness for a temporary AoE damage boost without costing us two GCDs.

5. Please bring us an end to the raid cooldown arms race. While it’s nice to be able to contribute something to the raid group, the sheer number of raid cooldowns being tossed around is getting absurd. Many encounters are being designed around rotating raid cooldowns to survive. While there’s certainly some level of coordination involved in that, I think it makes the game less fun for healers. It also leads to class stacking on encounters where those cooldowns are not equitable.

I feel that raid cooldowns should be limited to one role, and that role should probably be healers. In 20-man mythic, the number of healers should be more stable than it is in current 25-man heroics. While the number of tanks will be stable as well, the temptation to sacrifice a little DPS for another raid cooldown would be strong.  Sacrificing an entire player worth of DPS for a raid cooldown is much more punishing and also more strategic, since you would do that on a fight where you presumably want more healing to begin with.

Raid cooldowns should, in my mind, be a finite resource that you have to use intelligently and carefully. Precision tools to deal with only the most difficult situations. Rotating cooldowns to trivialize an entire 30-60 second period of an encounter just feels cheesy to me, as is having enough Devotion Auras to throw at every single instance of a boss’s raid-wide damage ability.

Yes, I will be sad to give up Devotion Aura. But I will be happier with raiding as a whole, so it’s a sacrifice I’m willing to make.


P.S. Sorry about killing your cousin Elegon every week for the past six months or so. He was… um… corrupted by the Mogu or something, so it was justified. Each week. Really. On the bright side, he dropped this great mount that looks just like you!


happy holidays

Happy Holidays!

Happy Holidays from everyone here at Sacred Duty. See you next year!


Posted in Humor, Tanking, Theck's Pounding Headaches, Uncategorized | Tagged , , , , , , , , , , , , , | 22 Comments