HOW TO CREATE A NORMAL DISTRIBUTION CURVE WITHIN TABLEAU
--
This past week during the course of my normal day-to-day activities at work, I came across a requirement for creating a normal distribution curve within Tableau. I immediately thought of Jim Dehner’s blog post on How-To Create a Normal Distribution Chart and decided to replicate it. This blog post is about my understanding of his original work and the steps that I followed to replicate it. I do not take credit for the original idea and forever thankful for his incredible contribution to the tableau community
Tableau Software provides users with different types of analysis so they can visualize more easily their recorded values. One of the most common analyses used to observe the deviations that occur is the Histogram with normal distribution. This type of analysis is represented by the presence of a curve, also called the bell curve. The normal distribution is also known as the Gaussian distribution and represents the distribution of symmetric probabilities compared to the mean. Thus, data that are closest to the higher point of the curve have a higher frequency of occurrence than those that are further away.
In a nutshell, the bell curve is the most common type of distribution for a variable, and due to this fact, it is known as a normal distribution. The term “bell curve” comes from the fact that the graph used to depict a normal distribution consists of a bell-shaped line. The highest point on the curve, or the top of the bell, represents the most probable event in a series of data, while all other possible occurrences are equally distributed around the most probable event, creating a downward-sloping line on each side of the peak.
Histogram with normal distribution allows users to analyze the values in a data set and observe whether they had a normal distribution. The histogram displays the sum of the values recorded in the analyzed data set. In order to have a normal distribution of our values, they must be concentrated at the highest point of the curve. The advantage that the curve offers is given by the fact that highlights how the values in the dataset were distributed and whether the distribution has a normal behavior. For better understanding, check this amazing blog post on Ever Wondered Why Normal Distribution Is So Important by Farhad Malik
STEP 1: CALCULATE THE NORMAL DISTRIBUTION
Now for the interesting part! As you just learned over at Wikipedia how to calculate the Normal Distribution — more precisely its density function — we simply need to translate this into the Tableau world. This is the formula for calculating the Z score :
(1 / ([St Dev] * SQRT(2*PI()))) * EXP(-((ATTR([Profit Bin]) – [Mean])^2 2/ (2*[St Dev]^2)))
Let’s first understand the core concepts of Standard Deviation. In laymen’s terms, the standard deviation can be thought of as roughly the average distance of scores from the mean. Precisely, the standard deviation is defined to be the square root of the average squared deviation of scores from the mean. If there are extreme scores in the distribution, the standard deviation is inflated. Thus, the shape of the distribution should be considered when interpreting the standard deviation.
A low standard deviation means that most of the numbers are close to the average, while a high standard deviation means that the numbers are more spread out.
For better understanding, check this link — Confusing Stats Terms Explained: Standard Deviation — Statistics Blog — Stats Make Me Cry
A lot of things in our world follow a normal distribution. What is the Normal Distribution? It is a concept whereby within 1 standard deviation (Level of dispersion of data) from the mean will contain 68% of all the values. Within 2 Standard deviations, 95%. Within 3, 99.7%.
The Calculation for the Standard Deviation (SD) is:
Where x is a given value, x̄ represents the mean across the group. The Sigma sign then |x- x̄|² means the Summation of all absolute (Always positive) values of the difference between value and the group mean. This is all divided by the number (n) of values in the dataset. This all then square rooted to get the SD.
So to recap
1 SD = 68% of Data
2 SD = 95% of Data
3 SD = 99.7% of Data
Hope you get some idea about this, so let’s calculate the mean and the standard deviation for our distribution. We could do this using a Table Calculation, but I prefer Level of Detail Expressions, so here we go:
STEP 2: CALCULATE THE STANDARD DEVIATION
The data used here is the “Superstore” dataset that comes with the 2021.1 version of the Tableau desktop. I tried my best to minimize the calculation and Created this chart. The concept can be adapted to any level in the Superstore dataset or your own
Tableau does not draw curved lines — so we simulate the familiar smooth bell curve of the normal distribution with a series of straight lines on an X-Y cartesian coordinate grid. The densification process is used to create intervals on the X-Axis (Sales in these examples).
Prepare the data for densification
Duplicate the data set by creating a union of the data with itself. Tableau will create a Dimension “Table Name” that can be used to distinguish the original dataset from the duplicate — in the example, the original data set is “Orders” and the duplicate in “Orders1”
Path
IF [Table Name]="Orders1" then 1 else [X spacing] end
Where [X spacing] is the parameter with a value 100 to vary the x-axis spacing
Use the Path formula to create bins — and make the bin size = 1
Create the X-values using the bins to pad the data — NOTE I used a +- 4 sigma range
First, we need the µ and σ values — the average and the standard deviation values so that we can show Sales by year and by category. The LOD’s below create the needed combinations of year and category and the widow function forces the calculation to be executed in each bin
µ (MU) — Mean
window_min( avg({ FIXED year([Order Date]),[Category] :avg([Sales])}))
σ (Sigma) — Standard Deviation
window_min(min({ FIXED year([Order Date]),[Category] :STDEV([Sales])}))Where (index()-1) is the bin number (interval) on the x-axis
Index-1
(index()-1)
X-value
(-4[sigma])+([index-1](8*[sigma]))/window_max(max([Path]))
Solve the Normal Distribution formula for Y using the table calculations needed to force Y to be determined at each bin (x-value)
For clarity, I broke the calculation into 2 parts
The coefficient = (1/[σ*sqrt(2π])
Coeff
1/(sqrt(2pi()([sigma]^2)))
Exponent
(-((([X value (Sales)])-[mu])^2)/((2*([sigma]^2))) )
And then combined them in a single formula for Y (note: Tableau does not have an e function but the EXP() as shown returns the proper value)
Y
([coeff]*EXP([exponent]) )
x normalized (Sales)
([mu]-[X value (Sales)])/[sigma]
Std Dev Filter
Create a Parameter for selecting the SD value. Place it on the filter shelf — it is a table calculation and should be computed using “Path(Bin)“
[x normalized (Sales)] >= -[enter number of standard deviations] and
[x normalized (Sales)]<=[enter number of standard deviations]
STEP 4 — CREATE THE VIZ — FILLED SHAPE
Now we have all our calculation ready
- Place Path(bin) on detail
- Place X-Value on columns and Y value on rows
- Add Sigma and Mu to Detail
- Open all the table calculations Δ and set to Calculate using “Path(bin)”
- Set the chart type to Polygon
- Place Padded on Path
- Place Category on Color
- Place Year(order date) on the filter and set the filter to single value button
With X Normalised Sales on Column Shelf
With X Value Sales on Column Shelf
Reference link: Fitting a Normal Curve to a Histogram
Knowing this rule makes it very easy to calibrate your senses. Since all we need to describe any normal distribution is the mean and standard deviation, this rule holds for every normal distribution in the world!
The challenging part, indeed, is figuring out whether the distribution is normal or not.
Also, here are a few hand-picked articles for you to read next:
- How to extract Tableau field information using Python API
- Extended bar chart in Tableau
- Taking your design to next level in Tableau
- Create custom maps in Tableau
- Drop Lines using Parameter Actions
- Toggle Button using Parameter Actions
- Overview of Alteryx Architecture
- #PreppinData Challenge S01E03
- How to use Buffer Spatial Function In Tableau