Visualization
Updated: 03 September 2023
Configuration
Before starting the course make sure you install the library with the relevant datasets included
From the dslabs library you can use the data sets as needed
Distributions
Numerical Data is often summarised in a single value, the average, with a standard deviation. these two numbers can provide us with a lot of data
Data Types
- Catergorical
- Ordinal
- Non-Ordinal
- Numeric
- Discrete
- Continious
If a numeric set is discrete and only consists of a few possible values (e.g 1, 2, 3) we can consider it to be ordinal
Describing Data
Frequency Distribution
The simplest way to describe a data set is a distribution. The simplest example of this could be a two catergory distribution. This works well for Ordinal data.
CDF
For discrete data we make use of a cumlative distribution function. The problem with a CDF is that it does not easiy convey a lot of data about the data
Histograms
Histograms allow us to sacrifice some information but allow us to display more usable data, which easily displays things like our range, distribution, and symmetry, however due to dividing our data into bins we lose some information regarding the specific values
Smooth Density Plots
A smoothened version of a histogram which goes on the fact that the list of observed data is a subset that is representative of the all the values, this allows us to approximate the whole based on the measured
We create a Histogram with Bin sizes appropriate for our data, change counts on the Y-Axis to densities, and then create a line that best approximates the points on the top of the histogram. The smooth density plot will have a bottom area of 1, the area under this graph gives us the fraction of our data that will fall within a specific range
Normal Distribution
AKA the Bell Curve, or Gaussian Distribution. This is a common distribution for datasets, the normal distribution for data can be calculated with only the mean, and standard deviation. We have functions in R for these
Standard Units
For data that is approximately normal we can express a datapoint in standard units which describes how many standard deviations away something is from the mean
Quantile—Quantile Plots
To check if an approximation is good we use QQ Plots.
If the quantiles for the data match the quantiles for the normal distribution then the data is normally distributed
To see if they match, we plot them against each other and then use an identity line, if the points fall close to the x = y
line on our graph then we have a normal distribution
If we use standard units we can instead do:
Percentiles
Special Cases of Quantiles that are commonly used in which p is defined at values of increments of 0.01, our special cases are the mean at 0.5, and quartiles at 0.25, 0.50, 0.75
Boxplots
Summarize data distribution in 5 numbers, the range with the percentiles: 0.25, 0.5, 0,75. These are useful for viewing the data distribution and for easily comparing data
GGPlot
We will be using the ggplot2
library for plotting because it is easy for beginners to use as well as makes use of good default behaviour which suits most general usecases
Graph Components
- Data
- Geometry - Plot Type
geom_...()
- Aesthetic Mapping - Points, colouring, etc.
aes()
- Scale - Log, linear, etc.
The first step to define a ggplot graph is by assigning the data to it, we can view the plot by simply printing it
Layers
Graphs are made of layers based on components, to add layers, we use the +
operator
The aes
function can recognise our values that are defined in the datFrame
, this is not the case with the other functions.
The problem we have is that we keep having to redifine the mappings for x and y, we can instead define a global aesthetic mapping
We can write local mappings which wil overwrite the global mappings
Scales, Colours, and Labels
We want the scales to be in log10 base, this can be done by adding
Since the log function is so close we can add it simply by
We can add colour mappings based on a catergorical variable by simply using an aesthetic mapping
Add a simple line with the gradient and intersept with
Add-On Packages
ggthemes
- Themesggrepel
- Ensures labels do not bump nto things
To use a theme we can do the following
Other Graph Types
geom_histogram
geom_density
geom_qq
To use multiple plot objects on a single plot we use the gridExtra
package
DPLYR Functions
Summarize
Dot Placeholder
The dot placeholder allows us to access the data being piped into a series of functions
Group By
This allows us to break data into two different sets which return a group data frame. This is like a table with the same columns but two separate sets of rows
Sorting Tables
In dplyr we have the arrange function that will allow us to order data by a single column or even multiple columns
We can view a part of the table with the top_n
function as follows:
Data Faceting
To separate the data in our data sets based on some sort of catergorization without the need to make separate plots we can use the facet_grid
function which requires our row vector and column vector separated by a ~
If we would like to use just one variable to differentiate we can use the dot placeholder
If we want to be able to wrap our plots if we have many plots we use
Time Series Plots
If we want to plot multiple lines on a plot we need to tell ggplot to do this
Transformations
We can tranform our data if we have some sort of logorithmc relationship making it easy to explore different datasets. However we have the ability to use a log scale or to log dataset
Reorder
We can reorder Factors in a method other than the default which is alphabetical for a plot with the reorder
function
Encoding Data Using Visual Cues
Humans are not good at using angles or areas independently. Pie or Donut charts are ineffective ways to display information. By using length and position we can more effectively communicate information.
Brightness and colour are difficult to distinguish but can sometimes be useful for displaying more complex data
When to Include Zero
With bar charts Zero sould be inluded as it makes it can be deceptive to exclude the origin. If it is the magnitude of the data that is impportant we should include the origin
However if what we are trying to show is variability within a group, then a more localised scale is more effective
Do Not Distort Quantities
We should not make use of deceptive presentation when displaying data, ie we should prefer something like length over a visual using areas as these are esasier for us to understand
Order Meaningfully
We should order by a meaningful value. We rarely will use alphabetical. Ordering by magnitude or median, min or max can be contribute towards more meaningful graphs
Show the Data
It’s useful to make use of tools like jitter and alpha blending to allow us to give an idea of our data distribution while showing the actual data points in a meaningful way
Common Axes
When comparing data across plots we must be sure to keep the axes the same
Plots should be laid out in the perpendicular direction to the data when using it to compare multiple sets of data
Slope Charts
Slope charts allow us to display change between a small number of points, this can be done with ggplot in an indirect fashion by using lines
Bland-Altman Plot
Depicts the difference versus the average
Encoding Additional Variables
We can use shape, colour, intensity, and size to additionally encode additional variable information into our graph
Avoid 3D plots
Plots should always be made in 2D as far as possible
Avoid too Many Significant Digits
The default behaviour in R is for a table is to show data up to 7 digits, this is not necessarily useful
We can specify our sifnificant figures with the signif
and round
functions
Alternatively the global digit count can be set by options(digits = n)