Sublime Text has a handy feature called “Minimap” which shows a small condensed version of your text file along the right hand margin. This gives you a high level view of the file you’re working in and what the file looks like if zoomed out so far that you could see all the text on one screen. This can help in navigating around very large files. I took some inspiration from this feature, thinking that if it can be useful for text files, it may also be useful for datasets! I’m writing an R package for exploratory data analysis in the browser with React and the Minimap is the first feature that I’d like to showcase to demonstrate what’s possible by leveraging a front-end web application to power data analysis. Please read A Front-end for EDA for a more detailed introduction to this project.
A common problem when working with data is trying to understand the big picture from the very small sliver that you can view on your screen at any given time. Imagine your dataset as a text file that we can zoom out to a high level just like Sublime’s Minimap. By placing each variable’s distribution side by side, you can understand a lot about the data that otherwise you may not be able to see without plotting or summarizing each variable separately.
Let’s see what we can learn with this kind of visualization. Please note that this is a very early prototype and there are a lot of improvements and tweaks necessary before these minimaps will be good enough to release. First, let’s look at a dataset that should be very familiar to R programmers, mtcars
:
I am using the same method for selecting colors as used in ggolot2 to visually distinguish each value of each variable independently. This allows you to quickly see the distribution of each variable. One feature not shown here is a tooltip that will render when hovering over the minimap containing the variable name, value, count, and percent of total observations. For example, hovering over the first value of cyl
would tell you that the value is 4 and the count is 11 observations, representing 34% of the total.
Next, here is the diamonds dataset from ggplot2. Unlike mtcars, this dataset has several high cardinality variables—caret
, price
, x
, y
, and z
. Note the normal distribution of depth
. Compare this to ggplot(diamonds, aes(x=depth)) + geom_density()
. In the minimap, you basically get a “top-down” view of the density function.
Also notice that there are several interesting values of price
where you can see green cells peeking through the black borders. From this we can see that there are a few values at which multiple observations are concentrated in what is a high cardinality variable without many repeated values. To discover this in R, you could run ggplot(diamonds, aes(x=price)) + stat_ecdf()
and look for the tiny bumps in the cumulative distribution function where the y-values flatline. But typically it is not easy to detect anomalies in high cardinality variables. The Minimap visualization is still very crude for such variables. Still, it is surprisingly easy to find strange values and in testing I was able to locate a strange clump of values in a dataset that I later tracked down to a formula error in the source csv file!
Let’s look at one more dataset: flights from the nycflights13 dataset. The column headers still need a lot of work!
Now I ask you, how long would it take to give you this level of visibility from a) printing the dataframe, b) running summarize
or count
on each variable, and c) finding a decent plot that works for each variable? I’m not going to argue that the Minimap is a magic bullet—just like Sublime’s Minimap won’t magically give you all the insight you need to refactor your code. But it’s a useful tool in the toolkit combining in one visualization what would otherwise require running through several different pieces of output.
How these work
The minimaps are made from SVG rect elements rendered by a React application served by the Fiery package in R. When the React application is launched in the browser, Fiery will listen for web requests from the front-end which enables R to run some functions and return data to the browser as json. One of these functions I’ve written will tabulate all the variables in a dataframe, returning the values and counts of each variable as an array called vartabs
as in the following example for mtcars:
{
"vartabs": [
{
"name": "cyl",
"value": [
{
"cyl": 4,
"n": 11
},
{
"cyl": 6,
"n": 7
},
{
"cyl": 8,
"n": 14
}]
}],
}
My Minimap is a React component that I call with the following props:
<Minimap vartabs={props.vartabs} varcolors={props.colors} n={props.n} />
The varcolors
prop is an array of the same dimensions as vartabs
, containing the color chosen from a color palette. The Minimap component loops over the vartabs
array to generate the rect
elements for each variable name as column headers and then calls a VariableRect
component to handle each variable individually. Leaving aside the code for managing the tooltip that runs on mouse hover events, the Minimap component looks like this:
function Minimap(props) {
const { vartabs, varcolors, n } = props;
if (!vartabs || !n) {
return "Loading Minimap...";
}
// TODO: dynamically size the map based on number of variables!
const mapWidth = 800;
const mapHeight = 600;
// create rect elements for column headers
const colHeaders = vartabs.map((v, i) => {
return (
<Fragment key={v.name}>
<rect
key={v.name}
x={i*mapWidth / vartabs.length}
y="0"
width={mapWidth / vartabs.length}
height="60"
stroke="green"
fill="white"
fillOpacity="0.2"
>
</rect>
<text x={4 + i*mapWidth / vartabs.length} y="30">{v.name}</text>
</Fragment>
;
);
})
// create rect elements for the values of each variable
const cells = vartabs.map((v, i) => {
const x = i*mapWidth / vartabs.length;
const fillColors = varcolors[i].value;
return (
<VariableRect
key={v.name}
vartab={v}
x={x}
varWidth={mapWidth / vartabs.length}
n={n}
varHeight={mapHeight}
fillColors={fillColors}
/>
;
);
})
return (
<div className="minimap">
<h3>I am a minimap!</h3>
<svg width={mapWidth} height={mapHeight}>
<g>
{colHeaders}
{cells}
</g>
</svg>
</div>
;
) }
The VariableRect
component’s job is to loop through each value of the variable, determining the appropriate height based on the count of values.
function VariableRect(props) {
let prevHeight = 60; // fixed height for column headers
let y = 0;
return (
<Fragment key={props.vartab.name}>
{props.vartab.value.map((cell, rownum) => {
+= prevHeight;
y const h = cell.n / props.n * (props.varHeight - 60);
= h;
prevHeight return (
<CellRect
key={`cell_${rownum}`}
cell={cell}
n={props.n}
varname={props.vartab.name}
rownum={rownum}
x={props.x}
y={y}
width={props.varWidth}
height={h}
fillColors={props.fillColors}
/>
;
)
}}
)</Fragment>
;
) }
The basic implementation of the CellRect
component is quite simple, since all the dimensions have already been calculated. The reason for making it its own component is because I’m also using it with event handlers to enable the tooltip (removed for brevity!).
function CellRect(props) {
const fillColor = props.fillColors[props.rownum] || 'green';
return (
<rect
key={`cell_${props.rownum}`}
x={props.x}
y={props.y}
width={props.width}
height={props.height}
stroke="black"
fill={fillColor}
fillOpacity="1.0"
>
</rect>
;
) }
Reproduction in ggplot2
Is it possible to reproduce these minimaps in R? With ggplot, anything is possible! One approach is to map each variable to the count function and aggregate all the values back into a single dataset:
%>%
diamonds map(~count(tibble(x=as.character(.x)), x)) %>%
enframe() %>%
unnest(cols = c(value)) %>%
ggplot(aes(x = name, y = n, fill = x)) +
geom_bar(stat="identity", colour="black") +
theme(legend.position = "none", axis.text.y=element_blank()) +
labs(x = "", y = "")
As you can see from the results, there are a few issues. First, we’ve had to reduce all variables to the lowest common denominator of variable type, in this case using as.character
because all of the values across all of the variables in the dastaset need to be represented as the same type. A second issue is that with all of the values combined, the color palette assigns colors based on the full set of values, rather than re-assigning colors independently for each variable. This reduces the amount of visual discrimination in the display of the fill colors.
In addition, we cannot use this method to display the variables in dataset order. Instead they will be displayed in alphabetical order based on name. The only way to bypass this would be to hard-code the desired order of the variables in the levels of a factor: aes(x=factor(caret, cut, color, ...
.
Another solution is to use faceting, but again the same issues arise—the values of all the variables will need to be aggregated together into one variable in order to use that variable in a facet. The only way I’ve found to avoid this is to simply plot each variable independently, and then find a way to pop them all onto the screen at the same time.
The Wrap-up
The current prototype still needs a lot of work. In addition to fixing the printing of column headers and making the map dimensions flexible based on the number of variables, there’s also a performance issue with large datasets (when isn’t that the case?). The flights
dataset with 336k rows and a lot of high cardinality variables means I’m drawing several million rect
elements in the svg. That means several million event handlers for the mouse hover tooltip, which causes a nearly unusable lag. Fixing this will involve separating out a HighCardinality
component where we limit the number of rects we will draw to something more manageable. But I still want to be able to display the useful information, notably the presence of any “clumps”. The approach I’m taking is to draw the top N values and fill in the gaps with gradients.
I would also like to add a zoom feature, which svg makes possible–it is ‘scalable’ after all! You could even imagine being able to gradually zoom in to a point at which you’d be able to see the values of individual rows, as if it were just a table. This might not be necessary, however, because of course I’m also displaying a data table component so the Minimap doesn’t really need to serve this purpose. Still, zooming would still be useful to increase the resolution for those tricky high cardinality variables.