A Front-end for EDA | Variables and Observations

I am writing an R package for exploratory data analysis in the browser with ReactJS. In my last post I outlined my lofty ambitions for writing a graphical data analysis tool to make exploratory data analysis easier, describing the motivations I had after struggling with the context-switch between analysis and programming. In this post I will go on a whirlwind tour of my thoughts and inspirations that will start with the front end, go deep into the backend and then journey back to the front end again. By the end I will hope to have made the case that I’m not crazy and this thing might actually work.

Motivation and the solution space

In Data analysis at the speed of thought I describe the motivations that led me to pursue this project of designing a front-end for exploratory data analysis. Writing code can be a huge distraction when you’re trying to explore data. The thought process of analyzing data is very different from the thought process of programming. The mental model of understanding that you try to build up during EDA can be quickly broken by the need to translate what you want to do next into the syntax you need to write in order to get to that next step. After reviewing my typical workflows, I believe there are several patterns that are common when approaching new datasets in the exploratory phase. I also believe it is possible to solve a lot of these patterns in a graphical environment rather than the traditional syntax-driven environment.

Ideally these and similar questions could be answered without having to write any code at all:

Does my primary key have duplicates?
Does my key metric have missing values?
What is the granularity of the dataset, i.e. what does each observation represent?
Does this variable have low or high cardinality?
Is there an implicit sort order to the dataset?
Are variable types inferred correctly? Eg. Is there a datetime represented as a character vector?
Do continuous variables have unusual clumps that would indicate a magic number, a data quality issue, or an “ordinal disguised as a continuous variable”?
If I left join dataset a with dataset b, will I get the same number of observations as dataset a? What about inner join?
Count of counts: how many of x have more than one value of y?
Does a distribution change if I ignore NAs?

Some of these can be answered with basic summary statistics and frequency tables which can of course be easily scripted. I’m sure you already know and can easily recall the code you need to write to accomplish these. You may even have your own set of convenience functions that you would typically throw against a new dataset to get a basic understanding of each variable. However, I think we can do better than just summary(dataset).

Some of these need to be shown in plots and if you’re like me, you’ll spend 5 minutes searching through previous projects to look for a similar case where you wrote a 12 line ggplot2 command to get the output you want. The grammar of graphics approach to plotting that ggplot2 enables is incredibly flexible. Naturally this is because there is no simple recipe for building visualizations without taking into account the nature of the insight that you intend to visually highlight. Yet, we can take advantage of some heuristics based on variable types and their distributions to generate default views of a variable to make it useful to identify patterns in the exploratory phase.

And then some of these questions are about surfacing relevant metadata or making it easier to discover patterns that would come from paging through the dataset. Is head(dataset) or print(dataset, width=Inf, n=20) the best way to see your data? What if you’re interested in seeing some random data from around the middle of the dataset, but sorted by time? To answer that question you need to write some code, and probably some fairly tricky code that will distract you from the initial motivation you had for looking at the dataset that way. This is exactly the kind of distraction that my project intends to address.

While I am focusing on exploratory data analysis as the primary use case, the solution I have in mind is certainly not limited to EDA. From data cleaning to validation and even model building and report generation, I believe it is possible to surface a lot of insights while hiding the code necessary to generate them. But what exactly do I have in mind for the solution space? Allow me to lead you through my own journey as a “casual” web developer, someone who has always had an interest in webdev but usually as a secondary function in my career.

Webdev and Me

In the “early days” of the web, the mid to late 90s, I built web pages¹ for myself, friends, grad school colleagues and professors, and eventually for many small companies around Seattle and New York. Things were pretty wild for awhile, a lot of trial and error, and the idea of separation of concerns was not an issue as CSS was barely up to the task of handling common layouts. So, we used table hacks and spacer gifs. I remember my surprise disappointment when showing off to a friend a painstakingly crafted design that looked great on my 1280x1024 monitor but looked like a broken mess on her 800x600 screen.

By the time I read Zeldman’s Designing with web standards in 2003, I was a frequent reader of A List Apart, I used my favorite clearfixes, Fahrner image replacement, made rounded corners and drop shadows with gifs in Photoshop, and I absolutely hated Internet Explorer. But my real focus even then was on what would become known as the back end, which at the time was Perl scripts in cgi-bin. Around that time I connected Perl to SAS and was building dynamic websites that ran SAS procedures and returned tables that were formatted in Perl. I setup many a LAMP stack like this to power websites built around data. I didn’t do much with JavaScript, not least because it was such a hacky mess but mainly because I felt that all the fun stuff was happening on the server—where the data and all the computational power lived. The web browser was just a “thin client” providing a window into the magical backend land.

Soon after Zeldman converted the web developer community to use standards, adopt CSS, and shame IE users, a little feature developed at Microsoft called XMLHttpRequest ripped the web wide open and fearless developers starting doing amazing things in the browser. The power and promise of AJAX was showcased with innovative sites like Writely and Google Maps. The Web 2.0 revolution was underway. Coincident to this, on the hardware side desktop computers were advancing so quickly that suddenly there was an amazing amount of computing power sitting on or under everyone’s desks. In only a few years time, broadband replaced modems for home connections, a Windows version was released that didn’t require you to reboot after changing your IP address, and network speeds and latency were becoming more and more stable. Suddenly the promise of all this spare computational power had an appealing outlet in the use of JavaScript to build websites that were good enough to replace desktop apps. A little thing called the iPhone came out which radically changed the web landscape again, but I’m afraid if I get into that it will take me even further afield than I already am.

Every few years I get the damn foolish idea in my head that I should do something cool with JavaScript. But every time I take a look at what it means to be a front-end developer, I run screaming back to the safe comfortable confines of the back-end. Over the years whenever I needed to use JavaScript, I would reach for a library like jQuery because it was convenient, well supported, easy to use, and had all the features I needed for simple things like “make this button select all checkboxes on the page”. I never cared to dig into it more deeply because for me those front-end tasks were secondary concerns compared to making the back-end work. But at the same time I’ve always hated my ignorance of Javascript because even trying to do something dead simple I had to lookup countless topics on MDN and questions on Stack Overflow.

Twenty years ago I built this little toy animated matrix of table cells with setTimeout and Math.random. Back then I only designed it to handle 4 rows because any more than that would bring my 1999-era desktop to a crawl. The fact that it still works is not so much a sign that I can write good code as it is a testimony to the dedication to backwards-compatibility of the ECMAScript standards committee and browser developers. It looks nothing like modern-era JavaScript because the language has undergone so many transformations over the last two decades. What began in 1995 as a hastily implemented scripting language to enable dynamic web pages in Netscape Navigator became by 2013 the most popular programming language on Stack Overflow. Douglas Crockford’s 2008 book JavaScript: The Good Parts was a turning point in the maturation of JavaScript by focusing developer attention on the best aspects of the language that make it suitable for writing high quality code.

2019 seemed to finally be the year that front-end development shed its adolescent angst and grew up into a real environment that grown-ups can use. And by grown-ups I mean a guy who just a month ago put his first sticker on his MBP. ES6 works natively without transpilation and it took me about 17 seconds to get create-react-app to Hello World. When I tried this in 2016, I followed the original JavaScript Stack from Scratch and basically gave up because it was a total cluster. But now that ES6 works withourt layers of indirection, the guts of Webpack are safely hidden away, imports and exports work nicely, classes exist, and nobody cares if I use npm instead of gulp, grunt, or yarn, I’m finding writing JavaScript to be rather fun.

A better way

Now I’ve found another reason to revisit the front-end. Last year I started working a little bit with Julia and while it certainly didn’t inspire me to immediately abandon R and Python, I was impressed enough with some of the tooling and can definitely see the promise. It’s a little funky and gives me Scala flashbacks, but because of its thoughtful design, Julia might in ten years time be a powerful ecosystem for data science. One of the interesting discoveries I made along this exercise was the DataVoyager interface to the Voyager data exploration tool, which spins up a nice browser interface with auto-generated plots to very quickly assist in the first stages of exploratory data analysis. Voyager is an impressive project but it doesn’t solve the problem from the analyst’s perspective. The defaults are all wrong, the interface is clumsy, the styling feels way too much like a Microsoft product, and adapting it to work the way my brain wanted it to work was so distracting I constantly lost my train of thought and couldn’t focus on the data. In sum, I was fighting with it so much that it failed to do its job of showing me what I needed to see in my datasets. But the idea is so tantalizing and it’s a great start except that it doesn’t go any further than plotting. So I decided to write my own front-end dataset exploration tool. With blackjack and… actually, forget the dataset exploration tool. Before we do that, we need a dplyr (or pandas if you prefer) for JavaScript. And before that, we need some basic statistical distribution functions.

I started writing a dataframe class, with basic tabulation and join methods. Mean, median, mode, no sweat. Need some random numbers like rnorm? Easy. Ok, what about qnorm? Ouch, now I need to code Wichura’s algorithm. Looking ahead, I can see this will start to get painful really fast. Do I really want to write a full set of robust statistical functions in JavaScript simply to enable my visual exploration tool? I soon realized that instead of trying to recreate R and dplyr, wouldn’t it be much easier if I could actually just use R and dplyr? I pivoted back to R. It was a breath of fresh air. What if I can let R do all the backend work and just focus on building a front end to enable all the ideas I’ve been having for better EDA?

For this idea to work, I’ll need the web browser talking to a running R process. And here’s where it starts to get interesting.

Web applications and R

Python has Django and Flask, Ruby has Rails, Javascript has Express, PHP has, well, PHP, Julia has Genie, even C has a web application framework called Kore. So what about R? Why doesn’t R have a popular web application framework?

There are several reasons why R as a platform for web application development hasn’t really taken off. I’ve often heard the criticism that R is slow and that’s why it hasn’t gained ground as a general purpose scripting language let alone a platform for web development. In my experience, this criticism misses the mark. I don’t find R to be inherently slower than any of the other high-level languages mentioned above. There are, however, some real differences with those languages that make R more challenging for serving web applications.

One glaring issue is that the R process is single-threaded. Efforts over the years to bring multithreading to R have met possibly insurmountable challenges due to the architecture of the R internals. Some of the best R core developers investigated many years ago what it would take to support threads in R, starting with Duncan Temple Lang in his PhD thesis in 1997. Luke Tierney discussed some of the issues in 2001 in Threading and GUI Issues for R. The topic continues to surface from time to time. For example, more recently, Lukasz Bartnik describes his pursuits into some research into parallelizing R in (A Very) Experimental Threading in R. But I think the final verdict in this case was well summarized by Dirk Eddelbuettel in 2012 in this Stack Overflow question, “The R (and before it, S) internals are single-threaded, and will almost surely remain single-threaded.”

Some confusion in this space arises from the fact that multi-threading can be accomplished in R packages. Since R can link to code written in other languages that do support threading, it’s possible to write a library in C++ in which threading can be managed and then R will appear to be taking advantage of multiple threads. For example, Revolution Analytics (now Microsoft R Open) can be compiled against multi-threaded BLAS/LAPACK libraries. Similarly, recent versions of httpuv will use a background thread for I/O which communicates with the R process using sockets. However, this approach doesn’t change the fact that the main R process itself is limited to a single thread of execution.²

You may be thinking, “why would anyone want to do that anyway?” When it comes to serving web applications, multithreading is actually very important. Handling requests with light-weight threads instead of heavy-weight processes will greatly improve the throughput of your application servers. Consider this common architecture for a high traffic Django site: a load-balanced fleet of nginx servers handling static assets and proxying application requests to a set of Apache servers running mod_wsgi. If you configure Apache to use either the worker or event MPM, then each Apache process will manage a number of threads each of which can handle its own request. If you use Apache’s prefork module, you will only be able to handle one request per process and this will sharply limit the throughput requests that your application server will be able to handle. On the Python side, configure mod_wsgi in daemon mode so that the Python interpreter will have its own set of processes, rather than being embedded in the Apache processes, and also enable multithreading so that each interpreter process can independently handle its own Django request. I’ve written more about how to accomplish this setup in Svelte Apache.

None of this multithreading would work with R. Even if you were to write a mod_wsgi interface between Apache and R, you would have to use prefork and you’d be limited to one request per R process. This to me is the fundamental limitation that prevents R from being a decent solution for serving web applications. Even more damning is the fact that R is an unrepentantly profligate consumer of memory. I don’t have hard data on this because it’s not even close: a vanilla R process without any data or third-party libraries loaded is already going to consume 10x the memory of a svelte Python interpreter process with all of Django.

A few years ago…

Back in 2016 it was my job to write an experiment reporting system at work. I named the system “Gosset”. Those who were there will know what this means! I had already done this twice before in my career at separate companies, the first time in SAS and Excel, and the second time in C# and SQL Server. This time, apart from the fact that we were using Redshift as our analytical database, I had carte blanche to develop the system using whatever tools I thought were most appropriate.

The first iteration was little more than a SQL query and an RMarkdown report. As the experiments started ramping up, I moved the reporting into Shiny Server. The prototype back-end was a cron job which called an R script to query Redshift. The query would aggregate the latest experiment sessions for each variant for each hour, including all the key events necessary to track conversion rates as well as the count, sum, and sum of squares for continuous variables. These would be merged into R data frames stored on the server. The UI in Shiny was a simple table of current and past experiments with links to request an RMarkdown report. This report contained all the tables and plots for t-tests, confidence intervals, cumulative conversion rates, and cumulative p-values over time. I spent a lot of time working with the Product and Engineering leads to design the report to make it easy for them to quickly read and understand everything they needed to know to evaluate an experiment’s performance. I also came up with some clever visualizations, like a confidence interval box plot, that I was very happy with and were all designed to communicate the most important insights about the experiment at a glance. The core of the report was about 800 lines of R.

This was a nice solution for awhile, but as we scaled up even further, the limitations of Shiny became evident and with them the need for a true web application framework was becoming more clear. The t-test reports were particularly intensive—each report would run 5 or 6 t-tests depending on the chosen metrics suite, with 3 or 4 plots per test. A complete RMarkdown report would take between 10 and 20 seconds to generate. With Shiny, one backend R process handles all requests serially. If two users make a request simultaneously, one of them will need to wait (~ 10 to 20 seconds) for the other report to finish before their request can begin. Often, the Shiny UI would timeout before the the backend R process completed. There is an app_init_timeout configuration directive but it is only relevant at application startup. There is no way to adjust the amount of time Shiny will wait for R to return results before it abandons the connection. Worse is the fact that when this happens, the backend R process doesn’t actually stop, it continues working (and consuming more memory) until it either completes or runs out of memory, but then it has no way to return the results back to the user.

Another limitation was the lack of caching. The raw data were aggregated every hour, so once a selected report is generated, the same report could be served to other users in between the hourly incremental load jobs. However, Shiny can’t serve static files, nor can it cache reports already run. This means every time a report is requested, it must be generated from scratch even though the reports will only change after the hourly incremental update process adds data to the data frames. Multiple users requesting the same report at the same time will all cause the system to run independent copies of the same report. This is highly wasteful. I considered pre-computing all the possible combinations of the reports but because there were so many, even selecting a subset of the most requested reports wasn’t going to scale well enough to solve the problem.

As we ran more experiments, we also began to accumulate a lot of useful feature requests: standard and custom dimensions, date filters, authn and authz, user preferences, report filters, alternative report options (such as user-selectable timezones, plot selection, relative vs. absolute differences), an admin interface, SSL support, scheduled email delivery, redirects, logging, secondary metrics not included in the standard reports, and custom metrics based on a standardized metadata format for experiment events. Most of these would simply not be possible with Shiny, but would be possible with a more flexible web application framework.

Enter Django

I was able to solve all of the issues and add all of the features by moving the front-end from Shiny Server to Django and using RDS instead of R data frames for the hourly aggregates from Redshift. However, we had one sticky issue to deal with. All of the code for the t-tests, power calculations, and my beautiful well-polished plots was done in R and I did not want to re-write all of this in Python. Luckily, there is rpy2 which provides an embedded R environment inside the python interpreter! With a thin wrapper, it was easy to use Django’s ORM to query RDS, convert to R dataframes, call R functions, and return html tables and plots back to Django. I was in heaven—I had all the goodness of a proper web application framework and I didn’t have to throw away any of the R code that I had developed at the core of the application.

There was just one tiny wrinkle: deployment. With R embedded in Python, any attempt to use mod_wsgi or Apache with multiple threads would cause the application to segfault. The only way to deploy was using Apache in prefork mode, and turning off multithreading in the mod_wsgi daemon processes. I would have to accept the fact that I would be stuck with one server process per request, just like it was 1998 again. With big bloated R binary code incorporated into my svelte Python processes, these processes were very memory intensive. Fortunately, the site was strictly internal to our company and the most users I would expect at any given time was somewhere around 20. In the end, using MinSpareServers 5 and MaxSpareServers 10 was good enough to handle the load. But I wouldn’t want to try that on a public website that had any kind of significant traffic!

Here lies the dilemma facing web app developers who may want to use R for the numerical work it is well suited for. You could use Python alternatives like pandas, numpy, scikit and even if matplotlib doesn’t look as good as ggplot2, it will probably look good enough. The alternative is to build out a very expensive fleet of application servers with very fat R processes handling one request at a time.

Web frameworks for R

The good news is that the R web application I have in mind will not need to be hosted on a public website nor will it need to be able to scale to thousands of concurrent requests. In fact it will only ever need to serve exactly one user, connecting to an R process on localhost. This greatly simplifies the solution space by removing a lot of really hard things about web development, particularly security. When I first had the epiphany of using R as the backend for my project, I reviewed the CRAN Task View: Web Technologies and Services, specifically the section on Web and Server Frameworks. If you’re interested in this space, I highly recommend reading up on each of them if for no other reason than to provide the historical context into what has been tried in the past.

In my mind, there were 3 leading alternatives I could pursue: Shiny, plumber, and Fiery. I’ve used Shiny a lot and despite the numerous frustrations I have with it as outlined above, if I thought I could get this job done with Shiny I’d choose it in a heartbeat. Shiny is very powerful for what it enables, it’s well supported, actively maintained, has a pro upgrade path, and is already pushing the boundaries of R’s integration with JavaScript in very interesting ways. However, I don’t think my project is a good fit for Shiny because I would often need to break out of its confines to get the kind of control necessary to build the visualizations that I have planned.

Plumber would be a suitable choice because it provides a simple mechanism for turning R functions into a REST API, and this really is the heart of what I need R to do. Taking this idea further, I’ve already thought about the possibility of using other backends instead of R. It should be possible, for example, to have a Python and pandas backend that provides the same interface to the data. I will admit, however, that the first version will be tightly coupled to R. I think it’s necessary at this stage to provide a firm ground on which to begin, but ultimately a backend agnostic solution will be in the cards. Given this end goal, plumber would be a good choice, but I’m not quite ready to go full REST just yet. To get started, I want as much of a traditional web application framework as I can possibly get: a request-reponse cycle, routing, and view functions.

Fiery is built on top of httpuv, as is Shiny, but it takes a much less opinionated approach to application development and is very much geared toward the traditional MVC architecture. httpuv is, as advertised, a low-level framework providing “protocol support for handling HTTP and WebSocket requests directly from within R.” But that in itself is an incredibly powerful set of functionality that basically makes all R web development possible. At its core is the Rook specification developed by Jeffrey Horner and originally intended for the rApache module mod_R. A Rook application must implement a call function returning a list. Here is a Hello World web server running in R:

library(httpuv)

runServer("0.0.0.0", 5000,
  list(
    call = function(req) {
      list(
        status = 200L,
        headers = list(
          'Content-Type' = 'text/html'
        ),
        body = "Hello world!"
      )
    }
  )
)

From that simple code, you could eventually reproduce Django for R but it would be quite a long slog. The author of Fiery, Thomas Lin Pedersen, has also created two packages that work as plug-ins in Fiery: reqres and routr. Together with Fiery, these packages provide the foundation of just about everything you need to make R web applications in a way that will be very familiar to those with experience in other frameworks like Django, Rails, or Express.

Using Fiery in my prototype has made development of the R backend move along very quickly.

Back to the Front-end

As I’ve alluded to a few times, my choice for the front-end is ReactJS. There is a very simple reason for using React in this project: it’s just so damn good. After working with React for a couple of months I concluded that any other method of manipulating the DOM felt like amateur hour. Today, React is almost synonymous with front-end Javascript development. It is used throughout industry on projects large and small. Having also spent some time with other popular frameworks Angular, Vue, Backbone, and Ember, I was continually drawn back to React despite the learning curve because of the structured way it makes you think about the design of your application. Combined with new features of the language brought about in ES6, the patterns for solving problems in React are crisp and elegant. Due to its explosive popularity, there are many high quality libraries for elements like react-tabs and react-select. These make it easy to add well-engineered components to your own project as well as providing a great source for learning best practices.

As soon as I started getting comfortable with my React components, I discovered and started reading about React Hooks. The author of The Road to Learn React has written an excellent introduction to hooks. Hooks are a new way of writing React components that are functional and stateless. Even in the prototypical stage of my application I found that moving to hooks made the code much more concise and less bug-prone. It does take quite a bit of meditation to wrap your head around how hooks work, and I’m far from being a Zen master in that regard, but for new React projects it is highly recommended to start writing them with hooks.

At this point, I’m still uncomfortably far away from being able to showcase a public demo, but in the next post I’ll begin sharing some code and interesting solutions I’ve come up with along the way.

The nomenclature of website didn’t surface until about the mid 00s. And today, of course, these are all known as web applications.↩︎
The parallel package which is now part of R Core enables the doParallel function which appears to add multithreading to R, but what’s really happening is that the main R process is forked and each independent process can run parallel computations and communicate together with shared memory segments. Still, these processes will be limited to a single thread.↩︎