How not to tank a perfectly good power analysis

Power analysis is one of the most fundamental data science procedures and a prerequisite for anyone doing A/B testing.  When reviewing the power analyses of junior data scientists I usually look first at how they estimate the variance of their target metric.  This is one of the trickier aspects of the process to get right, and it’s usually where they spend most of their time.  In our haste to get the variance right, it’s easy to overlook another even more critical piece of the estimate: your target population size, specifically, how many unique experiment assignments you will expect over time.

A Minimap for your data

Sublime Text has a handy feature called “Minimap” which shows a small condensed version of your text file along the right hand margin. This gives you a high level view of the file you’re working in and what the file looks like if zoomed out so far that you could see all the text on one screen. This can help in navigating around very large files. I took some inspiration from this feature, thinking that if it can be useful for text files, it may also be useful for datasets! I’m writing an R package for exploratory data analysis in the browser with React and the Minimap is the first feature that I’d like to showcase to demonstrate what’s possible by leveraging a front-end web application to power data analysis. Please read A Front-end for EDA for a more detailed introduction to this project.

A Front-end for EDA

I am writing an R package for exploratory data analysis in the browser with ReactJS. In my last post I outlined my lofty ambitions for writing a graphical data analysis tool to make exploratory data analysis easier, describing the motivations I had after struggling with the context-switch between analysis and programming. In this post I will go on a whirlwind tour of my thoughts and inspirations that will start with the front end, go deep into the backend and then journey back to the front end again. By the end I will hope to have made the case that I’m not crazy and this thing might actually work.

Using window functions and cross joins to count events above a threshold

I haven’t written a SQL post since Generating post-hoc session ids in SQL. I don’t ordinarily think of SQL as good candidates for blog posts because to me SQL is just boring. I do use it everyday though, and I’ve certainly internalized a lot of handy tricks. Today I’d like to share one of those rare moments when I sat back and thought to myself, “wow this query is beautiful!” The solution involved using not just one, but two cross joins, and a window function to count the number of events occurring at or above each level of a score.

Migrating to Gatsby

I have just completed a migration of this blog from Jekyll to Gatsby. Along the way I’ve learned a little bit more about React and Node while taking my blog out of the flat static and Bootstrap era and into the over-engineered modern Javascript era. I still have much to learn, and I still need to re-style the site, but at least now I have a workflow that actually works. In this post I’ll discuss how I decided on Gatsby and a few of the specifics I encountered during the migration.

Data analysis at the speed of thought

I have a provocative question to ask of experienced and beginner data scientists alike, whether you are fully fluent with the syntax you use to analyze data or not quite comfortable with the command line.  Do you think a graphical tool for exploratory data analysis could make you more productive?  Would you consider using such a tool?  What would you envision such a tool be able to do for your workflow? I’m developing an R package to help you analyze data at the speed of thought.

Linux From Scratch on EC2

I recently built an EC2 AMI based on Linux From Scratch and documented the process in a hint for the LFS project. Early last year in a state of growing frustration with the evolution of mainstream linux distributions, I wrote about Linux From Scratch and the benefits of having the freedom to carefully craft your operating system based on personal preferences. I am now more convinced than ever that we need the Linux From Scratch project to inspire momentum to counter the prevailing forces of uniformity which threaten to remove the creative element from computing altogether. I’m no longer participating in the boring cookie-cutter committee-designed mainstream distro upgrade circus. From now on, it’s my distro, my rules. My goal is to run LFS systems, in production, everywhere. This post will explain my motivations in a bit more detail. You can read my EC2 hint here.

Full text search with Postgres and Django

Being one of the most fundamental problems in computer science, occupying the first half of Volume 3 of Donald Knuth’s classic work The Art of Computer Programming, it would seem fair to assume that by 2017 search would be mainly a done deal. But search is still hard. General purpose algorithms may perform decently in the average case, but what constitutes decent performance may not scale well enough to meet user demands. Unless you’re building a search engine, you are probably not designing your data structures specifically to take advantage of search algorithms. When web application developers work on adding a search feature, they are not going to start evaluating different algorithms with Big-O notation. And they’re not going to ask whether quicksort or merge sort would obtain the best performance on their specific dataset. Instead, most likely, they will defer to whatever features are available in their chosen database and application framework, and simply go from there. That search is hard is why so many sites either defer to whatever vanilla features are available in their framework or outsource to a third-party library such as the Lucene-based document retrieval system Elasticsearch, or to a search-as-a-service provider like Algolia. While the cost in terms of both time and money of integrating third party libraries or services can be prohibitive for a small development team, the negative impacts of providing a poor search interface cannot be understated. In this post I will walk through the process of building decent search functionality for a small to medium sized website using Django and Postgres.

Legion of lobotomized unices

Over the past two decades, changes are underway with profound consequences for both social organization and system design. Virtual machines, cloud computing, and containers are reducing the need for general purpose multi-user systems and the stewardship that maintenance of such systems requires. At the same time, while we live in an age in which we are more connected than ever, we are increasingly cut off from one another because the systems we use are isolated clients. The loss of centralized loci of computing has changed the way we work and communicate online, in many ways making it more difficult to collaborate and causing more isolation by removing the shared spaces that brought us together in the past. Unix systems which used to be the durable social centers of computing have been replaced by a disposable legion of lobotomized unices.