Happy New Year 2025
And thus begins a new year. As we pass the quarter-century mark since Y2K, it is remarkable how much the world has changed in so few short years. In the time honored tradition of making grand pronouncements in reaction to the arbitrary turning over of a new calendar page, this post marks my attempt to return to this blog with semi-regularity; a discussion of things I have been thinking about as last year wound down, the bright spots that motivate me for the upcoming year, and a disturbing yet necessary brief statement about the perils we are presently staring down.
Obfuscating your Primary Keys
One of the most common follow-up feature requests I get from clients
after building a web application is to make urls containing model ids
more appealing. Nothing screams “we are amateurs!” like inviting a new
user to your website and redirecting them to /users/4/
or
sending a customer a billing invoice at /invoices/12/
or
linking to a page on your new photo-sharing site at
/photos/31/
. This is a common problem that affects
database-backed web applications that need to surface permalinks to
content that is accessed with a primary key embedded in the url. This
can be anything with a “Detail Page” view—users, products, orders,
comments, photos, profiles, posts, etc. In this post I would like to
share my own research into this question with discussion of the options
you might consider depending on your use case.
How to Load IPUMS Datasets into a Relational Database
Ten years ago I wrote a hacky Python script to read the metadata from IPUMS extracts in order to load the datasets into a relational database system. I’ve personally reached for this script at least once a year since I wrote it—every time new IPUMS datasets are released and often throughout the year when I need additional data for some supplemental analysis. It’s a little clunky, requires Python 2, needs a lot of extra space to uncompress the raw data files, but it still gets the job done. But today there is a better way, mainly thanks to the IPUMS project with the development of the ipumsr package. An RDBMS such as PostgreSQL is an ideal solution for storage of Census microdata and in this post I will share an updated method that simplifies moving the data into the database.
Exploratory Analysis with the 2020 American National Election Studies
I have been working with the 2020 ANES dataset and would like to share my report on Exploratory Analysis with the 2020 ANES Pre-election Survey. This report may be useful for three different audiences: 1) researchers seeking to use the ANES dataset in their work but feeling intimidated by its massive structure and the dearth of resources available on the web for using R with survey data, 2) R programmers interested in general purpose tabulation and data visualization, and 3) anyone with an interest in the differences between Biden voters and Trump voters in the months ahead of the 2020 election. This report is largely a tab deck, but also includes some of my introductory notes and a smattering of thoughts about how to approach some of the variables. All the code to generate the report is available in the HTAD repository on Github.
Stasis - a simple static site generator
After a short tenure as the generator for this website, I’ve retired Gatsby and replaced it with Stasis, a small and simple static site generator written by me in Python. At the risk of spending more time discussing how I generate my blog rather than actually contributing real content, in this post I will describe the frustrations I encountered with Gatsby, the motivations I had for writing my own generator, and some of the things I learned along the way. I will also try to make the case that you too should write your own static site generator. It’s not a hard problem, but it will help exercise your engineering skills in a few ways.
How not to tank a perfectly good power analysis
Power analysis is one of the most fundamental data science procedures and a prerequisite for anyone doing A/B testing. When reviewing the power analyses of junior data scientists I usually look first at how they estimate the variance of their target metric. This is one of the trickier aspects of the process to get right, and it’s usually where they spend most of their time. In our haste to get the variance right, it’s easy to overlook another even more critical piece of the estimate: your target population size, specifically, how many unique experiment assignments you will expect over time.
A Minimap for your data
Sublime Text has a handy feature called “Minimap” which shows a small condensed version of your text file along the right hand margin. This gives you a high level view of the file you’re working in and what the file looks like if zoomed out so far that you could see all the text on one screen. This can help in navigating around very large files. I took some inspiration from this feature, thinking that if it can be useful for text files, it may also be useful for datasets! I’m writing an R package for exploratory data analysis in the browser with React and the Minimap is the first feature that I’d like to showcase to demonstrate what’s possible by leveraging a front-end web application to power data analysis. Please read A Front-end for EDA for a more detailed introduction to this project.
A Front-end for EDA
I am writing an R package for exploratory data analysis in the browser with ReactJS. In my last post I outlined my lofty ambitions for writing a graphical data analysis tool to make exploratory data analysis easier, describing the motivations I had after struggling with the context-switch between analysis and programming. In this post I will go on a whirlwind tour of my thoughts and inspirations that will start with the front end, go deep into the backend and then journey back to the front end again. By the end I will hope to have made the case that I’m not crazy and this thing might actually work.
Using window functions and cross joins to count events above a threshold
I haven’t written a SQL post since Generating post-hoc session ids in SQL. I don’t ordinarily think of SQL as good candidates for blog posts because to me SQL is just boring. I do use it everyday though, and I’ve certainly internalized a lot of handy tricks. Today I’d like to share one of those rare moments when I sat back and thought to myself, “wow this query is beautiful!” The solution involved using not just one, but two cross joins, and a window function to count the number of events occurring at or above each level of a score.
Migrating to Gatsby
I have just completed a migration of this blog from Jekyll to Gatsby. Along the way I’ve learned a little bit more about React and Node while taking my blog out of the flat static and Bootstrap era and into the over-engineered modern Javascript era. I still have much to learn, and I still need to re-style the site, but at least now I have a workflow that actually works. In this post I’ll discuss how I decided on Gatsby and a few of the specifics I encountered during the migration.
subscribe via RSS