Variables and Observations

Obfuscating your Primary Keys

07 May 2021 • czep • Web • 6,500 words

One of the most common follow-up feature requests I get from clients after building a web application is to make urls containing model ids more appealing. Nothing screams “we are amateurs!” like inviting a new user to your website and redirecting them to /users/4/ or sending a customer a billing invoice at /invoices/12/ or linking to a page on your new photo-sharing site at /photos/31/. This is a common problem that affects database-backed web applications that need to surface permalinks to content that is accessed with a primary key embedded in the url. This can be anything with a “Detail Page” view—users, products, orders, comments, photos, profiles, posts, etc. In this post I would like to share my own research into this question with discussion of the options you might consider depending on your use case.

How to Load IPUMS Datasets into a Relational Database

23 Apr 2021 • czep • Data • 3,200 words

Ten years ago I wrote a hacky Python script to read the metadata from IPUMS extracts in order to load the datasets into a relational database system. I’ve personally reached for this script at least once a year since I wrote it—every time new IPUMS datasets are released and often throughout the year when I need additional data for some supplemental analysis. It’s a little clunky, requires Python 2, needs a lot of extra space to uncompress the raw data files, but it still gets the job done. But today there is a better way, mainly thanks to the IPUMS project with the development of the ipumsr package. An RDBMS such as PostgreSQL is an ideal solution for storage of Census microdata and in this post I will share an updated method that simplifies moving the data into the database.

Exploratory Analysis with the 2020 American National Election Studies

17 Apr 2021 • czep • Data • 300 words

I have been working with the 2020 ANES dataset and would like to share my report on Exploratory Analysis with the 2020 ANES Pre-election Survey. This report may be useful for three different audiences: 1) researchers seeking to use the ANES dataset in their work but feeling intimidated by its massive structure and the dearth of resources available on the web for using R with survey data, 2) R programmers interested in general purpose tabulation and data visualization, and 3) anyone with an interest in the differences between Biden voters and Trump voters in the months ahead of the 2020 election. This report is largely a tab deck, but also includes some of my introductory notes and a smattering of thoughts about how to approach some of the variables. All the code to generate the report is available in the HTAD repository on Github.

Stasis - a simple static site generator

30 Dec 2020 • czep • Web • 3,700 words

After a short tenure as the generator for this website, I’ve retired Gatsby and replaced it with Stasis, a small and simple static site generator written by me in Python. At the risk of spending more time discussing how I generate my blog rather than actually contributing real content, in this post I will describe the frustrations I encountered with Gatsby, the motivations I had for writing my own generator, and some of the things I learned along the way. I will also try to make the case that you too should write your own static site generator. It’s not a hard problem, but it will help exercise your engineering skills in a few ways.

How not to tank a perfectly good power analysis

26 Apr 2020 • czep • Statistics • 2,000 words

Power analysis is one of the most fundamental data science procedures and a prerequisite for anyone doing A/B testing. When reviewing the power analyses of junior data scientists I usually look first at how they estimate the variance of their target metric. This is one of the trickier aspects of the process to get right, and it’s usually where they spend most of their time. In our haste to get the variance right, it’s easy to overlook another even more critical piece of the estimate: your target population size, specifically, how many unique experiment assignments you will expect over time.

A Minimap for your data

04 Apr 2020 • czep • Data,Web • 1,800 words

Sublime Text has a handy feature called “Minimap” which shows a small condensed version of your text file along the right hand margin. This gives you a high level view of the file you’re working in and what the file looks like if zoomed out so far that you could see all the text on one screen. This can help in navigating around very large files. I took some inspiration from this feature, thinking that if it can be useful for text files, it may also be useful for datasets! I’m writing an R package for exploratory data analysis in the browser with React and the Minimap is the first feature that I’d like to showcase to demonstrate what’s possible by leveraging a front-end web application to power data analysis. Please read A Front-end for EDA for a more detailed introduction to this project.

A Front-end for EDA

31 Mar 2020 • czep • Data,Web • 5,100 words

I am writing an R package for exploratory data analysis in the browser with ReactJS. In my last post I outlined my lofty ambitions for writing a graphical data analysis tool to make exploratory data analysis easier, describing the motivations I had after struggling with the context-switch between analysis and programming. In this post I will go on a whirlwind tour of my thoughts and inspirations that will start with the front end, go deep into the backend and then journey back to the front end again. By the end I will hope to have made the case that I’m not crazy and this thing might actually work.

Using window functions and cross joins to count events above a threshold

24 Feb 2020 • czep • Data • 4,700 words

I haven’t written a SQL post since Generating post-hoc session ids in SQL. I don’t ordinarily think of SQL as good candidates for blog posts because to me SQL is just boring. I do use it everyday though, and I’ve certainly internalized a lot of handy tricks. Today I’d like to share one of those rare moments when I sat back and thought to myself, “wow this query is beautiful!” The solution involved using not just one, but two cross joins, and a window function to count the number of events occurring at or above each level of a score.

Migrating to Gatsby

02 Feb 2020 • czep • Web • 3,900 words

I have just completed a migration of this blog from Jekyll to Gatsby. Along the way I’ve learned a little bit more about React and Node while taking my blog out of the flat static and Bootstrap era and into the over-engineered modern Javascript era. I still have much to learn, and I still need to re-style the site, but at least now I have a workflow that actually works. In this post I’ll discuss how I decided on Gatsby and a few of the specifics I encountered during the migration.

Data analysis at the speed of thought

12 Jan 2020 • czep • Data • 2,700 words

I have a provocative question to ask of experienced and beginner data scientists alike, whether you are fully fluent with the syntax you use to analyze data or not quite comfortable with the command line. Do you think a graphical tool for exploratory data analysis could make you more productive? Would you consider using such a tool? What would you envision such a tool be able to do for your workflow?

I’m developing an R package to help you analyze data at the speed of thought.

Page: 1 of 6 Next »