Happy New Year 2025 | Variables and Observations

And thus begins a new year. As we pass the quarter-century mark since Y2K, it is remarkable how much the world has changed in so few short years. In the time honored tradition of making grand pronouncements in reaction to the arbitrary turning over of a new calendar page, this post marks my attempt to return to this blog with semi-regularity; a discussion of things I have been thinking about as last year wound down, the bright spots that motivate me for the upcoming year, and a disturbing yet necessary brief statement about the perils we are presently staring down.

There are many reasons for a math nerd to be excited about 2025. It is a perfect square, the last such year being 1936 and the next one being extremely unlikely to be witnessed by any readers of this blog. It is also the sum of the cubes of the first 9 natural numbers: \(\sum_{i=1}^{9} i^3\). What do these curiosities mean? Nothing. I am more focused on next year which will be another prime number year for me.

To begin, since it has been such a long time since anything new appeared here, let’s take a whirlwind tour of the last few years.

Recap of the Post-Covid Era

In rose-colored retrospect, the early days of the Pandemic were a peaceful respite from the prior few years entrapped within the bonds of a soul-crushing company with little redemptive value. Suddenly there was no commute, no three-meals-at-the-office, no repudiations for not having responded to a Slack at 10pm, no boring social engagements, just family and a terabit internet connection. Apart from the fear of dying a horrific death amid the complete disintegration of society, it wasn’t so bad. I had been working for the absolute most poorly managed data science organization I have ever seen and when the Great Reset happened, I chose severance and was relieved to walk away from the flaming chaos. Much of my opinions on the matter have already been written on Blind, where it really counts, but someday I will post here my full reflections, naming and shaming those who deserve the world to know how pathetically they tried to cling to their tiny fiefdoms.

Having vowed to never again work on terms and principles other than my own, the brave new world was rocky at times but turned out to be pivotal in many ways. Though resoundingly mid-career, I branched out into directions I had never thought were possible to pursue. I approached reading, learning, and studying with a renewed vigor that brought to mind the best parts of my memories of graduate school. I had the immense privilege of working with the NIH to build quality control pipelines for SARS-CoV-2 clinical trials. I got to work with an awesome group of ML consultants in exciting and challenging ML research for large brands that you’ve heard of. I spent afternoons walking along the Great Highway with my family. I slept in a lot more. I became a Minecraft Youtuber. I launched a stealth-mode startup. The latter two being intricately connected in ways my counsel advises I not yet make public, but I promise to share more once the both of them are revealed to be either life-changing successes or colossal failures. And, I found a home in Ed Tech at an amazing company working with amazing people.

Zooming into the microcosm, the last waning weeks of 2024 were spent monitoring an ML model that was months in development and performing far better than I had expected for a v1. With the experiment performing well, I finished the documentation and outlined the roadmap for the next 6 months. Punctuated with some solid quality family time, it was a great way to end the year.

The emotion of being ‘done’, the fleeting slippery sense of accomplishment that cools too soon like a cup of coffee that you try to savor before it becomes tepid, has quickly given way to the frustratingly wide-open field that presents itself in reply to the inevitable question, “Well, what now?” That gaping maw of uncertainty is the primary motivation for this post and leads us thus into our next section.

Things I want to do this year

These are neither goals nor resolutions. I set for myself no lofty standards of achievement to avoid the inevitable disappointment that would result from my failure to execute them. Still, I want to record these various thoughts and ideas and begin a habit of holding myself somewhat accountable for translating the visions in my head into something real, practical and concrete.

Project VizDev

This is the current codename for my idea of unifying and simplifying a guided, visual approach to data analysis that I introduced in Data Analysis at the Speed of Thought and further expanded in A Front-end for EDA. The years since I first began working on this have not dulled my motivation or inspiration as I firmly believe this is a good idea and, if implemented well, could become a highly useful way of approaching data analysis for both the novice and the professional. For now, however, it lives in a private repo where it shall remain until such time as I gather enough courage to unleash it for public scrutiny.

It suffers from a profound lack of time, or rather commitment, on my part to devote the vanishingly small portions of spare time I have to furthering personal projects. But this is why it is at the top of my list. It is high time I started turning the crank to make this a reality. The R development ecosystem is experiencing a renaissance. New and fascinating work is happening every day, pushing the boundaries of web integration and opening up possibilities to rethink or expand on traditional workflows. It is time to build the future.

And yet I must acknowledge my limitations. I am not a professional developer. I am a data scientist, and it shows in my code quality and engineering choices. The code I write is not enterprise-grade. And for that I make no apologies. When I write code, I want it to do the thing, and I want it to do the thing in the most straightforward way possible. At least that’s my excuse for writing the ugliest spaghetti you’ve ever seen!

But in all seriousness, I probably take too much time searching for the optimal solution when something simple and off-the-shelf would be perfectly serviceable. There are types of programming projects I can do in my sleep. I know what to reach for, I know the trade-offs, I can evaluate the landscape quickly and make a decision I can stick with. Then there are areas where I have very little intuition. For example, routing in React—or just about everything in React, tbh. That’s when I get really bogged down, consuming volumes of docs and blogs, trying out 14 different methods before finding what I think is The One True Path which I then throw away in disgust three months later when it’s no longer the New Shiny.

My Javascript is improving but I have by no means reached the point of fluency wherein ideas for what to do become readily expressible in code. Still, what drags me down the most is the framework problem. The more comfortable I become with vanilla JS, the less inclined I am to sign away all that I know, burying my instincts and subsuming my will to the imposing hand of the Framework. I know, I know, I’ve tried to do without them, but for anything more than a toy or a one-off single-purpose task, it always eventually moves in the direction of either using a framework or writing one. Now I could take the Monaco way of doing everything with native APIs, but this would have me re-inventing half of Monaco, or most of React, and I’m not the guy for it—let alone the team of how many hundreds?

React is, for better or worse, the framework—or library whatever you want to call it—that best fits the UI patterns that I am encountering in this project. My project is in many ways, just an ordinary web app with lots of the ordinary web app things that need to have happen. Cue the fun I just had trying to upgrade to React 19. I have so much more learning to do before I can truly be productive, and more on that I will address in the section down below.

How to Analyze Data

Here is another long dormant project that bridges the gap between learning and teaching. I think I am pretty good at what I do: Steady, reliable… punctual, and unlike Picard, I rather enjoy my life of statistical analysis. I sometimes get to guest lecture for Intro Stats and Methods classes and I always find the experience immensely satisfying. It was but for a tiny twist of fate—namely, not having been born rich—that I left academia and I am often wistful for the alternate reality that may have been. This project is about recapturing some of that magic, and sharing my own perspective on data analysis with the “next generation”.

The end product of what I want to accomplish here is not fully settled. On one level, it may simply be a series of blog posts and a repository of accompanying code. A first foray is my Exploratory Analysis of the 2020 ANES dataset with code and R Markdown files saved in the HTAD repository. I also have in mind a tutorial series on many interesting research questions that can be answered with the American Community Survey, which I’ve been using since its inception and is made highly accessible by the IPUMS Project.

I find survey data to be an extremely rich source of fascinating and important research problems (perhaps my bias as a sociologist) and clean datasets like those available through IPUMS and ANES make them very approachable as teaching tools since data preparation is minimal. However, the mechanics of complex survey designs require a fairly rigorous treatment of variance in order to do inference properly and it’s difficult to include that in introductory material. I’d like to incorporate additional datasets that are high quality, freely available, relatively clean, and of course, interesting. Ideally I’d like to cover a range of subject matter both to broaden the appeal and to expose students to the different types of data they may encounter in their life or work and for which they will have relevant skills in their approach to those data. If you have any ideas for datasets that would make good teaching tools, I’d love to hear from you in this thread.

My main intention with this project is not to make the datasets the focus of the analysis, but rather to use the data illustratively, with the main goal being to teach an approach to data analysis. So the data and the research questions I want to explore are really just a means to an end. Of course I want the data to be interesting, and for the research questions to be engaging enough that students are genuinely interested in solving them and motivated to improve their analytical skills.

Without question, I will primarily be using R. While the goal will not be specifically to teach R, the style of analysis enabled by the Tidyverse maps so easily to the analytical workflow that the syntax will not distract students from the process.

Ideally, to supplement the traditional text-based presentation of blog posts + R Markdown, I would like to develop this as a series of video tutorials. The medium of video makes showing things a whole lot easier and brings with it a lot more potential for visual explanation compared to static text, or even fancy interactive Javascript. This is very exciting for me. I’ve made a lot of Minecraft videos, and now I’d like to do the same in an area in which I’m slightly more qualified.

Tools: past, present, and future

When I first started this blog I was using UltraEdit as my text editor and was convinced I would never need another one. That was just a few years after I had switched from Vim, with which I was equally convinced would be the last editor I would ever use. It was also just a few years before I found Sublime Text and for the next ten years I was again certain that it would be my forever editor. But last year I moved almost everything to VSCode as my daily driver. And I am absolutely positively 100% convinced that this is the last time.

What does this illustrate and why does it matter? Thinking about tools and preferences for tools leads to some interesting lines of thought. Twenty years ago I was certain that logistic regression was the best all-purpose classifier. Ten years ago it was random forest. Today it is XGBoost. It would be silly to think that what we are using today as tools will always be the best choice to use in the future. It is also silly to think that when new tools appear on the horizon this immediately condemns into obsolescence the tools of yesterday. I still use vim and I still use logistic regression. They still have their roles and rival their more modern counterparts with their own unique strengths. Vim is clutch when inspecting files on remote servers. Logistic regression is still the best way of quantifying the relationship between predictors and outcomes, and of comparing the relative impact of features in a model.

Developing an affinity for one’s tools is I suspect a natural consequence of a devoted effort to develop one’s expertise. There is nothing wrong in that unless it begins to stray into the territory of fetishism. I loooooove logistic regression. I can invert the Hessian in my head all day long. But just because I know it and am comfortable with it, I don’t want to stubbornly cling to it and refuse to consider alternatives. It’s important to continually study, learn, evaluate, and experiment with tools. Doing this can be frustrating, causing you to doubt all that you think you know, but it is also immensely rewarding. It will help you grow, it will improve the quality of your work, and it will give you a deeper appreciation for and intuition about what makes a good tool. You will also see the deeper connections between tools, how they are adapted and evolve to meet the needs of the day and to address the limitations of the previous generation. This is just one beautiful slice of the history of science. But yes, logistic regression will always be my favorite little hammer.

What new tools are you experimenting with lately? Personally I’ve been a bit slow on the uptake here but I have recently been trying out Copilot in a few different contexts. My impressions seem to validate much of what I have heard from others: most of the time it’s wrong, but helpful nonetheless. When writing the introduction to this post, I asked Copilot in an R session to “print the sum of the cubes of the first 9 natural numbers”, and it almost got it right:

# print the sum of the cubes of the first 9 natural numbers
sum(1:9^3)

It was still helpful because, silly me, I was at first thinking I’d need to use lapply somehow. So yes, the rough edges of these generative tools should not be heralded as evidence of their inadequacy. The suggested completions can be noisy, but they are particularly useful in providing hints where I would otherwise have very little intuition, notably with React. It’s also a handy shortcut for times when I know what to do, like setInterval() or addEventListener() and just want the boilerplate to quickly pop up without having to copy it from MDN.

More learning, more teaching

If the past few years post-Covid have shown me anything, it’s that learning and teaching are the two most fundamentally important things we can do as humans. I’m getting old, goddamit, and I’m quite sure I have few useful things to share. At the same time, I refuse to stop challenging myself, or to think that my way is the best way, that I’ve got it all figured out. It’s a trap that one can easily fall into at any age, and the solution at any age is the same: never stop learning.

One of the hardest things to learn how to do well is how to teach. I thought I had this figured out in grad school, but I’m sure I was very sloppy and my best students learned in spite of my teaching not because of it. One of the most fun aspects of being a dad is trying to teach your kid something you think you know. This is where ELI5 really matters! As my son has grown, the topics have become more complex, and so too has the challenge of how to teach.

Now he’s at the age where he wants to learn coding, and we’ve had some great fun working through simple Javascript games like Snake and Minesweeper. When Advent of Code began in December, I thought this would be an fantastic opportunity to teach him how to use R to solve interesting puzzles. That was not my best choice. When the solutions look like this, it’s just not the right venue for teaching basic code concepts!

In my own code—remember I’m a data scientist and that’s my excuse—I am really trying to unlearn an approach to programming that is becoming too rigid. I tend to be imperative, procedural, synchronous when I need to be declarative, functional, asynchronous. This is a big paradigm shift for me. As simple as it is to write those words, to understand them and take them to heart is another matter entirely. This is one reason why React has been so hard for me, and also why I keep at it.

There are other things I want to teach, mainly motivated by a desire to understand them better, that brings us back to the topic of tools. I have given presentations on logistic regression and XGBoost that were very well received. I have also taught SQL a number of times and still receive the occasional comment on the couple of SQL posts I have here on my blog. I want to do more with both of those: revisit the presentations, and work on a more formal SQL course.

I think there’s no better way to start than to reinforce a habit of writing. It’s not something I’ve been holding myself to outside of work—hence the dearth of material in this blog—and I need to change that.

Since we are advancing headlong into a distressing time, I must leave you with some words to address my position on these events and my sincere hope that the arc of history will be restored in its path towards justice.

Caution and Hope

Despite the cute arithmetical trifles of the number 2025, this new year brings with it a heavy mess of unnerving historical baggage that will seek to break our strength. A vast dome of willful ignorance has descended upon half the population. They blindly worship a deceitful charlatan who has stumbled onto the recipe for a dangerous neurotoxin, unleashing its poison in whimpering tantrums of lies and blame, seducing its victims through soothing promises of vindication and cheap bacon. Then there is the insecure tech titan beset by drug addictions and a crippling lack of self-awareness, who has wrested control of a popular communications platform and twisted it into his own personal armpit of grievance dissemination. The other titans, fearful of losing their gilded spoils, are falling in line, feeding at the trough, suckling at the teat, and obsequiously ushering in a new era that makes the worst 1980s dystopian science fiction look like the more attractive option. Every day brings more examples of how the sick and depraved reveal themselves without shame or remorse to have gleefully traded their humanity for one small shot at soothing their deeply held sense of worthlessness and inferiority.

To defeat them we must simply not allow them to infect us with their rage. What they desire above all else is for us to capitulate to the same base instincts that have enslaved their defeated will, wrapped them in a choke-hold of self-loathing they cannot escape, to bring us writhing down to their own level of hell. I am not suggesting that we ignore them, the danger is too great. To survive we must channel the hatred they spew into positive acts of kindness and compassion for our families and communities. Tune in to the awareness that you are part of something infinite and eternal. Resist the impulse to lash out, because that is what they want. Reverse the polarity of their hate and fear. Transform the energy of destruction into frameworks of creation. Doing this will deprive them of the only satisfaction their sad, scared, wretched shadows have left to feel. We will prevail, and they will wither away into the abyss, their last gasps of regret evaporating amid the vast winds of timeless change.