Bayesian birders

After spending too much time in front of screens for most of 2020, I’ve tried to make space for more screenless activities that are pandemic-compatible. Evelyn and I got into birding late last year, and have been doing it pretty actively for most of 2021. We’ve been keeping track of our birding adventures on Instagram @OurBigYear.1 Besides the physical and mental health benefits of birding, it’s also an activity that suits data scientists.

Projections for the 2020 census

More of my work on Census 2020 got mentioned in the Washington Post! The article is about how the current wildfires and hurricanes affecting some states will affect the ability of enumerators to carry out non-response follow-up (NRFU). This compounds the threat imposed by the Trump administration cutting short the window for NRFU. The window is supposed to end September 30 (but there is a lawsuit going back-and-forth in the courts).

Coronavirus and the 2020 census

It’s been awhile since I posted anything, which has been due to my being stretched thin at my job (on top of trying to survive a pandemic). In my work at Civis Analytics, I’ve been monitoring the progress of the census, which determines how federal funding and congressional seats are apportioned to the different states. In the best of times, the census is a challenging logistical operation to execute well.

How to interpret statistical models

In my work as a data scientist at Civis Analytics, I often have to explain the nuances of statistical models to clients without technical backgrounds. With the recent Covid-19 pandemic, there has been a proliferation of epidemiological data and models being discussed online. Since public sector agencies need to make sense of these models, I wrote a short explainer (published by CivStart) for public sector officials on how to interpret Covid-19 models they might read about.

Revisiting The Office: A text analysis

I’m casually rewatching The Office as my I need something uninvolved to half-occupy my brain for 20 minutes show. Its been almost seven years since the last episode aired, but I still quote it with my closest friends and family – such was its personal and cultural significance. It’s aged like high school: I look back with fondness at the familiar jokes, but there’s also a sense that it was a time we have outgrown.

W(hat's the)MATA: DC Metro issues

I moved to the DC Metro area in the middle of the months-long Metro shutdown, giving me an immediate taste of the region’s special relationship with the Washington Metropolitan Area Transit Authority (WMATA). Having lived in cities with varying degrees of public transit options, one long-term project I’d love to tackle is comparing public transit across cities. Awhile back, I found a public dataset containing reported Metro issues from April 2012 to November 2016.

Fantasy vs. reality: NFL player fantasy value in the 2019 season

Since writing my post analyzing real-world NFL player value, I’ve been waiting for a couple of Fantasy Football leagues I’m in to complete drafting before posting about fantasy value. (Wouldn’t want my opponents to think I’m putting more thought into my team than I actually am.) In that post, I visualized player value – roughly, the ratio of each player’s Defense-adjusted Yards Above Replacement (DYAR) to their cap hit – for the 2018 season.

Real world: NFL player value in the 2018 season

I learned to love American football through Fantasy Football, which I imagine is a pretty common gateway (especially for those with a quantitative bent). However, watching football through the lens of fantasy sports can distort one’s view of the game. One potential distortion relates to how we understand each player’s contribution to a team’s performance. In fantasy football, a top-tier running back (RB) is way more valuable than a top-tier quarterback (QB).

Dem Debate 1: Topic modeling

The second Democratic debate is almost here, so I wanted to follow up the sentiment analysis I performed on the first debate by looking at the topics that were talked about across the two nights. In natural language processing, a topic model is a statistical model that can be used to infer the underlying meanings (i.e. topics) of utterances based on the words being used. In this post, I use Latent Dirichlet Allocation (LDA) from the topicmodels package to infer what each speech by the candidates is about.

Dem Debate 1: Sentiment analysis

The 2020 Democratic race finally got underway after last week’s first debates (even though it feels like the race has been going on forever). I’ve been absorbing plenty of commentary of how the two nights transpired (including a few more data-centric and analytical pieces). Not forgetting the gif-able moments. I downloaded the full debate transcripts from NBC News (posted here and here) and cleaned it up a little to play around with.

#TidyTuesday: Meteorite landings

#TidyTuesday is a weekly data project for R users to practice data wrangling and visualization skills. Users work on a new dataset released each Tuesday, and then share their work. This week’s edition (Tidy Tuesday 24) is a dataset of meteorite landings from NASA. It’s a pretty hectic week for me, so I decided to make a single visualization. The dataset consists of location data for each meteorite found up until 2013.

#TidyTuesday: Ramen ratings

#TidyTuesday is a weekly data project for R users to practice data wrangling and visualization skills. Users work on a new dataset released each Tuesday, and then share their work. This week’s edition (Tidy Tuesday 23) contains ratings of ramen from The Ramen Rater. For health reasons, I’m currently unable to eat most ingredients found in this delicacy, so this analysis is the closest I’ll come to ramen for now.

Passing the (climbing) grade

This past weekend, I had the chance to climb at the New River Gorge, one of the best climbing destinations on the East coast (and probably the closest world-class crag to where I live). Unfortunately, I’m rehabbing a wrist injury, so the hardest climb I got on was graded 5.9, several notches below the level I’m trying to climb at consistently. Fortunately, it was the classic epic 80-foot Flight of the Gumby.

Choosing board games

Board games have been a significant part of my life. Many of my closest friendships were forged through repeated gaming sessions. With my friends from high-school, I progressed from simple gateway games like Munchkin to more complex ones like Power Grid. I’m certainly not alone as a passenger on the board game train. What makes a good game, and relatedly, what should I play next? BoardGameGeek (BGG) is an online community that helps answer that question with reviews and discussions of almost anything that qualifies as a board/card game.

Avengers popularity contest

I’m catching Avengers: Endgame later today, early enough that I don’t have to hide from spoilers. The closing of this phase of the Marvel Cinematic Universe (MCU) completes the current storylines, but really the movies will continue unabated as long as Disney makes money off of them. The Avengers, seen here powered by the Money Stone. As someone who grew up following the Avengers in comic books, and then seeing them come to life on the screen, the ending of this phase is also emotionally significant.

Hello (again) world

For the last few months, I’ve been working almost exclusively on my dissertation and other academic research. Great news: I graduated! In July, I’ll be starting a position as a Data Scientist at Civis Analytics. I figured it was time to get back into the habit of writing about side-projects, which would also give me the opportunity to learn and practice new computational methods. I intend to use this blog as a journal to keep track of cool things I’m learning (both methods and insights from the data I analyze).

About

Me attempting Like the Dickens (V7) @ Coll’s Cove in PA, Spring 2019. Hello, I’m Kevin. I’m a Data Scientist working with public sector organizations, where I use computational methods to help them become more data-driven in their decision-making. I completed a PhD in cognitive psychology at the University of Pittsburgh. I researched how people make inferences about cause-effect relationships, commonly referred to as causal reasoning (even more commonly mistaken for “casual” reasoning).

Academic

As a graduate student, I worked with Benjamin Rottman at the University of Pittsburgh. I used a combination of behavioral experiments and computational modeling to investigate human learning, reasoning, and decision-making. In particular, I focused on how people made causal inferences from data. Publications Soo, K. W. & Rottman, B. M. (in preparation). The role of granularity in causal learning. Soo, K. W. & Rottman, B. M. (2020). Distinguishing causation from correlation: Causal learning from time-series graphs with trends.

National Park trails

I love exploring the National Parks. Different parks contain different collections of hikes. I typically choose which park to visit based on a combination of sights I want to see and logistical constraints (e.g., travel time, cost, availability of campsites). Once I’ve chosen a park to visit, I choose a set of trails to hike based on how much time/fitness I have (choosing hikes with distance and elevation gain within a certain range), with the goal of fitting in as many hikes as possible during my stay.

GE14: A statistical snapshot

Malaysia held its 14th General Election (GE14) on May 9, 2018. The opposition Pakatan Harapan (PH, translation: Coalition of Hope) won control of the federal government over Barisan Nasional (BN, translation: National Front). There are many reasons why this was a remarkable outcome. I don’t intend to expand on these, but here are a couple worth mentioning: The BN has ruled Malaysia since its independence in 1957. As a result, Malaysia’s democracy is flawed (e.

The world (is going up in flames)

This song by the late, great Charles Bradley had me wondering: Is the world getting more peaceful, or is it increasingly conflict-ridden? Are there changes in the types of conflicts that flare up today compared to the past? These questions led me to explore the Social, Political and Economic Event Database (SPEED), which contains instances of unrest between 1946 and 2005 (post-WW2). The dataset records 62,141 instances of conflict, with information about time, initiators, targets, sources, and many other interesting variables related to the conflicts.

Fantasy Football season in review (2017-2018)

This season, a friend invited me to join his family’s Fantasy Football league, which gave me a chance to dabble in analytics. Prior to this season, I’ve been more interested in American Football the band than the sport. I know people who make fantasy football decisions for non-rational reasons (e.g. choosing players out of loyalty). On the other hand, I also know those who use their intimate knowledge of the sport to take into account relevant information a novice like me might overlook (e.

National Park visits

Last year, the National Park Service celebrated its 100th anniversary. I’ve had the privilege to camp at several of them in the last few years (you can see photos I’ve taken on some of my travels here), and I’ll be going to Rocky Mountain National Park in Colorado this Summer. Growing popularity I think the National Parks are the greatest thing America has to offer. It seems many people agree – visitation has never been higher.

Presidential clusters

C-Span recently released the results of their Presidential Historians Survey for 2017. Historians (n = 91) gave scores to each US president on multiple attributes deemed important to the presidency, which allowed the presidents to be ranked. The C-Span report allows you to sort through the data by attribute (spoiler alert: Lincoln was rated highest overall, and Buchanan lowest). There are potential difficulties with these rankings: perhaps the attributes measured are not the only/most relevant ones for a good president to display, and it’s also possible that different historians interpret each attribute differently.

National battleground 2016

Election maps showing each state as a monolithic red or blue piece of the country aren’t very informative. Visualizing results by county reveals the voting patterns in some states to be pretty diverse – e.g., California’s coast went for Clinton while inland counties went for Trump; Texas starts to lean towards Clinton the further south you go; and most red states have at least one island of swing or blue-leaning counties.

How wrong were the polls?

Going into the 2016 Presidential Election, most pollsters were confident of a Clinton win. The aftermath of Trump’s win resulted in many questions being asked of pollsters and speculation as to how so many got it wrong. I won’t get into the reasons why; here are some articles with coverage on that. Instead, I want to focus on quantifying and visualizing the amount of error in the polls – where were they wrong, and how were they wrong?

Predictably comforting: The writings of Soo Ewe Jin

My father passed away on November 17, 2016. His life has been memorialized in countless tributes published in the Malaysian press and on social media. An executive editor at the most widely-read English daily in Malaysia, he had a popular weekly column: Sunday Starters. As a way of navigating the mourning process, I set out to collect all his writings published in The Star. I’m sure the editors would have given me all his writings if I had asked, but instead I wrote some code to scrape all 366 articles he authored from the website.