A Web Journal about Machine Learning, Music, and other Mischief

The Three Faces of Bayes


Last summer, I was at a conference having lunch with Hal Daume III when we got to talking about how “Bayesian” can be a funny and ambiguous term. It seems like the definition should be straightforward: “following the work of English mathematician Rev. Thomas Bayes,” perhaps, or even “uses Bayes’ theorem.” But many methods bearing the reverend’s name or using his theorem aren’t even considered “Bayesian” by his most religious followers. Why is it that Bayesian networks, for example, aren’t considered… y’know… Bayesian?

As I’ve read more outside the fields of machine learning and natural language processing — from psychometrics and environmental biology to hackers who dabble in data science — I’ve noticed three broad uses of the term “Bayesian.” I mentioned to Hal that I wanted to blog about these different uses, and he said it probably would have been more useful about six years ago when being “Bayesian” was all the rage (being “deep” is where it’s at these days). I still think these are useful distinctions, though, so here it is anyway.

I’ll present the three main uses of “Bayesian” as I understand them, all through the lens of a naïve Bayes classifier. I hope you find it useful and interesting!

A Theorem By Any Other Name…

First off, Bayes’ theorem (in some form) is involved in all three takes on “Bayesian.” This 250-year-old staple of statistics gives us a way to estimate the probability of some outcome of interest A given some evidence B:


If we care about inferring A from B, Bayes’ theorem says this can be done by estimating the joint probability p(A,B) and then dividing by the marginal probability p(B) to get the conditional probability p(A|B). The second equivalence follows from the chain rule: p(A,B)=p(B|A)p(A)p(A) is called the prior distribution over A, and p(A|B) is called the posterior distribution over A after having observed B.

1. Bayesians Against Discrimination

Now briefly consider this photo from the Pittsburgh G20 protests in 2009. That’s me carrying a sign that says “Bayesians Against Discrimination” behind the one and only John Oliver. I don’t think he realized he was in satirical company at the time (photo by Arthur Gretton, more ML protest photos here).


To get the joke, you need to grasp the first interpretation of “Bayesian”: a model that uses Bayes’ theorem to make predictions given some data…

Read the rest of this entry »

Machine Learning Roller Derby Names


I have several friends who do roller derby, and one of my favorite things about the culture is the derby name. This is a fun way for skaters to express their personalities through witty, satirical, and mock-violent alter-egos. Since most of my derby friends also have PhDs in computer science, machine learning, or statistics, they can be wonderfully nerdy, too.

A while back I was at a bout here in Pittsburgh to cheer on Ada Bloodlace and Angela Momentum (the latter now skates in Austin TX, which is like roller derby Mecca). Afterward, we got to talking about derby names and our colleague Logistic Aggression. My brain just started spinning with punny possibilities for computer and data science themed derby names. I recently unearthed a list of these, to which I added a few new ones:

  • Donna Matrix
  • Crash Function
  • Touring Machine
  • Finite Skate Machine
  • Smack-Propagation (which later morphed into the title of this blog)
  • Slamming Distance
  • Polly Nomial
  • Tara Bites
  • Horn Claws
  • Trample Bias
  • Hittin’ Variable
  • Gradient Dissent
  • Iterative Impaling
  • Smackdown Atomaton
  • Conditional Random Wheels
  • Bloodsport Vector Machine
  • Minimum Slamming Tree
  • Smack Overflow
  • NP-Hard Hitter
  • Em Cryption
  • Parity Hit
  • Dee Bugger
  • Suzie Queue

Someone suggested “Halting Problem,” but according to Angela Momentum, that one’s already taken by a skater in Austin. One could also use the hexadecimal encoding of her jersey number as a derby name (e.g., Abbe/43966, Becca/781514, Deb/3563, Fae/4014), but that might be too oblique.

Anyway, if you do derby — or are considering it — feel free to use any of these names on the track! Just give me a high five if we ever meet in person.

Speaking of derby girls and machine learning, the WiML Workshop is looking for organizers for this year’s event in Montreal. The deadline is Feb 29, 2016!

Transistor Clustering for DIY Guitar Effects

For the past couple of years, I’ve been teaching myself analog guitar electronics. (I previously wrote about fixing an old tube amplifier.) It’s nice to have a hobby that is “material” in nature — a departure from my usual world of data and software — but that also complements my musical interests. In fact, two of the seven pedals on my delicious pastries live gig rig are now DIY effects.

I had the final week of 2015 off from work, so I spent part of it building a few clones of some classic effect pedals. The only one I had time to complete was a MXR Phase 45 clone, using a circuit board layout from madbeanpedals that was small enough to cram into this cute little 1590A enclosure!


What’s more, the project presented a fun data analysis opportunity, since the effect uses two transistors that need to be matched. I’ll talk here a little about the phaser circuit, the matching process, and how I came to use an hierarchical clustering approach to identify the best components for my final build of the pedal! You can hear the effect in action in the following video.

Read the rest of this entry »

Rebuilding A Vintage Tube Amplifier

As someone who works mainly with digital data and software, it’s nice to have a “material” hobby that can get my hands dirty. That’s why I’ve been building my own guitar effects recently (but more about that in a later post…). For now, consider that the only guitar amplifier I’ve had for ages is my Peavey Classic 30 (ca. 2000, back when they were still made in the USA). It sounds great on stage, but packs too much of a punch for testing out circuits on a workbench. So I decided to build myself a small, low-power DIY practice amp. I was hitting up some thrift/antique stores one weekend in search of a lunchbox or small suitcase to build the amp into… when I came across this instead:


It’s a Harmony H303A, which I’d never seen before, though it seems to get good reviews and goes for $150-200 on eBay when listed (which is rare). The vintage appears to be 1950s or so, and online sources claim it outputs somewhere in the 2-5 watt range… too quiet for gigging but good for practicing and recording. A dealer in an antique mall had it in his collection and was really hot to move it for some reason. The price tag originally said “$50 for that sweet tube sound!” and when he went down to $25 I figured, “even if it doesn’t work, it’ll make a nice cabinet for the little solid-state practice amp I was going to build in the first place.” As it turns out, the Harmony does work… although it took a little effort to whip it into shape. Here’s a video showing the fruits of my labor:

Read the rest of this entry »

Most Livable Cities: A Meta-Analysis

Every few weeks, my Facebook newsfeed throws me an article like “Most Livable Cities” or “Best Cities for Quality of Life”or “Happiest and Unhappiest U.S. Cities” or somesuch. These rankings are generally quite different (though with a few common themes), and often include — in the top ten or so — the home city of whoever shared the link with their fellow facefriends.

The rankings vary widely in source, methodology, and credibility (for that matter, even in use of supporting data). So I was curious to do a sort of meta-analysis, combining these lists in a reasonable way to see (1) what cities are most livable by consensus, and (2) what social/demographic indicators seem to make them that way. Here are the main things I learned:

  • The livability of a city isn’t related to the happiness of its people.
  • Livability rankings comes in two types, which I call Chill Rankings and Jetsetter Rankings. The statistical models of livability that they produce are totally different, and there is no overlap in their top ten cities.
  • A few cities crack the top 25 for both types, though, suggesting a more balanced lifestyle: Washington DC, Boston, San Francisco, Pittsburgh, Minneapolis, Seattle, Buffalo, Honolulu, Portland, and Houston.
  • Surprisingly, cost of living and pollution have little relationship with livability in either type of ranking.

The Experiment

In full disclosure, this was originally just an excuse to play around with structural equation modeling (more below). But I also wanted to take an inductive approach to livability — to blenderize all these contradictory lists and try to learn something from them. The typical approach seems wantonly deductive to me — e.g., rank cities by average rent, household income, commute time, and violent crimes per capita using census data, and then sum those rankings into an overall score. Nevermind that rent and income are strongly correlated (and effectively double-counted), or that crime should maybe count for more than commute time. Some of these rankings come from studies by scientists who know how to deal with these complexities, but many are compiled by journalists (or their interns) based on intuition alone.

Structural Equation Models (SEMs)

I wanted to design the meta-analysis around SEMs, which I discovered from papers on social influence in online communities. SEMs were first articulated by Sewall Wright, a geneticist at my alma mater UW-Madison… and apparently a frequent dinner guest at my band‘s singer’s mother’s childhood home. (Small world!) But my training as a computer scientist did not include SEMs in school or any formal setting. I’ve been eager to learn more and find an interesting application.

SEMs are graphical models with the sweet ability to create hypothetical latent variables and uncover statistical relationships between them. For example, there isn’t really a way to measure livability, precisely, but we can tap into so-called “manifest variables” — like these Facebook-flung livability lists — to create a “latent construct” that summarizes them all. The idea is that this construct represents the “true but hidden” livability measure, and all these rankings are “symptomatic” manifestations of the underlying scale. The same can be done for for cost of living (a construct of the average rent, price of gas, cost of a slice of pizza, etc.) or education (a construct of the percentage of residents with various degrees). We can also essentially perform a regression analysis to see how constructs like cost of living and education might influence livability.

Model and Data

The figure below illustrates my basic model, which incorporates a lot of the general assumptions about what influences livability:
example-sem Read the rest of this entry »

Encoding Human Thought Processes into a Computer

One of my favorite characters in William Gibson’s Neuromancer was a so-called “psychological construct” named The Dixie Flatline. Dixie wasn’t a person, really, but an emulation of a famous computer hacker named McCoy Pauley (based on a brain scan that was made before he died). As he — or, it — said in a conversation with the novel’s protagonist Henry Case:

“Me, I’m not human … but I respond like one, see? … But I’m really just a bunch of ROM. It’s one of them, ah, philosophical questions, I guess….” The ugly laughter sensation rattled down Case’s spine. “But I ain’t likely to write you no poem, if you follow me.”

The Flatline was neither a human nor an artificial intelligence, but a machine that partially emulated how a human thought. It did a pretty good job, too, playing the central role of “smart guy” in the novel’s main cyberpunk-heist plotline. Yet it wasn’t a perfect human emulation: its laugh was “wrong,” and it was self-aware enough to note its own lack of creativity. Turning its ROM disk off and back on again totally reset Dixie’s memory, and later in the story the villain tried to take out Case first (still alive and human) precisely because the Flatline was a machine, and therefore much more predictable.

Cognitive Models and Their Uses

Regardless, it’s pretty cool to think about what we can accomplish with computational cognitive models derived using real data from real people. In Neuromancer, the data was McCoy Pauley’s brain scan, which was modeled and encoded into a computer program called The Dixie Flatline. The model wasn’t quite right, but was still useful. All that is science fiction of course, but we are making progress in the real world, too. There are both practical and theoretical uses for these kinds of models, such as:

  • “Encoding” a human thought process into a computer. It’s hard to “teach” computers directly. Most machine learning algorithms learn by example (i.e., observational data) but there aren’t great ways for people to inject their instincts about a problem into the machine. If we have a good cognitive model that captures properties of our thinking, though, we can perhaps encode that more directly into a learning algorithm.
  • Understanding how people think. If a computational model predicts real human behavior pretty well, then there’s a chance that it captures something real about how we think. And if its parameters are easily interpretable, we can gain insight into how our brains work, too.

With these in mind, let me summarize a recent collaboration with fellow computer/cognitive scientists at my alma mater UW-Madison. Here, the data consist of word lists that people think up, which we model computationally for both the practical and theoretical uses mentioned above. In fact, the paper is being presented at the ICML 2013 conference this week in Atlanta. We made a short video overview of the research, too:

That’s mostly me talking in the video, but Kwang-Sung will present it at the conference. The paper itself is here:

K.S. Jun, X. Zhu, B. Settles, and T.T. Rogers. Learning from Human-Generated ListsProceedings of the International Conference on Machine Learning (ICML), pages 181-189. 2013.

Read the rest of this entry »

On “Geek” Versus “Nerd”

To many people, “geek” and “nerd” are synonyms, but in fact they are a little different. Consider the phrase “sports geek” — an occasional substitute for “jock” and perhaps the arch-rival of a “nerd” in high-school folklore. If “geek” and “nerd” are synonyms, then “sports geek” might be an oxymoron. (Furthermore, “sports nerd” either doesn’t compute or means something else.)

In my mind, “geek” and “nerd” are related, but capture different dimensions of an intense dedication to a subject:

  • geek – An enthusiast of a particular topic or field. Geeks are “collection” oriented, gathering facts and mementos related to their subject of interest. They are obsessed with the newest, coolest, trendiest things that their subject has to offer.
  • nerd A studious intellectual, although again of a particular topic or field. Nerds are “achievement” oriented, and focus their efforts on acquiring knowledge and skill over trivia and memorabilia.

Or, to put it pictorially à la The Simpsons:

Both are dedicated to their subjects, and sometimes socially awkward. The distinction is that geeks are fans of their subjects, and nerds are practitioners of them. A computer geek might read Wired and tap the Silicon Valley rumor-mill for leads on the next hot-new-thing, while a computer nerd might read CLRS and keep an eye out for clever new ways of applying Dijkstra’s algorithm. Note that, while not synonyms, they are not necessarily distinct either: many geeks are also nerds (and vice versa).

An Experiment

Do I have any evidence for this contrast? (By the way, this viewpoint dates back to a grad-school conversation with fellow geek/nerd Bryan Barnes, now a physicist at NIST.) The Wiktionary entries for “geek” and “nerd” lend some credence to my position, but I’d like something a bit more empirical…

“You shall know a word by the company it keeps” ~ J.R. Firth (1957)

To characterize the similarities and differences between “geek” and “nerd,” maybe we can find the other words that tend to keep them company, and see if these linguistic companions support my point of view? Read the rest of this entry »