A Web Journal about Machine Learning, Music, and other Mischief

## Category: Machine Learning

### The Three Faces of Bayes

Last summer, I was at a conference having lunch with Hal Daumé III when we got to talking about how “Bayesian” can be a funny and ambiguous term. It seems like the definition should be straightforward: “following the work of English mathematician Rev. Thomas Bayes,” perhaps, or even “uses Bayes’ theorem.” But many methods bearing the reverend’s name or using his theorem aren’t even considered “Bayesian” by his most religious followers. Why is it that Bayesian networks, for example, aren’t considered… y’know… Bayesian?

As I’ve read more outside the fields of machine learning and natural language processing — from psychometrics and environmental biology to hackers who dabble in data science — I’ve noticed three broad uses of the term “Bayesian.” I mentioned to Hal that I wanted to blog about these different uses, and he said it probably would have been more useful about six years ago when being “Bayesian” was all the rage (being “deep” is where it’s at these days). I still think these are useful distinctions, though, so here it is anyway.

I’ll present the three main uses of “Bayesian” as I understand them, all through the lens of a naïve Bayes classifier. I hope you find it useful and interesting!

# A Theorem By Any Other Name…

First off, Bayes’ theorem (in some form) is involved in all three takes on “Bayesian.” This 250-year-old staple of statistics gives us a way to estimate the probability of some outcome of interest $A$ given some evidence $B$:

$p(A|B)=\frac{p(A,B)}{p(B)}=\frac{p(B|A)p(A)}{p(B)}$

If we care about inferring $A$ from $B$, Bayes’ theorem says this can be done by estimating the joint probability $p(A,B)$ and then dividing by the marginal probability $p(B)$ to get the conditional probability $p(A|B)$. The second equivalence follows from the chain rule: $p(A,B)=p(B|A)p(A)$$p(A)$ is called the prior distribution over $A$, and $p(A|B)$ is called the posterior distribution over $A$ after having observed $B$.

# 1. Bayesians Against Discrimination

Now briefly consider this photo from the Pittsburgh G20 protests in 2009. That’s me carrying a sign that says “Bayesians Against Discrimination” behind the one and only John Oliver. I don’t think he realized he was in satirical company at the time (photo by Arthur Gretton, more ML protest photos here).

To get the joke, you need to grasp the first interpretation of “Bayesian”: a model that uses Bayes’ theorem to make predictions given some data…

### Machine Learning Roller Derby Names

I have several friends who do roller derby, and one of my favorite things about the culture is the derby name. This is a fun way for skaters to express their personalities through witty, satirical, and mock-violent alter-egos. Since most of my derby friends also have PhDs in computer science, machine learning, or statistics, they can be wonderfully nerdy, too.

A while back I was at a bout here in Pittsburgh to cheer on Ada Bloodlace and Angela Momentum (the latter now skates in Austin TX, which is like roller derby Mecca). Afterward, we got to talking about derby names and our colleague Logistic Aggression. My brain just started spinning with punny possibilities for computer and data science themed derby names. I recently unearthed a list of these, to which I added a few new ones:

• Donna Matrix
• Crash Function
• Touring Machine
• Finite Skate Machine
• Smack-Propagation (which later morphed into the title of this blog)
• Slamming Distance
• Polly Nomial
• Tara Bites
• Horn Claws
• Trample Bias
• Hittin’ Variable
• Iterative Impaling
• Smackdown Atomaton
• Conditional Random Wheels
• Bloodsport Vector Machine
• Minimum Slamming Tree
• Smack Overflow
• NP-Hard Hitter
• Em Cryption
• Parity Hit
• Dee Bugger
• Suzie Queue

Someone suggested “Halting Problem,” but according to Angela Momentum, that one’s already taken by a skater in Austin. One could also use the hexadecimal encoding of her jersey number as a derby name (e.g., Abbe/43966, Becca/781514, Deb/3563, Fae/4014), but that might be too oblique.

Anyway, if you do derby — or are considering it — feel free to use any of these names on the track! Just give me a high five if we ever meet in person.

Speaking of derby girls and machine learning, the WiML Workshop is looking for organizers for this year’s event in Montreal. The deadline is Feb 29, 2016!

### Encoding Human Thought Processes into a Computer

One of my favorite characters in William Gibson’s Neuromancer was a so-called “psychological construct” named The Dixie Flatline. Dixie wasn’t a person, really, but an emulation of a famous computer hacker named McCoy Pauley (based on a brain scan that was made before he died). As he — or, it — said in a conversation with the novel’s protagonist Henry Case:

“Me, I’m not human … but I respond like one, see? … But I’m really just a bunch of ROM. It’s one of them, ah, philosophical questions, I guess….” The ugly laughter sensation rattled down Case’s spine. “But I ain’t likely to write you no poem, if you follow me.”

The Flatline was neither a human nor an artificial intelligence, but a machine that partially emulated how a human thought. It did a pretty good job, too, playing the central role of “smart guy” in the novel’s main cyberpunk-heist plotline. Yet it wasn’t a perfect human emulation: its laugh was “wrong,” and it was self-aware enough to note its own lack of creativity. Turning its ROM disk off and back on again totally reset Dixie’s memory, and later in the story the villain tried to take out Case first (still alive and human) precisely because the Flatline was a machine, and therefore much more predictable.

## Cognitive Models and Their Uses

Regardless, it’s pretty cool to think about what we can accomplish with computational cognitive models derived using real data from real people. In Neuromancer, the data was McCoy Pauley’s brain scan, which was modeled and encoded into a computer program called The Dixie Flatline. The model wasn’t quite right, but was still useful. All that is science fiction of course, but we are making progress in the real world, too. There are both practical and theoretical uses for these kinds of models, such as:

• “Encoding” a human thought process into a computer. It’s hard to “teach” computers directly. Most machine learning algorithms learn by example (i.e., observational data) but there aren’t great ways for people to inject their instincts about a problem into the machine. If we have a good cognitive model that captures properties of our thinking, though, we can perhaps encode that more directly into a learning algorithm.
• Understanding how people think. If a computational model predicts real human behavior pretty well, then there’s a chance that it captures something real about how we think. And if its parameters are easily interpretable, we can gain insight into how our brains work, too.

With these in mind, let me summarize a recent collaboration with fellow computer/cognitive scientists at my alma mater UW-Madison. Here, the data consist of word lists that people think up, which we model computationally for both the practical and theoretical uses mentioned above. In fact, the paper is being presented at the ICML 2013 conference this week in Atlanta. We made a short video overview of the research, too:

That’s mostly me talking in the video, but Kwang-Sung will present it at the conference. The paper itself is here:

K.S. Jun, X. Zhu, B. Settles, and T.T. Rogers. Learning from Human-Generated ListsProceedings of the International Conference on Machine Learning (ICML), pages 181-189. 2013.

## Read the rest of this entry »

### Machine Learning and Social Science: Taking the Best of Both Worlds (A Case Study)

Machine learning and social science are converging, since both are hot to answer questions and challenges raised by vast modern social data sets. The more I talk to and work with social scientists, the more I realize that we use the same basic statistical tools in our research (e.g., linear or logistic regression), but in very different ways. Here are the fundamental differences in how the two camps approach things, I think (broadly speaking):

• Social scientists (e.g., psychologists, sociologists, economists) tend to start with a hypothesis, and then design experiments — or find observational data sets — to test that hypothesis. I think of this as a deductive, top-down, or theory-driven approach.
• Computer scientists (i.e., the machine learning and data mining communities) tend to “let the data speak for itself,” by throwing algorithms at the problem and seeing what sticks. I think of this as an inductive, bottom-up, or data-driven approach.

Both approaches have their uses (and their pitfalls). Theory-driven research is probably better for advancing scientific knowledge: the models may not predict the future very well, but they can shed light on causes and effects, or confirm/deny hypotheses. Data-driven research is often more practical: we have great spam filters and recommender systems today as a result, but the best methods are usually “black boxes” that perform well without providing much insight. Ideally, we would like sophisticated methods that can make accurate predictions and tell us something about the world.

In this post, I’ll argue for (1) a hybrid inductive + deductive research approach and (2) a specific algorithm called path-based regression, both of which help push us toward this unified vision, I think. These perspectives grew out of a recent “machine learning meets social science” project of mine to try to explain and predict how creative collaborations form in an online music community.

(A note to self-identified statisticians: I’m not blatantly ignoring you, I just don’t quite know which camp you fall into. Perhaps it depends on whether you’re more motivated by inference or prediction. I suspect, though, that good statisticians are the unicorns who already know everything I have to say here…)

## Understanding and Predicting Online Creative Collaborations

Mere days ago, I launched the tenth iteration of February Album Writing Month (FAWM). FAWM is a music project I started during grad school with a few friends, the goal being to write an album in a month: “14 songs in 28 days.” Recently, it has become a bit of a research project, too, since I have collected a rich data set over the years about individuals’ online interactions and musical productivity. Last fall, I teamed up with Steven Dow from Carnegie Mellon’s Social Computing Group to look into how collaborative songwriting projects form and succeed in FAWM. We’ll present it at the CHI 2013 conference in a few months… and since CHI required us to make a promo video (sheesh), here is a 30-second overview:

The paper itself is available here:

B. Settles and S. Dow. Let’s Get Together: The Formation and Success of Online Creative CollaborationsProceedings of the Conference on Human Factors in Computing Systems (CHI). ACM, 2013.

## Read the rest of this entry »

### Machine Learning and Personality Type

Here are some thoughts on statistical approaches for pinpointing personality types. Text analysis and crowdsourcing FTW!

## Myers-Briggs

I recently discovered Typealyzer, a service that analyzes a web page and tries to determine the author’s personality type, in terms of Myers-Briggs Type Indicators. I’m not sure what kind of classifier it uses, but it’s apparently built on uClassify‘s API and trained using psychocographic text data gathered by Mattias Östmar. It determines each of the four dimensions independently, presumably using a “bag of words” document model.

In both formal and informal tests, I have always scored INTP (introverted, intuitive, thinking, perceiver) since high school. So I was curious what Typealizer would make of my writing. I have web presence in multiple public places, so I decided to try a few of them. Here are the results:

So one might conclude that I’m either an INTJ (“Mastermind”) with 2/5 = 40% probability, or marginalize over the four dimensions independently and say that I am in fact an INTP (“Architect”) with 17.3% probability (INTJ comes in second at 11.5%). A Bayesian might put stronger priors on my personal and academic pages, or priors based on population distribution, or use model confidences (which Typealizer didn’t provide). At any rate, I’m either an INTJ or INTP, and probably the latter.

(Update: I checked Typealizer immediately after posting this, and it revised its prediction for this blog to INTP.)

## The Enneagram

All that reminded me of a little “breakfast experiment” I did a while ago to help me determine my Enneagram type. The Enneagram is not nearly as popular as Myers-Briggs, but I find it more useful for being self-aware about bad habits or unhealthy tendencies. Without going into too much detail, there are nine basic types:

Each type also has two adjacent wings, and people have one of three instinctual variants, which allows for a total of 9×2×3 = 54 personality types! But I’m only concerned about two of the nine basic types shown above: Five (“Investigator”) and Nine (“Peacemaker”).

I’ve always tested as a Five, but with Nine in second place, which is kind of weird since they are share very little in common according to the theory. I always assumed I really was a Five since I feel more like an investigator than a peacemaker, plus type Five is correlated with both INTP and INTJ in Myers-Briggs-land (c.f., this study). But a little over a year ago I was going through a period of major personal stress, and my friend Charles (who introduced me to the Enneagram) suggested that I might be a Nine instead of a Five based on how I was responding to the situation(s). He said that he had recently revised his own type, and pointed me to an article by the Enneagram Institute arguing that educated male Nines tend to think they are Fives:

Despite their similarities, the main point of confusion for Nines arises around the notion of “thinking.” Nines think they are Fives because they think they have profound ideas: therefore, they must be Fives.

Part of the problem stems from the fact that individuals of both types can be highly intelligent…. Although intelligence can be manifested in different ways, being intelligent does not make Nines intellectuals, just as thinking does not make them thinkers.

They also claim that the Nine-to-Five (teehee!) misclassification is the most common… although it rarely happens the other way around. So I read up on both types, but they both felt like they described me in different ways. I re-took some tests, and Five still came out on top with Nine close behind.

### A Crowdsourcing Experiment

So (of course) I decided to build a classifier. First, I collected 40 first-person statements that supposedly characterize either Fives or Nines, lightly edited them for stylistic consistency, shuffled the order (to reduce presentation bias), and emailed the list to 28 close friends and family. I asked them to reply with all the statements they thought describe me, and delete the ones that do not. In a sense, “crowdsourcing” my personality description.

Then I built a multinomial naïve Bayes classifier that computes the probability $p(t|\mathbf{s})$ of an Enneagram type $t$ given a set of these statements $\mathbf{s} = \{s_i\}^{40}_{i=1}$:

$p(t|\mathbf{s}) \propto p(t)\prod_i p(s_i|t)^{\mathrm{freq}(s_i)}$

Here, $\mathrm{freq}(s_i)$ is the number of people who responded saying that statement $s_i$ describes me. Estimating the probabilities for this model was tricky with no actual data, but it is called naïve Bayes, so I took it to the extreme and only used priors. For type priors $p(t)$, I used probabilities of 44.1% for Five and 55.9% for Nine (which came from this study of the general population). For statement probabilities, I adapted an approach I have used before — for classifying text using labeled words in addition to or instead of labeled documents — and used a simple informative Dirichlet prior of 2 “pseudocounts” if statement $s_i$ describes type $t$ (e.g., “I stand back and try to view life objectively” describes type Five), and 1 otherwise. These get normalized to form the conditional multinomials $p(s_i|t)$.

### Results

Over about two weeks, 12 people replied (42.9% response rate), which was a pretty good cross-section of family and friends from high school, college, grad school, and my more recent Pittsburgh days. I generally agreed with the responses, although I was surprised how many people thought I would say “Tell me when you like how I look” or “Hug me and show physical affection.” I don’t think I give off those vibes. Do I? Anyway, here are the four statements all 12 respondents unanimously agreed on:

• “I need time alone to process my feelings and thoughts.”
• “I like to have a thorough understanding; perceiving causes and effects.”
• “My sense of integrity: doing what I think is right and not being influenced by social pressure.”
• “I know that most people enjoy my company, I’m easy to be around!”

The first three describe a Five, and the last one describes a Nine (although, let’s face it… that’s just flattery). The model predicts with 95.9% confidence (about 23-to-1 odds) that I am indeed a Five, a prediction that isn’t very sensitive to fiddling with the Dirichlet priors at all (although it is naïve Bayes, and the statements are probably not conditionally independent). Furthermore, I suppose that the very act of conducting an experiment like this, however silly, is a very Five kind of thing to do. So… uhmm… case closed?

### More Thoughts

After living with the idea for a year or so now, I think I actually disagree with that analysis. Insofar as we have discrete personality types (which is a little dubious to begin with), I think I am in fact a Nine… just a very curious and analytical Nine. Here is why, according to the theory (which relates to the arrows in the diagram above):

• The Five’s investigative nature supposedly stems from a fear of not being able to understand “Truths” about the world. Stressed-out Fives can be hyperactive and paranoid, spread across a lot of projects (like some Sevens). Healthy Fives, however, become confident leaders and decisive “benevolent dictators” (like some Eights).
• In contrast, Nines want peace of mind. In the face of stress, they can become anxious worrywarts (like some Sixes), but healthy Nines pick up energy and become focused on self-improvement (like some Threes).

The latter feels more like me. I am an investigator not because of some deep need to get to the bottom of things (Five), but because it’s a hell of a lot of fun, and it forces me to learn things and develop new skills in the process (healthy-Nine/Three). And while it’s true that I try to “do what I think is right and not be influenced by social pressure” (Five), I still worry a awful lot about what other people think of my decisions (unhealthy-Nine/Six). I suspect that academia is overrun with Fives, and thus I have either taken on some Five-like traits or they are projected onto me by the friends & family who replied to my survey. Nuances this fine might be too subtle to pull out of a questionnaire-based personality test.

## Conclusion

Anyway, fun stuff… although the Nine-to-Five misclassification got me thinking about an “active” personality test that, instead of asking a rote set of questions, could adapt and try to tease these subtleties out (like a good game of 20 Questions). Personal writing samples — which the Typealizer folks are starting to get at — could be a good source of data for such a test. It would be cool to gather Enneagram types for a bunch of bloggers, and use NLP techniques to try to understand how language is used by the different types. A test could tailor follow-up questions based on preliminary guesses from the text. As always, though, training data is a big bottleneck…

### Community Demographics and the 2012 Presidential Election

Earlier this month I was fighting a pretty bad cold. While I probably should have been doing “real” work during the sinusy waking hours at home, I decided instead to finish season two of The Walking Dead and scratch an itch regarding the 2012 presidential election. I feel like I’ve seen some sloppy journalism and “data porn” floating around in the wake of the election: simple mash-ups of data and ill-drawn conclusions about how race, poverty, religion, and historical baggage (e.g., slavery) influenced the vote.

So I threw together some data, too, to try and understand these thing for myself (and you, if you care). In short, I built a statistical computer model that predicts the vote distribution between Obama and Romney at the U.S. county level, using some demographic info about each county as evidence. Here is a quick summary of what the data seems to say:

• Obama-leaning communities tend to: live in more multi-unit housing, have more businesses owned by women and minorities, be more educated, and have an economy driven by manufacturing. (Some weak associations include: population density, retail or food & service economies, and percentage speaking a language other than English at home).
• Romney-leaning communities tend to: be more White or Hispanic, have more children, and own their own homes. (Some weak associations include: a wholesale-trade economy, percent of residents who were foreign-born, and whether the county was part of a former slave state or territory).
• Apparently irrelevant features include: income level or poverty rate, household size, and presence of a voter ID law.

Juicy details below. Note that these findings are correlations, and correlation does not imply causality. Also, these are results for counties, not individuals. For example, the percent of the population under 18 years of age is a predictive feature of pro-Romney counties. Clearly, these children were not voting for Romney (they are too young!), so it must have been the adults in counties with more children. We should be similarly cautious about drawing hasty conclusions about people described by other demographic features. (There is an interesting result regarding Hispanic populations that I discuss below.) Sometimes such conclusions are reasonable, but you have to be careful, and what sets my little study apart from most of the mash-ups I have seen is that these different demographic features are all taken into account. In other words, we’re considering the role of, say, voter ID laws in the context of everything else.

Before we dive in, a caveat: I am not a political scientist. I am a computer scientist with a background in machine learning and natural language processing. I am learning more social science, but my methods and interpretations do not benefit from a ton of social theory or training (yet). Read the rest of this entry »