Machine Learning and Social Science: Taking the Best of Both Worlds (A Case Study)
Machine learning and social science are converging, since both are hot to answer questions and challenges raised by vast modern social data sets. The more I talk to and work with social scientists, the more I realize that we use the same basic statistical tools in our research (e.g., linear or logistic regression), but in very different ways. Here are the fundamental differences in how the two camps approach things, I think (broadly speaking):
- Social scientists (e.g., psychologists, sociologists, economists) tend to start with a hypothesis, and then design experiments — or find observational data sets — to test that hypothesis. I think of this as a deductive, top-down, or theory-driven approach.
- Computer scientists (i.e., the machine learning and data mining communities) tend to “let the data speak for itself,” by throwing algorithms at the problem and seeing what sticks. I think of this as an inductive, bottom-up, or data-driven approach.
Both approaches have their uses (and their pitfalls). Theory-driven research is probably better for advancing scientific knowledge: the models may not predict the future very well, but they can shed light on causes and effects, or confirm/deny hypotheses. Data-driven research is often more practical: we have great spam filters and recommender systems today as a result, but the best methods are usually “black boxes” that perform well without providing much insight. Ideally, we would like sophisticated methods that can make accurate predictions and tell us something about the world.
In this post, I’ll argue for (1) a hybrid inductive + deductive research approach and (2) a specific algorithm called path-based regression, both of which help push us toward this unified vision, I think. These perspectives grew out of a recent “machine learning meets social science” project of mine to try to explain and predict how creative collaborations form in an online music community.
(A note to self-identified statisticians: I’m not blatantly ignoring you, I just don’t quite know which camp you fall into. Perhaps it depends on whether you’re more motivated by inference or prediction. I suspect, though, that good statisticians are the unicorns who already know everything I have to say here…)
Understanding and Predicting Online Creative Collaborations
Mere days ago, I launched the tenth iteration of February Album Writing Month (FAWM). FAWM is a music project I started during grad school with a few friends, the goal being to write an album in a month: “14 songs in 28 days.” Recently, it has become a bit of a research project, too, since I have collected a rich data set over the years about individuals’ online interactions and musical productivity. Last fall, I teamed up with Steven Dow from Carnegie Mellon’s Social Computing Group to look into how collaborative songwriting projects form and succeed in FAWM. We’ll present it at the CHI 2013 conference in a few months… and since CHI required us to make a promo video (sheesh), here is a 30-second overview:
The paper itself is available here:
B. Settles and S. Dow. Let’s Get Together: The Formation and Success of Online Creative Collaborations. Proceedings of the Conference on Human Factors in Computing Systems (CHI). ACM, 2013.
A Hybrid Inductive + Deductive Approach
One thing that excites me about this project is that we married bottom-up/data-driven experiments with top-down/theory-driven analysis, using both quantitative methods (a novel path-based regression) and qualitative methods (user surveys). That is, instead of hand-crafting a few specific variables to test specific hypotheses about collab formation and success, we ran four years of data through a big statistical model and saw what fell out. But we didn’t just dive straight into story time. Instead, we (1) used existing theory to guide what we looked for in the model, and (2) limited ourselves to interpretations that were corroborated by qualitative survey responses.
“If the result confirms the hypothesis, then you’ve made a measurement. If the result is contrary to the hypothesis, then you’ve made a discovery” ~ Enrico Fermi
By “letting the data speak” — but letting theory tell us what to listen for — we made some “discoveries” that neither confirmed nor contradicted the existing theory, exactly… but made it more nuanced. I won’t repeat the whole paper here (you should read it!), but I’ll give you a few examples:
- Theory says: You tend to collab with people who share your interests. Our finding: Sort of. If you write heavy metal, you are less likely to work with another metalhead than a jazz pianist who owns a few Mastadon records. People tend to work with others who have shared interests but complementary skills and backgrounds.
- Theory says: You tend to collab with people of the same social status. Our finding: You probably won’t team up with someone waaaaay higher (or lower) on the social ladder. However, you are more likely to work with someone of slightly different status than someone of the exact same status. (There are several reasons for this, ranging from newcomer socialization to hero-worship.)
- Theory says: You tend to enjoy a collab less if your partner is a slacker. Our finding: True, but you also enjoy it less if you are the slacker. A 50/50 work balance is the golden ratio.
We found evidence for all three of these in both the user surveys and the regression analyses, but we didn’t really find any precedent in the literature for these nuanced findings. Here’s the important thing: I don’t think we would have thought to explicitly operationalize variables for the regression to tease these out (c.f., a purely theory-driven approach). Thus, we were able to stumble across these almost-but-not-quite extensions of the existing theory. They make sense in hindsight, but the data had to tell us.
In order for the data to speak, though, they first need a vocabulary. The traditional approach is to hand-craft input variables to test some theory or intuition. In our study, we adopted a method developed by Ni Lao for his Ph.D. thesis (on probabilistic reasoning over knowledge base graphs, which has been useful for us in the Read the Web project). It turns out to be useful for reasoning over social network graphs, too. In fact, I bet it is useful for virtually any kind of inductive complex network modeling…
Here’s the basic idea. To predict the formation of an interesting network edge between users A and B — in our case, whether or not they will collaborate — look at all the other kinds of paths that can connect them (such as follows, direct messages, paths through comments on each other’s songs, or through tags shared by songs both have written). These paths become input variables that describe the pair 〈A, B〉 in a logistic regression model, which in turn predicts whether or not a collaboration edge should exist between them. The gory details — how to automatically gather path types and their statistics through random walks, etc. — can be found in our paper, or Ni’s thesis (in the context of knowledge base inference).
Experimental deets and more sexy results are in the paper. My intuition is that the path-based regression maintains semantics about the different kinds of paths (e.g., “follow” edges are slightly different than “message” edges, even though both connect user nodes). All the baselines, on the other hand, require homogenous edge types; we had to collapse the network into a vanilla adjacency matrix in order to apply them. I think those extra semantics give the path-based regression an advantage. What is über-awesome, though, is that those extra semantics mean something to us, too: we can inspect the model weights associated with different paths to actually understand their effects… which is in fact what lead to all the nuanced findings above.
The takeaway here is that we combined inductive and deductive investigation, and got lucky with a best-of-both-worlds result: we built a state-of-the-art model that also provides new insights into social theory. To do that, we let the natural structure of the data define our model vocabulary (in our case, path types through the social network) instead of instrumenting variables by hand. Then we let LASSO, our “theory goggles,” and the qualitative data help us sort out what was meaningful.
I haven’t come across a lot of work that takes this hybrid approach, although I’m sure it is probably out there (and growing). I feel that interdisciplinary machine learning and social science research like this has a lot of potential…