He and she: What's the real difference?
According to a team of computer scientists, we give away our gender in our writing style
By Clive Thompson, 7/6/2003
IMAGINE, FOR A SECOND, that no byline is attached to this article. Judging by the words alone, can you figure out if I am a man or a woman?
Moshe Koppel can. This summer, a group of computer scientists-including Koppel, a professor at Israeli's Bar-Ilan University-are publishing two papers in which they describe the successful results of a gender-detection experiment. The scholars have developed a computer algorithm that can examine an anonymous text and determine, with accuracy rates of better than 80 percent, whether the author is male or female. For centuries, linguists and cultural pundits have argued heatedly about whether men and women communicate differently. But Koppel's group is the first to create an actual prediction machine.
A rather controversial one, too. When the group submitted its first paper to the prestigious journal Proceedings of the National Academy of Sciences, the referees rejected it ''on ideological grounds,'' Koppel maintains. ''They said, `Hey, what do you mean? You're trying to make some claim about men and women being different, and we don't know if that's true. That's just the kind of thing that people are saying in order to oppress women!' And I said `Hey-I'm just reporting the numbers.'''
When they submitted their papers to other journals, the group made a significant tweak. One of the coauthors, Anat Shimoni, added her middle name ''Rachel'' to her byline, to make sure reviewers knew one member of the group was female. (The third scientist is a man, Shlomo Argamon.) The papers were accepted by the journals Literary and Linguistic Computing and Text, and are appearing over the next few months. Koppel says they haven't faced any further accusations of antifeminism.
The odd thing is that the language differences the researchers discovered would seem, at first blush, to be rather benign. They pertain not to complex, ''important'' words, but to the seemingly quotidian parts of speech: the ifs, ands, and buts.
For example, Koppel's group found that the single biggest difference is that women are far more likely than men to use personal pronouns-''I'', ''you'', ''she'', ''myself'', or ''yourself'' and the like. Men, in contrast, are more likely to use determiners-''a,'' ''the,'' ''that,'' and ''these''-as well as cardinal numbers and quantifiers like ''more'' or ''some.'' As one of the papers published by Koppel's group notes, men are also more likely to use ''post-head noun modification with an of phrase''-phrases like ''garden of roses.''
It seems surreal, even spooky, that such seemingly throwaway words would be so revealing of our identity. But text-analysis experts have long relied on these little parts of speech. When you or I write a text, we pay close attention to how we use the main topic-specific words-such as, in this article, the words ''computer'' and ''program'' and ''gender.'' But we don't pay much attention to how we employ basic parts of speech, which means we're far more likely to use them in unconscious but revealing patterns. Years ago, Donald Foster, a professor of English at Vassar College, unmasked Joe Klein as the author of the anonymous book ''Primary Colors,'' partly by paying attention to words like ''the'' and ''and,'' and to quirks in the use of punctuation. ''They're like fingerprints,'' says Foster.
To divine these subtle patterns, Koppel's team crunched 604 texts taken from the British National Corpus, a collection of 4,124 documents assembled by academics to help study modern language use. Half of the chosen texts were written by men and half by women; they ranged from novels such as Julian Barnes's ''Talking It Over'' to works of nonfiction (including even some pop ephemera, such as an instant-biography of the singer Kylie Minogue). The scientists removed all the topic-specific words, leaving the non-topic-specific ones behind.
Then they fed the remaining text into an artificial-intelligence sorting algorithm and programmed it to look for elements that were relatively unique to the women's set and the men's set. ''The more frequently a word got used in one set, the more weight it got. If the word `you' got used in the female set very often and not in the male set, you give it a stronger female weighting,'' Koppel explains.
When the dust settled, the researchers wound up zeroing in on barely 50 features that had the most ''weight,'' either male or female. Not a big group, but one with ferocious predictive power: When the scientists ran their test on new documents culled from the British National Corpus, they could predict the gender of the author with over 80-percent accuracy.
It may be unnerving to think that your gender is so obvious, and so dominates your behavior, that others can discover it by doing a simple word-count. But Koppel says the results actually make a sort of intuitive sense. As he points out, if women use personal pronouns more than men, it may be because of the old sociological saw: Women talk about people, men talk about things. Many scholars of gender and language have argued this for years.
''It's not too surprising,'' agrees Deborah Tannen, a linguist and author of best-sellers such as ''You Just Don't Understand: Women and Men in Conversation.'' ''Because what are [personal] pronouns? They're talking about people. And we know that women write more about people.'' Also, she notes, women typically write in an ''involved'' style, trying to forge a more intimate connection with the reader, which leads to even heavier pronoun use. Meanwhile, if men are writing more frequently about things, that would explain why they're prone to using quantity words like ''some'' or ''many.'' These differences are significant enough that even when Koppel's team analyzed scientific papers-which would seem to be as content-neutral as you can get-they could still spot male and female authors. ''It blew my mind,'' he says.
But this gender-spotting eventually runs into a $64,000 conceptual question: What the heck is gender, anyway? At a basic level, Koppel's group assumes that there are only two different states-you're either male or female. (''Computer scientists love a binary problem,'' as Koppel jokes.) But some theorists of gender, such as Berkeley's Judith Butler, have argued that this is a false duality. Gender isn't simply innate or biological, the argument goes; it's as much about how you act as what you are.
Tannen once had a group of students analyze articles from men's and women's magazines, trying to see if they could guess which articles had appeared in which class of publication. It wasn't hard. In men's magazines, the sentences were always shorter, and the sentences in women's magazines had more ''feeling verbs,'' which would seem to bolster Koppel's findings. But here's the catch: The actual identity of the author didn't matter. When women wrote for men's magazines, they wrote in the ''male'' style. ''It clearly was performance,'' Tannen notes. ''It didn't matter whether the author was male or female. What mattered was whether the intended audience was male or female.''
Critics charge that experiments in gender-prediction don't discover inalienable male/female differences; rather, they help to create and exaggerate such differences. ''You find what you're looking for. And that leads to this sneaking suspicion that it's all hardwired, instead of cultural,'' argues Janet Bing, a linguist at Old Dominion University in Norfolk, Va. She adds: ''This whole rush to categorization usually works against women.'' Bing further notes that gays, lesbians, or transgendered people don't fit neatly into simple social definitions of male or female gender. Would Koppel's algorithm work as well if it analyzed a collection of books written mainly by them?
Koppel enthusiastically agrees it's an interesting question-but ''we haven't run that experiment, so we don't know.'' In the end, he's hoping his group's data will keep critics at bay. ''I'm just reporting the numbers,'' he adds, ''but you can't be careful enough.''
Clive Thompson writes for Wired, The New York Times Magazine, and The Washington Post.