July 18, 2006

Google’s Norvig Questions Berners-Lee on the Semantic Web

by Nick Montfort · , 12:24 pm

This morning at AAAI ’06 in Boston I heard the king of the Web speak about what he sees as the next step in this system’s evolution: the Semantic Web. The non-semantic Web has plenty of good introductory material on the topic, so I won’t try to paraphrase Professor Sir Tim Berners-Lee — but I will mention what I remember of the brief, interesting question-and-answer period.

The first question (or first three questions, as Berners-Lee called them) were from Google’s director of research and search quality Peter Norvig (also a palindromist). Specifically, he identified three problems that were difficult to overcome in the Web (and which Google had spent a lot of effort working on) and which he saw as causing problems for the more general data-sharing system that Berners-Lee was working on.

First, incompetence: When large numbers of users get involved, many people have difficulty writing well-formed Web pages; how would they get RDF right? Second, commercial dominance in the case where businesses have data: Why would a proprietary market leader want to open its data? Third, deception, of the phishing and SEO sort.

Berners-Lee stared off his reply by mentioning what he saw as one of Google’s misconceptions about the semantic web: People aren’t supposed to hand-craft each RDF page, but map over data that is already in databases, systematically, using simple scripts. Regarding the second point, commercial dominance, he described that a few companies opening up information (booksellers who first say what titles they carry, then show their prices, then even show their stock levels because competitors do it and this makes things more convenient for customers) can provide a new equilibrium where information is accessible. I don’t recall the brief answer he gave to the third point, but it’s clear to me (either from his talk or elsewise) that establishing trust and discerning bad text or data involves various social as well as technological systems – it may not be easy, but if the problem isn’t insurmountable for the Web it shouldn’t be for the Semantic Web.

Andrew McCallum of the University of Massachusetts asked the final question. McCallum has made a lot of contributions in automatically exposing the meaning of documents that aren’t marked up semantically. His work uses probabilistic methods, as a great deal of AI work these days does. McCallum asked why there wasn’t support for a probabilistic semantic web, where, for instance, uncertainty or belief about a proposition could be represented.

Berners-Lee responded by using the word “fuzzy,” which was only about 24% relevant to the question. (Fuzzy logic allows for the representation of degrees of truth – “you are sort of tall, so the proposition ‘you are tall’ is 60% true” – but is not probabilistic – “I believe there is a 4% chance that I will get hit by car if I try to cross the street.”) He expressed the belief that there would be a lot of difficulty in agreeing on what probabilities to assign to different values in a way that would allow them to be shared and reused effectively. I wasn’t clear on why this was the case with probabilities any more than many other sorts of non-probabilistic data. Why can a bookseller label a used book as being in “fine” condition or a movie reviewer give four stars to a film, but there’s no way for, say, a stylistics researcher to express that, according to some method of analysis, a particular text is believed wth 80% certainty to have been written by a particular author?

To return to some (light) Semantic Web bashing, though: I sense a rather different barrier to widespread use of the Semantic Web than Peter Norvig does. It’s just not fun to post data on the Web when you compare it to writing a human-readable texts, which express the contributor’s style along authorial, personal, design, and other dimensions. People may offer up their data, but I doubt the enthusiasm level will ever match that of the Language Web. While this isn’t a reason to give up on the Semantic Web, I think it will blunt the cultural impact of this new system when compared to Berners-Lee’s hugely successful and broadly encompassing WWW. One feeble data point to support this: Although Berners-Lee closed by encouraging everyone to contribute data that they had in RDF to the Semantic Web, here I am typing up a blog post instead.