Archive: May 11, 2003

Google and Blogs

Sunday,  05/11/03  11:26 AM

There's been considerable discussion in the blogosphere about Google "dropping blogs" from search results.  Dave Winer linked Andrew Orlowski's article about Eric Schmidt's comments; more recently Dave links Evan Williams' reply that Orlowski is full of crap.  So what's the truth?  Unlike Evan, I have no inside knowledge (Evan is the founder of Pyra, makers of Blogger, which was recently purchased by Google), but here's some educated guesswork...

First, Google is all about delivering accurate search results.  If they thought dropping blogs would help, that's why they would do it.  (Not because they dislike blogs or have some philosophical axe to grind.)  So we need to think about whether blogs improve search results or not.  Second, Google has a history of separating search domains in their GUI (images, groups, directory, news).  Each of these domains have different characteristics, and when a user searches they generally know which domain they want to search within.  It is reasonable to assume that rather than dropping blogs altogether, Google would establish a new domain for them.  So we need to think about why they would do this and how it might work.  Finally, Google works great for most sites, but the way they index blogs could be improved.  So we can think about how blogs could best be indexed.

Dave asked "how will it [Google] tell the difference [between blogs and everything else]"?  I'm not sure how they could tell, there are gray lines between news sites, personal home pages, company sites, e-commerce stores, blogs, etc., but there are technical ways to distinguish (blogs ping, they have RSS feeds, etc.).  More on this below, but for now let's think about the differences a search engine would care about:

  1. Blogs' content changes frequently.
  2. Blogs are link-rich and content-poor.
  3. Blogs contain personal opinion

If you think about it these things all make blogs less useful to search engines.  Let's consider them in turn:

Blogs' content changes frequently.  Blogs are chronological diaries; many bloggers post at least once a day and some post multiple times a day.  Each post usually has a "permalink" (a URL which always links to the post), but the blog itself has a constant URL, and the content of that URL is always changing.  Consider my little blog; I post about once per day, and Google's spider visits me about once per day.  It takes Google some time before their spider's data are indexed and absorbed, so most of the time what Google "thinks" is on my blog's home page is only accurate for a few hours.  This is shown vividly by looking at my referer logs; Google often directs people to my home page based on content which is no longer there!

Blogs are link-rich and content-poor.  Many posts on a blog simply link to other posts on other blogs, perhaps adding some commentary and/or associating multiple posts with similar content together.  Not all blogs are that way - this is the "thinkers" vs. "linkers" distinction I've mentioned before - but overall if Google directs a searcher to a blog, they're more likely to find links than the information itself.  There is value in having the links aggregated by the blogger, but that's what Google does anyway.  So most blog posts are not very good targets for a search, even if many other bloggers have linked to them.

Blogs contain personal opinion.  By their very nature, blogs are one or a small number of people's thoughts about their world.  Blogs which blandly report news are uncommon; most blogs are full of philosophy, politics, sociology, and general spin.  This is what makes them interesting and fun to read, but it isn't clear this is helpful for someone searching for information.  If you are searching for "George Bush landing on the U.S.S.Lincoln", that's what you want to find, not 1,000 bloggers' personal opinions about George Bush's landing.

So I can see why Google might want to exclude blogs from search results.  By the same token, blogs have information that can't be found anywhere else; they are an incredible source of information.  The information takes several forms:

  • Firsthand accounts of news events.  Frequently bloggers "are there", and contribute detail and insight (and photos) unavailable anywhere else.
  • Links connecting information together in virtual threads.  The interconnections between blog posts are amazingly informative.  Consider the brief thread I described above: Dave Winer -> Evan Williams -> Andrew Orlowski -> Eric Schmidt.  Each added information to the overall picture, but I never would have found these connections by simply searching Google.
  • Personal opinions.  I noted above that if you are searching for information about George Bush's landing blogs would not be helpful.  { Except for a firsthand account, of course, what if a Navy seaman blogged about the event! }  But if you wanted to know what people thought about the landing, checking blogs is absolutely the thing to do (as opposed to, say, taking CNN's or Fox's word for it).
  • Discussions.  In addition to one person's opinion, you have the give-and-take between many people.  Frequently blogs have comment threads which host the discussion.  Or bloggers may link back and forth on their own blogs, perhaps connected by trackbacks.  The discussion is often more illuminating than the original information.

So I can see where Google would definitely want to continue presenting blogs' information, but segregated into a different search domain.  They would do this for another reason, too - to improve the presentation of results.  Google News results are different from Google Web results, and they are presented differently too, as a reflection of the underlying differences in the content.

There is no doubt Google's approach to indexing web sites made a qualitative improvement in web searching.  But there are ways blogs can be indexed which would be a big step forward:

  • Use for currency.  Most blogs "ping" whenever their content changes.  Google could use this to determine when blogs' content have changed and schedule their spiders accordingly.  By the same token any site which pings should be considered a blog.  If Google did this, everyone with a blog would want to ping.

That's the answer to Dave's question "how will it tell the difference?" - it will ask the bloggers!

  • Use RSS feeds for content.  Most blogs have RSS feeds which abstract their content.  Google can use blogs' RSS feeds to determine what posts are at which URLs without laboriously spidering each blog every time it is updated.  If Google did this everyone with a blog would want to have an RSS feed.
  • Model the interconnections between posts.  The multithreaded world of links between blogs contains a mine of information - as shown by Technorati, Dave Sifrey's terrific search engine.  If Google could provide a way to find and display these threads, it would be really cool.  Currently we have comments, trackbacks, links between sites, etc. - all valuable and all different - and it is tough to get the big picture without a lot of clicking around.
  • Aggregate opinions.  The magic of Google is that they use links to index pages, instead of the contents themselves.  ("You have what people say you have.")  This technique applied to blog posts could be very valuable, use links to categorize an expressed opinion, instead of the opinion itself.  ("You think what people say you think.")

No doubt there are other ways, too.  By segregating blogs and treating them differently, Google could improve the blog searching experience.  Which in turn would make the information on blogs more valuable.

Wrapping up, here are my conclusions:

  • Google might want to exclude blogs from search results.
  • Google would definitely want to continue presenting blogs' information, segregated into a different search domain.
  • Google could improve the blog searching experience by leveraging attributes of the blogs themselves, such as, RSS feeds, comments, and trackbacks, and by applying their technique of using links to categorize content.

Those are my thoughts, I'm sure you'll have others.  I'll search for them :)

P.S. Click here for a Technorati search for blogs which link to Orlowksi's article.  There are 195 listed, each of which has other inbound links, comment threads, trackbacks, etc.  Amazing!



Sunday,  05/11/03  11:06 PM

It's all happening... (seems like a good name for a blog :)

I've ignored the whole Jayson Blair / NYTimes thing; but of course I have a strong opinion.  So does Glenn Reynolds!  Anyone who thinks this isn't affirmative action-related is not paying attention.  The worst part of this whole affair is the way it taints the work of any other minority reporter; people may question their veracity based solely on race.  (In the same way that an Ivy-league degree means less if you're a minority.)  So we see that affirmative action actually hurts minorities in their struggle for credibility.  I hope the folks at the University of Michigan who are fighting to preserve their admission policies think about this.

Bush and Blair have been nominated for a Nobel Peace Prize.  Makes as much sense as anyone else...  This prize just doesn't have the luster or credibility of, say, the prize for Chemistry.  I mean, Yasser Arafat has won it.

There's a lot of discussion in the blogosphere about "social software".  The Guardian wonders "is it the next big thing or just hype".  I think - just hype.  Tools for communication are what people really need.  Don Park has misgivings about it...  Don't worry, Don.  It is what it is.  If there's a "there there", it will emerge from human-to-human communication, not because it is forced.

I think about "social software" as I think about "the semantic web".  I don't really get it, because there isn't really anything to get.  Dave Winer agrees.  And Scoble goes further and claims "the whole metadata movement is over-hyped".  I agree completely.  Raise your hand if  you enter keywords for your Word documents so you can categorize them later.

People will do the minimum *now* if there isn't a payoff.  This is why the best GUIs are as simple as possible.  Consider Google - one input field.  We discovered this big-time at PayPal, make the sign-up process as simple as possible, and more people will sign-up.  (Asking even one more question measurably reduced the completion rate.)

Metadata is best thought of as an emergent property, not an explicit one.  Tools which manage emergent metadata are very useful - Google is a perfect example.  Tools which require explicit entry of metadata are not...  RSS feeds work precisely because nobody has to do much to create them ("Real Simple Syndication").  If I had to go and tag every post in my blog with metadata, I would never do it.

Wired looks for A Tivo Player for the Radio.  Me, too!  Bottom line - the killer product is not out yet.  When it is, you'll know it.

Another new online music service: Magnatune.  [ via Cory Doctorow ]  "We are an Internet record label which sells and licenses music by encouraging MP3 file trading and Internet Radio."  Interesting, but of course only "unknown" artists are represented...

I finished Michael Crichton's Prey.  Not that good.  Yeah, the nanotechnology ideas were there, and the genetic programming algorithms, and of course as usual he creates interesting characters and a sense of tension.  But unlike some of his other novels it all seemed too farfetched, the science was far away from what's actually possible.

To see what is actually possible, check out the Avida project at Caltech.  This software is designed to model systems which feature self-reproduction, genetic algorithms, mutation, etc.  Really cool.

Also related - here's a great overview of grid computing from IBM.

Between the advances of nanotechnology, genetic algorithms, and grid computing something like Crichton envisions will exist, but dust clouds of nanoparticles spontaneously emulating people?  No.

With the Matrix Reloaded on tap (three more days!) consider The top 10 things I hate about Star Trek.  I'm going to reverse the polarity of this website right *now*; watch out!


