Sunday 31 January 2010

The Single Source of Search

There's more information on the Internet than one could apprehend in a hundred lifetimes, and it's growing too - and (most of the times) kept up-to-date. Different organisations, places and networks holding that information make it hard to get it all together, so how to make that information homogenous, and uniformly accessible?

Can it be done? Should it be done?

Over the last decade we went from data to knowledge.The World Wide Web has linked companies and consumers in the last decade. This inspired the more shy organisations to build intranets, where only company people would find each other
That was about linking data: the same things could be done in a new way

Then web shops came. Forums. Peer to peer networks for sharing ever legal and not so legal delight. All that came and became mainstream so fast that the inter-intra move didn't even have time to "happen".
That was about connecting, and information: more or less new things were done in new ways

In the last few years, social networks conquered the digital earth: LinkedIn, Facebook, Twitter. Such a different kind of behaviour, it was absolutely new. It was using the infrastructure already laid out (computers, networks and people using those) to build upon.
That was about sharing information, acquiring knowledge: entirely new means to an entirely new end.

Meanwhile, WikiPedia was born. An unprecedented source of information with more than 14,000,000 articles in more than 260 languages.
Stats in monthly unique visitors for all that: LinkedIn 15 million, Facebook 130 million, Twitter 55 million, WikiPedia 60 million.
That's a lot of data, information and knowledge. And it's all out there. Wait, where?

Yes, it's all out there, pretty much. Google helps us in finding it, almost real-time. We've seen some struggles with Facebook that treated their data as a walled garden, but they're slowly opening up too. With Google starting to index books, video and other content, all knowledge in the world starts to become available online and realtime.
But it is scattered all over the place, in different forms, behind different doors: not uniform or homogenous. It's very diverse

The Integration theme: overcoming diversity

Integrating applications, departments and companies has shown this same theme over the last decades: diversity in form, location and accessibility has to be overcome.
The European Parliament shows how that can be done: introduce an intermediate language (or two, in that case), support different communication channels, and facilitate-by-translation.
That works very well for all: the focus and attention remains on the "stars" themselves, the highly specialised participants. Like business shouldn't be bothered with IT, they aren't bothered by the linguistical barriers and can just move in and out.

There's a big precondition to all that though, which is that the semantics are agreed upon beforehand. In the European Parliament somehow magically, changing semantics are picked up by all parties involved. Now how does all this work in the World Wide Web?

The first WWW problem: different format

Structured versus unstructured versus semi-structured. HTML, text, .doc, .PDF, Facebook updates, Tweets, it's all different. However, search engines make all of that transparent. After all, there are only so many syntaxes around. Of course, visuals like video and images are an entirely different topic, but even those are magically informated by Google

The second WWW problem: different location

Is it on the web, or behind a company firewall? Does it need authorisation? Only what is openly available can be searched. And it doesn't matter whether it is located north, east, south, west, or orbiting around earth. Search engines make all of that transparent too

The third WWW problem: different languages, dialects and typos

It still takes too many rules to perfectly translate a language to another one. English is widely present though, and there are as many typos and spelling errors made by native speakers as by foreigners. All that has to be taken into account as well. Luckily most search engines do. They suggest correct spelling if you search something and misspell it. They'll even include misspelt search results

The real WWW problem: different semantics

The biggest problem is (changing) semantics. Wikipedia spends pages and pages on disambiguation explaining the differences between one word or acronym, and the other. The word web, for instance, can have entirely different meanings in different contexts. Even if, across all different forms, locations and languages, you are looking for the word web, what is the context you want to place it in? Heck, you might not even know that yourself...
The best example of how vivid a semantic discussion can be is the initial discussion around  E2.0 and Social Business Design

The possible solution: autonomous tagging

Tagging is a way of labelling a piece of information with a single word or phrase. Tags are decided upon individually by humans, in relative isolation. There is no central, global tagging system where one can pick their tags from. Although tags now also are a form of language or at least communication. If information were to be tagged, these tags could be translated or related, and form connections across all diversities.

What if there were a tag knowledgebase much like today's Wikipedia? Where tags are maintained, explained, etcetera? This would be the ultimate source of metadata, making it possible for the Single Source of Search to be conducted. Its interface could be defined and plugged into, and it would be the single source of truth for the Semantic Web
Bots could crawl the entire Web tagging information whether it's HTML, PDF, Video or images or whatsoever

In my last post I explained about the position and quality of humans versus machines. This very complex and dynamic terrain is definitely something that needs to be explored and maintained by humans first. When that's succesful, we might be able to automate that, and skip to the next level: wisdom

2 reacties:

Henk van Zuilekom said...

In the early 80's there was a system TRS (Text Retrieval System) that would allow you to link words, creating a 'tag-tree' (for want of a better word. Search on one word and all related words could be included in the search.
Moreover, it could also search on 'distance' of words within a sentence, paragraph or document. The problem with human tagging is that it relies on the mood of the person at the moment of writing. Plus you would want to add tags when new buzzwords become hot. Nobody goes back to add a tag. Consequently, such a tag-base should be used during search, not write.

Martijn Linssen said...

Thank you Henk, points well taken!

I think of Asterisq's Mention map (http://dlvr.it/16Yr) as a tool for linked/distance search like you suggest, does that come close?

Yes, perceptions do change, as do definitions. I agree that going back and retagging wouldn't help. To use a very blunt example: tags added to Timesharing (http://en.wikipedia.org/wiki/Time-sharing) in the '60s and '70s could now be related to Cloud ;-)
Context should translate the current tags to "old tags" during search

Post a Comment

Thank you for sharing your thoughts! Copy your comment before signing in...