This is a new sort of service and it is important because despite the phenomenal growth of the internet, 80% of the new information created each year is still unstructured – in other words it is text, graphics and video contained in things like documents, emails, Facebook shares, web pages and tweets, etc. This unstructured information is poorly integrated with the World Wide Web. Even if they are published to a web site, they have remained isolated islands of information – until now.
Within the last five years, the size of the average web page has more than tripled, and the number of external objects has more than doubled. The average web page is now 1 MB and growing at over 20% each year. Today the average web page has about 10 links and 600 words. So it is reasonable to say that less than 2% of the content is hyperlinked to something else.
The documents you deal with each day often contain only a single link (or less) and many more words than a web page.
If you are lucky, there is a single bridge (hyperlink) from the web to your document. Think of your documents as information that is connected to the web superhighway only via an off-ramp. You can find published documents easily enough because published documents are indexed by the likes of Google, Baidu and Bing. But what you can’t do is start from your document and get easily back onto the web in an intelligent way.
That’s why we see the current web-document world as a one-way off-ramp.
Bridger works for many of your documents – whether they are published on the web or not. This is how we can release the untapped value of the 80% of unstructured text. How do we do this? The good news is that you really don’t have to know about this if you don’t want to.
For the rest of us, all you need to understand is that Bridger takes your isolated documents, emails, etc. and automatically links them to over 50 billion trusted additional sources of information on the world wide web.
Drop a document on to the Bridger mobile app and it will instantly tell you what the document is about and give you web profiles of all the relevant people, organizations, places, technologies and ideas.
Your document is transformed into a fully integrated web body of knowledge.
Degree 1 Determine internal document structure
Machine reads the full text of the document and discards everything but the main text. For example, an article from The Times of May 02, 2013 entitled:
"President Barack Obama says its time to slam the door on Guantanamo Bay"
Degree 2 Detect subjects
Determine the subjects (topics, people, organizations, places, technologies, etc.) talked about in the text. In our example might contain:
David Taylor, Alexandra Frean, Washington, Barack Obama, President, Guantanamo Bay, Hunger strike
Degree 3 Analyze semantic relationships
Get the computer to identify relationships between subjects that in our example include:
David Taylor works for The Times of London newspaper
Barack Obama is President of the United States
US Congress is constitutionally related to the office of the President of the United States
Hunger strike is happening at Guantanamo Bay
Degree 4 Rank subject in-context importance
Work out how important each subject is in the context of the overall document. In our example:
Subject | Rank |
Barack Obama | 0.803 |
David Taylor | 0.515 |
Hunger strike | 0.511 |
The Times of London newspaper | 0.287 |
US Congress | 0.201 |
Guantanamo Bay | 0.174 |
Degree 5 Create subject profiles
Use linked open data to dynamically compose profiles of topics, people, organizations, places and technologies. Such as the example here on the right ->
Degree 6 Build and embed links
All subjects are then given embedded links in the original text of your document and the text summary. Profiles are also enhanced with a relevant link to further information available on the web.
Google now claims to index over 25 billion pages. The average amount of text in a web pages is around 10,000 words. This means that, for most searches, there will be a huge number of pages matching the words in your search phrase. That’s why you get 104,000,000 Google hits when searching for ‘Ethiopian age statistics’. That’s a massive ‘haystack’ where your ‘needle’ can be found somewhere.
The genius of Google is the way it ranks the importance to you of the pages that fit the search criteria.
We have turned the haystack into the needle by starting at a different place: your document. Bridger gives you the technology to use the web from the point-of-view of your document. This is done by combining some very smart semantic analysis software with linked open data, the semantic web and data visualization1. These concepts are explained in the following sections.
Semantic analysis relates syntactic structures, (such as phrases, clauses, sentences and paragraphs) to their meanings. Six Degrees Of Data (6DD) uses a semantic analysis ranking solution that is built into Bridger. We then us this structural knowledge to infer relationships between subjects and then their relative importance.
The Semantic Web is a collaborative movement led by the international standards body, the World Wide Web Consortium (W3C). The standard promotes common data formats on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web dominated by unstructured and semi-structured documents into a "web of data". The Semantic Web stack builds on the W3C'sResource Description Framework (RDF).
“A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.”
Tim Berners-Lee, director of the World Wide Web Consortium, coined the term in a design note discussing issues around the Semantic Web project.
The Open Data Movement aims at making data freely available to everyone. Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. The goals of the open data movement are similar to those of other "Open" movements such as open source, open hardware, open content, and open access.
Linked Open Data is best thought of as a powerful way to ‘break out’ of the data silos created by the millions of databases that currently make up the World Wide Web.
Linked data describes a method of publishing structured data so that it can be interlinked and so become more useful. It extends standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a way that can be read automatically by computers. This enables data from different sources to be connected and queried.
Before all this can happen the unstructured data needs to be structured. The good news is that there are a number of high-profile areas of the web that have already done this. It is one of the best-kept secrets of today’s web that linked open data is available from:
Wikipedia – contains about 3,500,000 concepts described by 1 billion relationships, including abstracts in 11 different languages
GeoNames – provides descriptions of more than 7,500,000 geographical features worldwide.
Freebase – with 40,000,000 topics about people, places and things and 1.3 billion facts.
Publishers – such as the New York Times, the BBC, and others
Governments – from around the world are embracing open data.
In total there are estimated to be over 50 billion linked open data ‘things’ distributed across hundreds of data sets on the web. Subject areas range from across many different domains like geography, media, biology, chemistry, economics, energy, etc.
6DD have created Bridger as a new category of product where documents realize Tim Berners-Lee’s vision of the next generation web and the Linked Open Data movement.
LOD makes it possible for computers to understand context in the same way that you and I know that ‘totally sick’ is our teenage child’s way of saying that they like something and not that they are feeling unwell. Easy for us, but up until LOD a real challenge for computers.
This matters because with LOD it makes it possible for computers to interpret our unstructured text and give meaningful and relevant responses. Bridger’s Profiles are an example of this.
A ‘cloud’ of 50 billion LOD things sounds – and is – vast, but it still only represents a fraction of the web today. So not everything is in the LOD cloud and not everything in the LOD cloud is
The good news is that the LOD cloud is growing daily. Bridger is positioned to bring to you each advance in LOD, but it will never be perfect and will regularly get things wrong. Bridger has been extensively tested and on average is correct 8 times out of 10 – and we are working hard to improve these results by developing new and smarter software components to build into Bridger. This is easy to do as Bridger is 100% browser based for our users and all of the ‘smart stuff’ lives on the cloud.
1. Bridger relies on a wide range of technologies and partners to function. These include Adobe Acrobat SDK, Alchemy API, Apache License 2.0, Apache Solr, Crunchbase, DBPedia, Dublin Core Metadata Initiative, EDGAR, Facebook, FOAF, Freebase, GeoNames, GNU Lesser General Public License, Google Custom Search API, Leximancer, LinkedMDB, New York Times, MusicBrainz, OpenCorporates, OpenCalais, OpenCyc, The OpenCyc Knowledge Base, The OpenCyc Java API and Other Non-CycL Open Source Code, The OpenCyc Knowledge Server, The OpenCyc OWL Ontologies, Smmly.com, Twitter, Wikipedia, WordNet, YAGO.
2. ’Semanticness’ of the web diagram courtesy of blog ‘About the social semantic web’ at ablvienna.wordpress.com.