Last week, Jeff Ullman from Stanford University gave a talk to the Informatics Entrepreneurship class which was opened up to the wider world. The core of it was the ideas that made Google really what it is – PageRank and TrustRank – and the technical concepts behind it. Here are my notes from the talk just in case you missed it and wanted to know what it was about …
The early search engines crawled the web (following the links from page to page, finidng and copying as many pages as you could). Index the pages by the words they contained. And then when you’re given a search query, you find the pages that contains those words.
But all the pages have some sort of notion of importance. Have to estimate the importance of pages – or think of it as relevance to the query. The first search engines considered the the role that the search query appeared in the page – e.g. title rather than paragraph. Then how many times that word appeared in that page.
There was a brief period where it worked well. But then when people began to use search engines to find things, then spammers began to jump in. Spammers used a variety of tactics to fool search engines – this type of spamming began to be known as “term spam”.
Google came up with 2 innovations which work together in order to kill term spam. 1. Believe what people say about you, rather than what you say about yourself – look at the anchor text. 2. PageRank (named because of Larry Page, rather than it ranking pages).
Real-World problems – the problem is that there are dead-ends, not every page links to other pages. Plus, there are spider traps. Can fix the deadends, but can’t fix the spider trap problem.
In Google, the problem was re-formulated so that there is an initial “teleport set”. And can have a topic-sensitive set of “relevant” pages (i.e. a teleport set”) which the crawler uses to get out of spider traps. By being part of the teleport set, those pages gain importance.
Then as soon as Google became popular, then the spammers began to develop “link spam” – an attempt to raise the ranking of the page. Created spam farms. Google then combated spam farms by detecting and blacklisting spam farm looking sites (i.e. web pages which have a million outlinks). One way of combating spam farms is to use TrustRank – a topic-specific PageRank with a teleport set of “tusted” pages. Pages whose TrustRank is much lower than their PageRank is calculated as the spam mass.
What has happened since search has become reliable?
1. Advertising has moved online
2. Textbooks have been destroyed
3. Newspapers have been destroyed
Advertising – there was the question at the start about how to combine search results with advertising because of speed issues. Don’t want to take 10 seconds to deliver an advert. But Google’s answer is, if we restrict what the advertisers do so that we don’t slow down the search query – which turns out to work.
Then people said that people wouldn’t buy products from the internet. But Ullman argues that as people begin to trust search engines, then people to trust search engines enough in order to find suitable vendors.
Ullman contends that all advertising will move online. This is because the pay per click model is measurable (but this can be done using A/B testing through newspapers advertising, but is expensive). Second is the ability to target adverts to people – this topic raises privacy issues. For instance, lipstick advertising wouldn’t appeal to men, but its still there in the newspaper. The more services like Google knows about me, the more targetted it can be. Ullman’s position is that as long as its done by a machine, then it’s ok.
Textbooks industry. It used to be, the textbook can sell a lot, 1st ed and 2nd edition. But now it is much easier to re-sell online. More difficult to make money from it. Leads to lower sales, leads to annoying tricks by publishers – such as re-writing new exercises in textbooks.
Another thing that’s killing textbooks is that trips to the library are replaced by search queries. The academics put stuff online, but PageRank elevates the best of these to the top of the list. “You link to the stuff that you think is good”.
In terms of textbooks, turns out that royalties are a relatively novelty in the 21st century. You used to write for glory and not income, and so the internet might bring us back into that time. Example – Jokes can be remembered and transmitted without the internet.
Who killed newspapers – a lot of people blame Google (over 10% of all advertising is online). Turns out it doesn’t matter than much anyway; because newspapers did not make a lot of money out of display ads, but instead their big revenue generator is classified advertising because they have a monopoly there. Sites like Craiglists for killing businesses.
Benefits of online news – it is a much better way to deliver news. Ullman talks about how he likes the ability to go to Google News and it would have multiple slants. “A big win for the consumer”. It is interesting to see the multiple viewpoints from different newspapers.
The dark-side of online news. “News reporting is serves a vital function in a democracy”. If you don’t have newspapers, then no one is going to pay for it.
Summary:
- Search requires fighting against spammers
- Newspapers are on the endangered species list (this is not necessary a good thing)
- So are textbooks (yay) – they don’t serve any useful purpose any more
If this comes out, it’s likely that the agency will be crushed by the advertisers and Google to do only CPA.