看了google blog上讲google搜索历史的一篇文章。
国内其实直接看不到的,整个blogspot都被和谐掉了的,就贴在下面了。
比较有意思的是,里面讲到,在搜索发展的初期,一个叫TREC的组织(Text Retrieval Conference),为了帮助发展搜索,发布了一大批原始文档和一批搜索查询(关键词?)以及人工挑选出来的这些文档中与这些查询相关的结果。
然后所有的搜索算法都可以用这套文档和结果来试验自己的准确度。于是还有了一年一年的竞赛,算法一年一年的改进成熟,然后出现了web,开始向web方面适应。
好方法!
Why data mattersfrom Official Google Blog by Karen
Posted by Hal Varian, Chief Economist
We often use this space to discuss how we treat user data and protect
privacy. With the post below, we're beginning an occasional series that
discusses how we harness the data we collect to improve our products
and services for our users. We think it's appropriate to start with a
post describing how data has been critical to the advancement of search
technology. - Ed.
Better data makes for better science. The history of information retrieval illustrates this principle well.
Work in this area began in the early days of computing, with simple
document retrieval based on matching queries with words and phrases in
text files. Driven by the availability of new data sources, algorithms
evolved and became more sophisticated. The arrival of the web presented
new challenges for search, and now it is common to use information from
web links and many other indicators as signals of relevance.
Today's web search algorithms are trained to a large degree by the
"wisdom of the crowds" drawn from the logs of billions of previous
search queries. This brief overview of the history of search
illustrates why using data is integral to making Google web search
valuable to our users.
A brief history of search
Nowadays search is a hot topic, especially with the widespread use of
the web, but the history of document search dates back to the 1950s.
Search engines existed in those ancient times, but their primary use
was to search a static collection of documents. In the early 60s, the
research community gathered new data by digitizing abstracts of
articles, enabling rapid progress in the field in the 60s and 70s. But
by the late 80s, progress in this area had slowed down considerably.
In order to stimulate research in information retrieval, the National
Institute of Standards and Technology (NIST) launched the Text
Retrieval Conference (TREC) in 1992. TREC introduced new data in the
form of full-text documents and used human judges to classify whether
or not particular documents were relevant to a set of queries. They
released a sample of this data to researchers, who used it to train and
improve their systems to find the documents relevant to a new set of
queries and compare their results to TREC's human judgments and other
researchers' algorithms.
The TREC data revitalized research on information retrieval. Having a
standard, widely available, and carefully constructed set of data laid
the groundwork for further innovation in this field. The yearly TREC
conference fostered collaboration, innovation, and a measured dose of
competition (and bragging rights) that led to better information
retrieval.
New ideas spread rapidly, and the algorithms improved. But with each
new improvement, it became harder and harder to improve on last year's
techniques, and progress eventually slowed down again.
And then came the web. In its beginning stages, researchers used
industry-standard algorithms based on the TREC research to find
documents on the web. But the need for better search was apparent--now
not just for researchers, but also for everyday users---and the web
gave us lots of new data in the form of links that offered the
possibility of new advances.
There were developments on two fronts. On the commercial side, a few
companies started offering web search engines, but no one was quite
sure what business models would work.
On the academic side, the National Science Foundation started a
"Digital Library Project" which made grants to several universities.
Two Stanford grad students in computer science named Larry Page and
Sergey Brin worked on this project. Their insight was to recognize that
existing search algorithms could be dramatically improved by using the
special linking structure of web documents. Thus PageRank was born.
How Google uses data
PageRank offered a significant improvement on existing algorithms by
ranking the relevance of a web page not by keywords alone but also by
the quality and quantity of the sites that linked to it. If I have six
links pointing to me from sites such as the Wall Street Journal, New
York Times, and the House of Representatives, that carries more weight
than 20 links from my old college buddies who happen to have web pages.
Larry and Sergey initially tried to license their algorithm to some of
the newly formed web search engines, but none were interested. Since
they couldn't sell their algorithm, they decided to start a search
engine themselves. The rest of the story is well-known.
Over the years, Google has continued to invest in making search better.
Our information retrieval experts have added more than 200 additional
signals to the algorithms that determine the relevance of websites to a
user's query.
So where did those other 200 signals come from? What's the next stage
of search, and what do we need to do to find even more relevant
information online?
We're constantly experimenting with our algorithm, tuning and tweaking
on a weekly basis to come up with more relevant and useful results for
our users.
But in order to come up with new ranking techniques and evaluate if
users find them useful, we have to store and analyze search logs.
(Watch our videos to see exactly what data we store in our logs.) What
results do people click on? How does their behavior change when we
change aspects of our algorithm? Using data in the logs, we can compare
how well we're doing now at finding useful information for you to how
we did a year ago. If we don't keep a history, we have no good way to
evaluate our progress and make improvements.
To choose a simple example: the Google spell checker is based on our
analysis of user searches compiled from our logs -- not a dictionary.
Similarly, we've had a lot of success in using query data to improve
our information about geographic locations, enabling us to provide
better local search.
Storing and analyzing logs of user searches is how Google's algorithm
learns to give you more useful results. Just as data availability has
driven progress of search in the past, the data in our search logs will
certainly be a critical component of future breakthroughs.