Linklist
Here are some background informations about how a search engine exactly work. We light ub what is difficult to crack if we try to build our own web crawler search engine from scratch:
Giga Blast
This page is a bit outdated (2004). But here you can read from the developer Matt Wells personally: All steps the search engine GigaBlast went through during the development process:
http://www.gigablast.com/rants.html
After that, a interview with Matt Wells (Gigablast) that answers the question: “When it comes to competing in the search engine arena, IS bigger always better?”. Some other interesting details of pitfalls you had to overcome running your own search engine:
http://queue.acm.org/detail.cfm?id=988401
But most importantly. the spider/crawler search engine Gigablast (C/C++) has become an open source project hosted on Github:
https://github.com/gigablast/open-source-search-engine
Blog Posts
In addition Yioop is the search engine from the Open Source Search Engine Software: SeekQuarry. In the blog you can read infos about the PHP search engine:
http://www.yioop.com/blog
Technical Theory
The links from the Highscalability Blog are fairly interesting. The forth one is the most technical. If you only have time to read one, go with the forth one:
1. http://highscalability.com/blog/2008/10/13/challenges-from-large-scale-computing-at-google.html
2. http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.html
3. http://highscalability.com/blog/2011/8/29/the-three-ages-of-google-batch-warehouse-instant.html
4. http://highscalability.com/blog/2012/4/25/the-anatomy-of-search-technology-blekkos-nosql-database.html
5. http://highscalability.com/blog/2013/1/28/duckduckgo-architecture-1-million-deep-searches-a-day-and-gr.html
Article from twitter about indexing the full history of tweets. Of note is the information about sharding. Due to the liner nature of the data (over time) they need a way to scale across time. Worth a look:
https://blog.twitter.com/2014/building-a-complete-tweet-index
A talk about the internals of Lucene. Covers some design decisions and shows Lucenes internally architecture:
http://lucene.sourceforge.net/talks/pisa/
Not as technical as the above. But a good primer which covers quite a lot of history. Worth a read:
http://alexmiller.com/the-students-guide-to-search-engines/
“Write an Internet search engine with 200 lines of Ruby code”. All about to write a small scale internet search engine in ruby. The code covers crawling as well as indexing for MySQL:
http://blog.saush.com/2009/03/17/write-an-internet-search-engine-with-200-lines-of-ruby-code/
The difficults
This blog has dedicated a separate page: Perhaps the most famous post with the exception of the original Google paper. Written by Anna Patterson. She was the developer of the search engine Cuil and Archive.org. It highlights the difficulties of developing and running a search engine. From crawling to indexing to delivering the ranked search engine result page:
https://www.suchmaschine.biz/writing-your-own-search-engine/
So let’s get into some funny written article, but still many trues in it. You think developing a search engine is easy? You should definitely read this article:
http://www.ideaeng.com/write-search-engine-0402
Ranking Algorithm
Algolia is a search-as-a-service solution provider. Nicolas Dessaigne (Co-founder & CEO at Algolia) made 2014 a blog post about the ranking algorithm they use:
http://blog.algolia.com/search-ranking-algorithm-unveiled/
The oldie of search engine technology papers: “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. From today’s point of view more than outdated. But it describes how the first version of Google was designed and written:
http://infolab.stanford.edu/~backrub/google.html
Some archive.org Links
ProCog was a blog by Matt Wells (Gigablast) who never had much content. Unfortunately this blog was switched off and the content is gone. Matt really knows his stuff and promotes an open ranking algorithm. That’s why I put an archive link to the old content here:
https://web.archive.org/web/20130606014958/http://blog.procog.com:80/
On the page “The Banana Tree” you can find a few articles about the complete design of a search engine from scratch. The site is very old, but some of the content is worth reading. For this reason I have also set an archive link here:
https://web.archive.org/web/20150214015110/http://www.thebananatree.org:80/
The blekko technology and team have joined IBM Watson! But on the old Blekko’s engineering blog there was some interesting material applicable to search engines development:
https://web.archive.org/web/20150315043329/http://blog.blekko.com:80/
Ben Boyter
Ben Boyter made this great series of blog posts about: “How to write a search engine in PHP that work well with 1 million pages”:
http://www.boyter.org/2013/01/code-for-a-search-engine-in-php-part-1/
This complete list was kindly provided to me by Ben Boyter the Developer and Founder of SearchCodeServer.com
My add to this Linklist
This 4 Part Blog Post shows how to implement an actual search engine with working code in python. It deliver detailed articles about creating the index, query the index and ranking the results:
http://www.ardendertat.com/2012/01/11/implementing-search-engines/
The Blogpost from Deangela Neves on medium.com about developing a search engine for TED talks. The TED finder is an open source search engine thats developed with the language python:
https://medium.com/@deangelaneves/how-to-build-a-search-engine-from-scratch-in-python-part-1-96eb240f9ecb
This is a very detailed Quora Post from David Quaid (PrimaryPosition.com). “How do you build a search engine from scratch? What’s the best technology stack for this?”. A must read:
https://www.quora.com/How-do-you-build-a-search-engine-from-scratch-What’s-the-best-technology-stack-for-this/answers/13752046
Above all, many interesting blog posts about crawling the web come from Jim Mischel (Programmer and Author) so i decide to link to the web-crawling category in his blog so you can take a look at all the nice posts:
http://blog.mischel.com/category/web-crawling/
A tiny nice post on the Developer-Blog from Werner Ziegelwanger about “Build your own search engine”:
https://developer-blog.net/en/build-your-own-search-engine/
Linklist Feedback
Do you have another Link, that I have missed in my Linklist?
Please drop me a line.
I would love to here from you.
Please add a link in the comments below.
Table of Contents