42klines: A Search Engine For UNIX Programmers
Are you a UNIX programmer? Then this may be very useful to you.
Google has offered the ability to create a customized search engine (CSE) which searches a list of sites given by you. I decided to take it for a test drive. I ended up with a surprisingly useful search engine customized to serve UNIX programmers. You can find the search engine box at the top of this blog. It currently searches more than 400 websites which are useful for UNIX programmers. You will find a search box which looks like this on the top of this blog.
Table of Contents
- Why is it useful?
- What websites does 42klines search?
- Can I put this search engine on my own website?
- Can I add my own bookmarks to 42klines?
- Custom search engine flavors
- Getting your hands dirty
If you do a Google web search, the search engine cannot identify the context in which you have done the search, immediately. A keyword such as “signals” can imply different things (traffic signals, hand signals, UNIX signals?). In order to be useful to all people, Google gives search results from different contexts, if applicable, in its search results. This means Google web search can end up wasting your time (you’ll have to filter results manually) while reducing the relevance of results in your context. A CSE, however returns results related to exactly what you want.
I realized the usefulness of this, while discussing the semantics of handling signals by multi-threaded processes in Linux, with a colleague recently. The problem we were facing was related to the way gdb was handling signals received by a multi-threaded process, we were tracing. We were not sure about the current Linux semantics so we decided to search. Co-incidentally I had added around 200 websites to this custom search engine related to UNIX programming a few days ago. So I decided to give it a test drive. I searched for signals thread. The top 5 results from the CSE gave me more than I needed to know, about Linux signal handling in multi-threaded processes. I compared the results with Google
The results on the web weren’t that bad, but they were not the best either. Google has done a good job with the custom search engine offering. Take a look at the results from the first page of web search below. Then try the 42klines search. Do you see the difference?
For a start I have seeded the engine with more than 400 websites which can be useful to UNIX programmers. They loosely fall in the following categories:
- Research organizations (IEEE, ACM, Citeseer etc.)
- UNIX/Programming Magazines (DDJ, Linux Journal, LWN , KernelTrap etc.)
- Forums (Interesting google groups, etc)
- OS development resources (NonDot, Sandpile, x86 etc)
- Bookmarking sites (Reddit, Del.icio.us)
- Free web hosted books (Linux Device Drivers, OpenBookProject etc.)
- Document hosting sites (Scribd, Wikipedia, Linux HOWTOs etc.)
- Blogs and personal websites hosting useful programming information (Robert Love, Ulrich Drepper etc.)
- University courses available online and useful for UNIX programmers (MIT Open Courseware etc.)
- Application hosting/indexing websites (Sourceforge, FSF etc.)
- Conferences (USENIX, Linux Conferences etc.)
- Miscellaneous pages
Yes, you can easily do that. The search engine hosted on this website is a linked CSE. Another flavor of it called the stored CSE, is hosted in Google’s databases. The differences between the two flavors have been detailed later on in the post. You can easily add the stored CSE flavor to your iGoogle page as a gadget. You can download this code to put the 42klines search engine on your blog or website. Customize the look in whatever way you want. The search results are hosted on a page on this website, because that page requires another snippet of code from Google. If you want to host the results on your website, let me know. I’ll provide the code necessary to do so. You can skip the rest of the post, if you are not interested in knowing how the search engine works. If you want to add your own bookmarks useful for UNIX programmers in the 42klines search engine, read on. A few useful resources are listed at the end of this post.
Whenever I find good links or websites useful to me as a UNIX programmer I plan to add them to this search engine, for everyone’s benefit. The list of websites which are currently indexed can be given to Google in an annotation file in the XML format. The annotation file for 42klines search engine is hosted in a Subversion repository: http://svn2.assembla.com/svn/42klines_search on Assembla. Assembla hosts subversion repositories for projects. If you are interested in adding more links to 42klines, send a mail to me at sudhanshu.goswami at 42klines dot com. I’ll send an invite to you from Assembla. Checkout the 42klines search engine’s websites list by running this command:
svn checkout http://svn2.assembla.com/svn/42klines_search
If you prefer GUIs, you can also use RapidSVN on Linux to do the same. The 42klines search engine on this website is a linked CSE. It has a stored CSE flavor as well. The difference between the two flavors are detailed in the next section. List of websites to be searched are maintained in a different way for each flavor. Going forward, I plan to update the linked CSE first, while periodically bringing the stored CSE in sync with it. I maintain two flavors because, it is easy to add the stored CSE to iGoogle as a gadget.
The table below describes the differences between a linked CSE and a stored CSE.
|Stored Custom Search Engine||Linked Custom Search Engine|
|Can be built using wizards hosted here.||Metafiles can only be created manually.|
|Websites searched are stored in Google's database.||Websites searched are stored in an annotation file hosted on your server.|
|Websites added to search engine database get immediately reflected in the search results.||Websites added to annotation files will get reflected in the search results on the next refresh by Google. To immediately refresh or test annotation file, you can use this tool.|
|Maximum number of sites = 5000.||Multiple annotation files allowed. Each file's max size = 3MB. Total file sizes <= 10 MB.|
|Get their own Google hosted web pages like this.||No home page for a linked CSE created on Google. You can create your own home page for it.|
|People can volunteer to contribute from a stored CSE's home page.||This option is not available for a linked CSE.|
|Restricted in number of things possible.||Be creative. You can customize your annotation files on the fly. How? You can switch from a stored CSE to a linked CSE like this.|
|Google provides links to add this kind of an engine easily to your blog or iGoogle home page. E.g. use this to add it to your iGoogle page.||Linked CSE has to be manually added to a website. E.g. Linked CSE flavor of 42klines search engine can be added by downloading and adding this piece of code to your website.|
This section is just a blurb about things to know, while working with Google’s custom search engine. I’ll list them down pointwise.
- Opera’s latest version does not seem to be supported. Some features like saving options for the search engine worked, but the “Save” button got permanently disabled after saving. These kinds of problems may occur if you are using uncommon browsers. YMMW.
- I tried to replace the context file of the stored search engine with that of the linked search engine using the Advanced tab of the search engine’s wizard interface, however it did not work. So, no home page for the linked CSE could be created on Google.
- If you are not trying to customize a search engine in non-traditional ways, and just want a search box for your blog/homepage, you are better off sticking to a Stored custom search engine. However, if you have got special needs or have more than 5000 websites to search, you’ll have to use a linked search engine.
- Google’s custom search engines can be customized to a great extent to give highly targeted results. This can be achieved by assigning topics to websites and labeling them. Labels can be used to tweak the search results in the favor of websites stamped with a particular label or completely provide search results only from websites stamped with that label. Further a boost factor can be associated with websites to boost search results from them. You can refer to this CSE glossary, if you are having trouble following these terms.
- Google’s management interface for stored CSEs does not provide the ability to assign labels, boost strengths for some websites, add filters, created nested search engines etc. You can do all of these with stored CSEs, but you will have to first download the annotation file for the stored websites and the context file for your stored search engine. Then you will have to edit them manually and upload them. This can be done from the Advanced tab of the management interface.
42klines CSE: Download code to put on your website here
42klines iGoogle gadget: Add this search engine to your iGoogle page
42klines subversion repository
Coopdir: Directory of custom search engines.
GooglePicks: Picked custom search engines by Google.
RubyCorner: A custom search engine for Ruby programmers.
Python CSE: A custom search engine for Python programmers.
Linux: A custom search engine for linux users created by a sysadmin.
Update: Some cleanups done to the post. Added a table of contents, but unfortunately the anchor links did not work as expected. Still trying to figure out how to fix this. [Mar 2: Fixed. At the cost of breaking previous permalinks. Please update any bookmarks to permanent links. This site is going through some initial growing pains.]