42klines: A Search Engine For UNIX Programmers

Are you a UNIX programmer? Then this may be very useful to you.

Google has offered the ability to create a customized search engine (CSE) which searches a list of sites given by you. I decided to take it for a test drive. I ended up with a surprisingly useful search engine customized to serve UNIX programmers. You can find the search engine box at the top of this blog. It currently searches more than 400 websites which are useful for UNIX programmers. You will find a search box which looks like this on the top of this blog.

Unix Programmer's Search Engine

Table of Contents

Why is it useful?

If you do a Google web search, the search engine cannot identify the context in which you have done the search, immediately. A keyword such as “signals” can imply different things (traffic signals, hand signals, UNIX signals?). In order to be useful to all people, Google gives search results from different contexts, if applicable, in its search results. This means Google web search can end up wasting your time (you’ll have to filter results manually) while reducing the relevance of results in your context. A CSE, however returns results related to exactly what you want.

I realized the usefulness of this, while discussing the semantics of handling signals by multi-threaded processes in Linux, with a colleague recently. The problem we were facing was related to the way gdb was handling signals received by a multi-threaded process, we were tracing. We were not sure about the current Linux semantics so we decided to search. Co-incidentally I had added around 200 websites to this custom search engine related to UNIX programming a few days ago. So I decided to give it a test drive. I searched for signals thread. The top 5 results from the CSE gave me more than I needed to know, about Linux signal handling in multi-threaded processes. I compared the results with Google web search, and found that a very good article related to this topic, was not present at all in the first few pages of the web search results! Moreover, I found that almost all the CSE’s results in the first page were directly relevant to what I wanted to know, while the quality of web search results wasn’t that high.

The results on the web weren’t that bad, but they were not the best either. Google has done a good job with the custom search engine offering. Take a look at the results from the first page of web search below. Then try the 42klines search. Do you see the difference?

Results of Web Search

No! I am not interested in girls who give me mixed signals.

What websites does 42klines search?

For a start I have seeded the engine with more than 400 websites which can be useful to UNIX programmers. They loosely fall in the following categories:

  1. Research organizations (IEEE, ACM, Citeseer etc.)
  2. UNIX/Programming Magazines (DDJ, Linux Journal, LWN , KernelTrap etc.)
  3. Forums (Interesting google groups, etc)
  4. OS development resources (NonDot, Sandpile, x86 etc)
  5. Bookmarking sites (Reddit, Del.icio.us)
  6. Free web hosted books (Linux Device Drivers, OpenBookProject etc.)
  7. Document hosting sites (Scribd, Wikipedia, Linux HOWTOs etc.)
  8. Blogs and personal websites hosting useful programming information (Robert Love, Ulrich Drepper etc.)
  9. University courses available online and useful for UNIX programmers (MIT Open Courseware etc.)
  10. Application hosting/indexing websites (Sourceforge, FSF etc.)
  11. Conferences (USENIX, Linux Conferences etc.)
  12. Miscellaneous pages

Can I put this search engine on my own website?

Yes, you can easily do that. The search engine hosted on this website is a linked CSE. Another flavor of it called the stored CSE, is hosted in Google’s databases. The differences between the two flavors have been detailed later on in the post. You can easily add the stored CSE flavor to your iGoogle page as a gadget. You can download this code to put the 42klines search engine on your blog or website. Customize the look in whatever way you want. The search results are hosted on a page on this website, because that page requires another snippet of code from Google. If you want to host the results on your website, let me know. I’ll provide the code necessary to do so. You can skip the rest of the post, if you are not interested in knowing how the search engine works. If you want to add your own bookmarks useful for UNIX programmers in the 42klines search engine, read on. A few useful resources are listed at the end of this post.

Can I add my own bookmarks to 42klines?

Whenever I find good links or websites useful to me as a UNIX programmer I plan to add them to this search engine, for everyone’s benefit. The list of websites which are currently indexed can be given to Google in an annotation file in the XML format. The annotation file for 42klines search engine is hosted in a Subversion repository: http://svn2.assembla.com/svn/42klines_search on Assembla. Assembla hosts subversion repositories for projects. If you are interested in adding more links to 42klines, send a mail to me at sudhanshu.goswami at 42klines dot com. I’ll send an invite to you from Assembla. Checkout the 42klines search engine’s websites list by running this command:

svn checkout http://svn2.assembla.com/svn/42klines_search

If you prefer GUIs, you can also use RapidSVN on Linux to do the same. The 42klines search engine on this website is a linked CSE. It has a stored CSE flavor as well. The difference between the two flavors are detailed in the next section. List of websites to be searched are maintained in a different way for each flavor. Going forward, I plan to update the linked CSE first, while periodically bringing the stored CSE in sync with it. I maintain two flavors because, it is easy to add the stored CSE to iGoogle as a gadget.

Custom Search Engine Flavors

The table below describes the differences between a linked CSE and a stored CSE.

Stored Custom Search Engine Linked Custom Search Engine
Can be built using wizards hosted here. Metafiles can only be created manually.
Websites searched are stored in Google's database. Websites searched are stored in an annotation file hosted on your server.
Websites added to search engine database get immediately reflected in the search results. Websites added to annotation files will get reflected in the search results on the next refresh by Google. To immediately refresh or test annotation file, you can use this tool.
Maximum number of sites = 5000. Multiple annotation files allowed. Each file's max size = 3MB. Total file sizes <= 10 MB.
Get their own Google hosted web pages like this. No home page for a linked CSE created on Google. You can create your own home page for it.
People can volunteer to contribute from a stored CSE's home page. This option is not available for a linked CSE.
Restricted in number of things possible. Be creative. You can customize your annotation files on the fly. How? You can switch from a stored CSE to a linked CSE like this.
Google provides links to add this kind of an engine easily to your blog or iGoogle home page. E.g. use this to add it to your iGoogle page. Linked CSE has to be manually added to a website. E.g. Linked CSE flavor of 42klines search engine can be added by downloading and adding this piece of code to your website.

Getting your hands dirty

This section is just a blurb about things to know, while working with Google’s custom search engine. I’ll list them down pointwise.

  1. Opera’s latest version does not seem to be supported. Some features like saving options for the search engine worked, but the “Save” button got permanently disabled after saving. These kinds of problems may occur if you are using uncommon browsers. YMMW.
  2. I tried to replace the context file of the stored search engine with that of the linked search engine using the Advanced tab of the search engine’s wizard interface, however it did not work. So, no home page for the linked CSE could be created on Google.
  3. If you are not trying to customize a search engine in non-traditional ways, and just want a search box for your blog/homepage, you are better off sticking to a Stored custom search engine. However, if you have got special needs or have more than 5000 websites to search, you’ll have to use a linked search engine.
  4. Google’s custom search engines can be customized to a great extent to give highly targeted results. This can be achieved by assigning topics to websites and labeling them. Labels can be used to tweak the search results in the favor of websites stamped with a particular label or completely provide search results only from websites stamped with that label. Further a boost factor can be associated with websites to boost search results from them. You can refer to this CSE glossary, if you are having trouble following these terms.
  5. Google’s management interface for stored CSEs does not provide the ability to assign labels, boost strengths for some websites, add filters, created nested search engines etc. You can do all of these with stored CSEs, but you will have to first download the annotation file for the stored websites and the context file for your stored search engine. Then you will have to edit them manually and upload them. This can be done from the Advanced tab of the management interface.

Resources

42klines CSE: Download code to put on your website here
42klines iGoogle gadget: Add this search engine to your iGoogle page
42klines subversion repository
Coopdir: Directory of custom search engines.
GooglePicks: Picked custom search engines by Google.
RubyCorner: A custom search engine for Ruby programmers.
Python CSE: A custom search engine for Python programmers.
Linux: A custom search engine for linux users created by a sysadmin.

Update: Some cleanups done to the post. Added a table of contents, but unfortunately the anchor links did not work as expected. Still trying to figure out how to fix this. [Mar 2: Fixed. At the cost of breaking previous permalinks. Please update any bookmarks to permanent links. This site is going through some initial growing pains.] 

Comments (4) left to “42klines: A Search Engine For UNIX Programmers”

  1. Imanpreet wrote:

    I think, I would have preferred if you would have given the link to your search engine just like you have given for “Linux”.

    In the current scenario, the search would be done on your page itself. While I might want to book mark the search engine on google itself.

  2. Sudhanshu wrote:

    @Imanpreet: I haven’t given that link in the resources section, because that link is for the stored custom search engine version hosted on Google. I mentioned the limitations of stored custom search engines in the comparison table above.

    In the future some time, I’d probably just maintain a linked CSE, which is hosted on this website. A linked CSE has no page on Google like the one on which the “Linux” search engine is hosted. I am supposed to host the linked CSE on my own website, which I have done here. The search box which you see on top of this blog is for the linked version of the 42klines CSE.

    I intend to keep the linked CSE version most current (add more bookmarks to it), so it might be helpful for you to bookmark this website, if you want better coverage. You can find the page of the stored version of 42klines CSE here, but I won’t recommend bookmarking that.

  3. Ravindranath wrote:

    IMO, we can achieve the same context in google by appending “Unix” to the search string instead of maintaining a list of unix related sites.
    For example, instead of searching for ” signals thread” we can search for “unix signals thread”.

  4. 42klines wrote:

    @Ravindranath: Did you try to do that search on Google. If you didn’t try, please try and then again compare web results with the results from the 42klines CSE. :)

Post a Comment

*Required
*Required (Never published)
 

Powered by WP Hashcash