<?xml version='1.0' encoding='utf-8' ?>
<!--  If you are running a bot please visit this policy page outlining rules you must respect. http://www.livejournal.com/bots/  -->
<rss version='2.0' xmlns:lj='http://www.livejournal.org/rss/lj/1.0/'>
<channel>
  <title>SearchTools Blog</title>
  <link>http://searchtools.livejournal.com/</link>
  <description>SearchTools Blog - LiveJournal.com</description>
  <lastBuildDate>Wed, 17 Sep 2008 22:17:31 GMT</lastBuildDate>
  <generator>LiveJournal / LiveJournal.com</generator>
  <lj:journal>searchtools</lj:journal>
  <lj:journaltype>personal</lj:journaltype>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/80845.html</guid>
  <pubDate>Wed, 17 Sep 2008 22:17:31 GMT</pubDate>
  <title>ESS West. next week in San Jose</title>
  <link>http://searchtools.livejournal.com/80845.html</link>
  <description>&lt;blockquote&gt;
				&lt;h4&gt;&lt;a href=&quot;http://enterprisesearchsummit.com/west2008/&quot;&gt;Enterprise Search Summit, 22-24 September, 2008&lt;/a&gt;&lt;/h4&gt;
				&lt;p&gt;Focused on real-world issues of implementing and enhancing search for intranets, portals and large web sites. It&apos;s been a wonderful conference every time, because it is just about search and related issues. It&apos;s a great mix of case studies, specialist presentations, and even the vendor talks are good.&lt;/p&gt;
				&lt;p&gt;Avi Rappoport will be presenting a pre-conference workshop, &lt;a href=&quot;http://enterprisesearchsummit.com/west2008/preconference.shtml&quot;&gt;Enterprise Search 101&lt;/a&gt;, and a talk, &lt;a href=&quot;http://enterprisesearchsummit.com/west2008/daytwo.shtml#session_1575&quot;&gt;Inside the Black Box of the Search Index&lt;/a&gt; which will start with the basic inverted index, and some of the more interesting aspects, including tokenizing and document caching.&lt;/p&gt;
				&lt;p&gt;&lt;a href=&quot;https://secure.infotoday.com/forms/default.aspx?form=essw08&quot;&gt;Online registration available&lt;/a&gt;. There&apos;s also a &lt;a href=&quot;http://www.kmworld.com/kmw08/Exhibitors.aspx&quot;&gt;vendor exhibit hall&lt;/a&gt;, shared with the KMworld and Intranets conferences. To just see the exhibits, fill in the registration, and scroll down to the &amp;quot;Exhibits only&amp;quot; button, that&apos;s free. But you&apos;ll be missing a lot of fascinating talks.&lt;/p&gt;				
			&lt;/blockquote&gt;

&lt;img src=&quot;http://stools.icons.ljtoys.org.uk/mi/dot.gif&quot; border=&quot;0&quot; alt=&quot;&quot;&gt;</description>
  <comments>http://searchtools.livejournal.com/80845.html</comments>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/80429.html</guid>
  <pubDate>Wed, 17 Sep 2008 22:14:41 GMT</pubDate>
  <title>Search Solutions meeting in London, 23 September, 2008</title>
  <link>http://searchtools.livejournal.com/80429.html</link>
  <description>&lt;a href=&quot;http://irsg.bcs.org/SearchSolutions/2008/sse2008.php&quot;&gt;Search Solutions meeting in London, 23 September, 2008&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Sponsored by the British Computer Society, Information Retrieval Specialist Group, this is an interactive and collegial meeting, focusing on innovations in information search and retrieval. &lt;br /&gt;&lt;br /&gt;&lt;img src=&quot;http://stools.icons.ljtoys.org.uk/mi/dot.gif&quot; border=&quot;0&quot; alt=&quot;&quot;&gt;</description>
  <comments>http://searchtools.livejournal.com/80429.html</comments>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/80276.html</guid>
  <pubDate>Tue, 26 Aug 2008 19:37:13 GMT</pubDate>
  <title>Great overview article on Inverted Indexing for Search</title>
  <link>http://searchtools.livejournal.com/80276.html</link>
  <description>I am doing a talk about going inside the black box of the search index for the &lt;a href=&quot;http://enterprisesearchsummit.com&quot;&gt;Enterprise Search Summit&lt;/a&gt; in September in San Jose (more on that later).&lt;br /&gt;&lt;br /&gt;While I have a lot to say about indexes, I used the opportunity to check around and look for current research on the topic, and pretty much struck gold.  Although this paper is from 2006, it is exhaustive and detailed, with both practical and theoretical information, including finding that inverted indexes are both significantly faster to search and easier to maintain than relational database management systems, signature files and suffix arrays.  It also has a thorough annotated bibliography.  Best of all,  Zobel and Moffat agree with me on lowercasing all words in the index and including stopwords, which they say &quot;have an important role in phrase queries&quot;. &lt;br /&gt;&lt;br /&gt;&lt;em&gt;Inverted files for text search engines&lt;/em&gt;&lt;br /&gt;by Justin Zobel and Alistair Moffat&lt;br /&gt;ACM Computing Surveys. 2006;38(2) (56 pages).&lt;br /&gt;Available from: &lt;a href=&quot;http://doi.acm.org/10.1145/1132956.1132959&quot;&gt; http://doi.acm.org/10.1145/1132956.1132959&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Unfortunately, this article is firmly behind the ACM firewall, so if you or your institution don&apos;t have a subscription, you have to go through a few hoops to get it.  Click the PDF link, you will be denied access and have to go through their free registration form.  After that, there&apos;s a little form and you can buy the article for $10 by credit card.  I think it&apos;s worth it.&lt;br /&gt;&lt;br /&gt;ETA: improved my title</description>
  <comments>http://searchtools.livejournal.com/80276.html</comments>
  <lj:mood> </lj:mood>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/79998.html</guid>
  <pubDate>Fri, 11 Jul 2008 18:36:25 GMT</pubDate>
  <title>Sphinx (open source free search engine): New SearchTools Report</title>
  <link>http://searchtools.livejournal.com/79998.html</link>
  <description>Sphinx is an open source search engine, written in C, using both SQL and custom index files to provide a very fast text search.  The architecture scales to over a billion records by distributing the index and querying among multiple virtual and real processors.&lt;br /&gt;&lt;br /&gt;While it does a full text search, Sphinx is designed to work with structured content (music lyrics, products), and semi-structured content (RSS feeds, blog posts, magazine articles).  Sphinx is much faster and more flexible than the internal SQL functions such as where, order by, and group by.  This structure allows it to display results in a faceted metadata,&lt;a href=&quot;http://www.widepress.com/index.php?keyword=hobbit&amp;amp;language1=1&amp;amp;language2=2&amp;amp;language3=3&amp;amp;language4=4&amp;amp;language5=5&quot;&gt; for example in the widepress.com results&lt;/a&gt;, showing graphical facets including  country, source, theme and date.&lt;br /&gt;&lt;br /&gt;Sphinx does not have a robot crawler, although it can accept input in XML which can be generated by a crawler.  It connects directly to mySQL and PostgreSQL, and has web scripts for external sources.  APIs are available in PHP, Python, Java, Perl and Ruby.&lt;br /&gt;&lt;br /&gt;Read &lt;a href=&quot;http://www.searchtools.com/tools/sphinx.html&quot;&gt;more&lt;/a&gt; and tell me about your Sphinx experience here.</description>
  <comments>http://searchtools.livejournal.com/79998.html</comments>
  <lj:mood>accomplished</lj:mood>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/79717.html</guid>
  <pubDate>Thu, 10 Jul 2008 20:00:51 GMT</pubDate>
  <title>x-robots-tag</title>
  <link>http://searchtools.livejournal.com/79717.html</link>
  <description>&lt;hr&gt;
    		&lt;blockquote&gt;
    			&lt;p&gt;In the &lt;a href=&quot;http://www.searchtools.com/robots/robots-exclusion-protocol.html#rep-june-08&quot;&gt;Robots Exclusion Protocol June 08 Agreement&lt;/a&gt;, the leading webwide search engines announced that they would recognize a new element in the HTTP header, the X-Robots-Tag. Google started using it at first, then Yahoo and now Microsoft Live Search is supporting it.&lt;/p&gt;
    			&lt;p&gt;When a browser or robot sends a request to the web server for a URL, part of the response is the invisible &lt;a href=&quot;http://en.wikipedia.org/wiki/List_of_HTTP_headers&quot;&gt;HTTP header&lt;/a&gt;, including information about the file type, encoding, and date modified. This information is generated by the web server.&lt;/p&gt;
    			&lt;p&gt;The new X-Robots-Tag, within the HTTP response header, can contain same values as the Robots META tags: NOINDEX, NOFOLLOW, NOARCHIVE, NOODP, NOSNIPPET.&lt;/p&gt;
    			&lt;p&gt;There are several cases where the X-Robots-Tag values will be very valuable:&lt;/p&gt;
				&lt;ul&gt;
					&lt;li&gt;For non-HTML documents, including plain text, XML, PDF, office documents, audio, and video files. While many of these documents are able to carry Properties information or metadata such as XMP, they rarely do, and even then, it&apos;s often incorrect or duplicated. &lt;/li&gt;
					&lt;li&gt;For situations where the web site publisher cannot change the content of the HTML files, but wishes to control some of the site interaction with search indexing robots.&lt;/li&gt;
					&lt;li&gt;Sites with large amounts of changing content, where updating individual files is too hard or expensive. &lt;/li&gt;
				&lt;/ul&gt;
   				&lt;p&gt;This is not something anyone can type in by hand, but it&apos;s  easily  added by programmatically by server-side tools such as Perl, Ruby, or PHP. For simple cases, the Apache .htaccess file is easy enough to configure, as in this example where the crawler is told not to index content in robots.txt:&lt;/p&gt;
   				&lt;blockquote&gt;
    						&lt;pre&gt;&amp;lt;FilesMatch &amp;quot;robots\.txt&amp;quot;&amp;gt;&lt;br&gt;	Header set X-Robots-Tag &amp;quot;NOINDEX, FOLLOW&amp;quot;&lt;br&gt;&amp;lt;/FilesMatch&amp;gt;

&lt;/pre&gt;
						&lt;/blockquote&gt;
   					&lt;p&gt;or to avoid following links in&amp;quot;.doc&amp;quot; files&lt;/p&gt;
   					&lt;blockquote&gt;
   						&lt;pre&gt;&amp;lt;FilesMatch &amp;quot;\.doc$&amp;quot;&amp;gt;&lt;br&gt;	Header set X-Robots-Tag &amp;quot;NOFOLLOW&amp;quot;&lt;br&gt;&amp;lt;/Files&amp;gt;&lt;/pre&gt;
						&lt;/blockquote&gt;
   				&lt;p&gt;I think this is a very clever way to add the known functionality of Robots META tags to non-HTML file formats, collated from an external metadata repository. It&apos;s likely to be particularly useful to intranet search engines, and portals which may not have  access to the documents themselves.&lt;/p&gt;
    		&lt;/blockquote&gt;
    		&lt;hr&gt;
		&lt;blockquote&gt;
   				&lt;p&gt;I have added an  &lt;a href=&quot;http://www.searchtools.com/test/robots/index.html#x-robots-tag&quot;&gt;X-Robots-Tag test suite&lt;/a&gt; to the &lt;a href=&quot;http://www.searchtools.com/test/index.html&quot;&gt;SearchTools testing section&lt;/a&gt; and will report if I find anything interesting.&lt;/p&gt;
&lt;p&gt;H/T to: &lt;a href=&quot;http://yoast.com/x-robots-tag-play/&quot;&gt;Playing with the X-Robots-Tag&lt;/a&gt;; &lt;a href=&quot;http://hamletbatista.com/2007/08/01/controlling-your-robots-using-the-x-robots-tag-http-header-with-googlebot/&quot;&gt;Controlling Your Robots&lt;/a&gt;; &lt;a href=&quot;http://sebastians-pamphlets.com/handling-googles-neat-x-robots-tag-sending-rep-header-tags-with-php/&quot;&gt;Handling Google&apos;s neat X-Robots-Tag&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;</description>
  <comments>http://searchtools.livejournal.com/79717.html</comments>
  <lj:mood>tangential</lj:mood>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/79455.html</guid>
  <pubDate>Thu, 03 Jul 2008 00:04:15 GMT</pubDate>
  <title>Webwide search robots now indexing Flash (and filling in forms)</title>
  <link>http://searchtools.livejournal.com/79455.html</link>
  <description>&lt;p&gt;The SWF (Flash) file format has been open for a while, and a lot of search engines have used the format to get at some of the static text in in the Flash files.  However, Flash is now an interactive web site application builder, and there is a lot of text that just does not exist until someone comes along and clicks.  This has meant that people who wanted their sites properly indexed by webwide search engines could not use Flash, or would have to go to extra lengths to provide static text for search engine robots to find.&lt;/p&gt;

&lt;p&gt;What &lt;a href=&quot;http://www.adobe.com/aboutadobe/pressroom/pressreleases/200806/070108AdobeRichMediaSearch.html&quot;&gt;Adobe&lt;/a&gt; and &lt;a href=&quot;http://googleblog.blogspot.com/2008/06/google-learns-to-crawl-flash.html&quot;&gt;Google&lt;/a&gt; have just announced is that Adobe is making a special version of the Flash code that can approximate a human interacting with the Flash application in the SWF file, triggering as many application states as it can.  As far as I can tell, the Flash client  within the indexing robot will be clicking every possible button and entering text in text fields. While indexing the labels on buttons seems odd at first, it makes sense to think of that as as anchor text pointing at other pages (or at least URLs).  &lt;/p&gt;

&lt;p&gt;This is similar to what the &lt;a href=&quot;http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html&quot;&gt;googlebot is doing on some site forms&lt;/a&gt;: automatically clicking every combination of buttons, menus, and checkboxes, and submitting words from the site in text boxes.  This has ended up creating phantom shopping carts and &lt;a href=&quot;http://www.lunchpauze.com/2008/04/googlebot-wtf-are-you-doing.html&quot;&gt;search queries&lt;/a&gt;. They only do this on GET actions, not on POST, and presumably will not do so if the page has meta NOINDEX and NOFOLLOW tags. &lt;/p&gt;

&lt;p&gt;The chief concerns I&apos;ve seen from web site publishers include: the lack of clarity about exactly which JavaScript Flash loading links will be acceptable (especially SWFObject); how external XML files loaded by Flash will be indexed, and how the deep linking into Flash files will work. Adobe has some &lt;a href=&quot;http://www.adobe.com/devnet/flashplayer/articles/swf_searchability.html&quot;&gt;explanations on their FAQ&lt;/a&gt;  At the moment, it&apos;s SWF only, all versions from the oldest to the current, whether generated by Flash or Flex, which they call &quot;RIAs&quot; (rich Internet applications).  However, they are not providing access to FLV files, which are used on YouTube etc. to contain video for playback, and rarely have textual metadata.&lt;/p&gt;

&lt;p&gt;Adobe &lt;a href=&quot;http://www.adobe.com/devnet/flashplayer/articles/swf_searchability.html&quot;&gt;says&lt;/a&gt; Yahoo is working on this as well, and Adobe says that they are &quot;exploring ways to make the technology more broadly available&quot; to other search vendors.&lt;/p&gt;  No word on whether that includes enterprise and site search developers.

There&apos;s an excellent &lt;a href=&quot;http://searchengineland.com/080701-000002.php&quot;&gt;writeup&lt;/a&gt; from the SEO point of view at Searchengineland, and searchmarketinggurus has &lt;a href=&quot;http://www.searchmarketinggurus.com/search_marketing_gurus/2008/07/google-can-now.html&quot;&gt;a skeptical response.&lt;/a&gt;</description>
  <comments>http://searchtools.livejournal.com/79455.html</comments>
  <lj:mood> </lj:mood>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/79215.html</guid>
  <pubDate>Fri, 27 Jun 2008 17:55:35 GMT</pubDate>
  <title>Search usability research findings</title>
  <link>http://searchtools.livejournal.com/79215.html</link>
  <description>Whitney Quesenbery and her colleagues convey the findings of a long study about how search is used at the UK&apos;s Open University,  She gave a talk at the Enterprise Search Summit, and presented more formally at the &lt;a href=&quot;http://www.usabilityprofessionals.org/conference//&quot;&gt;Usability Professionals’ Association conference&lt;/a&gt;, in June 2008&lt;br /&gt;&lt;br /&gt;The study included search log analysis, heuristic reviews, remote and local usability testing on the search user experience, over the course of several years, and they are linked from Whitney&apos;s valuable &lt;a href=&quot;http://www.wqusability.com/articles/search-usability.html&quot;&gt;Search Usability&lt;/a&gt; page.  &lt;br /&gt;&lt;br /&gt;&lt;hr&gt;&lt;a href=&quot;http://www.wqusability.com/articles/designing-for-search-stc2008.pdf&quot;&gt;Designing for Search: Making Information Easy&lt;/a&gt;&lt;font color=&quot;red&quot; size=&quot;smaller&quot;&gt;(PDF)&lt;/font&gt; covers both search and content.  It recommends focusing improvements first on the most frequent terms, the &lt;a href=&quot;http://www.searchtools.com/analysis/long-tail.html&quot;&gt;short head&lt;/a&gt; of search popularity.  The results of tests with eyetracking &quot;heat map&quot; visualizations show that both students and those outside the system will scan the whole search results page, and confirm the user tests stressing the strong value of meaningful titles. &lt;br /&gt;&lt;br /&gt;&lt;hr&gt;&lt;a href=&quot;http://www.wqusability.com/articles/search-is-normal-upa2008.pdf&quot;&gt;Search is now normal behavior. what do we do about that?&lt;/a&gt;&lt;font color=&quot;red&quot; size=&quot;smaller&quot;&gt;(PDF)&lt;/font&gt; has a different perspective. In addition to the classic &lt;a href=&quot;http://www.searchtools.com/analysis/long-tail.html&quot;&gt;long tail&lt;/a&gt; frequency of search terms, it showed that the most popular search terms (the &lt;a href=&quot;http://www.searchtools.com/analysis/long-tail.html&quot;&gt;short head&lt;/a&gt;) remained much the same over the course of three years, though there is also some seasonal variation.  Gratifyingly, because I&apos;ve been saying this for a while, the need for search as a supplement to navigation went down, when the site navigation changed.  The study finds that topical metadata and improving titles makes search results significantly more useful.  A comparison of four search engines found significant differences in results, in particular, in the variety of top results for common terms. Only one of the four search engines hid duplicate entries, consistently had all links on the first page be appropriate, and displayed links to several different locations, rather than a single subsite or directory.&lt;br /&gt;&lt;br /&gt;&lt;hr&gt;It&apos;s great to see more research done over time and with a large amount of data.  I&apos;m keeping a listing of what I&apos;ve found at CiteULike with the tag &lt;a href=&quot;http://www.citeulike.org/user/avirr/tag/search-interface&quot;&gt;search-interface&lt;/a&gt;, and planning to update my &lt;a href=&quot;http://www.searchtools.com/info/user-interface.html&quot;&gt;Search Usability page&lt;/a&gt;</description>
  <comments>http://searchtools.livejournal.com/79215.html</comments>
  <lj:mood>working</lj:mood>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/79012.html</guid>
  <pubDate>Thu, 26 Jun 2008 23:07:58 GMT</pubDate>
  <title>The Short Head and Long Tail of Search</title>
  <link>http://searchtools.livejournal.com/79012.html</link>
  <description>I&apos;ve just posted an article on the &lt;a href=&quot;http://www.searchtools.com/analysis/long-tail.html&quot;&gt;Long Tail, Short Head and Search&lt;/a&gt;.  Every site, intranet and enterprise search log I&apos;ve analyzed fits the model of the Long Tail, with a very few very popular search terms, then tailing off very quickly to unique queries (the Long Tail), creating a &lt;a href=&quot;http://www.searchtools.com/analysis/long-tail.html#zipf&quot;&gt;Zipf curve&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The Short Head -- the few most frequently used search terms -- is the best place to start in analyzing search engine usage.  My article also gives some &lt;a href=&quot;http://www.searchtools.com/analysis/long-tail.html#using-the-head&quot;&gt;suggestions&lt;/a&gt; for taking the information and using it to improve a search engine.</description>
  <comments>http://searchtools.livejournal.com/79012.html</comments>
  <lj:mood>awake</lj:mood>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/78605.html</guid>
  <pubDate>Tue, 24 Jun 2008 00:03:12 GMT</pubDate>
  <title>HCI/IR workshop</title>
  <link>http://searchtools.livejournal.com/78605.html</link>
  <description>&lt;a href=&quot;http://research.microsoft.com/~ryenw/hcir2008/&quot;&gt;HCIR 2008: Workshop on Human-Computer Interaction and Information Retrieval&lt;/a&gt; &lt;br /&gt;&lt;br /&gt;Making the connection between interface and search, this workshop is focused this year on complex search tasks. The 2007 Workshop presentations ranged from visual text analysis to online consumer choice. This year&apos;s workshop will be 23 October, 2008, in Redmond, Washington, USA.</description>
  <comments>http://searchtools.livejournal.com/78605.html</comments>
  <lj:mood>okay</lj:mood>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/78401.html</guid>
  <pubDate>Sun, 15 Jun 2008 22:22:55 GMT</pubDate>
  <title>article on the new Robots Exclusion Protocol</title>
  <link>http://searchtools.livejournal.com/78401.html</link>
  <description>My article is up on InfoToday: &lt;a href=&quot;http://newsbreaks.infotoday.com/nbReader.asp?ArticleId=49511&quot;&gt;New Robots Exclusion Protocol Agreement Among Yahoo!, Google, and Microsoft Live Search&lt;/a&gt;.  Nothing earthshaking, just a summary from a library point of view, and a quote from Danny Sullivan saying that it&apos;s an important first step.</description>
  <comments>http://searchtools.livejournal.com/78401.html</comments>
  <lj:mood>working</lj:mood>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/78085.html</guid>
  <pubDate>Fri, 13 Jun 2008 21:18:53 GMT</pubDate>
  <title>More Information on the new Robots Exclusion Protocol</title>
  <link>http://searchtools.livejournal.com/78085.html</link>
  <description>&lt;h4&gt;More Information on the new Robots Exclusion Protocol&lt;/h4&gt;
				&lt;p&gt;Search indexing robot writers and web publishers should definitely look at the &lt;a href=&quot;http://www.searchtools.com/robots/robots-exclusion-protocol.html#rep-june-08&quot;&gt;new extensions to the REP&lt;/a&gt;, as there are useful additions to both &lt;a href=&quot;http://www.searchtools.com/robots/robots-txt-elements.html&quot;&gt;robots.txt directives &lt;/a&gt; and &lt;a href=&quot;http://www.searchtools.com/robots/robots-meta.html&quot;&gt;Robots META tags&lt;/a&gt;. Most of these features have been supported by the big three search engines (Google, Yahoo, MSN Live), but it&apos;s nice to have that formalized, and other search robots can take advantage of the new functionality. &lt;/p&gt;
				&lt;p&gt;The new &lt;a href=&quot;http://www.searchtools.com/robots/robots-exclusion-protocol.html#x-robots-tag&quot;&gt;X-Robots-Tag&lt;/a&gt;  (added to the HTTP header for non-HTML files) is a good way to send the meta information, but requires automated extensions to the servers. For example, if content is available in both HTML and PDF formats, it&apos;s easy to send NOINDEX values for all PDF, directing search engines away from the printable format and towards the browser-readable format. &lt;/p&gt;
				&lt;p&gt;It turns out that  NOODP comes in handy when a page is linked from the &lt;a href=&quot;http://www.dmoz.org&quot;&gt;ODP&lt;/a&gt; (Open Directory Project), and the title or text in that entry is not accurate, which  happens sometimes. Using the NOODP robots meta tag value tells the search engines not to use the ODP entry, but rather the title  and text from the page. NOYDIR does the same for the &lt;a href=&quot;http://dir.yahoo.com&quot;&gt;Yahoo Directory&lt;/a&gt;, but is only officially supported by Yahoo and its Slurp robot. &lt;/p&gt;
				&lt;p&gt;For pages with frequent changes,  NOARCHIVE makes some sense: the old content may be in the searchable index, but at least the search engines will not display the old version of the page itself. &lt;/p&gt;
				&lt;p&gt;However, I have yet to figure out when someone would use NOSNIPPET (which also disables archive display). Limiting a listing in the search results page to the title and URL seems like such a bad idea.  Why would anyone do this?&lt;/p&gt;</description>
  <comments>http://searchtools.livejournal.com/78085.html</comments>
  <lj:mood>accomplished</lj:mood>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/77886.html</guid>
  <pubDate>Wed, 04 Jun 2008 00:47:29 GMT</pubDate>
  <title>New Robot Exclusion Protocol!</title>
  <link>http://searchtools.livejournal.com/77886.html</link>
  <description>Supported by webwide search engines Yahoo, Google and Microsoft, this adds directives to robots.txt:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&quot;Allow&quot; directives&lt;br /&gt;&lt;li&gt;wildcards in URLs &lt;br /&gt;&lt;li&gt;Sitemap Location&lt;/ul&gt;There are also HTML meta tags and document properties directives for &lt;br /&gt;&lt;ul&gt;&lt;li&gt;NOSNIPPET&lt;br /&gt;&lt;li&gt;NOARCHIVE&lt;br /&gt;&lt;li&gt;NOODP (don&apos;t use ODP information for this page).&lt;/ul&gt;Yahoo has a nice&lt;a href=&quot;http://www.ysearchblog.com/archives/000587.html&quot;&gt; long blog entry&lt;/a&gt; on this, &lt;a href=&quot;http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html&quot;&gt;as does Google&lt;/a&gt; and &lt;a href=&quot;http://blogs.msdn.com/webmaster/archive/2008/06/03/robots-exclusion-protocol-joining-together-to-provide-better-documentation.aspx&quot;&gt;MS Live Search&lt;/a&gt;.  Great news for web developers, who&apos;ve been waiting for this for a very long time.&lt;br /&gt;&lt;br /&gt;But there&apos;s nothing from the robots mailing list or the &lt;a href=&quot;http://robotstxt.org&quot;&gt;RobotsTxt.org&lt;/a&gt; which is a shame.&lt;br /&gt;&lt;br /&gt;This is also a test for all site and intranet search crawlers -- any abandoned software will not recognize these new directives.  &lt;br /&gt;&lt;br /&gt;I&apos;ll dig further into this in the next week and provide more analysis and details.&lt;br /&gt;&lt;br /&gt;&lt;a href=&quot;http://searchtools.livejournal.com/77886.html?mode=reply&quot;&gt;Comments?&lt;/a&gt;</description>
  <comments>http://searchtools.livejournal.com/77886.html</comments>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/77748.html</guid>
  <pubDate>Tue, 03 Jun 2008 22:00:51 GMT</pubDate>
  <title>Yahoo vs. Google interactive geosearch</title>
  <link>http://searchtools.livejournal.com/77748.html</link>
  <description>I wanted to find the trendy-but-good shop I stopped by yesterday: it&apos;s not really a cafe or a coffehouse or a restaurant.  They sell sandwiches and savory chicken pie and strawberry shortcake, with cartons of strawberries stacked high in the front.  I knew where it was, but not the name, so I compared Yahoo Maps and Google Maps, to learn a little about geosearch.&lt;br /&gt;&lt;br /&gt;Yahoo maps knew where I live, so it started there, and then just scrolled and zoomed until I found the corner of 51st and Telegraph.  (It would have been easier if I&apos;d used the hybrid map &amp; satellite photo, much better for me than plain maps).  Then I used the &quot;Find a Business&quot; and typed &quot;pie&quot;.  Instantly, &lt;a href=&quot;http://bakesalebetty.com&quot;&gt;Bakesale Betty&apos;s&lt;/a&gt; came up, with the word &quot;pie&quot; harvested from a Yahoo user review (5 stars!) in which a customer praised their &quot;pot-pies&quot;.  That was about 5 scrolls and 4 clicks and one short search, and success&lt;br /&gt;&lt;br /&gt;Google doesn&apos;t know where I live, so typed in &quot;51st and telegraph, oakland, ca&quot; -- and there I was. A couple of clicks to get to the closest street level.  Then I clicked on the &quot;Find businesses&quot; link and typed in &quot;pie&quot; and got a display of... 20 results out of 5,645 hits and a much higher view, so they could show me all the other pie matches. Oops.  I tried &quot;pies&quot; and that didn&apos;t work, and I couldn&apos;t think of anything else to search on as there area a million sandwich and cookie places.  2 searches (one long), 1 scroll and 4 clicks, but that got me stuck.  &lt;br /&gt;&lt;br /&gt;Then I typed &quot;betty&quot; and of course it was right there, with links to reviews.  &lt;br /&gt;&lt;br /&gt;Just to check, I clicked the Explore this area link, no luck, all about garbage pickup.  I tried using the &quot;Street View&quot; and looking at that corner, but in one view the storefront is too dark to see, and in the other there&apos;s a big truck in front of the shop.&lt;br /&gt;&lt;br /&gt;Why did Google not find me the store when I searched for pie?  You may ask, and I have an answer.  Because their index and/or their query parsing did not match &quot;pie&quot; with &quot;pot-pies&quot; in one of the customer reviews that Google automatically connected to the business.  If they&apos;d used even lightweight pluralization, it would have worked.  And zipping me up to the whole Bay Area view was really annoying.&lt;br /&gt;&lt;br /&gt;Yahoo did a stemmed match and showed it right there, without changing my map level.  It wins this comparison.</description>
  <comments>http://searchtools.livejournal.com/77748.html</comments>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/77432.html</guid>
  <pubDate>Thu, 15 May 2008 22:16:36 GMT</pubDate>
  <title>off to ESS</title>
  <link>http://searchtools.livejournal.com/77432.html</link>
  <description>I will be leaving for the &lt;a href=&quot;http://enterprisesearchsummit.com&quot;&gt;Enterprise Search Summit&lt;/a&gt; tomorrow (taking my family to New York for a little adventure).  I&apos;ll be teaching a workshop, Enterprise Search 101, on Monday the 19th, and starting the Search Analytics track on Wednesday the 21st.  If you see me, please introduce yourself, I&apos;d love to meet people who read this blog.</description>
  <comments>http://searchtools.livejournal.com/77432.html</comments>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/77291.html</guid>
  <pubDate>Tue, 06 May 2008 17:57:06 GMT</pubDate>
  <title>A First Taxonomy for &quot;Search Log Junk&quot;</title>
  <link>http://searchtools.livejournal.com/77291.html</link>
  <description>&lt;p&gt;Search logs contain a lot of weird things, and some of them can have a significant effect on search log analysis.  Having looked at tens of thousand lines of search log entries, I offer this first attempt at defining some of the weirdest and least useful kinds of log entry, which I call &quot;Search Log Junk&quot;.   Here are the types of junk that I&apos;ve seen most frequently:&lt;/p&gt;

&lt;dl&gt;
	&lt;dt&gt;&lt;b&gt;Empty Queries&lt;/b&gt;&lt;/dt&gt;
	&lt;dd&gt;Queries without any query text or usable parameters.  These can appear when people think the &amp;quot;Search&amp;quot; button is important in and of itself.  Or perhaps search is in the first page form, and the cursor gets into that field and users press Return.  These are often sent from the home page, according to the referer fields I&apos;ve seen. &lt;br /&gt;
		&lt;br /&gt;
		The first thing is to make sure that the search engine is doing something reasonable in this case.   This could be just bringing up a helpful search page, adding  a script to bring up an error dialog, or a script to ignore the empty query. I&apos;m leaning towards the last option.&lt;br /&gt;
		&lt;br /&gt;
		I&apos;ve found only a couple of ways to use this information. They are still useful for traffic and response time metrics, and I think it&apos;s useful to check the top referring pages occasionally. A lot of empty queries for a page deep within a site may indicate some navigation problems.&lt;br /&gt;
		&lt;br /&gt;
	&lt;/dd&gt;
	&lt;dt&gt;&lt;b&gt;Repeat Queries&lt;/b&gt;&lt;/dt&gt;
	&lt;dd&gt;Multiple identical queries to the search engine from the same IP or user ID.  My best guess is that the client is calling for a refresh automatically  -- my favorite was thousands of queries over months for two dots: &amp;quot;..&amp;quot;.&lt;br /&gt;
		&lt;br /&gt;
		Again, this is useful for traffic metrics and possibly for identifying really weird incoming links. For most situations, it won&apos;t affect the statistics in any important way. But if there are hundreds of repeat queries by the same client, removing them from the database  allows you to concentrate on the real data.  You may also want to ban that IP address.&lt;br /&gt;
		&lt;br /&gt;
	&lt;/dd&gt;
	&lt;dt&gt;&lt;b&gt;Robot crawlers&lt;/b&gt;&lt;/dt&gt;
	&lt;dd&gt;Having search and intelligent agents crawl search results may be a good thing. Incoming links are always good and it may be that the search results on your site for emerald green widgets is number one in webwide search results and drives good traffic. However, there may be other robots wasting your search engine cycles: for those, a combination of robots.txt and banning their IP address will help.&lt;br /&gt;
		&lt;br /&gt;
	&lt;/dd&gt;
	&lt;dt&gt;&lt;b&gt;Server hacks&lt;br /&gt;
	&lt;/b&gt;&lt;/dt&gt;
	&lt;dd&gt;Search engines are attacked by the standard web server hacking parameters, such as &amp;quot;phpmyadmin&amp;quot;. They may also be subject to buffer overflow and other attacks, so should be included in standard website security audits and checklists.&lt;br /&gt;
		&lt;br /&gt;
	&lt;/dd&gt;
	&lt;dt&gt;&lt;b&gt;Guestbook spam&lt;/b&gt;&lt;/dt&gt;
	&lt;dd&gt;There are automated advertising services that insert fake comments with URLs into form fields, guestbooks, blogs and wikis (and there&apos;s a &lt;a href=&quot;http://en.wikipedia.org/wiki/Spam_in_blogs&quot;&gt;wikipedia page about them&lt;/a&gt;).  Many of them do the same with search fields, which explains why logs contain bizarre queries with spaces, HTML formatting and URLs in them.&lt;/dd&gt;
	&lt;dd&gt;&lt;br /&gt;
	For sites with light search traffic, these meaningless entries can cause problems with both traffic metrics and top query listings. Even for sites with thousands of queries per day, they can distort  statistics about the average length of query, so removing them from your analysis database is a good idea.&lt;br /&gt;
	&lt;br /&gt;
	It&apos;s fairly easy to identify these queries with simple regular expressions looking for href, http and .com. I haven&apos;t heard of any search engines which filter this, though some may be doing it without bothering their customers about it. &lt;br /&gt;
&lt;br /&gt;
	&lt;/dd&gt;
&lt;dl&gt;&lt;dt&gt;&lt;b&gt;Internal testing queries&lt;/b&gt;&lt;/dt&gt;
&lt;dd&gt;For light traffic sites, any kind of automated testing, or even heavy manual testing 	can change the search log significantly -- especially given how quickly the Long 	Tail shows up.  Remove queries from testers by user ID  or IP address to look at 	real user data.&lt;/dd&gt;
&lt;/dl&gt;&lt;/dl&gt;</description>
  <comments>http://searchtools.livejournal.com/77291.html</comments>
  <lj:mood>accomplished</lj:mood>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/77043.html</guid>
  <pubDate>Tue, 04 Mar 2008 23:22:11 GMT</pubDate>
  <title>partly offline due to injury</title>
  <link>http://searchtools.livejournal.com/77043.html</link>
  <description>I slipped on a stepladder and broke my left leg (tibial plateau fracture) and then chipped my right heel while on crutches.  My office is not really wheelchair accessible, nor can I go down my house&apos;s steps without great effort, so I&apos;m working remotely, part-time. &lt;br /&gt;&lt;br /&gt; I am trying to read email every day and respond in a timely way, so if you&apos;ve left a voice message or sent email that I have not answered, please try again (by email if possible).  Apologies for your inconvenience.&lt;br /&gt;&lt;br /&gt;Avi</description>
  <comments>http://searchtools.livejournal.com/77043.html</comments>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/76770.html</guid>
  <pubDate>Wed, 23 Jan 2008 21:55:04 GMT</pubDate>
  <title>where has Entopia gone?</title>
  <link>http://searchtools.livejournal.com/76770.html</link>
  <description>One of my clients is interested in Entopia, so I was taking a look.&lt;br /&gt;&lt;br /&gt;I tried to go to the web site and it was replaced by one of those placeholder spam sites which pops up several spammy windows. It seems like the kind of thing that might have viruses, worms or trojans, so I&apos;d suggest against opening the site in IE, or really, at all on a Windows machine.&lt;br /&gt;&lt;br /&gt;No one answered at one phone number, the other two I found were disconnected.&lt;br /&gt;&lt;br /&gt;Casualty of the recession?  Acquired by someone?  It&apos;s a mystery, and I&apos;m curious.&lt;br /&gt;&lt;br /&gt;ETA: The Wayback Machine (archive.org) has an actual home page as of &lt;a href=&quot;http://web.archive.org/web/20060613051353/http://entopia.com/&quot;&gt;June 13, 2006&lt;/a&gt; and an empty page as of &lt;a href=&quot;http://web.archive.org/web/20060701042058/http://www.entopia.com/&quot;&gt;July 1&lt;/a&gt; of that year.  I always thought they were promising more than they could deliver, so this is perhaps confirmation.</description>
  <comments>http://searchtools.livejournal.com/76770.html</comments>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/76543.html</guid>
  <pubDate>Wed, 19 Dec 2007 01:28:15 GMT</pubDate>
  <title>Small updates to Search Tools reports</title>
  <link>http://searchtools.livejournal.com/76543.html</link>
  <description>We&apos;ve updated the following reports on search engines large and small in the last few weeks:
&lt;ul&gt;

	&lt;li&gt;i411 has changed its name to &lt;a href=&quot;http://www.searchtools.com/tools/intelligenx.html&quot;&gt;Intelligenx&lt;/a&gt; and added autocatagorization and multiple language support. &lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;http://searchtools.com/tools/engenium.html&quot;&gt;Engenium&lt;/a&gt; now has OEM library and automatic clustering module.&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;http://searchtools.com/tools/freefind.html&quot;&gt;FreeFind&lt;/a&gt; now has wildcards for excluding URL paths from indexing, indexes common office document file formats, relevance weight adjustments for URL paths (with wildcards), and some really nice indexing reports -- URLs extracted, server response, status, and which URLs are actually in the searchable index.&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;http://searchtools.com/tools/homepagesearchengine.html&quot;&gt;HomePageSearchEngine&lt;/a&gt; now indexes more file types.&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;http://searchtools.com/tools/doclinx.html&quot;&gt;Doclinx&lt;/a&gt; now has a web monitoring agent, with support for speech recognition, for research and competitive intelligence, and a language analyzer.&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;http://searchtools.com/tools/booleansearch.html&quot;&gt;Boolean Search&lt;/a&gt; now runs natively on both PPC and Intel Mac OS X systems, includes web-based admin, spellchecking and match term highlighting in search results, template and AppleScript integration for search results formatting, standalone search server, and regular expressions in queries.&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;http://searchtools.com/tools/crawl-it.html&quot;&gt;Crawl-it remote service&lt;/a&gt;  is still being supported.&lt;/li&gt;

	&lt;li&gt;&lt;a href=&quot;http://searchtools.com/tools/datagold.html&quot;&gt;Datagold&lt;/a&gt; is no longer a separate search, it&apos;s part of an online archiving suite.&lt;/li&gt;

	&lt;li&gt;&lt;a href=&quot;http://searchtools.com/tools/educesoft.html&quot;&gt;Educasoft&lt;/a&gt; has no indication of continuing development &lt;/li&gt;

&lt;/ul&gt;</description>
  <comments>http://searchtools.livejournal.com/76543.html</comments>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/76135.html</guid>
  <pubDate>Wed, 19 Sep 2007 23:58:28 GMT</pubDate>
  <title>Search Conferences Listing updated</title>
  <link>http://searchtools.livejournal.com/76135.html</link>
  <description>This &lt;a href=&quot;http://www.searchtools.com/info/conferences.html&quot;&gt;list&lt;/a&gt; covers all the search and related related conferences I know about. &lt;br /&gt;&lt;br /&gt;At the &lt;a href=&quot;http://www.enterprisesearchsummit.com/west/&quot;&gt;Enterprise Search Summit West&lt;/a&gt; I will be doing a pre-conference workshop on Critical Success Factors (how search engines work and how to make them better), a presentation on Tuning Search using Analytics and a moderating a panel on Good Practices for Search User Interfaces. At the &lt;a href=&quot;http://www.ftponline.com/conferences/webbuilder/2007/agenda.aspx&quot;&gt;Web Builder 2.0&lt;/a&gt; conference, I&apos;ll be presenting on Web Site Search and the User Experience. If you are a reader of this web site, please come and say hi, and if you&apos;d like an online presentation to your organization or company, I do those as well. &lt;br /&gt;&lt;br /&gt;To suggest a conference or the listing, please &lt;a href=&quot;http://www.searchtools.com/site/contact.html&quot;&gt;leave a comment&lt;/a&gt; and I&apos;ll add it.&lt;img src=&quot;http://stools.icons.ljtoys.org.uk/mi/dot.gif&quot; border=&quot;0&quot; alt=&quot;&quot;&gt;</description>
  <comments>http://searchtools.livejournal.com/76135.html</comments>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/75856.html</guid>
  <pubDate>Wed, 29 Aug 2007 22:35:21 GMT</pubDate>
  <title>Critique of the Google Custom Search Traffic Report</title>
  <link>http://searchtools.livejournal.com/75856.html</link>
  <description>&lt;p&gt;Edward Tufte would be disappointed in Google.  The traffic reports in the Google Custom Search Business Edition are not only insufficient, but somewhat misleading.&lt;/p&gt;
	&lt;p&gt;Below is a picture from a CSBE search for a B2B site that I helped install in August 2007. The fact that it&apos;s a line chart,  with no data points given, filled underneath,makes it look active.  It seems as though something&apos;s happening, the traffic is making progress, or worse, losing ground.  The deep dips look scary, as though the site has done something wrong.&lt;/p&gt;
&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;	&lt;p&gt;&lt;img src=&quot;http://www.searchtools.com/images/gcsbe-monthly-traffic.gif&quot; width=&quot;700&quot; height=&quot;356&quot; /&gt;&lt;/p&gt;
	&lt;p&gt;The problems come about  because it&apos;s the &lt;i&gt;wrong graph format&lt;/i&gt; for the content.  This is very simple data: one point per day.  Look at it as a simple bar graph and it suddenly seems more reasonable.  The traffic resolves itself into a rhythm: the dips are on weekends -- all the customers are home. I don&apos;t know why they got it so wrong, but it&apos;s worth getting right. &lt;/p&gt;
	&lt;p&gt;&lt;img src=&quot;http://www.searchtools.com/images/searchtools-traffic-example.gif&quot; width=&quot;700&quot; height=&quot;383&quot; /&gt;&lt;/p&gt;

	&lt;p&gt;&lt;a href=&quot;http://www.edwardtufte.com/&quot;&gt;Edward Tufte&lt;/a&gt; wrote some enlightening books on these topics, including  &lt;i&gt;The Visual Display of Quantitative Information,&lt;/i&gt; which taught those of us paying attention that how data is presented deeply affects how it is received. I highly recommend getting some of Tufte&apos;s books, &lt;a href=&quot;http://www.amazon.com/gp/search?ie=UTF8&amp;amp;keywords=edward%20tufte&amp;amp;tag=searchtoolscom&amp;amp;index=books&amp;amp;linkCode=ur2&amp;amp;camp=1789&amp;amp;creative=9325&quot;&gt;from Amazon&lt;/a&gt;&lt;img src=&quot;http://www.assoc-amazon.com/e/ir?t=searchtoolscom&amp;amp;l=ur2&amp;amp;o=1&quot; width=&quot;1&quot; height=&quot;1&quot; border=&quot;0&quot; alt=&quot;&quot; style=&quot;border:none !important; margin:0px !important;&quot; /&gt;, &lt;a href=&quot;http://www.powells.com/partner/24574/s?kw=Tufte+Edward&quot;&gt;from Powell&apos;s&lt;/a&gt; or &lt;a href=&quot;http://worldcat.org/search?q=tufte,+edward&amp;amp;fq=dt%3Abks&amp;amp;qt=facet_dt%3A&quot;&gt;from your library (using WorldCat)&lt;/a&gt;. &lt;/p&gt;
&lt;p&gt;Please comment whether you agree or disagree. I&apos;m haven&apos;t seen quite this problem in other search engine traffic reports, but I&apos;m wondering what other interfaces might look like, and what you think is best. Tell me your opinions, please!</description>
  <comments>http://searchtools.livejournal.com/75856.html</comments>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/75633.html</guid>
  <pubDate>Mon, 20 Aug 2007 20:55:48 GMT</pubDate>
  <title>Google Search Appliance and Mini - SearchTools Report Updated</title>
  <link>http://searchtools.livejournal.com/75633.html</link>
  <description>&lt;p&gt;I have updated my  report on the &lt;a href=&quot;http://searchtools.com/tools/google-app.html&quot;&gt;GSA and Mini search appliances&lt;/a&gt;, with detail based in part on my recent experiences customizing a Google Mini. The report includes information on the pricing as far as I could find it, the terms of licensing, new features, links to informative documents, and features that are not included with the Mini appliance.&lt;/p&gt;
&lt;p&gt;Once I update my &lt;a href=&quot;http://www.searchtools.com/analysis/google-appliance-v3.html&quot;&gt;full product review&lt;/a&gt;, I will have a chance to pay attention to other search engines, and that will be lovely. &lt;img src=&quot;http://stools.icons.ljtoys.org.uk/mi/dot.gif&quot; border=&quot;0&quot; alt=&quot;&quot;&gt;&lt;/p&gt;</description>
  <comments>http://searchtools.livejournal.com/75633.html</comments>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/75274.html</guid>
  <pubDate>Thu, 16 Aug 2007 22:03:07 GMT</pubDate>
  <title>Google CSE - different results when searching more than three sites</title>
  <link>http://searchtools.livejournal.com/75274.html</link>
  <description>A &lt;a href=&quot;http://www.google.com/support/customsearch/bin/answer.py?answer=70392&amp;amp;topic=11502&quot;&gt;support document&lt;/a&gt; for the Google CSE (Custom Search Engine)and CSBE (Custom Search Business Edition) notes that some results may be different than those found in the same search on Google.com.  It attributes this to including more than three sites in the CSE, and says that the CSE is using a subset of the Google.com index.  &lt;br /&gt;&lt;br /&gt;They recommend limiting the CSE to three sites, changing the behavior to &apos;Search the entire web but emphasize included sites&apos;, or adding refinements that have the same effect.&lt;br /&gt;&lt;br /&gt;As of August 16, 2007, the support note says &quot;We&apos;re working to bring more complete results to all Custom Search Engines.&quot;.</description>
  <comments>http://searchtools.livejournal.com/75274.html</comments>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/75028.html</guid>
  <pubDate>Fri, 03 Aug 2007 17:43:11 GMT</pubDate>
  <title>Google Launches Site Search Service for Business</title>
  <link>http://searchtools.livejournal.com/75028.html</link>
  <description>Google&apos;s Custom Search Business Edition uses the Google web search index limited by site or sites. It provides most of the Google web search features and is very cheap, only $100 per year for up to 50,000 pages, $500 for up to 500,000 pages. &lt;a href=&quot;http://newsbreaks.infotoday.com/nbReader.asp?ArticleId=37075&quot;&gt;More here at my InfoToday article&lt;/a&gt; / more at the &lt;a href=&quot;http://searchtools.com/tools/google-service.html&quot;&gt;SearchTools Google Service report page&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;What do you think of it?</description>
  <comments>http://searchtools.livejournal.com/75028.html</comments>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/74946.html</guid>
  <pubDate>Thu, 19 Jul 2007 23:48:35 GMT</pubDate>
  <title>New Google hosted search with no advertising</title>
  <link>http://searchtools.livejournal.com/74946.html</link>
  <description>&lt;p&gt;Called the &lt;a href=&quot;http://www.google.com/enterprise/csbe/index.html&quot;&gt;Google Custom
	Search Business Edition&lt;/a&gt;, this is a hosted site search, designed for small businesses with web
	site content, who don&apos;t want the advertising displayed on the older &lt;a href=&quot;http://google.com/coop/cse/&quot;&gt;Custom
	Search Engine&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This version uses Google&apos;s existing index of the Internet, searching all the pages they know about it
	on the specified sites including non-HTML file types, using their query language, retrieval and relevance
	algorithsm, and searching in multiple languages and character sets. Like the web search engine, there
	is no way to index pages protected by access control such as passwords or ACLs. &lt;/p&gt;
&lt;p&gt;The default interface customization
	is limited to a logo and colors of the results page border, title, background, text and links, but the
	XML results format is fairly configurable using the Google AJAX Search API. While there is no structure
	in place to display site advertising on search results, presumably one could do that very easily with
	XML results. Reports are limited to top queries and queries per day/week/month/all, but can be connected
	to the Google Activity Monitor site traffic analysis tool. &lt;/p&gt;
&lt;p&gt;Note that Google will not guarantee that they&apos;ll crawl all of the pages of a particular site, update
	on-demand, or even update frequently. Using this service will not improve a site&apos;s position in the Google.com
	search results. &lt;/p&gt;
&lt;p&gt;Pricing is $100 per year for up to 5,000 pages; $500 per year for up to 50,000 pages (both payable by
	credit card via Google Checkout). According to &lt;a href=&quot;http://www.ecommerce-guide.com/news/article.php/3689231&quot;&gt;ecommerce-guide.com&lt;/a&gt;, it seems to go to a $15,000 per year
	fee for up to 1 million pages, but potential customers should contact the company. (Non-profits, university
	and government agencies can use the standard &lt;a href=&quot;http://google.com/coop/cse/&quot;&gt;Custom
	Search&lt;/a&gt; and
	opt-out of advertising).&lt;/p&gt;</description>
  <comments>http://searchtools.livejournal.com/74946.html</comments>
  <lj:mood>interested</lj:mood>
  <lj:security>public</lj:security>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/74698.html</guid>
  <pubDate>Fri, 04 May 2007 00:00:24 GMT</pubDate>
  <title>Swish-e - SearchTools Report Updated</title>
  <link>http://searchtools.livejournal.com/74698.html</link>
  <description>&lt;a href=&quot;http://www.searchtools.com/tools/swish-e.html&quot;&gt;Swish-e&lt;/a&gt;, a free open-source Unix search engine, Swish-e is fast at indexing and searching, and quite flexible. It can handle simple authentication, indexes HTML, text, XML, and (via converters), PDF, MS Word, Excel and MP3 ID3 tags, with an emphasis on storing feilds/tags for specifying during search. Resuts can be sorted by relevance, date, size, and other fields. It runs as a CGI to a web server (Apache recommended), and has a fairly active user and developer base. New features include adjustments to the relevance algorithm, &quot;near&quot; operator and &quot;?&quot; single character wildcard operator (in addition to &quot;*&quot;).</description>
  <comments>http://searchtools.livejournal.com/74698.html</comments>
  <lj:mood>working</lj:mood>
  <lj:security>public</lj:security>
</item>
</channel>
</rss>
