| | The is the blog for SearchTools.com, a free and unbiased site about web site, enterprise and intranet search engines. These are smaller than public web search engines, such as Google or Yahoo -- they do the work behind the search fields on commerce sites, newspaper archives, government sites, etc. For more information, please see the main SearchTools Site. Search Tools Consulting offers advice to companies and individuals who need help choosing and configuring search engines. For more information, please leave a comment for any post, see the Consulting page or fill in the contact formTechnorati Profile
Focused on real-world issues of implementing and enhancing search for intranets, portals and large web sites. It's been a wonderful conference every time, because it is just about search and related issues. It's a great mix of case studies, specialist presentations, and even the vendor talks are good.
Avi Rappoport will be presenting a pre-conference workshop, Enterprise Search 101, and a talk, Inside the Black Box of the Search Index which will start with the basic inverted index, and some of the more interesting aspects, including tokenizing and document caching.
Online registration available. There's also a vendor exhibit hall, shared with the KMworld and Intranets conferences. To just see the exhibits, fill in the registration, and scroll down to the "Exhibits only" button, that's free. But you'll be missing a lot of fascinating talks.
 | |
|
I am doing a talk about going inside the black box of the search index for the Enterprise Search Summit in September in San Jose (more on that later). While I have a lot to say about indexes, I used the opportunity to check around and look for current research on the topic, and pretty much struck gold. Although this paper is from 2006, it is exhaustive and detailed, with both practical and theoretical information, including finding that inverted indexes are both significantly faster to search and easier to maintain than relational database management systems, signature files and suffix arrays. It also has a thorough annotated bibliography. Best of all, Zobel and Moffat agree with me on lowercasing all words in the index and including stopwords, which they say "have an important role in phrase queries". Inverted files for text search enginesby Justin Zobel and Alistair Moffat ACM Computing Surveys. 2006;38(2) (56 pages). Available from: http://doi.acm.org/10.1145/1132956.1132959Unfortunately, this article is firmly behind the ACM firewall, so if you or your institution don't have a subscription, you have to go through a few hoops to get it. Click the PDF link, you will be denied access and have to go through their free registration form. After that, there's a little form and you can buy the article for $10 by credit card. I think it's worth it. ETA: improved my title - Mood:

| |
|
Sphinx is an open source search engine, written in C, using both SQL and custom index files to provide a very fast text search. The architecture scales to over a billion records by distributing the index and querying among multiple virtual and real processors. While it does a full text search, Sphinx is designed to work with structured content (music lyrics, products), and semi-structured content (RSS feeds, blog posts, magazine articles). Sphinx is much faster and more flexible than the internal SQL functions such as where, order by, and group by. This structure allows it to display results in a faceted metadata, for example in the widepress.com results, showing graphical facets including country, source, theme and date. Sphinx does not have a robot crawler, although it can accept input in XML which can be generated by a crawler. It connects directly to mySQL and PostgreSQL, and has web scripts for external sources. APIs are available in PHP, Python, Java, Perl and Ruby. Read more and tell me about your Sphinx experience here. - Mood:accomplished

| |
|
In the Robots Exclusion Protocol June 08 Agreement, the leading webwide search engines announced that they would recognize a new element in the HTTP header, the X-Robots-Tag. Google started using it at first, then Yahoo and now Microsoft Live Search is supporting it.
When a browser or robot sends a request to the web server for a URL, part of the response is the invisible HTTP header, including information about the file type, encoding, and date modified. This information is generated by the web server.
The new X-Robots-Tag, within the HTTP response header, can contain same values as the Robots META tags: NOINDEX, NOFOLLOW, NOARCHIVE, NOODP, NOSNIPPET.
There are several cases where the X-Robots-Tag values will be very valuable:
- For non-HTML documents, including plain text, XML, PDF, office documents, audio, and video files. While many of these documents are able to carry Properties information or metadata such as XMP, they rarely do, and even then, it's often incorrect or duplicated.
- For situations where the web site publisher cannot change the content of the HTML files, but wishes to control some of the site interaction with search indexing robots.
- Sites with large amounts of changing content, where updating individual files is too hard or expensive.
This is not something anyone can type in by hand, but it's easily added by programmatically by server-side tools such as Perl, Ruby, or PHP. For simple cases, the Apache .htaccess file is easy enough to configure, as in this example where the crawler is told not to index content in robots.txt:
<FilesMatch "robots\.txt"> Header set X-Robots-Tag "NOINDEX, FOLLOW" </FilesMatch>
or to avoid following links in".doc" files
<FilesMatch "\.doc$"> Header set X-Robots-Tag "NOFOLLOW" </Files>
I think this is a very clever way to add the known functionality of Robots META tags to non-HTML file formats, collated from an external metadata repository. It's likely to be particularly useful to intranet search engines, and portals which may not have access to the documents themselves.
I have added an X-Robots-Tag test suite to the SearchTools testing section and will report if I find anything interesting.
H/T to: Playing with the X-Robots-Tag; Controlling Your Robots; Handling Google's neat X-Robots-Tag - Mood:tangential

| |
|
The SWF (Flash) file format has been open for a while, and a lot of search engines have used the format to get at some of the static text in in the Flash files. However, Flash is now an interactive web site application builder, and there is a lot of text that just does not exist until someone comes along and clicks. This has meant that people who wanted their sites properly indexed by webwide search engines could not use Flash, or would have to go to extra lengths to provide static text for search engine robots to find.
What Adobe and Google have just announced is that Adobe is making a special version of the Flash code that can approximate a human interacting with the Flash application in the SWF file, triggering as many application states as it can. As far as I can tell, the Flash client within the indexing robot will be clicking every possible button and entering text in text fields. While indexing the labels on buttons seems odd at first, it makes sense to think of that as as anchor text pointing at other pages (or at least URLs).
This is similar to what the googlebot is doing on some site forms: automatically clicking every combination of buttons, menus, and checkboxes, and submitting words from the site in text boxes. This has ended up creating phantom shopping carts and search queries. They only do this on GET actions, not on POST, and presumably will not do so if the page has meta NOINDEX and NOFOLLOW tags.
The chief concerns I've seen from web site publishers include: the lack of clarity about exactly which JavaScript Flash loading links will be acceptable (especially SWFObject); how external XML files loaded by Flash will be indexed, and how the deep linking into Flash files will work. Adobe has some explanations on their FAQ At the moment, it's SWF only, all versions from the oldest to the current, whether generated by Flash or Flex, which they call "RIAs" (rich Internet applications). However, they are not providing access to FLV files, which are used on YouTube etc. to contain video for playback, and rarely have textual metadata.
Adobe says Yahoo is working on this as well, and Adobe says that they are "exploring ways to make the technology more broadly available" to other search vendors. No word on whether that includes enterprise and site search developers.
There's an excellent writeup from the SEO point of view at Searchengineland, and searchmarketinggurus has a skeptical response.- Mood:

| |
|
Whitney Quesenbery and her colleagues convey the findings of a long study about how search is used at the UK's Open University, She gave a talk at the Enterprise Search Summit, and presented more formally at the Usability Professionals’ Association conference, in June 2008 The study included search log analysis, heuristic reviews, remote and local usability testing on the search user experience, over the course of several years, and they are linked from Whitney's valuable Search Usability page. Designing for Search: Making Information Easy(PDF) covers both search and content. It recommends focusing improvements first on the most frequent terms, the short head of search popularity. The results of tests with eyetracking "heat map" visualizations show that both students and those outside the system will scan the whole search results page, and confirm the user tests stressing the strong value of meaningful titles. Search is now normal behavior. what do we do about that?(PDF) has a different perspective. In addition to the classic long tail frequency of search terms, it showed that the most popular search terms (the short head) remained much the same over the course of three years, though there is also some seasonal variation. Gratifyingly, because I've been saying this for a while, the need for search as a supplement to navigation went down, when the site navigation changed. The study finds that topical metadata and improving titles makes search results significantly more useful. A comparison of four search engines found significant differences in results, in particular, in the variety of top results for common terms. Only one of the four search engines hid duplicate entries, consistently had all links on the first page be appropriate, and displayed links to several different locations, rather than a single subsite or directory.
It's great to see more research done over time and with a large amount of data. I'm keeping a listing of what I've found at CiteULike with the tag search-interface, and planning to update my Search Usability page- Mood:working

| |
|
I've just posted an article on the Long Tail, Short Head and Search. Every site, intranet and enterprise search log I've analyzed fits the model of the Long Tail, with a very few very popular search terms, then tailing off very quickly to unique queries (the Long Tail), creating a Zipf curve. The Short Head -- the few most frequently used search terms -- is the best place to start in analyzing search engine usage. My article also gives some suggestions for taking the information and using it to improve a search engine. - Mood:awake

| |
|
| |