Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Searching the Web Or “If there’s so much out there, why can’t I find it?” Presented by: Allen Brown IS/SE Date: 2003-05-12  Outline - Searching the Web 1. 2. 3. 4. 5. 6. 7. 8. Information Cartography Visible and Invisible Web Information Information Finding Strategies Reference Tools, Pathfinders, Specialized Information Repositories, Subject Directories, and Search Engines Information Search Strategies Information Evaluation Strategies Information Finding Summary Search Engines and their Characteristics Searching the Web - 2  Information Cartography Imagine a physical map of an ocean basin • identifiable areas of the sea floor • large abyssal plain • many undulating hills above the plain • occasional higher elevations or plateaus • sparse atolls and seamounts Imagine the Web • some information content identifiable by subject • vast amounts of very low value information • some good stuff distributed across many sites • occasional high quality site with quality and quantity • sparse stunningly useful sites (to die for) Searching the Web - 3  Information Cartography - 2 Information issues: quality completeness + location! In searching for information we need to adjust the: •breadth of search to find all that is relevant in an “ocean” of information •quality level to find only “atolls” of information quality to find everything that is important and useful Searching the Web - 4  Visible and Invisible Information Information space Visible = indexed by search engine Invisible = not indexed but accessible db 2 site 3 engine 4 engine 2 engine 3 engine 1 site 7 db 1 db 4 db 6 site 5 Searching the Web - 5  Search Engines Won’t Do It All! According to a recent study reported in Nature (1) no search engine indexes more than 16% of the Web. Even though search engine databases are enormous, they cover very little of what's actually available on the Web. 1) Steve Lawrence and C. Lee Giles. (July 8, 1999). Accessibility of Information on the Web. Nature, 400, 107 - 109 Searching the Web - 6  Information Finding Strategies Identify Starting Points based on your question: What type of information do you need? Facts, statistics, government document, scholarly articles, popular opinion, music, picture, multimedia, news, … What form do you want the information in? Dictionary definition, encyclopedia entry, journal article, elementary school project, video file, audio file, … What type of site would offer this information? Academic, commercial, government, non-government organization How much information do you need? Introduction, in-depth, references, … Searching the Web - 7  Information Finding Reference Materials (Often invisible) – dictionaries, thesauri, encyclopedia, newspapers Information Pathfinders (Sometimes invisible) / Portals / Vortals – subject specific, highly relevant, sometimes bizarre – usually high quality – managed by dedicated enthusiasts, possibly amateur – e.g., Web design, Perl, micro cars, Curta calculators, … Specialized Information Repositories (Often invisible) / Portals – institution-based, sometimes obscure – usually high quality – managed by information professionals – e.g., government documents, archives, … Searching the Web - 8  Information Finding - 2 Subject Indices (Often invisible – but this is changing) – subject-based – e.g., Yahoo Search Engines and Search Brokers (Visible web) – e.g., Google, Alta Vista, Hot Bot, Lycos, Vivisimo, dogpile Searching the Web - 9  Reference Tools - Dictionaries http://www.yourdictionary.com/ Searching the Web - 10  Reference Tools - Thesauri http://www.visualthesaurus.com/index.jsp Searching the Web - 11  Reference Tools - Encyclopedia http://www.britannica.com/ Searching the Web - 12  Pathfinders A pathfinder site provides an information map of what is available within a fairly narrow area of interest; usually compiled by domain experts. These sites are often called “vortals” (vertical portals). Searching the Web - 13  Specialized Information Repositories National Library of Canada A specialized information repository often collects and catalogues relatively specific information; usually compiled by information experts. Some are considered to be vortals. Searching the Web - 14  Subject Directories www.yahoo.com Subject directories are lists compiled by people. They are organized in a hierarchy where each subject includes a list of sub-topics. These sites are often called “portals” - a one-site starting location for general information seeking. Searching the Web - 15  Subject Directories Subjects lists are usually evaluated but sites are not presented in order of relevancy. In other words, the best sites on a topic are not necessarily listed first. Sites are compiled through submission of URLs by site creators and human evaluation and selection. One advantage of is their browsability, although this feature is only suitable with fairly general topics. A disadvantage is their relatively small size. Other examples of subject directories : Infomine: http://infomine.ucr.edu Scout Report Signpost: http://www.signpost.org/signpost Searching the Web - 16  Invisible Web Directories Look at http://www.invisibleweb.net/ Searching the Web - 17  Search Engines Search engines use computer programs that automatically collect web sites using "spiders" or "robots". The sites are indexed and stored in an index database. To query a search engine, type topic keywords and Boolean connectors into a search "box." The search engine scans its index and returns links to websites containing the specified keyword relationships. Size matters - an advantage of using search engines is their coverage (though size is relative), but this can also be a disadvantage if relevance ranking is poor. Searching the Web - 18  Search Engines: Operational Concepts World Wide Web crawling and page contents extraction and indexing query parsing, index index lookup, data base results ranking and management Search Engine query User query results Searching the Web - 19  Search Engines - Does Size Matter? Searching the Web - 20  Size If you are looking for unusual or hard-to-find information should try one or more of the search engines with a large index to check more web content. This improves the likelihood of finding what you seek. However, for general searches or when looking for information about popular topics, a large index does not necessarily equal better results. Also, large indexes may have longer re-visit intervals. Searching the Web - 21  Search Engines: Search Scoping and Ranking / Results Management It is essential to learn and apply each engine's specialized search formats to narrow results and filter and push the most relevant pages to the top of the results list. Use Boolen operators, proximity connectors, stems, wild cards, sounds-like, media-type and metadata filters. Result relevancy ranking also depends on the size of the search index and how the search engine interprets and uses your query. Each engine determines result relevancy ranking in unique ways. Consult the help file of each engine to learn about these. Some engines offer search refinement and conceptual clustering for better focus (tighter “hit cluster”) or greater accuracy / validity (centred on the “right stuff”). Searching the Web - 22  Search Engines - Search Scoping + expands the scope, - reduces the scope • Exact phrase - - quotes, e.g., “We hold these things to be self-evident” • Boolean operators - and - (default) or + (caution!) not - (extreme • • • • • • caution!), e.g., large male dog, large or male or dog, not cat Proximity connectors - near - (depends on engine), e.g., spring near flower Stemming and wildcards - + e.g., swim*  swim, swimming, swimmer, swimmers, swimmingly, … Sounds-like - + e.g., table  cable, able, fable, … Media type - - e.g., image, audio file, … Concept-based + - e.g., synonym  thesaurus, antonym, homonym, … Metadata-based - - in some systems Searching the Web - 23  Search Engines - Ranking Result relevancy ranking (=“usefulness”) can be done according to two techniques (or some combination): • • Conventional - using intra-page information Relative - using extra-page information Searching the Web - 24  Search Engines - Conventional Ranking Conventional (intra-page): • frequency of words (number and density) • phrases (exact word sequences) • hierarchy (e.g., closer to the top of the document) • adjacency (proximity of words) • metadata (keywords provided by content owners) • font size and style (relative intra-page)  Jack Christensen repairs CURTA calculators. I've known Jack for many years and can highly recommend him. Here are a few questions I asked Jack: What do you charge to clean a Curta? Typically $65 to $95, depending on the work involved. More often than not, the upper carriage needs a complete disassembly, whereas the main body can be cleaned without a complete disassembly. If the main body needs to be completely disassembled, something is usually bent, out of adjustment, or broken. What do you charge when repairing a Curta? I charge $20 per hour of my time. It seems my hours are about 90 minutes long, however, because I rarely finish in the time I originally quoted. Extended repair time is absorbed by me. What spare parts do you have? Are they expensive? I actually have many hundreds of new original Curta parts. Most are for inside the instrument, though. I use them when I do general cleaning and repairs. Outer body pieces, replacement cannisters, and external parts that are easily damaged or broken due to abuse are not generally available, although I do occasionally locate some these items. Sometimes I have to fabricate a part, or repair an item as best I can. Obviously, this takes time, and the cost is high. Parts costs are charged as the traffic will bear. I usually try to be blunt about this to the Curta owner, often telling them that a severely damaged unit is best sold as a "parts Curta". Unfortunately, I've sometimes had to tell this to someone who wanted to repair a Curta looked upon as an heirloom. What to them appears to be a minor issue actually turns out to be a major problem (e.g., a crank handle tilted downward is due to a broken main shaft). I think the most I ever charged for a repair was about $375. There were many severe problems with the unit. Generally, when the price gets to be above $175 most people simply decide to keep the damaged Curta as a memento. Can you replace a clearing ring? What costs are involved? The plastic clearing rings are easy to install. I have several new ones, but I typically do not sell them separately as a spare part. Rather, I install them during a general cleaning and repair. Metal rings are more difficult to replace. As with the plastic clearing rings, I will only install a metal clearing ring during a general cleaning and repair. It takes a special tool to properly swage the rivet in place. [Editor's note: Very old Type I clearing rings were held on with a screw and nut. The nut was also crimped to the screw threads.] I used all the new metal clearing rings I had about five years ago, but I do have a few used ones that were removed from other damaged Curtas. I have these for both the Type I and Searching the Web - 25  Search Engines - Relative Ranking Relative (extra-page): • popularity (page visits - from the search engine) • citation (links pointing to the item) • relevance of the pages containing the links pointing to the item (!) Yahoo   Web Pages Searching the Web - 26  Search Engines: Keys to Success World Wide Web Size  Large index and / or several engines Scoped query  “wide net” but appropriate “sieve” carefully constructed for your needs Ranked and manageable results  query construction and search engine features Searching the Web - 27  Meta Search Engines “Meta" search tools are able to search the index databases of multiple engines “simultaneously”, via a single interface. “Meta” search tools don’t really search metadata. They are simply brokers that reformulate a query and hand it off to a set of search engines, then combine the results. “Meta” engines are very fast but they do not offer the same level of control over the relationship between keywords as do individual search engines. Also, meta search engines may produce poor ranking of combined results. Searching the Web - 28  Search Engines Examples of popular search engines include: Google: http://www.google.com Alta Vista: http://www.altavista.com All the Web http://www.alltheweb.com Northern Light: http://www.northernlight.com Also see The KartOO clustering visual engine http://www.kartoo.com/ For meta engines, try Vivisimo at http://vivisimo.com/ Searching the Web - 29  Information Search Strategies • • • • • • • • • • • Think hard about what you are looking for! Use a Reference Tool, if appropriate Use a Pathfinder, if you know one Use a Specialized Information Repository, if appropriate Use Subject Indexes, if it is a common topic Use several Search Engines, if needed, especially for the obscure or academic topic, but learn how they work Use keywords - be narrow, and specific (and technical) Use phrases - try synonyms or related concepts Use Boolean connectors - but find out if / how the engine uses them Use stemming and wildcards - but find out if / how the engine uses them Use media-type filters or metadata, if appropriate Searching the Web - 30  Information Search Tools - Use depth Pathfinder focused content pre-selected by domain experts Search Engines and Meta-engines easy to use obscure or academic caveat emptor! Subject Indexes Specialized Information Repository Information space popular or common pre-selected by interested people related or themed pre-selected by professionals contains “invisible” content Reference Tool hard to use well generic simple lookup created by professionals contains “invisible” content breadth Searching the Web - 31  Information Evaluation Strategies: CARS CARS checklist: http://library.queensu.ca./inforef/guides/evalchart.htm • Credibility - author credentials stated with email contact - evidence of quality control (site location) • Accuracy - timeliness - comprehensiveness - audience & purpose • Reasonableness - fairness - objectivity - consistency - world view • Support - source documentation or bibliography Searching the Web - 32  Summary  There is much information on the Web, but it’s not: - all there - all good (or all bad) - always easy to locate  Use an information search strategy that: - matches the information sought - uses the appropriate tools - uses them in the correct ways  Use an information evaluation strategy, e.g., CARS methodology.  Choose and use search engines wisely, knowing their strengths, features, and their limitations. Searching the Web - 33  How Do Search Engines Work? Three Activities Occur: 1. Crawling – fetch pages – compile URL list (a db) – re-visit pages 2. Page harvesting – parse page – add to index db and establish ranking 3. Responding to search requests – parse query – apply to index – present and rank results Searching the Web - 34  Search Engines: Operation fetch URL Crawler Robot re-visit URL Really clever stuff in here URL data base World Wide Web fetch Harvester Robot page contents query Query Processor query User results Fairly clever stuff in here Index data base Search Engine Searching the Web - 35  Search Engine - Hardware (not really …) Searching the Web - 36  How Do Search Engines Work? • See “The Anatomy of a Large-Scale Hypertextual Web Search Engine” at http://wwwdb.stanford.edu/~backrub/ google.html Searching the Web - 37  References • Information Search Strategies: <http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/FindInfo.html> • Information Evaluation Strategies: <http://www.vuw.ac.nz/~agsmith/evaln/evaln.htm> • Search Engines: < http://www.library.arizona.edu/search.htm> < http://www.brightplanet.com/deepcontent/tutorials/search/index.asp > < http://www.searchenginewatch.com/ > • Susan Maze, David Moxley, Donna Smith: Authoritative Guide to Web Search Engines, Neal Schuman Pub, 1997, ISBN 1555703054 Searching the Web - 38