* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download using a spatial database in a location
Survey
Document related concepts
Transcript
BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI Publicat de Universitatea Tehnică „Gheorghe Asachi” din Iaşi Tomul LV (LIX), Fasc. 3, 2009 SecŃia AUTOMATICĂ şi CALCULATOARE USING A SPATIAL DATABASE IN A LOCATION-BASED SEARCH APPLICATION BY ANDREI TABARCEA, *PASI FRÄNTI and VASILE MANTA Abstract. This paper describes a solution for a georeferencing problem in a location-based search engine. Georeferencing is the process of assigning a geographic location to a web-page or part of it. Our solution is to use a spatial indexed database which acts as a gazetteer and contains geographical coordinates attached to address strings. We perform a series of tests to choose the best indexing solution for the database. Key words: spatial database, LBS, search engine, gazetteer, georeferencing. 2000 Mathematics Subject Classification: 68P20. 1. Introduction During the last few years, the location of a user connected to the internet has gradually become easier to determine. Starting with rough estimations based on the IP address and continuing with the development of positioning technologies such as GPS (Geographical Positioning System), locating the user stopped being a major obstacle in the development of locationbased services. The increasing availability of geographical positioning in low cost consumer devices such as PDAs or mobile phones has made possible the development of various services which also consider the user’s location. A location-based search engine, which is basically a web search engine which uses the user’s location as an additional relevance criterion, is one of such services. The goal of a location-based search engine is to help users find points of interest described by one or several keywords in the proximity of their location. The main problem which rises in the development a location-based search application is georeferencing (the process of assigning geographic coordinates to a resource, in our case a web page). Only very few web pages 56 Andrei Tabarcea, Pasi Fränti and Vasile Manta give a direct positioning (geotags or other forms of coordinates) for which the information in the page relates to, so the use of geographical coordinates in a web search has little applicability. However, it is common to find street or postal addresses on web pages. Therefore, an answer to the georeferencing problem can be to use a predefined data structure or database that connects any given postal address to its exact location (coordinates). We propose a solution [14] that searches addresses from web pages, converts them to geographical coordinates and uses the information as an additional relevance criterion. Georeferencing is aided by a gazetteer, which is defined in [7] as a geospatial dictionary of geographic names and its minimum components as a geographic name, a geographic location represented by coordinates and a type designation. Our implementation of the gazetteer is a spatial database that contains postal addresses as geographic names, their corresponding coordinates as geographic locations and a single type: postal addresses. We underline the importance o a spatial database in the location based-search and we perform a series of tests to determine the fastest and most efficient structure for the database, considering the use of specialized spatial data types and functions and indexing of the most common used fields. 1.1. Related Work Location based-search has been implemented into various projects, starting with commercial services such as Google Maps, Yahoo! Local, Bing Maps and Yellow Pages or with research projects such as [8], [11] and [2]. The first methods of detecting the location of a web resource are found in [5] and [9]. In [5] “whois” records are analyzed and phone numbers of network administrators are used with a zip code and area database to assign coordinates to Class A and B domains and to determine the globality of a website. In [9] the sources for geospatial context are classified as being for the hosts of a web-page (usually found in “whois” databases and the way the traffic is routed on the Internet) and for its content (postal addresses and codes, telephone numbers, geographic feature names). Additional geographical information is found from hyperlinks and meta tags. The postal address detection uses a postal code database with latitude/longitude information. The gazetteer has been defined in [7] as a geospatial dictionary of geographic names and has been used on most of the projects that involve postal address detection, such as [3], [4], [6] or [1]. On the other hand, Name Entity Recognition without gazetteers is discussed in [10], which turns out to work well with people and organizations, but bad with locations. The system in [3] employs the gazetteer approach to identify geographic locations in web-pages. In [4], an ontology-based approach that extracts geographic knowledge is presented. The address is divided into 3 parts (basic address, complement and location identifiers such as phone number, postal code or municipality name) and the address recognition is a process of geoparsing Bul. Inst. Polit. Iaşi, t. LV (LIX), f. 3, 2009 57 and geocoding, which uses a gazetteer described in [12]. In [6] the location-based data is retrieved by recognizing postal addresses. The method is ontology-based conceptual information retrieval combined with graph matching. The concepts (knowledge/address elements) in a document are identified and linked together in a graph by semantic relations and the concept set used is actually a gazetteer. In [1] a geoparser that can identify address level location information using a database rather than rely on metadata or other structured annotation is described. Their database contains postal codes, city names, street names, and also every city-postal code combination for each street of the target area and is also used for validation. Address detection relies on assuming the address blocks have a certain structure and that there are certain dependencies between address elements. 2. Postal Address Spatial Database 2.1. Use of a Spatial Database A location-based search engine finds websites which contain information about services or targets in the proximity of a given location, usually the user’s location. The MOPSI location based search engine [14] conducts a real-time search with a prominent search engine such as Google and extracts potential postal address information from the resulting web pages. Therefore, its georeferencing process consists in finding corresponding coordinates of a given postal address. This requires a data structure or a database that connects any given address to its exact location (coordinates). Our solution is to use such a database, which is commonly available (although not necessarily free) and can be purchased for given (or specified) regions. Such a spatial database can be used in geocoding, which is the process of finding associated geographic coordinates (latitude and longitude) from other geographic data, such as street addresses, or zip codes (postal codes). The main purposes of the postal address database are: converting a postal address into geographical coordinates, finding the postal address of geographical coordinates and finding all the location points in a square or rectangle bounding box. Therefore, the database or data structure used for georeferencing and geocoding must be optimized so that every location point and every address from a bounding box defined by minimum and maximum latitudes and longitudes can be retrieved easily and fast. A complete server-side database was constructed beforehand for containing all addresses and coordinates of the target region. The speed and accuracy of the database can be facilitated by using a database management system which implements Open Geographical Information System (OpenGIS) specification or other spatial and location-data standards for faster query results. The database management systems that can be used include MySQL with 58 Andrei Tabarcea, Pasi Fränti and Vasile Manta spatial extensions, PostgreSQL with PostGIS or Oracle Spatial. Alternatively, the database structures can be altered to increase performance ratios of the entire application. There are several solutions to enhance performance of the application, for example, using database indexing or using data types and functions which implement GIS standards. 2.2. Common Operations In all usage scenarios, the search engine performs the following operations which translate into database queries: a) Finding all the municipalities in a square (or rectangle) bounding box area This query is performed at the beginning of the search when the user’s interest area is determined. This is usually defined as a square bounding box with a fixed length and it can intersect only one municipality, but, in many cases, can intersect two or more cities. The search engine needs the names of the municipalities for location identification, because the same street name in many cities can be found in many cities. Fig. 1 – The bounding box intersects one municipality (Joensuu). In the first case (Fig. 1) the bounding box intersects only one municipality, mainly because the provided location point is near the center. The search engine will need addresses only from Joensuu, so running a query to find all the cities would seem pointless. Bul. Inst. Polit. Iaşi, t. LV (LIX), f. 3, 2009 59 Fig. 2 – The bounding box intersects two municipalities (orange – Vantaa, blue – Helsinki). In the second case (Fig. 2), the user is near the border between the two municipalities and the search engine needs addresses and location points from both municipalities. The query for finding all the municipalities in a bounding box is made only one time per search, so its running time is not critical. b) Finding all the street names in all the municipalities bounding box The second step in the location-based search is finding all the street names in the selected area. This query is made one time for every municipality found in the first case, so the running time is not highly critical. The street names are used by the search engine for finding services or other targets in the area. c) Converting addresses found into location points (latitude and longitude) The third step is converting all the addresses which correspond to the search results into locations points (geocoding). This operation is usually made for every address found by the search engine and the resulted locations points are used for calculating distances to the user’s location. This is one of the most time critical operations, because it can be done from zero to tens or hundreds times per search, depending of the number of addresses found in the search results. d) Converting the location points (latitude and longitude) into addresses This operation is mainly used for determining the user’s address, if the user provides only its location point. Also, there can be other situations where 60 Andrei Tabarcea, Pasi Fränti and Vasile Manta the conversion from location points to addresses is needed, especially if the search engine finds location points and the user needs addresses. 2.3. Implementations and Results A postal address database which stores all the addresses in Finland was implemented on a MySQL5 database management system. For the design of the database the following factors were considered: using MySQL spatial extensions or common data types and using database indexing. For testing purposes, a postal address database of the North Karelia region was created. Table 1 shows the database sizes for the considered solutions. Database implementation Without spatial extensions and without indexing Without spatial extensions and with indexing With spatial extensions and without indexing With spatial extensions and with indexing Table 1 Database Sizes Data size Index size MB MB Database size MB 22.71 0 22.71 26.77 25.67 52.44 29.31 0 29.31 33.37 48.02 81.39 Results show that indexing dramatically increases the database size with more than 90%, whilst using spatial extensions also increases the storage size with more than 12%. For testing the query execution times, the benchmark program randomly chose a number of 500 location points and tested the database. The queries from the 4 most common operations for all the chosen location points were tested for each location point, the execution times were logged and the total time for query execution for the proposed testing scenario was calculated with the following formula: Ttotal = Tquery 1 + n1 ∗ Tquery 2 + n1 ∗ n 2 * Tquery 3 + Tquery 4 , where n1 represents the average number of municipalities returned by query1 and its value is 1.7 and n2 represents the average number of search results and its value is 64. Bul. Inst. Polit. Iaşi, t. LV (LIX), f. 3, 2009 Table 2 Average Query Execution Times Without spatial Without spatial With spatial extensions and extensions and extensions and without indexing with indexing without indexing query1 ms query2 ms query3 ms query4 ms Total time s 61 With spatial extensions and with indexing 886.10 898.67 3956.10 176.83 715.70 953.27 1091.46 173.88 670.00 14.85 674.99 13.33 887.78 909.27 3866.22 215.44 75.88 5.04 83.11 2.13 Results show that the non-indexed solutions are at least 15 times slower comparing to the indexed ones, therefore using a non-indexed database is not justified. Using spatial extensions makes queries run significantly faster on the indexed solutions (at least 2 times faster) and slower on the non-indexed solutions (1.09 times slower). 3. Conclusions In this paper we solve the georeferencing problem in a location-based search engine by using a spatial database as a gazetteer. The most efficient solutions for the database are also the most storage-costly solutions. However, the execution time is more important that storage space, which is not a big issue (the size of a database which stores all the postal addresses in Finland would be approximately 2GB). The solution we propose uses spatial extensions and indexing and is at least 2 times faster than the other tested solutions. Received: August 6, 2009 “Gheorghe Asachi” Technical University of Iaşi, Department of Computer Engineering e-mail: vmanta@cs.tuiasi.ro *University of Joensuu, Finland Department of Computer Science and Statistics e-mail: franti@cs.joensuu.fi REFERENCES 1. Ahlers D., Boll S., Retrieving Address-based Locations from the Web. 2nd International Workshop on Geographic Information Retrieval, Napa Valey-USA, October 26-30, 2008, 27−34. 2. Ahlers D., Boll S., Urban Web Crawling. First International Workshop on Location and the Web, Beijing-China. April 22, 2008, 25−32. 62 Andrei Tabarcea, Pasi Fränti and Vasile Manta 3. Amitay E., Har’ El N., Sivan R., Soffer A., Web-a-where: Geotagging Web Content. 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield-United Kingdom, July 25-29, 2004, 273–280. 4. Borges K., Laender A., Medeiros C., Davis Jr. C., Discovering Geographic Locations in Web Pages Using Urban Addresses. 4th ACM Workshop on Geographic Information Retrieval, Lisbon-Portugal, November 9, 2007, 31−36. 5. Buyukkokten O., Cho J., Garcia – Molina H., Gravano L., Shivakumar N., Exploiting Geographical Location Information of Web Pages. 2nd International Workshop on the Web and Databases WebDB (Informal Proceedings), Philadelphia-SUA, June 3-4, 1999, 91−96. 6. Cai W., Wang S., Jiang Q., Address Extraction: Extraction of Location-Based Information from the Web. 7th Asia-Pacific Web Conference, Shanghai-China, March 29 – April 1, 2005, 925−937. 7. Hill L., Frew J., Zheng Q., Geographic Names: The Implementation of a Gazetteer in a Georeferenced Digital Library. D-Lib Magazine, January 1999, Vol. 5, Issue 1 (1999). 8. Jones C.B., Abdelmoty A.I., Finch D., Fu G., Vaid S., The SPIRIT Spatial Search Engine: Architecture, Ontologies and Spatial Indexing. 3rd International Conference on Geographic Information Science GIScience 2004, MarylandUSA, October 20-23, 2004, 125−139. 9. Mc Curley K.S., Geospatial Mapping and Navigation of the Web. 10th International Conference on World Wide Web, Hong Kong-China, May 1-5, 2001, 221−229. 10. Mikheev A., Moens M., Grover C., Named Entity Recognition without Gazetteers. 9th Conference on European Chapter of the Association for Computational Linguistics, Bergen-Norway, June 8-12, 1999, 1−8. 11. Morimoto Y., Aono M., Houle M.E., Mc Curley K.S., Extracting Spatial Knowledge from the Web. 2003 Symposium on Applications and the Internet, OrlandoUSA, January 27-31, 2003, 326−333. 12. Souza L.A., Davis Jr. C.A., Borges K.A.V., Delboni T.M., Laender A.H.F., The Role of Gazetteers in Geographic Knowledge Discovery on the Web. 3rd Latin American Web Congress LA-WEB 2005, Buenos Aires-Argentina, October 31 – November 2, 2005, 9. 13. Wang C., Xie X., Wang L., Lu Y., Ma W.Y., Detecting Geographic Locations from Web Resources. 2005 Workshop on Geographic Information Retrieval, Bremen-Germany, October 31 – November 5, 2005, 17–24. 14. *** The MOPSI Location-based Search Engine. http://cs.joensuu.fi/mopsi/, 2009. FOLOSIREA UNEI BAZE DE DATE SPAłIALE ÎNTR-O APLICAłIE DE CĂUTARE BAZATĂ PE LOCALIZARE (Rezumat) Acest articol descrie o soluŃie pentru problema georeferenŃierii, problemă ce poate să apară într-un motor de căutare bazat pe localizare. GeoreferenŃierea este procesul de atribuire de coordonate geografice unei pagini web sau unei părŃi ale sale. SoluŃia propusă constă în folosirea unei baze de date spaŃiale indexate, care funcŃionează ca un lexicon geografic şi care conŃine coordonate puse în corespondenŃă cu adrese poştale. Au fost efectuate o serie de teste pentru a determina structura bazei de date care eficientizează căutarea.