Download Subset Extraction [ WEB LAB - Requirements for Subset

WEB LAB - Subset Extraction: Requirements and Specification Swati Singhal Megha Siddavanahalli The database schema required for the subset extraction has to be designed in such a way so as to adapt to data of size in terabytes. Information retrieval and storage is critical and one of our major concerns. The solution to data storage is using Database Management Systems (DBMS). The factors that influence the subset extraction and storage of the data extracted after retrieval are:  The database should be scalable depending on the user queries. The user may query on general and broad topics which will produce a large set of documents.  Providing users options to store the retrieved data subset on their system. User has the option to use the data retrieved by a particular query for further research and need not run the same query again to retrieve the data.  Problem faced by the subset extraction is to try and store the data on the client machine and be able to retrieve all the data in the ARC files using the schema.  An efficient mechanism is required for loading and retrieving data records to and from the database in utilizing the data for research purposes.  Another challenge in subset extraction is to store the data which spans over different crawls. The pages in the database are stored for different crawls and the same URL maybe stored in several crawl but each copy of the URL will have a different time associated with it. The underlying database that stores the DAT files is shown in the following section. The database is made unique to every crawl, every crawl has its own database. The crawls are given a unique crawl id. In the database that we use for query data, we simply store the crawl id, crawl time and the page id. The archive database The archive database is used to store the internet archive data in the form of a database for easy querying. Researchers Subset extraction is the block named Extract features. There are two main ways in which the user can obtain information stored in the Internet Archive. 1. The user API will allow the user to select data from the ARC and DAT files based on certain criteria that the user can set/select from the browser. 2. The API can also allow experienced users to write formal queries on the data in the query language understood by the Archive. If the query language used is different a wrapper class can be used over the database to get the required information. The database has a 2 tier schema Each of the crawls has its own set of tables so diagrammatically it looks like Crawl 1 Crawl n Page – Page ID, URL ID, Content ID, IP Address, ARC time, MIME type URL – URL ID, Domain ID, Path Domain – Domain ID, Domain Link – Source Page ID, Destination Page ID, PreContext ID, Anchor ID, Post Context ID, Link Position Context – Context ID, Context Page – Page ID, URL ID, Content ID, IP Address, ARC time, MIME type URL – URL ID, Domain ID, Path Domain – Domain ID, Domain Link – Source Page ID, Destination Page ID, Pre-Context ID, Anchor ID, Post Context ID, Link Position Context – Context ID, Context Each crawl is identified by the CrawlID and CrawlDate For identifying each page, we will need a combination of CrawlID, CrawlDate, PageID Choice of PageID : A single subset may have multiple pages of the same URL and hence storing just the URL will not give us the exact page. The PageID can be used to uniquely identify the page that are required by the user. Query the archive – Subset extraction When the user runs queries on the archive the result will be a series of pages that were stored in the archive Properties of these pages    They can span over more than one crawl In a single crawl multiple pages with the same URL can be present They have the following properties – Page ID, URL ID, IP Address, Arc Time, MIME type Given just the pageID, it should be possible to extract it’s URL, fetch it’s contents, and find all it’s incoming and outgoing links Subset Storage Schemes The subset information can be stored in a database in the generic form CrawlID, CrawlTime, PageID This tuple uniquely identifies a page in the resulting hits. The three options for storing this information are 1. Store the actual data itself : This can be fast and efficient if the amount of data is small. 2. Store references to the data records: This saves space by just storing references instead of the actual data for the ID, CrawlTime and the PageID. 3. Storing the query generator itself. This can be run at any time to get the resulting subset when the user wants to access the result. Schema design Storing the Data : The schema would look like CrawlID Crawl – Date/Time PageID Primary Key: If the pageID can be used to uniquely identify a crawl then the PageID itself can act as a primary key to the schema. If the Crawl – Date/Time is the same as the Page Table’s Archiving time, then this field can be removed from the schema stored and can be presented to the user if requested for as it is a property of each the pages. Storing References: This can just store a reference to the Page record that corresponds to the result. This might have to access the other tables to finally get the desired result of CrawID, Crawl Date/Time. If only the references are stored then the space taken up by the subset table is reduced Diagrammatically this would look like Reference to Page Each of the references will be a reference to some Page on some Crawl. For this to work, the PageID will have to be unique over the Crawls. Storing the subset generator itself: The subset generator is different from the user query. The user might input the query in the following ways 1. By specifying key terms – like searching on google 2. By fields – if the user’s UI has an option of searching on fields then user just has to select these 3. By specifying the exact query in terms of a query language Once the user gives the input, the query will have to be converted into a standard query language that is understood by the server. This will be the generator that can be stored and used in the future. Each of the subset generators can be uniquely identified by a generatorID. The generators can be store against these IDs are either plain text queries. When the user wants the results, the generator can be executed to get the query in the form of a table of contents/references. A generator ID will have to include the user’s identification so that they can be mapped back to the user who makes the request. Generator ID Generators Possible ways to generate the generator ID Use a combination of username, time to uniquely identify it. Use a combination of username, user specified generator name to identify it. Persistence: The server can store the generators for users upon request until the user specifies that it is no longer needed. Since the user specified query is different from the generator generated by the server, the formal generator can be sent to the user who can in the future use that instead of making the server compute this generator. Storing the schema When the server storing the internet archive gets the query, it can retrieve the required subset. The subset can be stored as table in the server itself in either one of the formats described above. If the user wishes to have the subset on his local system, the entire data will have to be transferred either in the form of a schema or in the form of a list on the user’s interface. If the entire table is very big maybe subsets of this can be fetched from the server by the local system, or maybe the subset table itself can be broken up into smaller more manageable sub-subsets that can be transferred upon request by the server. This would be something like showing only the top ten ranked pages at first and letting the user decide if they want more information from the server.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Subset Extraction [ WEB LAB - Requirements for Subset