Download Subset Extraction [ WEB LAB - Requirements for Subset

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IMDb wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Functional Database Model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Relational model wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
WEB LAB - Subset Extraction: Requirements and Specification
Swati Singhal
Megha Siddavanahalli
The database schema required for the subset extraction has to be designed in such a way
so as to adapt to data of size in terabytes.
Information retrieval and storage is critical and one of our major concerns. The solution
to data storage is using Database Management Systems (DBMS).
The factors that influence the subset extraction and storage of the data extracted after
retrieval are:

The database should be scalable depending on the user queries. The user may
query on general and broad topics which will produce a large set of documents.

Providing users options to store the retrieved data subset on their system. User has
the option to use the data retrieved by a particular query for further research and
need not run the same query again to retrieve the data.

Problem faced by the subset extraction is to try and store the data on the client
machine and be able to retrieve all the data in the ARC files using the schema.

An efficient mechanism is required for loading and retrieving data records to and
from the database in utilizing the data for research purposes.

Another challenge in subset extraction is to store the data which spans over
different crawls. The pages in the database are stored for different crawls and the
same URL maybe stored in several crawl but each copy of the URL will have a
different time associated with it.
The underlying database that stores the DAT files is shown in the following section. The
database is made unique to every crawl, every crawl has its own database. The crawls are
given a unique crawl id.
In the database that we use for query data, we simply store the crawl id, crawl time and
the page id.
The archive database
The archive database is used to store the internet archive data in the form of a
database for easy querying.
Researchers
Subset extraction is the block named Extract features. There are two main ways in which
the user can obtain information stored in the Internet Archive.
1. The user API will allow the user to select data from the ARC and DAT files based on
certain criteria that the user can set/select from the browser.
2. The API can also allow experienced users to write formal queries on the data in the
query language understood by the Archive. If the query language used is different a
wrapper class can be used over the database to get the required information.
The database has a 2 tier schema
Each of the crawls has its own set of tables so diagrammatically it looks like
Crawl 1
Crawl n
Page – Page ID, URL ID, Content ID, IP Address,
ARC time, MIME type
URL – URL ID, Domain ID, Path
Domain – Domain ID, Domain
Link – Source Page ID, Destination Page ID, PreContext ID, Anchor ID, Post Context ID, Link
Position
Context – Context ID, Context
Page – Page ID, URL ID, Content ID, IP Address, ARC time, MIME
type
URL – URL ID, Domain ID, Path
Domain – Domain ID, Domain
Link – Source Page ID, Destination Page ID, Pre-Context ID, Anchor
ID, Post Context ID, Link Position
Context – Context ID, Context
Each crawl is identified by the CrawlID and CrawlDate
For identifying each page, we will need a combination of CrawlID, CrawlDate, PageID
Choice of PageID : A single subset may have multiple pages of the same URL and hence
storing just the URL will not give us the exact page. The PageID can be used to uniquely
identify the page that are required by the user.
Query the archive – Subset extraction
When the user runs queries on the archive the result will be a series of pages that were
stored in the archive
Properties of these pages



They can span over more than one crawl
In a single crawl multiple pages with the same URL can be present
They have the following properties – Page ID, URL ID, IP Address, Arc Time,
MIME type
Given just the pageID, it should be possible to extract it’s URL, fetch it’s contents, and
find all it’s incoming and outgoing links
Subset Storage Schemes
The subset information can be stored in a database in the generic form
CrawlID, CrawlTime, PageID
This tuple uniquely identifies a page in the resulting hits.
The three options for storing this information are
1. Store the actual data itself : This can be fast and efficient if the amount of data is
small.
2. Store references to the data records: This saves space by just storing references
instead of the actual data for the ID, CrawlTime and the PageID.
3. Storing the query generator itself. This can be run at any time to get the resulting
subset when the user wants to access the result.
Schema design
Storing the Data : The schema would look like
CrawlID
Crawl – Date/Time
PageID
Primary Key:
If the pageID can be used to uniquely identify a crawl then the PageID itself can act as a
primary key to the schema.
If the Crawl – Date/Time is the same as the Page Table’s Archiving time, then this field
can be removed from the schema stored and can be presented to the user if requested for
as it is a property of each the pages.
Storing References:
This can just store a reference to the Page record that corresponds to the result. This
might have to access the other tables to finally get the desired result of CrawID, Crawl
Date/Time. If only the references are stored then the space taken up by the subset table is
reduced
Diagrammatically this would look like
Reference to Page
Each of the references will be a reference to some Page on some Crawl.
For this to work, the PageID will have to be unique over the Crawls.
Storing the subset generator itself:
The subset generator is different from the user query. The user might input the query in
the following ways
1. By specifying key terms – like searching on google
2. By fields – if the user’s UI has an option of searching on fields then user
just has to select these
3. By specifying the exact query in terms of a query language
Once the user gives the input, the query will have to be converted into a standard query
language that is understood by the server. This will be the generator that can be stored
and used in the future.
Each of the subset generators can be uniquely identified by a generatorID. The generators
can be store against these IDs are either plain text queries.
When the user wants the results, the generator can be executed to get the query in the
form of a table of contents/references.
A generator ID will have to include the user’s identification so that they can be mapped
back to the user who makes the request.
Generator ID
Generators
Possible ways to generate the generator ID
Use a combination of username, time to uniquely identify it.
Use a combination of username, user specified generator name to identify it.
Persistence:
The server can store the generators for users upon request until the user specifies that it is
no longer needed. Since the user specified query is different from the generator generated
by the server, the formal generator can be sent to the user who can in the future use that
instead of making the server compute this generator.
Storing the schema
When the server storing the internet archive gets the query, it can retrieve the required
subset.
The subset can be stored as table in the server itself in either one of the formats described
above.
If the user wishes to have the subset on his local system, the entire data will have to be
transferred either in the form of a schema or in the form of a list on the user’s interface.
If the entire table is very big maybe subsets of this can be fetched from the server by the
local system, or maybe the subset table itself can be broken up into smaller more
manageable sub-subsets that can be transferred upon request by the server.
This would be something like showing only the top ten ranked pages at first and letting
the user decide if they want more information from the server.