Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISQS 6339, Data Management & Business Intelligence Extraction, Transformation, and Loading (II) Zhangxi Lin Texas Tech University 1 ISQS 6339, Data Management & Business Intelligence Agenda I. Using SSIS for ETL Integration Services Learn by doing Package items Problem-oriented package development II. The Principle of ETL Extraction Transformation Loading 2 ISQS 6339, Data Management & Business Intelligence II. The Principle of ETL 3 ISQS 6339, Data Management & Business Intelligence Structure and Components of Business Intelligence SSMS SSIS SSAS SSRS SAS EG 4 ISQS 6339, Data Management & Business Intelligence SAS EM Automating your routine information processing tasks Your routine information processing tasks Read online news at 8:00a and collect a few most important pieces Retrieve data from database to draft a short daily report at 10a View and reply emails and take some notes that are saved in a database View 10 companies’ webpage to see the updates. Input the summaries into a database Browse three popular magazines twice a week. Input the summaries into a database Generate a few one-way frequency and two-way frequency tables and put them on the web Merge datasets collected by other people into a main database. Prepare a weekly report using the database and at 4p every Monday, and publish it to the internal portal site. Prepare a monthly report at 11a on the first day of a month, which must be converted into a pdf file and uploaded to the website. Seems there are many things are on going. How to handle them properly in the right time? Organizer – yes How about regular data processing tasks? 5 ISQS 6339, Data Management & Business Intelligence Information Processing and Information Flow Transaction processing Interactions between a user and a computer application system with immediate responses from the application Operational processing Make use of computer to control a process Batch processing Consisting of a series of executions, each of which is applied to a set of data and turns the result to the next one. Analytical processing The interaction between analysts and collections of aggregated data that may have been reformulated into alternative representational forms for improved analytical performance. 6 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation, Loading (ETL) Processes Extract source data Transform/clean data Operational systems Index and summarize Load data into warehouse Detect changes Refresh data Programs Gateways Tools Data Warehouse 7 ISQS 6339, Data Management & Business Intelligence ETL ETL: Tasks, Importance, and Cost Operational systems Extract Clean up Consolidate Restructure Load Maintain Refresh Data Warehouse ETL Relevant Useful Quality Accurate Accessible 8 ISQS 6339, Data Management & Business Intelligence Extracting Data Source systems Data from various data sources in various formats Extraction Routines Developed to select data fields from sources Consist of business rules, audit trails, error correction facilities Data mapping Transform Operational databases 9 Data staging area ISQS 6339, Data Management & Business Intelligence Warehouse database Production Data Operating system platforms File systems Database systems and vertical applications IMS DB2 Oracle Sybase Informix VSAM 10 ISQS 6339, Data Management & Business Intelligence SAP Shared Medical Systems Dun and Bradstreet Financials Hogan Financials Oracle Financials Archive Data Historical data Useful for analysis over long periods of time Useful for first-time load May require unique transformations Operation databases 11 ISQS 6339, Data Management & Business Intelligence Warehouse database Internal Data Planning, sales, and marketing organization data Maintained in the form of: Spreadsheets (structured) Documents (unstructured) Treated like any other source data Planning Marketing Accounting 12 ISQS 6339, Data Management & Business Intelligence Warehouse database External Data Information from outside the organization Issues of frequency, format, and predictability Described and tracked using metadata Purchased databases A.C. Nielsen, IRI, IMS, Walsh America Dun and Bradstreet Barron's 13 Warehousing databases ISQS 6339, Data Management & Business Intelligence Competitive information Economic forecasts Wall Street Journal Possible ETL Failures A missing source file A system failure Inadequate metadata Poor mapping information Inadequate storage planning A source structural change No contingency plan Inadequate data validation 14 ISQS 6339, Data Management & Business Intelligence Maintaining ETL Quality ETL must be: Tested Documented Monitored and reviewed Disparate metadata must be coordinated. 15 ISQS 6339, Data Management & Business Intelligence Transformation Transformation eliminates anomalies from operational data: Cleans and standardizes Presents subject-oriented data Extract Transform: Clean up Consolidate Restructure Operational systems Load Data Staging Area 16 ISQS 6339, Data Management & Business Intelligence Warehouse Remote Staging Model Data staging area within the warehouse environment Transform Extract Operational system Load Staging area Warehouse Data staging area in its own environment Transform Operational system 17 Extract Load Staging area ISQS 6339, Data Management & Business Intelligence Warehouse On-site Staging Model Data staging area within the operational environment, possibly affecting the operational system Transform Extract Operational system 18 Load Staging area ISQS 6339, Data Management & Business Intelligence Warehouse Data Anomalies No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies 19 CUSNUM NAME ADDRESS 90233479 Oracle Limited 100 N.E. 1st St. 90233489 Oracle Computing 15 Main Road, Ft. Lauderdale 90234889 Oracle Corp. UK 15 Main Road, Ft. Lauderdale, FLA 90345672 Oracle Corp UK Ltd 181 North Street, Key West, FLA ISQS 6339, Data Management & Business Intelligence Transformation Routines Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load 20 ISQS 6339, Data Management & Business Intelligence Transforming Data: Problems and Solutions Multipart keys Multiple local standards Multiple files Missing values Duplicate values Element names Element meanings Input formats Referential Integrity constraints Name and address 2 1 ISQS 6339, Data Management & Business Intelligence Multipart Keys Problem Multipart keys Product code = 12 M 654313 45 Salesperson code Country Sales code territory 22 Product number ISQS 6339, Data Management & Business Intelligence Multiple Local Standards Problem Multiple local standards Tools or filters to preprocess 23 cm DD/MM/YY 1,000 GBP inches MM/DD/YY FF 9,990 cm DD-Mon-YY USD 600 ISQS 6339, Data Management & Business Intelligence Multiple Files Problem Added complexity of multiple source files Start simple Multiple source files 24 Logic to detect correct source ISQS 6339, Data Management & Business Intelligence Transformed data Missing Values Problem Solution: Ignore Wait Mark rows Extract when time-stamped If NULL then field = ‘A’ 25 ISQS 6339, Data Management & Business Intelligence A Duplicate Values Problem Solution: SQL self-join techniques RDMBS constraint utilities ACME Inc ACME Inc ACME Inc SQL> 2 3 4 5 6 7 26 SELECT ... FROM table_a, table_b WHERE table_a.key (+)= table_b.key UNION SELECT ... FROM table_a, table_b WHERE table_a.key = table_b.key (+); ISQS 6339, Data Management & Business Intelligence Element Names Problem Solution: Customer Common naming conventions Client Contact Name 27 ISQS 6339, Data Management & Business Intelligence Customer Element Meaning Problem Avoid misinterpretation Complex solution Document meaning in metadata Customer’s name All customer details Customer_detail 28 ISQS 6339, Data Management & Business Intelligence All details except name Input Format Problem EBCDIC “123-73” ASCII 12373 ACME Co. áøåëéí äáàéí 29 ISQS 6339, Data Management & Business Intelligence Beer (Pack of 8) Referential Integrity Problem Solution: SQL anti-join Server constraints Dedicated tools Departme nt Emp Name Department 10 1099 Smith 10 1289 Jones 20 1234 Doe 50 6786 Harris 60 20 30 40 30 ISQS 6339, Data Management & Business Intelligence Name and Address Problem Single-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565 Multiple-field format 31 Database 1 Name Mr. J. Smith NAME LOCATION Street 100 Main St. DIANNE ZIEFELD N100 Town Bigtown HARRY H. ENFIELD M300 Country County Luth Code 23565 Database 2 NAME LOCATION ZIEFELD, DIANNE 100 ENFIELD, HARRY H 300 ISQS 6339, Data Management & Business Intelligence Quality Data: Importance and Benefits Quality data: ◦ Key to a successful warehouse implementation Quality data helps you in: ◦ Targeting right customers ◦ Determining buying patterns ◦ Identifying householders: private and commercial ◦ Matching customers ◦ Identify historical data 32 ISQS 6339, Data Management & Business Intelligence Data Quality Guidelines Operational data: Should not be used directly in the warehouse Must be cleaned for each increment Is not simply fixed by modifying applications 33 ISQS 6339, Data Management & Business Intelligence Transformation Techniques Merging data Adding a Date Stamp Adding Keys to Data 3 4 ISQS 6339, Data Management & Business Intelligence Merging Data Operational transactions do not usually map one-to-one with warehouse data. Data for the warehouse is merged to provide information for analysis. Pizza sales/returns by day, hour, seconds 35 Sale 1/2/02 12:00:01 Ham Pizza $10.00 Sale 1/2/02 12:00:02 Cheese Pizza $15.00 Sale 1/2/02 12:00:02 Anchovy Pizza $12.00 Return 1/2/02 12:00:03 Anchovy Pizza - $12.00 Sale Sausage Pizza $11.00 1/2/02 12:00:04 ISQS 6339, Data Management & Business Intelligence Merging Data Pizza sales/returns by day, hour, seconds Sale 1/2/02 12:00:01 Ham Pizza $10.00 Sale 1/2/02 12:00:02 Cheese Pizza $15.00 Sale 1/2/02 12:00:02 Anchovy Pizza $12.00 Return 1/2/02 12:00:03 Anchovy Pizza - $12.00 Sale Sausage Pizza $11.00 1/2/02 12:00:04 Pizza sales 36 Sale 1/2/02 12:00:01 Ham Pizza $10.00 Sale 1/2/02 12:00:02 Cheese Pizza $15.00 Sale 1/2/02 12:00:04 Sausage Pizza ISQS 6339, Data Management & Business Intelligence $11.00 Adding a Date Stamp Time element can be represented as a: ◦ Single point in time ◦ Time span Add time element to: ◦ Fact tables ◦ Dimension data 37 ISQS 6339, Data Management & Business Intelligence Adding a Date Stamp: Fact Tables and Dimensions Product Table Product_id Time_key Product_desc Time Table Week_id Period_id Year_id Time_key 38 Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units ISQS 6339, Data Management & Business Intelligence Item Table Item_id Dept_id Time_key Adding Keys to Data #1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 #4 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00 #5 Sale 12:00:04 Sausage Pizza $11.00 1/2/98 Data values 39 or artificial keys #dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #dw2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 ISQS 6339, Data Management & Business Intelligence Summarizing Data 1. 2. During extraction on staging area After loading to the warehouse server Operational databases 40 Staging area ISQS 6339, Data Management & Business Intelligence Warehouse database Maintaining Transformation Metadata Transformation metadata contains: ◦ Transformation rules ◦ Algorithms and routines Sources Stage Rules Extract Transform 41 ISQS 6339, Data Management & Business Intelligence Publish Load Query Maintaining Transformation Metadata Restructure keys Identify and resolve coding differences Validate data from multiple sources Handle exception rules Identify and resolve format differences Fix referential integrity inconsistencies Identify summary data 42 ISQS 6339, Data Management & Business Intelligence Transformation Timing and Location ◦ Transformation is performed: Before load In parallel ◦ Can be initiated at different points: On the operational platform In a separate staging area 43 ISQS 6339, Data Management & Business Intelligence Monitoring and Tracking Transformations should: Be self-documenting Provide summary statistics Handle process exceptions 44 ISQS 6339, Data Management & Business Intelligence Loading Data into the Warehouse Loading moves the data into the warehouse Loading can be time-consuming: Consider the load window Schedule and automate the loading Initial load moves large volumes of data Subsequent refresh moves smaller volumes of data Transform Extract 45 Transport, Load Operational Staging area Warehouse database ISQSdatabases 6339, Data Management & Business Intelligence Initial Load and Refresh Initial Load: Single event that populates the database with historical data Involves large volumes of data Employs distinct ETL tasks Involves large amounts of processing after load Refresh: Performed according to a business cycle Less data to load than first-time load Less-complex ETL tasks Smaller amounts of post-load processing 46 ISQS 6339, Data Management & Business Intelligence Data Refresh Models: Extract Processing Environment After each time interval, build a new snapshot of the database. Purge old snap shots. Operational databases T1 47 ISQS 6339, Data Management & Business Intelligence T2 T3 Data Refresh Models: Warehouse Processing Environment Build a new database. After each time interval, add changes to database. Archive or purge oldest data. Operational databases T1 48 ISQS 6339, Data Management & Business Intelligence T2 T3 Building the Loading Process Techniques and tools File transfer methods The load window Time window for other tasks First-time and refresh volumes Frequency of the refresh cycle Connectivity bandwidth 49 ISQS 6339, Data Management & Business Intelligence Building the Loading Process Test the proposed technique Document proposed load Monitor, review, and revise 50 ISQS 6339, Data Management & Business Intelligence Data Granularity Important design and operational issue Low-level grain: Expensive, high level of processing, more disk space, more details High-level grain: Cheaper, less processing, less disk space, little details 51 ISQS 6339, Data Management & Business Intelligence Loading Techniques Tools Utilities and 3GL Gateways Customized copy programs Replication FTP Manual 52 ISQS 6339, Data Management & Business Intelligence Loading Technique Considerations Tools are comprehensive, but costly. Data-movement utilities are fast and powerful. Gateways are suitable for specific instances: Access other databases Supply dependent data marts Support a distributed environment Provide real-time access if needed Use customized programs as a last resort. Replication is limited by data-transfer rates. 53 ISQS 6339, Data Management & Business Intelligence Post-Processing of Loaded Data Transform Extract Load Staging area Create indexes Generate keys Post-processing of loaded data Summarize 54 ISQS 6339, Data Management & Business Intelligence Filter Warehouse Creating Derived Keys The use of derived or generalized keys is recommended to maintain the uniqueness of a row. Methods: Concatenate operational key with a number Assign a number sequentially from a list 55 109908 109908 01 109908 100 ISQS 6339, Data Management & Business Intelligence Summary Management Summary tables Materialized views Summary data 56 ISQS 6339, Data Management & Business Intelligence Filtering Data From warehouse to data marts Summary data Warehouse Data marts 57 ISQS 6339, Data Management & Business Intelligence Verifying Data Integrity Load data into intermediate file. Compare target flash totals with totals before load. Counts & Amounts = Flash Totals Load 58 Intermediate file Target Counts & Amounts Flash Totals = ISQS 6339, Data Management & Business Intelligence Preserve, inspect, fix, then load Steps for Verifying Data Integrity Source files Source files Source files Control Target 3 4 1 Extract SQL*Loader 5 2 6 7 59 ISQS 6339, Data Management & Business Intelligence .log .bad Standard Quality Assurance Checks Load status Completion of the process Completeness of the data Data reconciliation Referential integrity violations Reprocessing Comparison of counts and amounts 1+1=3 60 ISQS 6339, Data Management & Business Intelligence