* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Big data techniques and Applications – A General Review Li,
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					Big data techniques and Applications – A General Review Dr. Jun Li, jun.li@wlv.ac.uk School of Mathematics and Computer Science, University of Wolverhampton OUTLINE      Big Data Concept Different Schools: Hadoop, HPCC, Splunk Databases and NoSQL Parallel/Distributed Computing & Databases Research Scenarios BIG DATA CONCEPT  Big-Data is characterized by three Vs     Volume: Terabytes1012,Yottabytes 1024, Brontobytes1027 and Geopbytes 1030 Velocity: the speed at which the data is generated and processed Variety: from unstructured (raw files and log files) to structured (relational databases), with different types such as messages, social media conversations, photos, sensor data, video and voice recordings. Everything we do leaves digital trace, which can be used and analysed. Because of the size and complexity, they can not be processed and analysed through traditional methods such as a RDBMS. BIG DATA EXAMPLES    A supermarket could use their loyalty card data, and monitor social media sites to get an overall view of customer behaviour and preferences. Hospitals analyse medical data and patient records to predict if certain type of treatment is efficacious, e.g. fractal analysis of large amount of medical images. Calculate information entropy by languages and person ID and characters, i.e., Personal Information Entropy (pie), through data from social media and web pages etc. Fractal Analysis  An image is called "fractal" if it displays self-similarity: e.g. the tree shown, it can be split into parts, each of which is (at least approximately) a reduced-size copy of the whole. Fractal Analysis  A possible characterisation of a fractal set is provided by the "box-counting" method Fractal Analysis Fractal Analysis  Number of boxes are counted based on different size for one image as shown in the blue curve; the calculation is time-consuming  What if hundred of thousands of images (in large storage)? what if we count a sliding box – one pixel per move horizontally and vertically? What is    An Apache open-source framework for distributed computing and data storage. Developed for large scale computation and data processing on a network of commodity hardware (i.e., affordable). Moves computation (i.e. applications) to the data rather than move data around Hadoop Architecture Hadoop Logical Deployment Hadoop Physical Deployment Hadoop Data Import/Export Hadoop Architecture       HDFS – Hadoop distributed file system MapReduce – A YARN-based system for parallel processing of large data sets. YARN – A framework for job scheduling and cluster resource management toward an distributed operating system HBase – A non-relational, distributed database Hive – A data warehouse infrastructure for data summarization, query, and analysis Pig – A high-level platform for creating MapReduce programs using language Pig Latin HDFS - Hadoop distributed file system  Hadoop distributed file system     HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines by blocks (64M or 128M) With data awareness (Metadata in-memory) Runs on top of native filesystems HDFS Daemons    Namenode: manages the file system's namespace/meta-data of file blocks Datanodes: Stores and retrieves data blocks and reports to namenode Secondary Namenode: snapshots of the primary namenode's directory information HDFS - Hadoop distributed file system  Upload a file  File distribution by locations and blocks HBase      HBase is a non-relational, distributed database running on top of HDFS. Column-Oriented key-value (NoSQL) Supports random real-time CRUD operations (unlike HDFS) Integrated with MapReduce framework Not an ACID compliant database. What is NoSQL    NoSQL: Not only SQL, schema-free Provides a mechanism for storage and retrieval of data that is modelled in data structures such as key-value, graph or document other than RDBMS. Applied in Big Data   NoSQL databases use Map/Reduce to query and index the database Map/Reduce tasks are distributed among multiple nodes for parallel processing What is Key-Value Pair Databases  KVP Examples, Key Value Color Blue Libation Beer Hero Soldier Key Value FacebookUser12345_Color Red TwitterUser67890_Color Brownish FoursquareUser45678_Libation “White wine” Google+User24356_Libation “Dry martini with a twist” LinkedInUser87654_Hero “Top sales performer” What is Column-oriented Data Model    Store in columns by block Primary key is the data Assume whole-row operations are rare 1 2 3 4 1 2 3 4 HBase Data Model   Key: row _ column family _ column E.g. a personal information table with family columns HBase    Cells are stored by Column Family as a file (HFile) on HDFS Cells are not set will not be stored (no NULLs) Table is made of Column Families. HBase Data Model  Create table  Insert data  Retrieve data NoSQL DATABASES  Types of NoSQL databases      Column Document Key-value Pair Graph Multi-model DATABASES  Database models          Hierarchical databases Network databases Relational databases Object oriented databases Object-Relational Databases Entity-Attribute-Value (EAV) data model Semi-structured model Associative model Context model HIERARCHICAL DATABASES  The data is organized into a tree-like structure.  An entity type corresponds to a table in the relational database model and a record corresponds to a row. HIERARCHICAL DATABASES   Hierarchical databases were IBM's first database, called IMS (Information Management System), which was released in 1960. A hierarchical schema consists of record types and PCR types.    A record/segment is a collection of field values. Records of the same type are grouped into record types. A PCR type (parent-child relationship type) is a 1:N relationship between two record types. HIERARCHICAL DATABASES  PDBR – Physical Data Base Record Type PCR department dname dnumber mgrname mgrstartdate employee name ssn bdate address project pname pnumber plocation HIERARCHICAL DATABASES LOGICAL ORGANIZATION  Logically organized in PDB (Physical Data Base) – a collection of occurrence trees.  An occurrence of tree – root is a single record with multiple child records Math Department A 001 … employee1 Jones 1234 April London employee2 Tom … … … employee3 Mary … … … … CS IS project1 001 MI125 project2 … … HIERARCHICAL DATABASES  Physical organization in storage-1  Sequential order using array – “top-down, left-right” HIERARCHICAL DATABASES  Sequential method using linked list instead of array HIERARCHICAL DATABASES  Doubly linked list: one pointing to the first child, another to neighbour brother NETWORK DATABASES   Very similar to the hierarchical model; the hierarchical model is a subset of the network model. But, child tables were allowed to have more than one parent. NETWORK DATABASES  Network databases concepts   Record – represents object (e.g., customer, branch) Set – represents one to many relationship (e.g. depositor consisted of customer and account) NETWORK DATABASES  Data store structure – data is organized by set 3 set values 3 set values NETWORK DATABASES  DML commands        find – locates a record or set in the database get – get a copy of the record from the database store – insert a record into the database modify – modify the current record erase – delete the current record connect – insert a new record into a set: connect <record > to <set> disconnect – remove a record from a set disconnect <record> from <set> NETWORK DATABASES NETWORK DATABASES  Advantages of a Network Database Model    Because it has the many-many relationship, network database model can easily be accessed in any table record For complex data, it is easier to use because of the multiple relationships among data Disadvantage of a Network Database Model   Difficult for first time users Difficulties with alterations of the database because when information entered can alter the entire database MapReduce  Now, MapReduce 2.0 on YARN – Yet Another Resource Negotiator YARN Daemons Deployment  Yarn replaced resource management and job scheduler YARN Daemons Word Count MapReduce Parallel Computing: Data decomposition, task dependency and interaction  Sparse matrix-vector multiplication Given n x n sparse matrix A and vector b, 𝒚 = 𝑨 × 𝒃 In parallel to calculate, 𝒚 𝒊 = 𝒏𝒋=𝟏 𝑨[𝒊, 𝒋] × 𝒃[𝒋]  Each process owns y[i], A[i,*] and b[i]   Parallel Computing: Exploratory Decomposition  15-puzzle problem   A number can be moved to a blank region Determine a path/sequence or shortest path/sequence to the final configuration ( Here, a sequence from 1 to 15) Parallel Computing: Exploratory Decomposition  15-puzzle problem   A number can be moved to a blank region Determine a path/sequence or shortest path/sequence to the final configuration ( Here, a sequence from 1 to 15) Parallel Computing: Exploratory Decomposition Parallel Computing Design  Decomposition techniques   Characteristics of tasks      As shown in examples above Task generation (static or dynamic) Task sizes (i.e. time required to complete, or data sizes) Knowledge of task sizes Inter-task relations (i.e., dependency, acyclic) and interactions Mapping tasks to processes for load balancing Parallel Computing Design  Parallel Algorithm Models      The Data-Parallel Model The Task Graph Model The Work Pool Model The Master-Slave Model The Pipeline or Producer-Consumer Model MapReduce Workflows using Oozie MapReduce Workflows using Oozie    Describes workflows in set of XML and configuration files Has coordinator engine that schedules workflows based on time and incoming data Provides ability to re-run failed portions of the workflow No directed cycles Hadoop Supports of Relational Databses    Hive provides SQL-like query language named HiveQL, but NOT low latency or real-time queries; supports table partition (partitioning and bucketing). Pig Latin using bag (table), tuple, field Both run on HDFS and MapReduce (stored in the format of file) Hive to MapReduce and HDFS Concurrency Control    Is to prevent transactions conflicting with each other. Problems normally occur if more than one transaction tries to access the same record or set of records at the same time. Solutions    Timestamping algorithms Optimistic algorithms Pessimistic algorithms Distributed systems/databases  Essential Requirements  Local transparency   Data fragmentation & replication    Intelligent optimizer for fragmentation and query to minimize the cost ( I/O cost + CPU cost + Communication cost) Update issue of replication Transaction scheduling     Data naming scheme, or, dictionary in case of databases ACID Two-Phase Commit (coordinated by an agent) Concurrency issues (where is the lock manager?) The above three requires a distributed operating system  Is YARN the reason developed? Distributed systems/databases  Time synchronization and global state issues   We cannot synchronize clocks perfectly across a distributed system, therefore to use physical time to find out the order of any arbitrary pair of events occurring within it. Lamport logical time Distributed systems/databases  To examine whether a particular property is true, e.g. determine whether is a deadlock, and global debugging  Consistent Global State (cause-effect) Real-Time Stream Processing  Spark    User interfaces e.g. SQL – provided by Hive, and Real-time streaming. Transparent interfaces to connect the lower level components, e.g. YARN and HDFS. At Client, to launch a program through a ‘standalone manager’ bin/spark-submit --master spark://host:7077 --executor -memory 10g myProgram.py Converts a user program into tasks, i.e. directed acyclic graph (DAG) Launch workers, i.e. executors, and schedule them Real-Time Stream Processing  Data is split by time interval. Spark Streaming Receivers Tn … T2 T1 Input data streams  T0 Results pushed to external Input, processing, and output are distributed on different work nodes, scheduled by server. Worker Node Driver Program Executor Long Task Receiver Input Stream StreamingContext Spark jobs to process received data SparkContext Data replicated Worker Node Executor Task Task Output results in batches Splunk    Reads all sorts (almost any type, even in real time) of data into Splunk's internal repository, add indexes and create events – the data unit in Splunk. Users can then set up metrics and dashboards (using Splunk) that support basic business intelligence, analytics, and reporting on key performance indicators (KPIs). A NoSQL query approach is used, reportedly based on the Unix command's pipeline concepts and does not involve or impose any predefined schema, called search processing language (SPL) Splunk Architecture 4. Functions & Interfaces 3. 1. Load 2. Conventional use cases  Investigational Searching   Monitoring and Alerting   A Splunk app (or application) can be a simple search collecting events, a group of alerts categorized for efficiency (or for many other reasons), or an entire program developed using the Splunk's REST API. Monitor any infrastructure (e.g. Windows event logs) in real time. Decision Support Analysis Splunk Deployment Dedicated search head   Dedicated search head is an instance that handles search management functions, directing search requests to a set of search peers and then merging the results back to users Forwarder to gather data from a variety of inputs and forward the data to a Splunk Enterprise server for indexing and searching. HPCC  High-Performance Computing Cluster HPCC     Thor cluster is for extract, transform, load (ETL) processing of the raw data, as well as large-scale complex analytics, and creation of keyed data and indexes for Roxie cluster. Thor cluster is similar in its function, execution environment, filesystem and capabilities to Hadoop MapReduce. Roxie cluster designed as an online high-performance structured query and analysis platform or data warehouse delivering the parallel data access processing requirements of online applications. Roxie cluster is similar in its function and capabilities to Hadoop with HBase and Hive capabilities, and provides for near real time predictable query. Big Data Complexity and Lambda  Operational complexity   Eventual consistency complexity   E.g. two replicas with a count of 10, one increases 2 and the other increases 1. What should be the merge value be? Lack of human-fault tolerance   E.g. index compaction at times for all nodes Programming mistakes CAP theorem – You can have at most two of Consistency, Availability and Partition tolerance.  In our context, ‘In a distributed system, it can be consistent or available but not both’ Lambda Architecture     Lambda is to build Big-Data systems as three layers. Batch layer run parallel tasks on distributed datasets to produce batch views for Serving layer. Speed layer accept changes to produce real-time view, intended to solve CAP. The solution is trivial. Queries answered by combining Batch and Realtime views. Lambda principle ‘Data is immutable’ An example of strength of Lambda  Batch and Serving layers together solve normalization and de-normalization issue Normalized De-normalized Lambda Data Model   Graph schema: (fact and properties) vs (table and fields) Physically stored by fact Lambda architecture  Hadoop as an Enterprise Data Hub What else NoSQL Databases  Types of NoSQL databases      Column Document Oriented Database (DOD) Key-value Pair Graph Multi-model Document Oriented Databases   DOD as a subclass of key-value database, consists of a collection of documents. CouchDB – A JSON document-oriented database   JSON Documents - Everything stored in CouchDB boils down to a JSON document. RESTful Interface - From creation to replication to data insertion, every management and data task in CouchDB can be done via HTTP. Document Oriented Databases  JSON document – person.json  CouchDB DML commands     POST - creates a new record GET - reads records PUT - updates a record DELETE - deletes a record MapReduce Operation on DOD  Map Function – Retrieve order from person.json  Reduce Function – Calculate sales of products No-SQL Doubts  Concurrency control   Unavailability of ACID properties, therefore transactions are reliably supported. (Is Neo4j an exception?) Data integrity  Inability to define relationships - parent versus child (graph could be complex), therefore data can be inconsistent  Absence of support for JOIN and cross-entity query  Suggestions – 1     RDBMS for Transactional applications NoSQL/RDBMS for Computational applications (e.g. sales record management) NoSQL for Web-scale application (e.g. web analytics) Suggestions – 2 Polyglot Persistence  Polyglot Persistence: using different data storage technologies for varying data storage needs Information entropy  Claude Shannon's information entropy is defined by , (1)  Where P is the probability of occurrence 𝒙𝒊 . H is an expected value as a measure of uncertainty.  For example, to calculate 26 English letters entroy in a big corpus, (2)   𝑃 P are the occurrences of letters in corpus, 𝑙𝑜𝑔2 is the number of bits can represent the probability. Then H(L) is the expected value Information entropy      Shannon estimated the entropy of written English to be 1.0 and 1.5 bits per character based on clean English. But, in reality the spoken and typed English on Internet is full of noises, so should be higher. How about English word? How about other languages? How about each person? I expect each person is associated with a unique number – Personal Information Entropy (pie) in both real and virtual world, with featured computation beyond language model (See more in thesis - Noisy Language Modeling Framework Using Neural Network Techniques)  Skynet is coming true, AlphaGo has beaten human being, we need to hide our ‘pie’. Questions – what is Big-Data? References              Hierarchical Model: http://codex.cs.yale.edu/avi/db-book/db6/appendices-dir/e.pdf Hierarchical Database: www.uwinnipeg.ca/~ychen2/databaseNotes/hierarchicalDB.ppt Network Model: http://codex.cs.yale.edu/avi/db-book/db5/slide-dir/appA.ppt and http://codex.cs.yale.edu/avi/db-book/db6/appendices-dir/d.pdf CouchDB – Get Started, http://guide.couchdb.org/draft/tour.html Jiawei Han and Micheline Kamber (2006), Data Mining – Concepts and Techniques, 2nd Ananth Grama et al. (2003), Introduction tot Parallel Computing, 2nd Jun Li, 2009, ‘Noisy Language Modelling Framework using Neural Network techniques’ http://www.coreservlets.com/hadoop-tutorial Holden Karau et al (2015), O’Reily – Learning Spark George Coulouris, Distributed Systems Concepts and Design, 5th Edition Nathan Marz et al (2015), Big Data – Principles and best practices of scalable realtime data systems (Lambda Architecture) Michael Manoochehri (2014), Data Just Right - Introduction to Large Scale Data & Analytics See related references in the notes of each slide
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            