Download Dynamic and Distributed Scheduling in Communication

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Winter 2017
Big Data Processing & Analytics
Vassilis Christophides
vassilis.christophides@inria.fr
https://who.rocq.inria.fr/Vassilis.Christophides/Big/
Ecole CentraleSupélec Winter 2017
1
Winter 2017
The Data Avalanche: From
Science to Business
2
1
1
Shifting Paradigm in Sciences




Winter 2017
Thousand years ago: science was empirical
describing natural phenomena
2
 .
Last few hundred years: theoretical branch
a
4G
c2
 a   3  2
a


using models, generalizations
 
Last few decades: a computational branch
simulating complex phenomena
The fourth paradigm today (eScience): data
exploration unify theory, experiment, and simulation
Data captured by instruments or generated by
simulator
Processed by software
Information/Knowledge stored in computer
Scientist analyzes data using data management
services and statistics
© Jim Gray
3
Winter 2017
Large Synoptic Survey Telescope (LSST)
–100-200 Petabyte image archive
–20-40 Petabyte database catalog


LSST will take more
than 800 panoramic
images each night
recording the entire
visible sky twice
each week
Ten-year time series
(~2020-2030)
imaging of the night
sky – mapping the
Universe !
http://www.lsst.org
8.4-meter diameter
primary mirror = 10
square degrees!
3.2 billion-pixel
camera
4
2
2
Large Hadron Collider (LHC)


Winter 2017
Protons collide some 1 billion times
per second where each collision
produces about a megabyte of data
Even after filtering out about 99% of
it, scientists are left with around 30
petabytes each year to analyze for a
wide range of physics experiments,
including studies on the Higgs boson
reconstructing particle trajectories,
the particle types and their speeds
9km diameter, ≈100m below ground
27-kilometre ring of superconducting magnets
http://home.web.cern.ch/topics/large-hadron-collider
Human Brain Project (HBP)


5
Winter 2017
Generate and interpret
strategically selected data
needed to build multilevel atlases
and unifying models of the brain
Use anatomical frameworks to
organize and convey spatially
and temporally distributed
functional information about the
brain at all organizational levels,
from genes to cognition, and at
all the relevant spatial and
temporal scales
http://blogs.scientificamerican.com/sa-visual/2014/04/02/how-do-you-visualize6
the-brain/
3
3
Data-driven Discovery
Winter 2017
Innovation is no longer hindered by the ability to collect data but, by the
ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
© JOHN R. JOHNSON
Disruption Time!

8
Winter 2017
Until recently, you were a folder
Now, you are Your Data!
9
4
4
Winter 2017
Blurring the Boundaries of Real & Virtual Worlds

Ubiquitous sensing & reasoning in physical and cyber worlds
10
Digital Disruption Already Happening !








Winter 2017
Largest telco companies owns no telco infrastructure (Skype)
World’s largest movie houses owns no cinemas (Netflix)
Largest software companies don’t write the apps (Apple, Google)
World’s most valuable retailer has no inventory (Alibaba)
Most popular media owner creates no content (Facebook)
World’s largest taxi company owns no vehicles (Uber)
Largest accommodation provider owns no real estate (Airbnb)
Faster growing banks have actually no money (SocietyOne)
http://www.independent.co.uk/news/business/comment/hamishmcrae/facebook-airbnb-uber-and-the-unstoppable-rise-of-thecontent-non-generators-10227207.html
11
5
5
The Data Tsounami
Winter 2017
Mobile devices
(tracking all objects all the time)
Scientific instruments
(collecting all sorts of data)
Social media and networks
(all of us are generating data)

Sensor networks
(measuring all kinds of data)
Three main reasons:
Processes are increasingly automated
Systems are increasingly interconnected
People are social and increasingly generate data exhausts by
interacting online
12
The New Moore’s Law

The Economist: digital information
10 times/5 years!

Winter 2017
Data volume is increasing exponentially
44x increase from 2009 to 2020
From 0.8 ZB to 35ZB
13
6
6
Winter 2017
Big Data = Transactions+Interactions+Observations
IoT sensors are reporting even more personal data than humans are!
Petabytes
Terabytes
Gigabytes
Megabytes
Increasing Data Variety, Velocity, & Veracity
hortonworks.com/blog/7-key-drivers-for-the-big-data-market
15
Winter 2017
What Makes Data, “Big” Data?
18
7
7
Definitions

Winter 2017
No single standard definition…
“Big Data” is data whose scale,
diversity, and complexity require new
architecture, techniques, algorithms,
and analytics to manage it and
extract value and hidden knowledge
from it… (McKinsey Global Inst.)
“Big Data” is high-volume, highvelocity and high-variety information
assets that demand cost-effective,
innovative forms of information
processing for enhanced insight and
decision making (Gartner)
19
The Four V’s of Big Data
Winter 2017
20
8
8
Winter 2017
Characteristics of Big Data: 1-Scale (Volume)
Web data
Mobile data
ERP, CRM data

Too big: petabyte-scale collections or lots of (not necessarily big) data sets
21
Winter 2017
Characteristics of Big Data: 2-Speed (Velocity)
Financial data
IoT data
Social data

Too fast: needs to be processed quickly and react promptly
22
9
9
Winter 2017
Characteristics of Big Data: 3-Complexity (Variety)
Medical Imaging data
Measurement data
Video data
Textual data

Textual data
Too diverse: does not fit neatly in an existing tool
23
Winter 2017
Characteristics of Big Data: 4-Quality (Veracity)

Too crappie: needs to assess their quality
24
10
10
Winter 2017
Summary of 4 Big Data Characteristics
Characteristic Description
Properties
Drivers
Volume
The amount of data generated Batch Processing
(data intensity) that must be
ingested, processed& analyzed
to make data-driven decisions
High number of
data sources
High resolution
sensors
Velocity
How fast data is being
produced and ingested and
the speed at which data is
transformed into insight
Streaming, online
Processing
(Near) Real-time
Analytics
real-time, high-rate
data acquisition,
low cost of
hardware
Variety
The degree of diversity (and
structuring) of data from
sources both inside and
outside an organization
Multi-Modality
Complex interrelations
Sequences
Implicit Semantics
Social media
Scientific data
Video
M2M / IoT
Veracity
The quality and
traceability of data
Consistency
Completeness
Integrity
Ambiguity
Crowd data
production,
Human Sensing
25
We’ve Moved into a New Era of Data
Analytics
12+ terabytes
5+ million
of Tweets
create daily
100’s
Winter 2017
trade events
per second
Volume
Velocity
Variety
Veracity
of different
types of data
1 in 3
Only
decision makers trust
their information
27
11
11
Declining % of Data an Organization
Can Analyse
Winter 2017
http://www.youtube.com/watch?v=B27SpLOOhWw
28
Winter 2017
Big Data Processing
29
12
12
Winter 2017
Big Data: Old Wine in a New Bottle?



No, it is a different type of data wave:
 one needs to put together many sources of information, coming through
many different channels, throwing away what is not important, working
under resource constraints (time), serving real users’ needs
Yes, most of these problems have been in the focus of data management
research for years
Big Data movement: exponential growth of data enthusiasts!
Proliferation of data producers and consumers, e.g., on the Web,
scientific, social, government, urban, home and personal spaces
The main issue is to put all this together to satisfy concrete data analysis
needs via innovative technology
30
Beyond Big Data Size!
Winter 2017
Volume = Length Width Depth
Length: Collect & Compare
Width: Curate & Integrate
Depth: Analyze & Understand
Massive Data Analysis J. Freire & J. Simeon New York University
Course 2013
31
13
13
The WRONG Picture!
Winter 2017
32
Big Data vs Deep Insights
data
Winter 2017
knowledge
Data exploration is hard regardless of whether data are big or small !
33
14
14
The TRUE Picture!
The time for developing an
analysis (with small data)
Winter 2017
The time for developing an
analysis (with big data)
Big Data Infrastructures: Exploiting the Power of Big Data
T. Sellis School of CS & IT, 2015 Athens
Exploring Big Data: What is Hard?





34
Winter 2017
Scalability for computations?
NOT REALLY!
Lots of work on distributed
systems, parallel databases, …
Cloud elasticity: Add more
nodes!
But there are no one-size-fits-all
solution:
often, you have to build your
own…
Rapidly-evolving technology
Many different tools
Different computation model:
need new algorithms!
35
15
15
Advanced Analytics Requires a Robust,
Comprehensive Information Platform
Winter 2017
The bottleneck is the human (data scientist) !
©2011 IBM Corporation
36
Big Data Research Agenda
Acquisition, Storage, and
Management of “Big Data”
Data representation,
storage, and retrieval
New parallel data
architectures, including
clouds
Data management policies,
including privacy and secure
access
Communication and storage
devices with extreme
capacities
Sustainable economic
models for access and
preservation
Data Analytics
Computational, mathematical,
statistical, and algorithmic
techniques for modelling high
dimensional data
Learning, inference,
prediction, and knowledge
discovery for large volumes
of heterogeneous data sets
Data mining to enable
automated hypothesis
generation, event
correlation, and anomaly
detection from data streams
Information infusion of
multiple data sources
Winter 2017
Data Sharing and
Collaboration
Tools for distant data
sharing, real time
visualization, and
software reuse of
complex data sets
Cross disciplinary model,
information and
knowledge sharing
Remote operation and
real time access to
distant data sources and
instruments
Source Big Data R&D Initiative Howard
Wactlar NIST Big Data Meeting June, 2012
37
16
16
Winter 2017
Big Data Mining
40
What to Do with Big Data?

Data contains value and knowledge

But to extract the knowledge data
needs to be
Stored
Managed
And ANALYZED

Data Analysis include:
Mine/summarize large datasets
Extract knowledge from past data
Predict trends in future data
Winter 2017
Data Mining ≈ Big Data ≈ Data
Analytics ≈ Data Science
41
17
17
A Bit of Terminology


Winter 2017
Data mining is the old big data: an overused term including anything such as
collecting, storing, curating and visualizing data
machine learning / AI (which predates the term data mining)
non-ML data mining (as in "knowledge discovery", where the focus is on
new knowledge, not on learning of existing knowledge)
"Business intelligence", "business analytics“ are marketing terms
stressing that more data leads to better business decisions (periodic
reporting as well as ad hoc queries, importance of tools and dashboards);
Most "Big Data" today isn't ML: It's Extract, Transform, Load (ETL), so it is
replacing data warehousing (except computational advertisement)
Business Intelligence aims at descriptive statistics with data with high
information density to measure things, detect trends etc.
Big Data targets inductive statistics with data with low information density
whose huge volume allow to infer laws (regressions…) and thus giving
(with the limits of inference reasoning) to Big Data some predictive
capabilities (called Deep Analytics)
42
Winter 2017
Data Analysis: ERP & CRM Examples
Who are our
lowest/highest margin
customers ?
What is the most
effective distribution
channel?
What product prom-otions have the biggest
impact on revenue?
Who are my customers
and what products
are they buying?
Which customers
are most likely to go
to the competition ?
What impact will
new products/services
have on revenue
and margins?
Agrawal et al., VLDB 2010 Tutorial 43
18
18
The Data Analysis Spectrum
Winter 2017
How can we
make it happen?
Source: Gartner
Prescriptive
Analytics
Value
What might happen?
Predictive
Analytics
Why did it happen?
What happened?
Diagnostic
Analytics
Descriptive
Analytics
What is happening?
Monitoring (Dashboards,
Scorecards)
Difficulty
45
Winter 2017
Classic ML Algorithms used for Decades
K-means
Logistic Regression
KNN (N-nearest Neighbours)
Naïve Bayes
Decision Trees
SVD (Singular Value
Decomposition)
47
19
19
Growing Need for Big ML Tasks
Winter 2017
make sense of images, audio
find significant genes
make sense of documents
find similar users
Big ML Software for Modern ML Algorithms Q. Ho,
E. P. Xing 49
ML Computation vs. Classical
Computing Programs
ML Program:
optimization-centric and
iterative convergent
Winter 2017
Traditional Program:
operation-centric and
deterministic
Parallelization Strategies and Systems for Distributed
Machine Learning E. Xing
50
20
20
Data Mining: Different Cultures



Winter 2017
Data mining overlaps with:
Databases (DB): Large-scale data, simple queries
Machine Learning (ML): Small data, Complex models
Computer Science Theory: (Randomized) Algorithms
Different cultures:
To a DB person, data mining is an extreme form of analytic processing
– queries that examine large amounts of data
Result is the query answer
To a ML person, data-mining is the inference of models – ML
algorithms = “engine” to solve ML models
Result is the parameters of the model
Big Data urges for a cross-culture curriculum stressing on
Machine
Scalability
Statistics Learning/
Algorithms
/AI
Data Pattern
Computing architectures
MiningRecognition
Automation for handling large data
Database
systems
Hadoop is not Suited to Iterative ML


Typically we want to analyse a dataset
by accessing data several times
Many trial-and-error steps, easy to
get lost…
Most existing data mining/ML methods
were designed without considering
data access and communication of
intermediate results
They iteratively use data by
assuming they are readily available

51
Winter 2017
Hadoop is not efficient at iterative
programs
need many map-reduce phases
HDFS disk I/O becomes
bottleneck!
52
21
21
Why Need new Big ML Systems?
ML practitioner’s view
Winter 2017
Systems view
 Focus on
Focus on
High iteration throughput (more
correctness,
iterations per sec)
fewer iterations to converge
strong fault-tolerant atomic ops
 … but assume an ideal system
 … but assume ML algo is a black box
for (t = 1 to T) {
Fast-but-unstable
Asynchronous Parallel
Slow-but-correct
doThings()z
Bulk Sync. Parallel
parallelUpdate(x,θ)
doOtherThings()
}
 Oversimplify ML issues
 Oversimplify systems issues
e.g. ML algos “still work” without
e.g. need machines to perform
proof under different exec. models
consistently
e.g. “easy to rewrite” in chosen
e.g. need to sync parameters any
abstraction (MapR, vertex, etc.)
time

Big ML Software for Modern ML Algorithms Q. Ho,
E. P. Xing 53
An Alg/Sys INTERFACE for Big ML
Winter 2017
This is an on-going research topic
methods not ready
Alone, neither side has full picture ...
New opportunities exist in the middle! tools are not convenient
platforms rapidly change,
…
Big ML Software for Modern ML Algorithms Q. Ho,
E. P. Xing 54
22
22
The Big ML Research


Winter 2017
Roughly there are two types of approaches
Parallelize existing (single-machine) algorithms (data, model, hybrid)
Design new algorithms particularly for massively parallel settings
of course there are things in between
To have technical breakthroughs in big-data analytics, we should know both
algorithms and systems well, and consider them together
55
Winter 2017
The Big ML “Stack” - More than Just Software!
Big ML Software for Modern ML Algorithms Q. Ho,
E. P. Xing 56
23
23
MapReducable?
Winter 2017
57
Big Data Analytics Platforms
Online Machine
Learning
Winter 2017
Big Machine Learning
(Mahout, MillWheel, R/Hadoop)
(SAMOA, Rapid Miner,
OIIDM)
IoT Data
Analysis
(Parstream, Vitria,
Splunk, virdata)
Big Time
Series
Analytics
(Striim, Storm, Spark,
Google RT, Apache S4,
MS , Azure, AWS
Kinesis)
(InfluxDB, AT&T M2X, IBM
Informix TS, OpenTSDB)
Source: Vmware
59
24
24
Winter 2017
What this Course is About?
60
The Need: Making Sense of
the Big Data Universe
Winter 2017
Frameworks:




New computing paradigms: Cloud,
Hadoop – Map/Reduce
New storage solutions: NoSQL, column
stores, Big Table
New languages: JAQL, Pig Latin
We will survey these and how they
relate to previous technologies
Analysis:


New frameworks demands new
approaches to explore data
We will study algorithms to process and
mining data in Big-Data environments
Tackling Big Data M. Cooper &
P. Mell NIST Information
Technology Laboratory Computer
Security Division
61
25
25
Winter 2017
What You Will learn




Understand different models of computation:
MapReduce
Streams and online algorithms
Mine different types of data:
Data is high dimensional
Data is infinite/never-ending
Use different mathematical ‘tools’:
Hashing (LSH, Bloom filters)
Dynamic programming (frequent itemsets)
Solve real-world problems:
Duplicate document detection
Market Basket Analysis
62
Winter 2017
Tentative Course Schedule






Lecture 1 (13/01): Course Overview
Lecture 2 (20/01): The Map-Reduce Software Ecosystem
Lecture 3 (27/01): Finding Similar Items
Lecture 4 (17/02): Massive Data Warehousing
Lecture 5-6 (24/02-03/03): Mining Association Rules
Lecture 7-8 (10-17/04): Analysing Data Streams
© NY Times
63
26
26
Course Text Books

Jure Leskovec, Anand Rajaraman, Jeff Ullman. “Mining
of Massive Datasets” Cambridge University Press, 2014
http://www.cambridge.org/gr/academic/subjects/compute
r-science/knowledge-management-databases-and-datamining/mining-massive-datasets-2nd-edition
Free download http://www.mmds.org

Donald Miner, Adam Shook “MapReduce Design
Patterns” O'Reilly Media 2013
http://shop.oreilly.com/product/0636920025122.do
Free download of chapter samples
http://cdn.oreillystatic.com/oreilly/booksamplers/9781
449327170_sampler.pdf
Winter 2017
64
Winter 2017
Course Organization


Three Programming Exercises (40%):
Map/Reduce & Haddop
Final Examination (60%):
66
27
27
Winter 2017
Words of Caution






We can only cover a small part of the big data universe
Do not expect all possible architectures, programming models,
theoretical results, or vendors to be covered
This really is an algorithms course, not a basic programming course
But you will need to do a lot of non-trivial programming
There are few certain answers, as people in research and leading tech
companies are trying to understand how to deal with big data
We are working with cutting edge technology
Bugs, lack of documentation, new Hadoop API
In short: you have to be able to deal with inevitable frustrations and plan
your work accordingly…
…but if you can do that and are willing to invest the time, it will be a
rewarding experience
67
Winter 2017
References







CS246: Mining Massive Datasets Jure Leskovec, Stanford University, 1014
CS9223 – Massive Data Analysis J. Freire & J. Simeon New York University
Course 2013
CS 6240: Parallel Data Processing in MapReduce Mirek Riedewald
Northeastern University 2014
Big Data Infrastructures: Exploiting the Power of Big Data T. Sellis School of
CS & IT, 2015 Athens
CS525: Special Topics in DBs Large-Scale Data Management Advanced
Analytics on Hadoop Mohamed Eltabakh Spring 2013
Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department
of Computer Science National Taiwan University August 30, 2014
Knowledge Discovery and Data Mining Evgueni Smirnov Maastricht School
on Data Mining Department of Knowledge Engineering, Maastricht
University, Maastricht, The Netherlands August 27 - August 30, 2013
68
28
28
Winter 2017
69
The Big Data Analysis Pipeline
Winter 2017
Major
Processing
Steps
Major
Challenges
http://cacm.acm.org/magazines/2014/7/176204-big-data-and-its70
technical-challenges/abstract
29
29
Winter 2017
71
Winter 2017
72
30
30
Winter 2017
Big Data Processing & Analysis Framework
73
Winter 2017
http://www.pinterest.com/pin/272327108689099496/
74
31
31
Winter 2017
75
32
32