Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
On the role of Interactivity and Data
Placement in Big Data Analytics
Srini Parthasarathy
OSU
The Data Deluge: Data Data Everywhere
2
2
Data Storage is Cheap
600$
to buy a disk drive that can store all of the world’s
music
[McKinsey Global Institute Special Report, June ’11]
3
Data does not exist in isolation.
4
Data almost always exists in
connection with other data – integral
part of the value proposition.
5
Social networks
VLSI networks
Protein Interactions
Internet
Neighborhood
Data dependencies
graphs
6
Big Data Problem: All this data is only
useful if we can scalably extract useful
knowledge from such complex data
7
THIS TALK
• THE ROLE OF DATA PLACEMENT IN BIG DATA
SYSTEMS
• THE ROLE OF VISUALIZATION AND
INTERACTION IN BIG DATA ANALYSIS
GLOBAL GRAPHS
GLOBAL GRAPHS
• What?
– System for deploying applications processing complex data
• Why?
– Seeks balance between high productivity and high performance
• How?
–
–
–
–
Built on top of PNL’s GlobalArrays
Trees (GlobalTrees, GlobalForests)
Relational Arrays (ArrayDB-GA)
Graphs (GlobalGraphs)
• Data Placement is key to high performance
Importance of Data Placement
• Locality
– Placing related items close to each other so they may
be processed together
• Mitigating Impact of Data Skew
– Reducing load imbalance in a parallel setting
– Reducing variance in partition samples
• Generating Stratified Samples
– Improving interactive performance
Key Ideas
• Pivotization
– Convert data with complex structure into sets
– Each element of set captures features of local topology
• Hashing into Strata: Hash related sets into similar bins
– Can employ a sketch-clustering algorithm
• Partitioning: Place Strata into partitions for
• Locality
• Mitigating Data Skew
• Samples
.
.
C
B
E
.
.
DATA (Δ)
C F
A
F E
B
L F
A
C
E
C
A
L
B
L
(PS-1)
L
.
.
A
E
C B
A
A
L
E
PIVOT
Δ25
B
A
C E
L
(PS-25)
.
.
PIVOT SETS (PS)
{1050, 2020,
3130,1800}
(SK-1)
.
.
.
{1050, 2020,
7225, 2020}
(SK-25)
.
.
.
SKETCHES(SK)
S-1
:
:
S-4
(Δ1, SK-1)
(Δ5, SK-5)
(Δ12,SK-12)
(Δ25,SK-25)
:
:
:
S-5
:
:
:
S-128
:
:
:
Strata (S)
PARTITIONING & REPLICATION
F L
B
TRANSFORMATIONS
B
A
A
SKETCHSORT or SKETCHCLUSTER
C
A
MINWISE HASHING on PIVOT SETS
Δ1
E
A
A
P-1
:
P-2
S-4
S-7
S-8
S-12
:
S-128
P-3
:
:
:
P-8
S-3
S-4
S-9
S-12
: S127
Frequent Tree Mining
• Our proposed approaches shows 100X gains
WebGraph Compression
• Linear Scaleup with no loss in compression ratio
HD
PRISM-HD
-
PRobing the Intrinsic Structure and
Makeup of High-dimensional Data
Visualization and Interactivity are key to
discovery
17
PRISM-HD
HD
• What?
– A novel mechanism for exploring
complex data
• Why?
– User is often overwhelmed with
characteristics of data
– Befuddled on where to start
• How?
– Given, similarity measure-of-interest
– Compute similarity graph at
threshold (t)
• Key: Graphs are dimensionless
– Provide user graph visualization cues
• User determines next threshold and
repeats
HD
HIGH THRESHOLD
MODERATE THRESHOLD
LOW THRESHOLD
HD
Benefits of Knowledge Caching
HD
Benefits of Incremental Processing
on Twitter
Incremental estimates on Twitter t1 = 0.95
HD
PRISM-HD and Global Graphs in Context:
Leveraging Social Media in Emergency Response
Concluding Remarks
HD
• Data is everywhere
• Data is fraught with complexities
– Dimensionality, dynamics, structure, massive…
• Both data placement and data interactivity
have an important role to play in big data
analytics
– PRISM-HD and GlobalGraphs can help!
Thanks for your attention
Contact: srini@cse.ohio-state.edu
Mining Simulation Data
Medical Image Analysis
Protein Interaction Network (yeast)
Acknowledgements: Various NSF, NIH, DOE and industry grants