* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download CS 5614: (Big) Data Management Systems
Survey
Document related concepts
Transcript
CS5614:(Big)Data
ManagementSystems
B.AdityaPrakash
Lecture#1:Introduc/on
Prakash2017
CS5614:(Big)DataManagementSystems
2
Datacontainsvalueandknowledge
Prakash2017
CS5614:(Big)DataManagementSystems
3
DataandBusiness
4*
Data1and1business*
Recommended'linksA
+79%''clicksA
Personalized''
News'InterestsA
+250%'clicksA
Top'SearchesA
+43%'clicksA
vs.1randomly1selected*vs.1editorial1oneDsizeDfitsDall*vs.1editor1selected*
Prakash2017
CS5614:(Big)DataManagementSystems
Source:A.Machhanavajjhala
4
DataandScience
Data1and1science*
5*
Red:1official1numbers1from1Center1for1Disease1Control1and1Prevention;1weekly11
Black:1based1on1Google1search1logs;1daily1(potentially1instantaneously)*
Detecting'influenza'epidemics'using'search'
engine'query'data1
http://www.nature.com/nature/journal/v457/n7232/full/
nature07634.html
Prakash2017
CS5614:(Big)DataManagementSystems
5
DataandGovernment
Data1and1government*
6*
http://www.washingtonpost.com/opinions/obama-the-big-datapresident/2013/06/14/1d71fe2e-d391-11e2b05f-3ea3f0e7bb5a_story.html
http://www.washingtonpost.com/
business/economy/democratspush-to-redeploy-obamas-voterdatabase/2012/11/20/
d14793a4-2e83-11e2-89d4-040c93
30702a_story.html
http://www.whitehouse.gov/blog/
Democratizing-Data
Prakash2017
http://www.theguardian.com/world/2013/jun/23/
edward-snowden-nsa-files-timeline
CS5614:(Big)DataManagementSystems
Source:A.Machhanavajjhala
6
DataandCulture
Data1and1culture*
7*
• Word1frequencies1in1
EnglishDlanguage1
books1in1Google’s1
database1
1
http://blogs.plos.org/everyone/
2013/03/20/what-are-you-inthe-mood-for-emotionaltrends-in-20th-century-books/
Prakash2017
CS5614:(Big)DataManagementSystems
Source:A.Machhanavajjhala
7
8*
Data1and1____1☜ your1favorite1subject*
Sports*
Prakash2017
Journalism*
CS5614:(Big)DataManagementSystems
8
Goodnews:DemandforDataMining
Prakash2017
CS5614:(Big)DataManagementSystems
9
Howtoextractvaluefromdata?
§ ManipulateData
– CS,Domainexper/se
§ AnalyzeData
– Math,CS,Stat…
§ Communicateyourresults
– CS,DomainExper/se
Prakash2017
CS5614:(Big)DataManagementSystems
10
CommunicaEonisimportant!
Communicating1results*
13*
“The"British"government"spends""
£13"billion"a"year"on"universities.”F
– So?*
– Try1instead1
http://wheredoesmymoneygo.org/"
bubbletree-map.html#/~/total/education/university
“On"average,"1"in"every"15"Europeans""
is"totally"illiterate.”F
– True*
– But1about111in1every1141is1under171years1old!*
http://datajournalismhandbook.org/1.0/en/understanding_data_0.html*
Prakash2017
CS5614:(Big)DataManagementSystems
11
WhatisDataMining?
§ Givenlotsofdata
§ DiscoverpaJernsandmodelsthatare:
– Valid:holdonnewdatawithsomecertainty
– Useful:shouldbepossibletoactontheitem
– Unexpected:non-obvioustothesystem
– Understandable:humansshouldbeableto
interpretthepaWern
Prakash2017
CS5614:(Big)DataManagementSystems
12
DataMiningTasks
§ DescripEvemethods
– Findhuman-interpretablepaWernsthat
describethedata
• Example:Clustering
§ PredicEvemethods
– Usesomevariablestopredictunknown
orfuturevaluesofothervariables
• Example:Recommendersystems
Prakash2017
CS5614:(Big)DataManagementSystems
13
Theory
&Algo.
Biology
Physics
Comp.
Systems
ML&
Stats.
Social
Science
Bigdata
Econ.
14
Prakash2017
CS5614:(Big)DataManagementSystems
COURSELOGISTICS
Prakash2017
CS5614:(Big)DataManagementSystems
15
CourseInformaEon
§ Instructor
B.AdityaPrakash,TorgersenHall3160F,badityap@cs.vt.edu
– OfficeHours:TBD
– IncludestringCS5614insubject
§ TeachingAssistant
TBD
– OfficeHours:TBD
§ ClassMeeEngTime
Tuesdays,Thursdays,9:30-10:45am,McBrydeHall226
§ Syllabus:RelaEonalDatabaseSystems,BigdataTechnologies(MR
andnewsoYwarestack),Streams,RecommendaEonSystems,
LargeScaleMachineLearning,andGraphMining
Prakash2017
CS5614:(Big)DataManagementSystems
16
CourseInformaEon
§ KeepinginTouch
Coursewebsite
hWp://www.cs.vt.edu/~badityap/classes/cs5614-Spr17/
updatedregularlythroughthesemester
– Piazzalinkonthewebsite
Prakash2017
CS5614:(Big)DataManagementSystems
17
Textbook
§ Required
JureLeskovec,AnandRajaramanandJefferyUllman:
MiningMassiveDatasets(2nd)CambridgeUniversity
Press.2010
Webpageforthebook(withFREEPDF!)
www.mmds.org
Prakash2017
CS5614:(Big)DataManagementSystems
18
Textbook
§ Recommended(fordatabaseinternals)
RaghuRamakrishnanandJohannesGehrke
DatabaseManagementSystems(3rdEd.).McGraw
Hill.
Prakash2017
CS5614:(Big)DataManagementSystems
19
Pre-reqs
(A) ShouldenjoythecourseJ
(B) Backgroundin
1.
2.
3.
4.
5.
Algorithms
ProbabilityandStats
UndergraduatelevelDatabases
LinearAlgebra(helps)
Graphtheory(helps)
(C) Graduate-levelProgrammingSkills(i.e.abilityto
useunfamiliarsohware,pickingupnew
languages,comfortablewithatleastoneof
Python/C/C++/Ruby/Javaetc.(Matlab/Raplus))
Prakash2017
CS5614:(Big)DataManagementSystems
20
Force-add
§ Talktomeonceaherclass
AND
§ Fill-inthissurveyby6pmESTtoday
hWps://goo.gl/forms/APfoI5CymKqKg0Pk1
Prakash2017
CS5614:(Big)DataManagementSystems
21
CourseGrading
§ Detailscomingsoon(nextlecture)
§ Broadly
– Somehws
– Nomidterm
– Take-homeFinal
– Project
– Classpar/cipa/on
Prakash2017
CS5614:(Big)DataManagementSystems
22
CourseProject
§ 2,or3(max)personsperproject.
§ Majorworkforthisclass.
§ Pickyourowntopic
– Youhavetojus/fywhythetopicisinteres/ng,andrelevantto
thecourse,andofsuitabledifficulty
§ Harderway:
– Jointprojectswithothercoursesarealsonego/able.Inthat
case,youwillneedtheapprovaloftheinstructor,andyoualso
needtoclarifyexactlywhatstepswillbedoneforthiscourse,as
wellasfortheothercourse.
– Projectsrelatedtoyourdisserta/on/master-projectarealso
possible,aslongasthereisno'double-dipping',i.e.,youclearly
specifywhattheprojectwilldo,inaddi/ontowhatyouwere
planningtodoforyourthesisanyway.
§ Askmeifyouneedhelpandideas(Imayreleasealistof
suitabletopicslater)
Prakash2017
CS5614:(Big)DataManagementSystems
23
CourseProject
§
§
§
§
Proposal
Milestone
FinalReport
PosterPresenta/on(orin-classpresenta/onTBD)
Prakash2017
CS5614:(Big)DataManagementSystems
24
WARM-UPANDBASICS
Prakash2017
CS5614:(Big)DataManagementSystems
25
RelaEonalDatabases:Whatwewill
cover(next1month)
§ Implementa/on
– Whatisunder-the-hoodofaDBlikeOracle/MySQL?
§ Design
– Howdoyoumodelyourdataandstructureyourinforma/onin
adatabase?
§ Programming
– Howdoyouusethecapabili/esofaDBMS?
§ Achievesabalancebetween
– afirmtheore/calfounda/ontodesigningmoderate-sized
databases
– crea/ng,querying,andimplemen/ngrealis/cdatabasesand
connec/ngthemtoapplica/ons
Prakash2017
CS5614:(Big)DataManagementSystems
26
CS4604:CourseOutline
§ Weeks1–4:Query/
Manipula/onLanguages
andDataModeling
Rela/onalAlgebra
Datadefini/on
ProgrammingwithSQL
En/ty-Rela/onship(E/R)
approach
– SpecifyingConstraints
– GoodE/Rdesign
–
–
–
–
§ Weeks5–8:Indexes,
Processingand
Op/miza/on
–
–
–
–
Storing
Hashing/Sor/ng
QueryOp/miza/on
NoSQLandHadoop
Prakash2017
§ Week9-10:Rela/onal
Design
– Func/onalDependencies
– Normaliza/ontoavoid
redundancy
§ Week11-12:Concurrency
Control
– Transac/ons
– LoggingandRecovery
§ Week13–14:Students’
choice
– Prac/ceProblems
– XML
– Dataminingand
warehousing
Wewillgooverallof
CS5614:(Big)DataManagementSystems
thisquickly!J
27
WhatisthegoalofaDBMS?
§ Electronicrecord-keeping
Fastandconvenientaccesstoinforma/on
§ DBMS==databasemanagementsystem
– `Rela/onal’inthisclass
– data+setofinstruc/onstoaccess/manipulate
data
Prakash2017
CS5614:(Big)DataManagementSystems
28
WhatisaDBMS?
§ FeaturesofaDBMS
– Supportmassiveamountsofdata
– Persistentstorage
– Efficientandconvenientaccess
– Secure,concurrent,andatomicaccess
§ Examples?
– Searchengines,bankingsystems,airlinereserva/ons,
corporaterecords,payrolls,salesinventories.
– Newapplica/ons:Wikis,social/biological/mul/media/
scien/fic/geographicdata,heterogeneousdata.
Prakash2017
CS5614:(Big)DataManagementSystems
29
FeaturesofaDBMS
• Supportmassiveamountsofdata
– Giga/tera/petabytes
– Fartoobigformainmemory
• Persistentstorage
– Programsupdate,query,manipulatedata.
– Datacon/nuestolivelongaherprogramfinishes.
• Efficientandconvenientaccess
– Efficient:donotsearchen/redatabasetoansweraquery.
– Convenient:allowuserstoquerythedataaseasilyaspossible.
• Secure,concurrent,andatomicaccess
– Allowmul/pleuserstoaccessdatabasesimultaneously.
– Allowauseraccesstoonlytoauthorizeddata.
– Providesomeguaranteeofreliabilityagainstsystemfailures.
Prakash2017
CS5614:(Big)DataManagementSystems
30
ExampleScenario
§ Students,takingclasses,obtaininggrades
– FindmyGPA
– <andotherad-hocqueries>
Prakash2017
CS5614:(Big)DataManagementSystems
31
ObvioussoluEon1:Folders
§ Advantages?
– Cheap;Easy-to-use
§ Disadvantages?
– Noad-hocqueries
– Nosharing
– LargePhysicalfoot-print
Prakash2017
CS5614:(Big)DataManagementSystems
32
ObviousSoluEon++
§ FlatfilesandC(C++,Java…)programs
– E.g.one(ormore)UNIX/DOSfiles,withstudent
recordsandtheircourses
Prakash2017
CS5614:(Big)DataManagementSystems
33
ObviousSoluEon++
§ Layoutforstudentrecords?
– CSV(‘comma-separated-values’)
Hermione Grainger,123,Potions,A
Draco Malfoy,111,Potions,B
Harry Potter,234,Potions,A
Ron Weasley,345,Potions,C
Prakash2017
CS5614:(Big)DataManagementSystems
34
ObviousSoluEon++
§ Layoutforstudentrecords?
– Otherpossibili/eslike
Hermione Grainger,123
Draco Malfoy,111
Harry Potter,234
Ron Weasley,345
Prakash2017
CS5614:(Big)DataManagementSystems
123,Potions,A
111,Potions,B
234,Potions,A
345,Potions,C
35
Problems?
§ inconvenientaccesstodata(need‘C++’
exper/ze,plusknowledgeoffile-layout)
– dataisola/on
§
§
§
§
§
§
dataredundancy(andinconsistencies)
integrityproblems
atomicityproblems
concurrent-accessproblems
securityproblems
…….
Prakash2017
CS5614:(Big)DataManagementSystems
36
Problems-Why?
§ Twomainreasons:
– file-layoutdescrip/onisburiedwithintheC
programsand
– thereisnosupportfortransac/ons(concurrency
andrecovery)
DBMSshandleexactlythesetwoproblems
Prakash2017
CS5614:(Big)DataManagementSystems
37
ExampleScenario
§ RDBMS=“Rela/onal”DBMS
§ Therela/onalmodelusesrela/onsortablestostructuredata
§ ClassListrela/on:
Student
Course
Grade
HermioneGrainger
Po/ons
A
DracoMalfoy
Po/ons
B
HarryPoWer
Po/ons
A
RonWeasley
Po/ons
C
§ Rela/onseparatesthelogicalview(externals)fromthe
physicalview(internals)
§ Simplequerylanguages(SQL)foraccessing/modifyingdata
– FindallstudentswhosegradesarebeWerthanB.
– SELECTStudentFROMClassListWHEREGrade>“B”
Prakash2017
CS5614:(Big)DataManagementSystems
38
DBMSArchitecture
Prakash2017
CS5614:(Big)DataManagementSystems
39
TransacEonProcessing
§ Oneormoredatabaseopera/onsaregrouped
intoa“transac/on”
§ Transac/onsshouldmeetthe“ACIDtest”
– Atomicity:All-or-nothingexecu/onoftransac/ons.
– Consistency:Databaseshaveconsistencyrules(e.g.whatdata
isvalid).Atransac/onshouldNOTviolatethedatabase’s
consistency.Ifitdoes,itneedstoberolledback.
– Isola/on:Eachtransac/onmustappeartobeexecutedasifno
othertransac/onisexecu/ngatthesame/me.
– Durability:Anychangeatransac/onmakestothedatabase
shouldpersistandnotbelost.
Prakash2017
CS5614:(Big)DataManagementSystems
40
Disadvantagesover(flat)files?
Prakash2017
CS5614:(Big)DataManagementSystems
41
Disadvantagesover(flat)files
§ Price
§ addi/onalexper/se(SQL/DBA)
(hence:over-killforsmall,single-userdatasets
But:mobilephones(eg.,android)usesqlite)
Prakash2017
CS5614:(Big)DataManagementSystems
42
ABriefHistoryofDBMS
§ Theearliestdatabases(1960s)evolvedfromfilesystems
– Filesystems
• Allowstorageoflargeamountsofdataoveralongperiodof/me
• Filesystemsdonotsupport:
– Efficientaccessofdataitemswhoseloca/oninapar/cularfileisnot
known
– Logicalstructureofdataislimitedtocrea/onofdirectorystructures
– Concurrentaccess:Mul/pleusersmodifyingasinglefilegenerate
non-uniformresults
• Naviga/onalandhierarchical
• Userprogrammedthequeriesbywalkingfromnodetonodeinthe
DBMS.
§ Rela/onalDBMS(1970stonow)
– Viewdatabaseintermsofrela/onsortables
– High-levelqueryanddefini/onlanguagessuchasSQL
– Allowusertospecifywhat(s)hewants,nothowtogetwhat(s)hewants
§ Object-orientedDBMS(1980s)
– Inspiredbyobject-orientedlanguages
– Object-rela/onalDBMS
Prakash2017
CS5614:(Big)DataManagementSystems
43
TheDBMSIndustry
§ ADBMSisasohwaresystem.
§ MajorDBMSvendors:Oracle,Microsoh,IBM,Sybase
§ Free/Open-sourceDBMS:MySQL,PostgreSQL,Firebird.
– UsedbycompaniessuchasGoogle,Yahoo,Lycos,BASF….
§ Allare“rela/onal”(or“object-rela/onal”)DBMS.
§ AmulE-billiondollarindustry
Prakash2017
CS5614:(Big)DataManagementSystems
44
Fundamentalconcepts
§ 3-levelarchitecture
§ logicaldataindependence
§ physicaldataindependence
Prakash2017
CS5614:(Big)DataManagementSystems
45
3-levelarchitecture
§ viewlevel
§ logicallevel
§ physicallevel
Prakash2017
v1
CS5614:(Big)DataManagementSystems
v2
v3
46
3-levelarchitecture
§ viewlevel
§ logicallevel:eg.,tables
– STUDENT(ssn,name)
– TAKES(ssn,cid,grade)
§ physicallevel:
– howarethesetablesstored,howmanybytes/
aWributeetc
Prakash2017
CS5614:(Big)DataManagementSystems
47
3-levelarchitecture
§ viewlevel,eg:
– v1:selectssnfromstudent
– v2:selectssn,c-idfromtakes
§ logicallevel
§ physicallevel
Prakash2017
CS5614:(Big)DataManagementSystems
48
3-levelarchitecture
§ ->hence,physicalandlogicaldata
independence:
§ logicalD.I.:
– ???
§ physicalD.I.:
– ???
Prakash2017
CS5614:(Big)DataManagementSystems
49
3-levelarchitecture
§ ->hence,physicalandlogicaldata
independence:
§ logicalD.I.:
– canadd(drop)column;add/droptable
§ physicalD.I.:
– canaddindex;changerecordorder
Prakash2017
CS5614:(Big)DataManagementSystems
50
Databaseusers
§ ‘naive’users
§ casualusers
§ applica/onprogrammers
§ [DBA(Databaseadministrator)]
Prakash2017
CS5614:(Big)DataManagementSystems
51
Casualusers
select*
fromstudent
DBMS
andmeta-data=
catalog
data
Prakash2017
CS5614:(Big)DataManagementSystems
52
``Naive’’users
Pictorially:
app.(eg.,
reportgenerator)
DBMS
andmeta-data=
catalog
data
Prakash2017
CS5614:(Big)DataManagementSystems
53
App.programmers
§ thosewhowritetheapplica/ons(likethe
‘reportgenerator’)
Prakash2017
CS5614:(Big)DataManagementSystems
54
DBAdministrator(DBA)
§ Du/es?
Prakash2017
CS5614:(Big)DataManagementSystems
55
DBAdministrator(DBA)
§ schemadefini/on(‘logical’level)
§ physicalschema(storagestructure,access
methods
§ schemasmodifica/ons
§ gran/ngauthoriza/ons
§ integrityconstraintspecifica/on
Prakash2017
CS5614:(Big)DataManagementSystems
56
Overallsystemarchitecture
§ [Users]
§ DBMS
– queryprocessor
– storagemanager
– transac/on
manager
§ [Files]
Prakash2017
CS5614:(Big)DataManagementSystems
57
naive
app.pgmr
emb.DML
casual
DMLproc.
DBA
users
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
58
Overallsystemarchitecture
§ queryprocessor
– DMLcompiler
– embeddedDMLpre-compiler
– DDLinterpreter
– Queryevalua/onengine
Prakash2017
CS5614:(Big)DataManagementSystems
59
Overallsystemarchitecture(cont’d)
§ storagemanager
– authoriza/onandintegritymanager
– transac/onmanager
– buffermanager
– filemanager
Prakash2017
CS5614:(Big)DataManagementSystems
60
Overallsystemarchitecture(cont’d)
§ Files
– datafiles
– datadic/onary=catalog(=meta-data)
– indices
– sta/s/caldata
Prakash2017
CS5614:(Big)DataManagementSystems
61
Someexamples:
§ DBAdoingaDDL(datadefini/onlanguage)
opera/on,eg.,
createtablestudent...
Prakash2017
CS5614:(Big)DataManagementSystems
62
naive
app.pgmr
emb.DML
casual
DMLproc.
DBA
users
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
63
Someexamples:
§ casualuser,askingforanupdate,eg.:
updatestudent
setnameto‘smith’
wheressn=‘345’
Prakash2017
CS5614:(Big)DataManagementSystems
64
naive
app.pgmr
emb.DML
casual
DMLproc.
DBA
users
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
65
naive
app.pgmr
emb.DML
casual
DMLproc.
DBA
users
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
66
naive
app.pgmr
emb.DML
casual
DMLproc.
DBA
users
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
67
Someexamples:
§ app.programmer,crea/ngareport,eg
main(){
....
execsql“select*fromstudent”
...
}
Prakash2017
CS5614:(Big)DataManagementSystems
68
naive
app.pgmr
casual
DBA
users
pgm(src)
emb.DML
DMLproc.
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
69
Someexamples:
§ ‘naive’user,runningthepreviousapp.
Prakash2017
CS5614:(Big)DataManagementSystems
70
naive
app.pgmr
casual
DBA
users
pgm(src)
emb.DML
DMLproc.
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
71
Conclusions
§ (rela/onal)DBMSs:electronicrecordkeepers
§ customizethemwithcreatetablecommands
§ askSQLqueriestoretrieveinfo
Prakash2017
CS5614:(Big)DataManagementSystems
72
Conclusionscontd
mainadvantagesover(flat)files&scripts:
§ logical+physicaldataindependence(ie.,
flexibilityofaddingnewaWributes,newtables
andindices)
§ concurrencycontrolandrecovery
Prakash2017
CS5614:(Big)DataManagementSystems
73