Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Bringing together the data mining, data science and analytics
community.
ACM: Association for Computing Machinery is the world’s largest educational and
scientific computing society with the highest reputation as a professional organization.
SIGKDD: Special Interest Group on Knowledge Discovery and Data Mining.
Previous Courses
• Introduction to Spark
• Advanced Hadoop Based Machine Learn
• Hadoop Based Machine Learning
• MapReduce Design Patterns
Austin SIGKDD Chapter-officers:
Chance Coble
chancecoble@gmail.com
Robert Chong
robert.j.chong@gmail.com
Omar Odibat
omarodibat@gmail.com
Machine Learning with Python
Date Topic 22-Apr 1: GETTING STARTED WITH PYTHON MACHINE
LEARNING Instructor Chance Coble
29-Apr 2: LEARNING HOW TO CLASSIFY WITH REAL-WORLD Roger Huang
EXAMPLES 6-May 3: CLUSTERING – FINDING RELATED POSTS 13-May 4: TOPIC MODELING 20-May 5: CLASSIFICATION I 27-May 6: CLASSIFICATION II – SENTIMENT ANALYSIS certificate
will
be awarded
to
3-Jun 7:
REGRESSION
– RECOMMENDATIONS
An official ACM
those who complete the course (attending at
8: REGRESSION – RECOMMENDATIONS IMPROVED least 8 10-Jun
sessions).
17-Jun 9: COMPUTER VISION – PATTERN RECOGNITION 24-Jun 10: DIMENSIONALITY REDUCTION | 1-Jul 11: DIMENSIONALITY REDUCTION || 8-Jul 12: BIG(GER) DATA Misty Nodine
Christine Doig
Mark Landry Jennifer Davis
Jessica
Williamson
Omar Odibat
Ed Solis
Jessica
Williamson
Robert Chong
Chance Coble
Introduction to Machine
Learning with Python
Getting Started with Machine Learning and Python
Outline
Introduction to an ATX ACM SIGKDD, Course and
Instructors
Introduction to Python (Interactive)
Getting Started with Machine Learning (Lecture)
Q&A
Chapter Walk-through
Part I: Getting Started with
Python
Python Introduction
Python is a high level general purpose programming
language
Compiled to byte-code, then interpreted
Other flavors are JPython and IronPython
It’s the Strong, Dynamic Type.
Object-oriented
Functional features
And here it goes: print(“Hello World”) Python Continued
Comments: # This is commented ‘’’ More than one line ‘’’ Variables
a = 0 ß Interpreter determines this is an integer
b = “Hello” ß Interpreter determines this is a string
casting:
int(x) float(x) str(x) type(x) : handily returns type of variable x Python: Operators
print(3 + 4) print(3 – 4) print(3*4) print(3/4) print(3 % 2) print(3 ** 4) # 3 to the 4th print(3 // 4) # Floor division Python: Guards
a = 20 if a>= 22: print(“if”) # Note the spaces before ‘print’ –important! elif a>=21: print(“elif”) else: print(“else”) Python Functions
def someFunction(): print(“boo”) # ß Again with the space someFunction() def someOtherFunction(a,b): print(a+b) someOtherFunction(12,451) Python: Iteration
for a in range(1,3): print(a) a = 1 while a<10: print(a) a+=1 Python: Strings
str = “a string” str.count(‘x’) str.find(‘x’) str.lower() str.upper() str.replace(‘a’,’b’) str.strip() print(str[1:3]) print(str[:-‐1]) Python: Lists
sampleList = [1,2,3,4,5,6,7,8] for a in sampleList: print(a) sampleList.append(9) sampleList.count(2) sampleList.index(5) sampleList.pop() sampleList.remove(7) sampleList.reverse() sampleList.sort() Python: Tuples and Dictionaries
• Tuples are immutable (unlike lists)
myTuple = (1,2,3) a,b,c = myTuple • Dictionaries store key value pairs
dictExample = {‘someItem’:20,’other’:100} dictExample[‘newItem’] = 400 for a in dictExample: print(a) printdictExample[‘someItem’] Classes
class Calculator(object): #define class to simulate simple calculator def __init__(self): #start with zero self.current = 0 def add(self, amount): #add number to current self.current += amount def getCurrent(self): return self.current myCalc = Calculator() myCalc.add(2) print(myCalc.getCurrent()) Part II: Getting Started with
Machine Learning
Problem Setup
Programs for well processes to map input to output
Create invoices from an accounting system
Compute basic statistics on a set of data records
Compile programs into machine level code
Programs for scenarios with only the input and output
Find the baby in the picture
Drive a car
Rank a set of documents by importance
Transcribe speech
Some the human brain does well
Problem Setup
Entity Analytics Records
For each entity in which we want to perform a mapping,
create a column of values associated with that entity
We will call these: features
For each entity you could have many features
We need to construct a program Φ so that an architecture
m(Φ,x) yields the best answer given that
x: a set of features which have an ideal target value
Learning is Representation,
Optimization & Evaluation
m in this case is our representation
That is the language in which our model will be built
Line: y = m * x + b
Rules (if x then y)
Decision Tree
Optimization allows you to improve your parameters
Evaluation (Ch 2): determines if one model is better than
another
Achieving Generalization
Easy to get 0 error in training
novice mistake!
If your optimization creates a dictionary of inputs and their
outputs, and then for a new point outputs the result for the
closest input
This is an extreme example of overfitting
This is constantly your enemy in machine learning
Achieving Generalization
Easy to get 0 error in training
novice mistake!
If your optimization creates a dictionary of inputs and their
outputs, and then for a new point outputs the result for the
closest input
This is an extreme example of overfitting
This is constantly your enemy in machine learning
Achieving Generalization
Easy to get 0 error in training
novice mistake!
If your optimization creates a dictionary of inputs and their
outputs, and then for a new point outputs the result for the
closest input
This is an extreme example of overfitting
This is constantly your enemy in machine learning
Achieving Generalization
Easy to get 0 error in training
novice mistake!
If your optimization creates a dictionary of inputs and their
outputs, and then for a new point outputs the result for the
closest input
This is an extreme example of overfitting
This is constantly your enemy in machine learning
Generalization
See low error in training? Be skeptical!
Have records with 100 boolean features?
Let’s say you have 1 million records to learn from
Assuming distinct records, you have covered only 2100 – 106 !
Requires domain knowledge in your representation
The goal of optimization is not to get zero error, it is to get
the right kind of error
Nothing does better than everything else on all data
Generalization
Watch out for these claims
Asymptopia: When carried out to infinity our
optimization approach is guaranteed to find the
minimum
Any model is representable in our model
Representable is not learnable!
More data trumps more complex learners
Training and Testing
Back to the Entity Analytics Record
You will have a set of data for your entities that should have an
output from m
Training set: Optimize Φ based on the data you provide
Testing set: Evaluate m(Φ,x) for all x’s in a representative set
Dimensionality isn’t always
intuitive
Back to our example
a set of 100 boolean features is a space of 2100 and our data
set of 1,000,000 leaving 2100 – 106
The size of the space grows much faster than our data can
cover
Even stranger still – most of the data in a high dimensional
orange would be in the skin, not the pulp
Feature Design
Often features are “engineered”
Flags of other properties
Calculations of other features
Reductions of other features
As a practitioner, most of your time will be spent engineering
features
High hopes exist to automate this process one day (i.e. it’s
holy grail stuff)
Ensembles
Many models are better than one
Bagging
Boosting
Stacking
Trend is toward larger ensembles
Netflix prize winner (and runner up) were both ensemble
approaches
References
Pedro Domingo : A Few Useful Things to Know about Machine
Learning
http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Building Machine Learning Applications
Elements of Statistical Learning
Pattern Classification
Handbook of Statistical Analysis and Data Mining
Python Tutorial:
http://www.afterhoursprogramming.com/tutorial/Python/
Classes/