Download Steven F. Ashby Center for Applied Scientific Computing Month DD

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
Discovery of Patterns in the
Global Climate System using Data Mining
Vipin Kumar
Army High Performance Computing Research Center
Department of Computer Science
University of Minnesota
http://www.cs.umn.edu/~kumar
Collaborators:
G. Karypis, S. Shekhar, M. Steinbach, P.N. Tan (AHPCRC),
C. Potter, (NASA Ames Research Center),
S. Klooster (California State University, Monterey Bay).
This work was partially funded by NASA and Army High Performance Computing Center
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
Research Goals
Ocean and Land Temperature (Jan 1982)
Research Goals:

Find global climate patterns of
interest to Earth Scientists
A key interest is finding
connections between the
ocean and the land.
NPP
.
Pressure
NPP
.
Pressure
.
Precipitation
Precipitation
SST
SST
Latitude
grid cell
© V. Kumar
Longitude
Time

Global snapshots of values for
a number of variables on land
surfaces or water.

Monthly over a range of 10 to
50 years.
zone
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Sources of Earth Science Data

Before 1950, very sparse, unreliable data.

Since 1950, reliable global data.
– Ocean temperature and pressure are based on data from
ships.
– Most land data, (solar, precipitation, temperature and
pressure) comes from weather stations.

Since 1981, data has been available from earth
orbiting satellites.
– FPAR, a measure related to plants and greenness

Since 1999 TERRA, the flagship of the NASA earth
observing system, is providing much more detailed
data.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Importance of Global Climate Patterns

The climate of the Earth’s land surface is
strongly influenced by the behavior of the
Earth’s oceans.
– El Nino is the anomalous warming of the eastern tropical
region of the Pacific.
– Associated with droughts in Australia and Southern Africa
and heavy rainfall along the western coast of South America.
El Nino Events
Sea Surface
Temperature
Anomalies off
Peru
(ANOM 1+2)
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Importance of Global Climate Patterns and NPP

Net Primary Production (NPP) is the net
assimilation of atmospheric carbon dioxide (CO2)
into organic matter by plants.
– NPP is driven by solar radiation and can be constrained by
precipitation and temperature.

Keeping track of NPP is important because it
includes the food source of humans and all other
organisms.
– Sudden changes in the NPP of a region can have a direct
impact on the regional ecology.

NPP is impacted by global climate patterns.
– Precipitation and temperature are directly affected by global
climate patterns such as El Nino.
– Solar radiation is affected indirectly by cloudiness.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Role of Statistics and Data Mining

Previously Earth scientists have relied on
statistical techniques.
– Hypothesize-and-test paradigm is extremely laborintensive.

Data mining provides earth scientist with tools
that allow them to spend more time choosing
and exploring interesting families of
hypotheses.
– By applying the proposed data mining techniques,
some of the steps of hypothesis generation and
evaluation will be automated, facilitated and improved.

However, statistics is needed to provide
methods for determining the “statistical”
significance of results.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Patterns of Interest

Zone Formation
– Find regions of the land or ocean which have similar
behavior.

Teleconnections
– Teleconnections are the simultaneous variation in climate
and related processes over widely separated points on
the Earth.

Associations
– Find relations between climate events and land cover.

River Discharge
– Relationship between water discharged from a river and
precipitation, climate, and man.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Clustering for Zone Formation



Interested in relationships between regions, not
“points.”
For ocean, clustering based on SST (Sea
Surface Temperature) or SLP (Sea Level
Pressure).
For land, clustering based on NPP or other
variables, e.g., precipitation, temperature.
– Typically we work with the points.

When “raw” NPP and SST are used, clustering
can find seasonal patterns.
– Anomalous regions have plant growth patterns which
reversed from those typically observed in the hemisphere
in which they reside, and are easy to spot.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
K-Means Clustering of Raw NPP and Raw SST
(Num clusters = 2)
Clusters for Raw SST and Raw NPP
90
60
Land Cluster 2
latitude
30
Land Cluster 1
0
Ice or No NPP
-30
Sea Cluster 2
-60
Sea Cluster 1
-90
-180
-150
-120
-90
-60
-30
0
30
longitude
© V. Kumar
60
90
120
150
180
Cluster
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
K-Means Clustering of Raw NPP and Raw SST
(Num clusters = 2)
Land Cluster Cohesion:
North = 0.78, South = 0.59
Ocean Cluster Cohesion: North = 0.77, South = 0.80
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
K-Means Clustering of Raw NPP and Raw SST
(Num clusters = 6)
Clusters for Raw SST and Raw NPP
90
Land Cluster 1
60
Land Cluster 2
Land Cluster 3
Land Cluster 4
30
latitude
Land Cluster 5
Land Cluster 6
0
Ice or No NPP
Sea Cluster 6
-30
Sea Cluster 5
Sea Cluster 4
Sea Cluster 3
-60
Sea Cluster 2
Sea Cluster 1
-90
-180 -150 -120 -90 -60 -30 0
30
longitude
© V. Kumar
60
90
120 150 180
Cluster
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Preprocessing

Time series preprocessing issues
– Need to remove seasonality

Earth scientists mostly interest in anomalies
– Need to remove most of the autocorrelation

Statistical test are affected
– Need to remove trends

Normally want to detect patterns and trends separately
– Normally interested in similarity once differences
in means and scale have been considered.

© V. Kumar
Pearson’s correlation coefficient has this property
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Sample NPP Time Series
Minneapolis
Correlations between time series
Minneapolis
© V. Kumar
Atlanta
Sao Paolo
Minneapolis
1.0000
0.7591
-0.7581
Atlanta
0.7591
1.0000
-0.5739
Sao Paolo
-0.7581
-0.5739
1.0000
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Seasonality Accounts for Much Correlation
Minneapolis
Normalized
using monthly
Z Score:
Subtract off
monthly mean
and divide by
monthly
standard
deviation
Correlations between time series
Minneapolis
© V. Kumar
Atlanta
Sao Paolo
Minneapolis
1.0000
0.0492
0.0906
Atlanta
0.0492
1.0000
-0.0154
Sao Paolo
0.0906
-0.0154
1.0000
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Removing Seasonality Removes Most Autocorrelation
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Preprocessing: Removing Trends
Two Random Time Series (corr = 0.0123)
Two Random Time Series With Linear Trend (corr = 0.1669)
1
1.4
0.9
1.2
0.8
0.7
1
0.6
0.8
0.5
0.6
0.4
0.3
0.4
0.2
0.2
0.1
0
0
50
100
150
0
0
50
100
150
A slight linear trend added to two random time series increases their
correlation dramatically, from 0.01 to 0.17.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Ocean Climate Indices:
Connecting the Ocean and the Land

An OCI is a time series of temperature or
pressure
– Based on Sea Surface Temperature (SST) or Sea
Level Pressure (SLP)

OCIs are important because
– They distill climate variability at a regional or
global scale into a single time series.
– They are well-accepted by Earth scientists.
– They are related to well-known climate
phenomena such as El Niño.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
17
Ocean Climate Indices – ANOM 1+2

ANOM 1+2 is associated with El Niño and La Niña.
El Nino Events

Defined as the Sea Surface Temperature (SST) anomalies
in a regions off the coast of Peru

El Nino is associated with
– Droughts in Australia and Southern Africa
– Heavy rainfall along the western coast of South America
– Milder winters in the Midwest
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Connection of ANOM 1+2 to Land Temp
Correlation Between ANOM 1+2 and Land Temp (>0.2)
90
0.8
0.6
60
0.4
30
latitude
0.2
0
0
-0.2
-30
-0.4
-60
-0.6
-0.8
-90
-180 -150 -120 -90
-60
-30
0
30
60
90
120
150
180
longitude
OCIs capture teleconnections, i.e., the simultaneous variation in climate
and related processes over widely separated points on the Earth.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Ocean Climate Indices - NAO

The North Atlantic Oscillation (NAO) is associated with
climate variation in Europe and North America.
North Atlantic Oscillation
3
Iceland
2
Azores
1
0
-1
-2
82 83 84 85 86 87 88 89 90 91 92 93 94

Normalized pressure differences between Ponta Delgada,
Azores and Stykkisholmur, Iceland.

Associated with warm and wet winters in Europe and in
cold and dry winters in northern Canada and Greenland

The eastern US experiences mild and wet winter
conditions.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Connection of NAO to Land Temp
Correlation Between NAO and Land Temperature (>0.3)
90
1
0.8
60
0.6
0.4
latitude
30
0.2
0
0
-0.2
-30
-0.4
-0.6
-60
-0.8
-1
-90
-180 -150
-120
-90
-60
-30
0
30
60
90
120
150
180
Correlation
longitude
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Influence of OCI on Land – Area Weighted Correlation

Correlation of an OCI with a land variable is a
standard way to evaluate its “influence.”
– Correlation does not imply causality.
– Temperature and precipitation are the typical land
variables.


If relatively many land points have a relatively
high correlation, then an OCI is influential.
To evaluate whether clusters (or pairs) are
potential OCIs we compute their area
weighted correlation.
– Weighted average of the correlation with land
points, where weight is based on area.
– May exclude points whose correlation is low and
then calculate area weighted correlation.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Evaluation of Known OCIs via Area Weighted Correlation
Total weighted area correlation
0.2
0.15
0.1
0.05
0
SOI
NAO
AO
PDO
QBO
CTI
WP
Known Indices
ANOM12 ANOM3 ANOM4 ANOM34
Area Weighted Correlation of Known OCIs to Land Temp
Overlapping, threshold = 0
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Evaluation of Known OCIs via Area Weighted Correlation …
Min corr = 0.1
Min corr = 0.2
0.2
0.125
0.1
0.15
0.075
0.1
0.05
0.05
0
0.025
1
2
3
4
5
6
7
8
9 10 11
0
1
2
3
Min corr = 0.25
4
5
6
7
8
9 10 11
Min corr = 0.3
0.1
0.08
0.08
0.06
0.06
0.04
0.04
0.02
0.02
0
1
2
3
4
5
6
7
8
9 10 11
0
1
2
3
4
5
6
7
8
9 10 11
Area weighted correlation declines as we consider only land points
whose temperature correlates with the OCI above a given threshold.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Discovering OCIs via Data Mining

Earth scientists have discovered currently
known OCIs.
– Observation
– Eigenvalue techniques such as Principal
Components Analysis (PCA) and Singular Value
Decomposition (SVD).

Clustering provides an alternative approach.
– Clusters represent ocean regions with relatively
homogeneous behavior.
– The centroids of these clusters are time series that
summarize the behavior of these ocean areas,
and thus, represent potential OCIs.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Finding Influential Ocean Regions


Not all points on the ocean correlate well with
land variables such as temperature and
precipitation.
Best points are those which have a high
“density”
– Dense points are relatively homogenous with respect
to their neighboring points.
Sorted weighted area correlation for Hi and Lo density points
0.2
Total weighted area correlation
0.18
Hi
Lo
0.16
0.14
0.12
0.1
0.08
0.06
© V. Kumar
0
50
100
150
200
250
300
350
400
450
500
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Discovery of Ocean Climate Indices

Use clustering to find areas of the oceans that have
high density, I.e., relatively homogeneous behavior.
– Cluster centroids are potential OCIs.
– For SLP pairs of cluster centroids are potential OCIs.

Evaluate the “influence” of potential OCIs on land
points.

Determine if the potential OCI matches a known OCI.

For potential OCIs that are not well-known, conduct
further evaluation.
– Are there land points that have higher correlation for the
potential OCI than for known indices?
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
SST Clusters
107 SNN Clusters for Detrended Monthly Z SST (1958-1998)
90
60
latitude
30
0
-30
-60
-90
-180
-150
-120
-90
-60
-30
0
30
60
90
120
150
180
longitude
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Evaluating Cluster Centroids as Potential OCIs

Evaluation will be based on area weighted correlation
– Ignore clusters who area weighted correlation is low.

Three cases:
– Clusters are highly similar to known OCIs (corr > 0.4)



May represent a known OCI
Clusters may be “better,” i.e., higher coverage
Clusters may cover different area, i.e., some points for
which the new OCI is a better predictor
– Clusters are moderately similar to known OCIs
( 0.25 < corr < 0.4 )
 Again, new OCIs may be better predictors for some
points.
– Clusters are not similar to known OCIs (corr < 0.25)
 These clusters may represent as yet undiscovered
Earth Science phenomena.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
SST Clusters Highly Correlated to Known Indices
Total weighted area correlation
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
1 8 9 11 17 20 24 25 28 29 31 36 37 57 58 59 67 68 75 78 79 80 81 83 87 92 94 95 96 97 99
Candidate indices
Area Weighted Correlation of Cluster Centroids to Land Temp
Overlapping, threshold = 0
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
SST Clusters Highly Correlated to Known Indices …
Min corr = 0.20
Min corr = 0.10
0.18
0.14
0.16
0.12
0.14
0.1
0.12
0.1
0.08
0.08
0.06
0.06
0.04
0.04
0.02
0.02
0
0
1 8 9 11172024252829313637575859676875787980818387929495969799
1 8 9 11172024252829313637575859676875787980818387929495969799
Min corr = 0.25
Min corr = 0.30
0.1
0.09
0.09
0.08
0.08
0.07
0.07
0.06
0.06
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0
0
1 8 9 11172024252829313637575859676875787980818387929495969799
© V. Kumar
1 8 9 11172024252829313637575859676875787980818387929495969799
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
SST Clusters that Correspond to El Nino Climate Indices
EL Nino Related SST Clusters
90
60
latitude
30
0
75
78 67 94
Niño
Region
Range
Longitude
Range
Latitude
1+2 (94)
90°W-80°W
10°S-0°
3 (67)
150°W-90°W
5°S-5°N
3.4 (78) 170°W-120°W
5°S-5°N
4 (75) 160°E-150°W
5°S-5°N
El Nino Regions Defined
by Earth Scientists
-30
-60
-90
-180 -150 -120 -90
-60
-30
0
30
longitude
60
90
120
150
180
SNN clusters of SST that are highly correlated
with El Nino indices, ~ 0.93 correlation.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
SST Clusters Highly Correlated to Known Indices …
Examples of some SST clusters that are highly correlated to known
OCIs and have high area weighted correlation with land temperature.
These indices have a significant correlation with El Nino indices.
Cluster 29
Cluster 11
90
90
70
70
50
50
30
30
10
10
-10
-10
-30
-30
-50
-50
-70
-70
-90
-180
-140
-100
-60
-20
20
60
100
140
180
-90
-180
-140
-100
-60
Cluster 31
90
70
70
50
50
30
30
10
10
-10
-10
-30
-30
-50
-50
-70
-70
© V. Kumar
-140
-100
-60
-20
20
20
60
100
140
180
60
100
140
180
Cluster 37
90
-90
-180
-20
60
100
140
180
-90
-180
-140
-100
-60
-20
20
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
SST Clusters Highly Correlated to Known Indices
However, there are areas (yellow) where these clusters correlate better.
Cluster 11 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20)
Cluster 29 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20)
90
90
0.6
0.5
70
70
0.4
50
0.3
0.2
30
0.4
50
30
0.2
0.1
10
10
0
-10
0
-10
-0.1
-30
-0.2
-0.3
-50
-0.2
-30
-50
-0.4
-0.4
-70
-70
-0.5
-0.6
-90
-180
90
Cluster -100
31 - SOI ANOM12
ANOM3
ANOM4
-140
-60
-20
20 ANOM34
60 (mincorr
100 = 0.20)
140
-90
-180
90
180
Cluster -100
37 - SOI ANOM12
ANOM3
ANOM4
-140
-60
-20
20 ANOM34
60 (mincorr
100 = 0.20)
140
180
0.4
70
0.4
70
50
50
0.2
30
0.2
30
10
10
0
-10
0
-10
-30
-30
-0.2
-0.2
-50
-50
-70
-70
-0.4
-0.4
-90
-180
© V. Kumar
-140
-100
-60
-20
20
60
100
140
180
-90
-180
-140
-100
-60
-20
20
60
Discovery of Patterns in the Global Climate System using Data Mining
100
140
180
‹#›
SST Clusters Highly Correlated to Known Indices
Cluster 29 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20)
90
0.6
70
0.4
50
30
0.2
10
0
-10
-0.2
-30
-50
-0.4
-70
-0.6
-90
-180
© V. Kumar
-140
-100
-60
-20
20
60
100
140
180
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
SST Cluster Moderately Correlated to Known Indices
Cluster 62
Cluster 62 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20)
90
90
70
70
50
50
30
30
10
10
-10
-10
-30
-30
-50
-50
-70
-70
-90
-180
-90
-180
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-140
© V. Kumar
-100
-60
-20
20
60
100
140
180
-140
-100
-60
-20
20
60
Discovery of Patterns in the Global Climate System using Data Mining
100
140
180
‹#›
Comments from our NASA collaborators
“Ocean cluster results based on SST correlations with land
surface temperature suggest that
– New areas of the ocean may be identified that are
unknown as being highly representative of the El Nino
Southern Oscillation (ENSO) and the Arctic Oscillation
(AO).”
– New predictive indices for land climate over the past 40
years can be identified that will improve upon predictions
using any known ocean climate index to date, including
SOI and AO.”
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Issues in Mining Associations from Earth Science Data
Data is continuous rather than discrete.
 Data has spatial and temporal components.
 Data can be multilevel
– time and spatial granularities.
 Observations are not i.i.d. due to spatial and
temporal autocorrelations.
 Data may contain noise, missing information
and measurement errors
– historical SST data between 1856-1941 is
measured using wooden buckets.
 Data may come from heterogeneous sources
– Calibration issues.

© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Mining Associations in Earth Science Data: Challenges
Transaction
Id
1
2
3
4
5

Items
Bread, Milk
Beer, Diaper, Bread, Eggs
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Bread, Diaper, Milk
How to transform Earth Science data into
transactions?
– What are the “baskets”?
– What are the “items”?
– How to define “support”?
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Mining Associations Patterns in Earth Science Data: Challenges
(Lat,Long,time)
Events
1 FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI ==> NPP-HI
(support count=145, confidence=100%)
(10N,10E,1) {Temp-Hi, Prec-Lo}
(10N,10E,2)
(10N,11E,2)
(10N,11E,5)
(10N,11E,10)

{Temp-Hi,Prec-Lo,NPP-Lo}
{Temp-Hi, NPP-Lo}
{Solar-Hi, NPP-Lo}
{Prec-Hi, PET-LO}
2 FPAR-HI PET-HI PREC-HI TEMP-HI ==> NPP-HI
(support count=933, confidence=99.3%)
3 FPAR-HI PET-HI PREC-HI ==> NPP-HI
(support count=1655, confidence=98.8%)
4 FPAR-HI PET-HI PREC-HI SOLAR-HI ==> NPP-HI
(support count=268, confidence=98.2%)
…
How to efficiently discover
spatio-temporal associations?
– Use existing algorithms.
– Develop new algorithms.

How to identify interesting
patterns?
– Use objective interest measures.
– Use domain knowledge.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Event Definition
Items are events abstracted from time series.
 Events of interest include:
– Temporal events:

Anomalous temporal events such as warmer winters and
droughts.

Changes in the periodic behavior such as longer growing
seasons or earlier month of onset of greenup.

– Spatial events:
Large percentage of land areas in a certain region having
below-average precipitation.

– Spatio-temporal events:

© V. Kumar
Changes in circulation or trajectory of jet-streams.
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Example of Anomalous Event Definition
80
NPP
60
40
20
0
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
2
Z
1
0
-1
-2
1982
1
Events
0.5
0
-0.5
-1
1982
X   month ( X )
Z
 month ( X )
© V. Kumar
If threshold for Z = 1.5, on average, there are ~20 events per
time series.
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Transaction and Support Definitions

Convert the time series into sequence of events
for each spatial location.
time
t1
y
AK
M
t2
A
DF
t3
AB
D
AB
B
AB
D
CM
ABE
G
AB
G
DL
J
BC
E
© V. Kumar
EG
K
BCD
CE
F
EG
M
x
DEF
DK
L
AB
GL
CFM
AB
E
Grid Cell (x,y)
(1,1)
(1,2)
(1,3)
(1,4)
(2,1)
(2,2)
(2,3)
(2,4)
(3,1)
(3,2)
(3,3)
(3,4)
(4,1)
(4,2)
(4,3)
(4,4)
t1

{A, B, D}

{A, K, M}
{B, C, E}


{A, B}

{A, B, G}
{C, M}





Discovery of Patterns in the Global Climate System using Data Mining
t2

{D, L, J}
{A, B, E, G}

{E, G, M}
{C, E, F}

{D, F}





{D, K, L}

{A, B}
t3


{B, C, D}

{C, F, M}
{A, B, G, L}

{A, B, D}

{A, B, E}




{E, G, K}
{D, E, F}
‹#›
Examples of Association Patterns

min support = 0.001%, min confidence=10%
1 FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI ==> NPP-HI (support count=145, confidence=100%)
2 FPAR-HI PET-HI PREC-HI TEMP-HI ==> NPP-HI (support count=933, confidence=99.3%)
3 FPAR-HI PET-HI PREC-HI ==> NPP-HI (support count=1655, confidence=98.8%)
4 FPAR-HI PET-HI PREC-HI SOLAR-HI ==> NPP-HI (support count=268, confidence=98.2%)
5 FPAR-HI PET-HI PREC-HI SOLAR-LO TEMP-HI ==> NPP-HI (support count=44, confidence=97.8%)
6 FPAR-LO PET-LO PREC-LO SOLAR-LO ==> NPP-LO (support count=216, confidence=96.9%)
7 FPAR-LO PREC-LO SOLAR-LO TEMP-HI ==> NPP-LO (support count=152, confidence=96.2%)
8 FPAR-LO PET-LO PREC-LO SOLAR-LO TEMP-LO ==> NPP-LO (support count=47, confidence=95.9%)
9 FPAR-LO PREC-LO SOLAR-LO TEMP-LO ==> NPP-LO (support count=49, confidence=94.2%)
10 FPAR-LO PREC-LO SOLAR-LO ==> NPP-LO (support count=595, confidence=93.7%)
…
75 FPAR-HI ==> NPP-HI (support count = 216924, confidence = 55.7%)
NPP = Solar * FPAR *  * Temperature * Moisture
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Example of Interesting Association Patterns
FPAR-Hi ==> NPP-Hi
(sup=5.9%, conf=55.7%)
Shrubland areas
Rule has high support in shrubland areas
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Land Cover Types
Barren
Croplands
90
90
70
70
50
50
30
30
10
10
-10
-10
-30
-30
-50
-50
-70
-70
-90
-180 -140 -100 -60
-20
20
60
100
140 180
-90
-180 -140 -100 -60
Shrublands/ Grasslands
90
70
70
50
50
30
30
10
10
-10
-10
-30
-30
-50
-50
-70
-70
-20
20
60
20
60
100
140 180
60
100
140 180
Forests
90
-90
-180 -140 -100 -60
-20
100
140 180
-90
-180 -140 -100 -60
-20
20
Using Land Cover as Additional Features
1. FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI ==> NPP-HI (support count=145, confidence=100%)
2. FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI GRASSLAND ==> NPP-HI (support count=145, confidence=100%)
3. FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI FOREST ==> NPP-HI (support count=44, confidence=100%)
4. FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI CROPLAND ==> NPP-HI (support count=44, confidence=100%)
5. FPAR-HI PET-HI PREC-HI SOLAR-HI FOREST ==> NPP-HI (support count=75, confidence=100%)
6. FPAR-HI PET-HI PREC-HI SOLAR-HI CROPLAND ==> NPP-HI (support count=81, confidence=100%)
7. FPAR-HI PREC-HI SOLAR-HI TEMP-HI CROPLAND ==> NPP-HI (support count=58, confidence=100%)
8. FPAR-HI PET-HI PREC-HI TEMP-HI GRASSLAND ==> NPP-HI (support count=376, confidence=99.5%)
9. FPAR-HI PET-HI PREC-HI TEMP-HI CROPLAND ==> NPP-HI (support count=170, confidence=99.4%)
10. FPAR-HI PET-HI PREC-HI CROPLAND ==> NPP-HI (support count=277, confidence=99.3%)
…..
• Produce multiple rules that have the same form:
• {A} ==> {B}, {A,Grassland} ==> B, {A, Cropland} ==> {B}, etc.
• Some of the support counts could be missing if itemsets fall below the
minimum support threshold.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Finding Interesting Earth-Science Patterns

A pattern is interesting if it occurs relatively more
frequently in some homogeneous regions.

If the relative frequency of a pattern is similar in all groups
of land areas, then it is less interesting.
If the pattern occurs mostly in a certain group of land
areas, then it is potentially interesting.

© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Filtering Patterns using Land Cover Types
Grassland
n1
r1
s1
Total grid points
Location support
Transaction support
Barren land
n2
r2
s2
Forests
n3
r3
s3
Croplands
n4
r4
s4
N =  ni
R =  ri
S =  si
Transactio n support 
For each pattern p:
• Actual coverage for land cover type i = si /S
Location support 
• Expected coverage for land cover type i = ni /N
#  x, y , t 
 ( x, y , t  )
#  x, y 
 (  x, y  )
• Ratio of actual to expected coverage for land cover type i,
ei = si N / ni S
k
• Interest Measure
 1   ei log k ei
i 1
• If pattern occurs in arbitrary regions, interest measure will be low.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Interesting Spatial Association Pattern
FPAR-Hi ==> NPP-Hi
• FPAR-Hi  NPP-Hi tends to occur in shrubland
and grassland regions.
• Possible explanation: this type of vegetation can
take advantage of periodically high precipitation
more quickly than forests.
• New hypothesis: FPAR-HI events in these regions
could be related to unusual precipitation
conditions.
Forest (0.1%), Croplands (4.2%),
Grasslands (25.9%), Desert (3.6%)
FPAR-Hi Prec-Hi ==> NPP-Hi
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Interesting Spatial Association Pattern…
Land Cover
• Prec-Hi  NPP-Hi tends to occur in grassland and cropland regions.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Other Interesting Spatial Association Patterns
Support Count
Land Cover
• Temp-Hi  NPP-Hi tends to occur in the forest and cropland regions in the northern
hemisphere
(Forests (33.5%), Grassland(8.7%), Cropland (24.5%), Desert (0.4%) )
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Global River Discharge Data
• Global River Discharge Data
• 30 rivers, 0.5 degree resolution
• Two measurement stations: mouth and source of river system/basin
• Minimum of ten continuous years of monthly station discharge records
• Interesting associations
• e.g., Amazon discharge is highly correlated with ANOM3.4(r = -0.5)
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Relationship Between River Basin PREC and OCI
Amazon
Parana
• Correlation between PREC aggregation on river basins and OCI is shown in left figure
• Interesting Observations:
• Amazon and Parana are nearby, however, the signals to OCI are almost reverse
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Discharge Data: Amazon vs. Parana
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Relationship Between River Basin PREC and OCI ..
Petchora
•Interesting Observations:
• Petchora and Pacific Decadal Oscillation (PDO) are highly correlated.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Correlation between PREC and DISCHARGE
Danube
Amazon
Nile
© V. Kumar
Yenisei
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Correlation between OCI and DIS (r = 0.3)
Amu-Darya
Brahmaputra
Colorado
© V. Kumar
Amazon
Discovery of Patterns in the Global Climate System using Data Mining
Columbia
‹#›
Proposed Framework of River Analysis
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Conclusions

Association rules can uncover interesting
patterns for Earth Scientists to investigate.
– Challenges arise due to spatio-temporal nature of the
data.
– Need to incorporate domain knowledge to prune out
uninteresting patterns.

By using clustering we have made some
progress towards automatically finding climate
patterns that display interesting connections
between the ocean and the land.
– Need to further evaluate candidates for new climate
indices.

Correlation analysis on river discharge data can
be used to evaluate the effects of climate and
man.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Case Studies: Earth Science Data

Michael Steinbach, Pang-Ning Tan, Vipin Kumar, Chris Potter, Steven
Klooster, Alicia Torregrosa, “Clustering Earth Science Data: Goals, Issues and
Results”, Workshop on Mining Scientific Data, KDD 2001, San Francisco, CA,
2001.

Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Steven Klooster, Christopher
Potter, Alicia Torregrosa, “Finding Spatio-Termporal Patterns in Earth Science
Data: Goals, Issues and Results,” Temporal Data Mining Workshop, KDD
2001, San Francisco, CA, 2001.

Vipin Kumar, Michael Steinbach, Pang-Ning Tan, Steven Klooster, Chris
Potter, Alicia Torregrosa, “Mining Scientific Data: Discovery of Patterns in the
Global Climate System”, Joint Statistical Meetings, Atlanta, GA, 2001.

Michael Steinbach, Pang-Ning Tan, Vipin Kumar, Chris Potter, Steven
Klooster, “Data Mining for the Discovery of Ocean Climate Indices”,
Workshop on Mining Scientific Data, SDM 2002.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Statistical Issues

Temporal Autocorrelation
– Makes it difficult to calculate degrees of freedom and determine
significance levels for tests, e.g., non-zero correlation.
– Moving average is nice for smoothing and seeing the overall
behavior, but introduces additional autocorrelation.
– Removal of seasonality removes much of the autocorrelation (as
long as not performed via the moving average).

Measures of time series similarity
– Detecting non-linear connections
– Detecting connections that only exist at certain times.
Sometimes only extreme events have an effect.
– Automatically detecting appropriate time lags.

– Statistical tests for more sophisticated measures.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›
Statistical Issues …

Detecting spurious connections.
– We are performing many correlation
calculations and there is a chance of spurious
correlations.
 Given
that we have ~100,000 locations on the Earth
for which we have time series, how many spuriously
high correlations will we get when we calculate the
correlation between these locations and a climate
index?
– Because of spatial autocorrelation, these
correlations are not independent.
 Again
we have trouble calculating the degrees of
freedom.
© V. Kumar
Discovery of Patterns in the Global Climate System using Data Mining
‹#›