Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BA 275 Winter 2007 Exploring Data Exploring Data: Summary and Outline Qualitative Data (Categorical Data) e.g. gender, college major, etc. Graphical Methods Pie charts Bar graphs Numerical Methods (Descriptive Statistics) Frequency tables Display one variable: Histograms Stem-and-Leaf Displays Dot plots Measures of Location: Median: Arrange the observations in ascending order. If n is odd, median = the middle number If n is even, median = the simple average of the middle two observations. Mode: The measurement that occurs most frequently in the data set. It might not be unique, or not even exist. Quantitative Data (Numerical Data) e.g. age, income, SAT scores, etc. 1 n X i ”simple average” n i 1 Display two variables: Scatter plots Display one variable over time: Time series plots Mean: X Measures of Spread/Variability: 1 n ( X i X )2 n 1 i 1 Variance: s Standard deviation: s s Range = the largest – the smallest Interquartile range = IQR = Q3 – Q1 2 2 Measures of Relative Standing: Percentiles: The pth percentile is a number such that p% of n observations fall below it and (100-p)% fall above it. Quartiles Q1 = QL = the lower quartile = 25th percentile Q2 = Median = 50th percentile Q3 = QU = the upper quartile = 75th percentile Z-scores = obs - mean X X std s Z-scores tell you how far the observation is above or below the mean (the center of a data set.) Hsieh, P-H 1 BA 275 Winter 2007 Exploring Data Boxplot Box-and-Whisker Plot 30 40 50 60 Box-and-Whisker Plot 70 2 3 Age 4 5 6 Salary 7 (X 10000) The Empirical Rule: the observations come from a mound-shaped and symmetric distribution. 1. Approximately 68% of the observations will fall within 1 standard deviation of the mean. 2. Approximately 95% of the observations will fall within 2 standard deviations of the mean. 3. Approximately 99.7% of the observations will fall within 3 standard deviations of the mean. 99.7% 95% 68% 0.15% 2.35% 13.5% 34% 34% 13.5% 2.35% 0.15% x 3s 3 x 2s xs x xs x 2s 2 2 3 2 1 0 x 3s 3 1 2 3 1. 2. 3. Questions to ask when describing and summarizing data: Where is the approximate center of the distribution? Are the observations close to one another, or are they widely dispersed? Is the distribution unimodal, bimodal, or multimodal? If there is more than one mode, where are the peaks, and where are the valleys? 4. Is the distribution symmetric? If not, is it skewed? If symmetric, is it bell-shaped? Hsieh, P-H 2