Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Integrating Discovery, Development, and Commercial Data into Data Mining Jennifer Sloan Data Mining Consultant GlaxoSmithKline: US Pharma IT 15 September 2004 Data Mining Definition Data Mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid and accurate predictions. Data Mining is a tool that allows us to     Identify problematic areas Control process variability Make concrete decisions on business needs Develop a model which can aid in future business decisions Commercial Data Analyzing Multivariate Data Managing Data Usage Model Building Multivariate Data Sets  Data are multivariate in nature  Large data sets containing multiple criteria within each observation  Comparing multiple vectors is nearly impossible without reducing to a single point Here we view 5-dimensional information on one observation. Each point represents a prescriber and the color represents a Market Share increase or decrease. Overlapping distributions make this difficult to interpret and further analysis is required. Over 200K observations are represented in this graph. The same observations are observed but now two-way interactions between the variables help us determine which variables are affecting market shifts and lead to constructing models which will predict prescriber behavior. Drug Development Drug Development Issues  Adverse Event Reporting System (AERS) Over 2 million AE reports and approximately 2000 drugs and biologics submitted to the FDA since 1968   Creates Extremely Complicated Matrix of Data Recently, Data Mining methods have helped address this issue with the development of a method used to examine large databases for associations between drugs and AEs Data Mining Algorithm  Multi-Item Gamma Poisson Shrinker (MGPS) Developed by William DuMochel (AT&T) Through statistical modeling, this Empirical Bayesian method identifies higher-than-expected reporting relationships of drug-event combinations  Automated, web-based system with rapid drilldown capability MGPS runs using all event terms and drugs in the AERS database and produces results for all drug-event combinations MGPS: Significance    Handles Complex Stratification (age, gender, year of report > 945 categories) Performs complex computations in minimal amount of time: Much MORE EFFICIENT Real World Example: Membership: PhRMA-FDA Working Group Chair: June Almenoff (GSK) FDA Involvement Involved PhRMA companies: Abbott, Allergan, AstraZeneca, Bristol-Myers Squibb, GlaxoSmithKline, Johnson & Johnson, Lilly, Merck, Novartis, Schering-Plough, Pfizer, Roche, Wyeth Drug Discovery SCAM—Statistical Classification of Activities of Molecules  Recursive partitioning customized for chemistry  Creates a structure activity relationship (SAR) mode7l  Handles large numbers of descriptors (> 1 million) SCAM : Data Structure Biological Activities Y1 Y2 Y3 Y4 ... Yn >100K O N S H N N NH O O 1010111010000000000001 1010011110000000000001 1010111110000100010001 1010011010000010010001 ... 1000111101010001000001 > 2 million SCAM’s Recursive Partitioning n = 1650 Ave = 0.34 SD = 0.81 Feature n = 1614 ave = 0.29 sd = 0.73 t= Signal Noise rP = 2.03E-70 aP = 1.30E-66 = 2.60 - 0.29 0.734 1 1 + 36 1614 n = 36 ave = 2.60 sd = 0.9 = 18.68 SCAM Tree Advantages of SCAM  Works for complex situations, mixtures and interactions.  Output  High is easy to understand and explain statistical power  Produces a valid answer SCAM Drawbacks Data greedy  Only one view of the data  Binary descriptors may be too “crude”  Disposition of outliers is difficult  Highly correlated variables may be obscured  Higher order interactions may be masked  Concluding Remarks  Data Mining enables us to efficiently handle LARGE amounts of data  Data Mining allows us to perform analyses IN REAL TIME  Data Mining covers a wide array of topics in drug industry and its benefits are plentiful References Almenoff, June S, et al. “Disproportionality Analysis Using Empirical Bayes Data Mining: A tool for the Evaluation of Drug Interactions in the PostMarketing Setting.” Pharmacoepidemiology and Drug Safety,12, 517-521 (2003). Donahue, Rafe. “An Overview of Data Mining in Drug Development and Marketing.” http://home.earthlink.net/~rafedonahue. May 2003. Hawkins, D.M. and G.V. Kass, “Automatic Interaction Detection.” Topics in Applied Multivariate Analysis, ed. Hawkins, (1982). Hawkins, D.M., S.S. Young and A. Rusinko. “Analysis of a Large StructureActivity Data Set Using Recursive Partitioning.” QSAR, 16, 296-302 (1997).