Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2013 IEEE International Conference on Bioinformatics and Biomedicine An Error Detecting and Tagging Framework for Reducing Data Entry Errors in Electronic Medical Records (EMR) System Yuan Ling, Yuan An Mengwen Liu, Xiaohua Hu College of Computing and Informatics, Drexel University Philadelphia, USA {yl638, ya45}@drexel.edu College of Computing and Informatics, Drexel University Philadelphia, USA {ml943, xh29}@drexel.edu Abstract—we develop an error detecting and tagging framework for reducing data entry errors in Electronic Medical Records (EMR) systems. We propose a taxonomy of data errors with three levels: Incorrect Format and Missing error, Out of Range error, and Inconsistent error. We aim to address the challenging problem of detecting erroneous input values that look statistically normal but are abnormal in medical sense. Detecting such an error needs to take patient medical history and population data into consideration. In particular, we propose a probabilistic method based on the assumption that the input value for a field depends on the historical records of this field, and is affected by other fields through dependency relationships. We evaluate our methods using the data collected from an EMR System. The results show that the method is promising for automatic data entry error detection. Electronic Medical Records (EMR) data errors have been classified as incomplete, inaccurate, or inconsistent [9]. More specifically, based on EMR visit notes, the previous study [10] classifies the data entry errors into Inconsistent Information, Miscellaneous Errors, Missing Sections, Incomplete Information, and Incorrect Information [10]. For our case, we conducted a comprehensive literature review. As a result, we summarized different levels of data errors in a more detailed manner. Three levels of data entry errors are proposed as shown in Table I. Example errors are discussed in the following paragraphs. TABLE I. Level 1: Incorrect Format and Missing Level 2: Out of Range Keywords—Data Entry Errors; Electronic Medical Records (EMR); Error Detecting; Error Tagging I. INTRODUCTION L1-1: Incorrect Format Data L1-2: Missing Data L2-1: Out of normal range L2-2: Out of population trends L2-3: Out of personal trend Currently, large-scale data analysis is conducted in almost every area. Advanced analytical process and data mining techniques have been used to discover valuable information and knowledge from large-scale data. However, most data mining techniques assume that data has already been cleaned [1]. This is not always true in real situations. Data preparation process and data quality issues are critical to reliable data mining results, because data with errors will compromise the credibility of the data mining results. The data lifecycle of an application typically consists of the following steps: data collection, transformation, storage, auditing, cleaning, and analyzing to making decisions [2]. Data errors can be reduced in each step of the data lifecycle. Since data collection is the first step of the lifecycle, reducing data entry errors in the data collection stage is the earliest opportunity to enable data quality. Level 3: Inconsistent L3-1: Personal Inconsistent L3-2: Population Inconsistent L1-1 errors can be easily controlled with prior knowledge about the data format. Constraints can be set for each field. For example, only numbers are allowed to be entered into blood pressure field. L2-1 error also can be easily controlled as long as there are knowledge about the normal range of each field. For example, height values can be constrained in the range of 0 to 3 meters. Detecting L1-2 errors, however, is a difficult research task, but we can learn to detect missing data according to clinical guidelines and caregiving practices. Once the missing values are detected, we can set constraints to this particular field as we did for L1-1 and L2-1. For L2-2 errors, using statistics for detecting erroneous values for a single attribute is well-studied and robust results can be obtained. For example, L2-2 errors can be calculated using a univariate outlier detection technique called Hampel X84 [11]. In this paper, we do not focus on solving the L1-1 to L2-2 error problems. Error and outlier detection have been studied in various application domains, such as credit card fraud detection[3], network intrusion detection[4], and data cleaning in data warehouse[5]. In general, errors can be classified into three types: point errors, contextual errors, and collective errors [6]. The challenge lies in the detection of values that look statistically normal but are abnormal in medical sense like L23, L3-1, and L3-2 errors. For example, the weight value of a patient increases to 230lbs from 180lbs in three months. The weight value of 230lbs is in the normal population distribution In the healthcare domain, data errors in electronic medical databases have been studied recently [7, 8]. Generally, 978-1-4799-1310-7/13/$31.00 ©2013 IEEE LEVELS OF DATA ENTRY ERRORS 249 range, but for this particular patient, it is suspicious if he/she has a healthy living style and does not have major medical problems. It is categorized as a L2-3 error. The following illustrates other examples. A record showing a male patient has the status of pregnancy would be considered as a L3-1 error. An obese and diabetic patient with BMI (Body mass index) larger than 35 is likely to have a cardiovascular disease manifested as hypertension and dyslipidemia. If the patient’s blood pressure values are significantly low, but still in the normal range, the blood pressure value would be inconsistent with the BMI value from the population perspective. This signals a L3-2 type of error. 4) We test our method on dataset from an EMR system. Our experimental results show that the predictive accuracy of our probabilistic model is better than Bayesian Network. Our model also generates better results than the other method assuming the input value for a field is based solely on historical records. The rest of the paper is organized as follows: Section II describes related work. Section III presents our error detecting and tagging framework. Section IV introduces our probabilistic model and error detecting method. Section V presents our evaluation process and experimental results. Section VI presents our conclusions and future research directions. In this paper, we propose an error detecting and tagging mechanism in real time to address these challenges based on historical data and dependency relationships. The entire framework consists of three main components: II. RELATED WORK Statistical methods have been used to analyze and detect data errors at data cleaning stage [12-14]. However, we need some methods to reduce date entry error in the data collection stage[15]. 1) Find a similar patient group. Error detection is based on historical datasets. For a single patient, the historical datasets should be composed from patients who have equivalent medical situations, clinical conditions, and basic demographic information. Because it is difficult to find a group of patients with exactly the same medical conditions, we detect errors based on a dataset from patients with similar conditions. 2) Learn a probabilistic model. Based on the dataset from a group of similar patients, we learn parameters for our proposed model. The probabilistic model is used to caculate the input value probability for each field. 3) Return a tag. Error tagging is based on results from the probabilistic model. If an input value probability for a field is relatively high, then the value has a higher chance to be a correct value and a “Normal” tag will be returned. Whereas if the probability is relatively low, the tag is “Suspicious”. We use a threshold to decide which tag should be returned. Our experimental results show that our error detecting and tagging method is promising. The predictive accuracy of the probabilistic model is between 70% and 85% using our experimental data (explained in the Evaluation section). We also compare our method with existing methods. The results indicate that our method has a better performance. There are different ways to detect errors and control data quality, including classification based neural networks, Bayesian networks[16], support vector machine, rule-based methods, unsupervised clustering, and instance-based methods. The Usher system is developed to detect errors on form entry fields by using Bayesian network and a graphical model with explicit error modeling[17]. The system includes a probabilistic error model based on a Bayesian network for estimating contextualized error likelihood for each field on a form. The method uses Bayesian network to learn dependency relationships among fields, but it does not take historical records for a particular field into consideration. In this paper, we detect errors based on dependency relationships among fields as well as historical data for a particular field. In addition, we classify patients based on some basic information to cluster them into similar groups before we apply error detection. As our method is probabilistic in nature, selecting the right dataset for estimating the probabilities is critical. For example, it is unreasonable to predict the input value for a person whose age is 50 using a dataset for people whose ages are around 10. Using tagging mechanism has been shown to decrease certain types of errors [18]. Data entry interfaces can be designed to help improving data quality. We propose a system that classifies the input values of each field as “Suspicious” or “Normal”, and provides this information as feedback to the user. If the “Suspicious” tag occurs, then it reminds the user to check the input value. Consequently, it would help reducing data entry errors. In summary, the contributions of our paper are: 1) We summarize different levels of data entry errors based on a comprehensive literature review. Our method focuses on detecting and tagging errors that cannot be detected by simple statistical methods. 2) We design a probabilistic model based on two assumptions: first, the current patient’s input value for a field depends on his/her historical records in this field; and second, the input value for a field depends on values from other fields.We combine the influences from these two aspects in our model. 3) We develop an error detecting and tagging framework with five stages: selecting a classification model, finding similar patient group, building a probabilistic model, setting a threshold, and returning tags. III. ERROR DETECTING AND TAGGING FRAMEWORK A. Error Detecting and Tagging Framework Our error detecting and tagging framework is presented in Fig. 1. A data entry interface contains multiple fields. After a user inputs a value for a field, a tag (“Normal” or “Suspicious”) will be returned. 250 probabilistic model and the threshold setting, we return a “Normal” or “Suspicious” tag to the current user. IV. Fig. 1. Error Detecting and Tagging Framework DATA ENTRY ERROR DETECTING METHOD A. Data Entry Error Detecting Method Given a data entry interface I , let F = {Fv , … , F j , … , Fk } be a set of input fields on the interface I . For example, in Fig. 4, the field for entering the “Weight” value can be represented as FWeight . Fig. 2. Histroical Records for Error Detecting Task Here we use a simple example to illustrate the historical data format and the error detecting task. Fig. 2 shows a set of records for 3 different patients. There are five fields for each record: Date (Observation Date), BP_S (Systolic Blood Pressure value), BP_D (Diastolic Blood Pressure value), BMI, and Weight. Our error detecting task is to tell whether the patient with ID = 3 has a normal or suspicious input value for BP_S, BP_D, BMI and Weight at the observation date: 1/13/2010. The feedback is generated based on the former 8 records from the patient with ID = 3, and also the other 11 records from patients with ID = 1 and ID = 2, assuming these three patients have similar conditions. Fig. 4. A Data Entry Form Fig. 5. An example of relationships among fields For a single field Fv , we can use a sequence of values v = {v1 , …, vi −1 , vi } to represent all historical records for a patient, where vi is the current value entered by a user for the field Fv , {v1 ,…, vi −1} is the historical record sequence for the field Fv and vi −1 is the value entered for the field Fv at the (i − 1)th time. Based on the data, we will learn dependency relationships among BP_S, BP_D, BMI and Weight. These relationships are captured in a probabilistic graphical model. We assume that a patient’s current input value for a field depends on historical input records of this field. Abnormal changes of input values comparing with historical records need to get attention. In probability theory, such a sequentially dependent process can be described by a Markov model. According to 1th order Markov assumption, the current value for the field Fv at i th time depends only on the value for the same field at (i − 1)th time. Therefore, for a giving input value vi , we can use the value vi −1 to predict the probability of the value B. Error Detecting Process Based on our error detecting and tagging framework, we propose the error detecting process as shown in Fig. 3. A patient is modeled as a set of data fields describing the patient’s demographic information and clinical conditions. A form is used to collect a set of specific types of information during an encounter, for example, blood pressure, weight, etc. For our method, we divide the entire patient information into two categories: independent fields which are considered as fixed during the error detection process and dependent field which will be collected and flagged as “normal” or “suspicious”. vi for the field Fv at the current time. The probability can be represented as qm (vi | vi −1 ) . Similarly, we can have a probability according to 2th order Markov assumption as qm (vi | vi −1 , vi−2 ) . But for a new patient, there are no historical records for value of field Fv . So we can assume that the current value vi for the field Fv at i th time does not depend on previous historical data from the patient. Therefore the predicted probability can be represented as qm (vi ) . Our goal is to predict the probability of an input value vi for Fig. 3. Error Detecting Process the field Fv , which can be represented as qM (vi ) . Using 1th order As shown in Fig. 3, our error detecting process consists of the following five steps. First, patient records are categorized into groups based on information about the basic independent fields from the current patient and other historical records using a classification model. Second, we find a similar patient group for the current patient. Third, we build a probabilistic model based on clinical fields and historical records from a group of similar patients. Fourth, we set a threshold for the probability of input value. Finally, based on results from the Markov assumption, 2th order Markov assumption, and a prior probability, we can calculate qM (vi ) as in Equation (1): q M ( v i | all _ conditions ) = α 3 × q m ( v i | v i −1 , v i − 2 , all _ conditions ) + α 2 × q m ( v i | v i −1 , all _ conditions ) + α 1 × q m ( v i | all _ conditions ) 251 (1) “all_conditions” means that the probability is calculated based on a dataset only from patients that have equal demographic and clinical conditions. In the following equations, we will omit “all_conditions” in each term assuming that they have been taken into consideration. The three parameters α1 , α 2 , and α3 are used to balance the three probabilities [24], and α1 + α 2 + α 3 = 1 . Given a set of records, we can estimate each part of probability according to Equation (2), Equation (3), and Equation (4), respectively. qm (vi | vi −1 , vi − 2 ) = qm (vi | vi−1 ) = qm (vi ) = N (vi − 2 , vi −1 , vi ) N (vi − 2 , vi −1 ) N (vi−1 , vi ) N (vi−1 ) N (vi ) N (all ) Network as Fig. 5. By plugging in the above values, we compute a final probability q (v i | all _ conditions ) . B. Error Tagging Method We assume that for each field Fv the set of possible values is finite and discrete. If the values for a field are continuous, we apply approapriate methods for discretizing them. Let V = {vi' , vi'' ,…} be the set of possible values for a field Fv . The current input value vi at the i th time is a member of the set. For each possible input values in the set V , we can calculate probability based on the Equation (5). We get a tag result for current input value vi for the field Fv as Equation (6). (2) (3) (4) ⎧ Normal tag (vi ) = ⎨ ⎩Suspicious For Equation (2) and Equation (3), N (vi − 2 , vi −1 ) is the occurrence number of input value vi − 2 for field Fv at the (i − 2)th time followed by the input value vi −1 for field Fv at the (i − 1)th time for a patient. Similarly, N (vi −2 , vi −1 , vi ) is the occurrence number of input value in a sequence of {vi − 2 , vi −1 , vi } (6) q(vi ) < θ For the threshold value θ in the Equation (6), there are methods that can be applied to set the threshold. For example, 80/20 rule, Linear Utility Function [22] and Maximum Likelihood Estimation[23]. In this paper, we will use 80/20 rule to dynamically decide the threshold value. In particular, probability locates before 20% highest values means the corresponding value is “Normal”, otherwise it is “Suspicious”. for the (i − 2)th , (i − 1)th , and i th time for a patient. For Equation (4), N (all ) is the total number of values for field Fv for historical records from all similar patients. Similar patients means a group of patients who are similar to the current patient. The similarity is measured in terms of their demographic and clinical conditions such as age, gender, disease conditions, medications, treatments, social activities, self-management, etc. N (vi ) is the occurrence number of vi for field Fv from all historical records based on all similar patients. V. EVALUATION The previous sections described the theoretical derivations of the model and computational steps. In this section, we present an experiment using a set of limited data for evaluating the performance of the model. We aimed to gain some firsthand practical confidence through a pilot study. The limitation of the data is two-fold. First, we only obtained the data for a small number of people, about 1448. Second, we consider a patient model with a small number of fields including age, sex, BP, weight, and BMI. The experiment is by no means thorough and complete because the real model for describing a patient is far more complicated than just sevaral fields. The purpose of the experiment is to demonstrate the effectiveness of the probabilistic model for error detection taking historical data and inter-field dependency relationships into consideration. The Equation (1) only uses the historical records from a single field. However, there are dependency relationships among different fields on a data entry interface I . The relationships among different fields can be obtained based on prior knowledge from experts or learned from historical data [17]. The relationships can be learned from historical records with a Bayesian network using the Max-Min Hill-Climbing (MMHC) method [19]. We implemented the method using the R packages: bnlearn [20] and gRain [21]. An example of learned relationships among fields is showed in Fig. 5. A Conditional Probability Table (CPT) will also be learned for the Bayesian Network. We obtained the data from a health clinic affiliated with Drexel University in Philadephia, PA, USA. All the data are de-identified for HIPAA (Health Insurance Portability and Accountability Act) compliance. We only obtained records with six fields without any additional clinical information. For the example shown in Fig. 5, the value vi of Fv depends on the value Fw and Fu , so we can represent the probability as qB (vi | wi , ui ) . By adding the probability of qB (vi | wi , ui ) to the Equation (1), we get the following Equation (5): q (v i ) = β × q M (vi ) + γ × q B (v i | wi , u i ) q(vi ) ≥ θ A. Data Sets We selected six fields from databases: Age, Sex, Diastolic Blood Pressure (BP_D) value, Systolic Blood Pressure (BP_S) value, Weight, and Body Mass Index (BMI). we extracted 1448 patients with 12199 records from all these six fields from the EMR system. Because our dataset is limited and patients’ information is insufficient, we cannot guarantee all patients have equal clinical conditions and treatments in this small scale preliminary study. Only Age and Sex are used as basic fields to conduct classification process to find a similar patients group for current patient. Based on the two basic attributes: age and sex. Patients are divided into 14 categories and the numbers of (5) where the two parameters β and γ , β + γ = 1 . We already get the value of qM (vi ) by Equation (1). We can get the value of q B (vi | wi , ui ) from CPT of the learned Bayesian 252 patients for each category are shown in Table II. The average number of historical records for each patient is 8.4. We examine our data entry error detecting method for each group. TABLE II. Group 1 2 3 4 5 6 7 8 9 10 11 12 13 14 The average predictive accuracy for all groups is 71%. The predictive power is stable for all these groups, which ranges from 70% to 85%, except for group1, group13, and group14. Their numbers of records are lower than 200, so their predictive accuracy are relatively lower. 16 KINDS OF SIMILAR PATIENTS Age [20, 30) [30, 40) [40, 50) [50, 60) [60, 70) [70, 80) [80, 90) Sex M F M F M F M F M F M F M F a. We implement Usher’s error model [17] on our dataset. Usher induced the probability of making an error using an error model. The comparative results of Usher and our method is shown in Fig. 8. Usher’s average accuracy is 63.2%, while our method is 71%. Except for group 1 and group 13 in which the numbers of records are too small, our method has better performance than Usher. Patients/Records Number 30/153 63/567 68/328 132/973 122/643 188/1967 191/1513 270/2552 103/850 181/1592 20/195 55/658 4/36 21/172 Since the number of records and accuracy of group 8 is relative high. We test different sets of parameters on the dataset of group 8. The different parameter sets and corresponding accuracy results are as shown as Table III and Fig. 9. There are 9 different sets of parameters. TABLE III. “M” stands for “Male”, and “F” stands for “Female” BP_D, BP_S, Weight, and BMI are transformed to discrete value and used as clinical fields to build a Bayesian Network to find the inter-field dependency relationship among these clinical fields. In our experiment, we built a unified Bayesian network using all 12199 records. The Max-Min Hill-Climbing (MMHC) method is used to generate the Bayesian Network structure as shown in Fig. 6. Fig. 6. Learned Bayesian Network α1 α2 α3 β γ 1 2 3 4 5 6 7 8 9 0 1/3 0 1/3 1/3 0.25 0.2 0.1 0 0 1/3 0 1/3 1/3 0.25 0.2 0.1 0 1 1/3 0 1/3 1/3 0.5 0.6 0.8 1 1 1 0 0.75 0.5 0.5 0.5 0.6 0.5 0 0 1 0.25 0.5 0.5 0.5 0.4 0.5 Accuracy 0.736813 0.748621 0.757969 0.754706 0.75221 0.780027 0.783213 0.782445 0.791291 As shown in Table III, in Set 1 and Set 2, β = 1 and γ = 0 , which means that prediction only depends on past historical values of the current single field. In set 3, β = 0 and γ = 1 , which means that prediction only depends on other fields, but does not depend on historical values of current single field. That is, the result of set 3 is the result of only using Bayesian network. The accuracy results from set 1, 2 and 3 are relatively low. In summary, the accuracy results obtained by the methods purely depending on past historical values of the current single field or on dependency relationships among multiple fields learned from Bayesian network are not as good as the results obtained by combining these two aspects. Fig. 7. Predictive Accuracy and Number of Records of 14 Groups The variable “BP_D” depends on the variable “BP_S”. The joint probability between “BP_D” and “BP_S” can be calculated from the CPT (Conditional Probability Table) of Bayesian Network. B. Experimental Results Based on our dataset, we used our method to predict the value of variable “BP_D”. The predictive power of our method is measured in individual 3-fold cross validation experiments for each group. The average predictive accuracy and number of records for each group is shown in Fig. 7. The accuracy is calculated as correctly predicted instance number divided by total instance number. The parameters set here are α1 = α 2 = α 3 = 1 / 3 , β = 3 / 4 , and γ = 1 / 4 . Fig. 8. Comparative Performance of Usher with Our Method ACCURACY RESULTS FOR DIFFERENT PARAMTERS SET Set In Fig. 9, sets 6, 7, 8, and 9 have relatively higher accuracy. In these four sets, parameters β and α 3 have higher weight. That means prediction depending on all former 2 stages of historical values of the current single field and dependency relationships among multiple fields improves the accuracy. The set 9 has highest accuracy. But in practice, it is not recommended, because for some small datasets, α 3 might equal to 0. The recommended parameter sets are Set 6, 7, and 8. We also conduct an experiment to compare the performance of our model to that of the Hampel X84[11]. Based on our method, the “Normal” ranges for a patient’s “BP_D” value can dynamically narrow down to [70, 90] as shown in Fig. 10. a), while the Hampel X84 gave “Normal” to a broader range [45.3, 114.7] as shown in Fig. 10. b) for all population. The narrower range indicates that our method is more sensitive to error detection. Fig. 9. Accuracy Comparison for Different Parameters Set 253 [6] [7] [8] a) Our method b) Hampel X84 Fig. 10. Our method compared with Hampel X84 [9] VI. DISCUSSION AND FUTURE WORK This paper proposes a five-step error detecting and tagging framework for reducing data entry errors that pose significant challenges to existing statistic methods. Our probabilistic model for error detecting is based on the assumption that input values for a field can be predicated using historical data and inter-field dependency relationships among fields. [10] [11] [12] We conducted pilot experiments to evaluate our method. Results show that our probabilistic model has better accuracy than a Bayesian network, it is also better than the method based on a Markov assumption that the input value for a field is based only on historical records of this single field. We also compare our method with Usher’s error model. The result shows our method has a better performance. Moreover, our method has more precise results than the Hampel X84 method. [13] [14] Our future research work will focus on following aspects: [15] Goldberg, S.I., et al., A Weighty Problem: Identification, Characteristics and Risk Factors for Errors in EMR Data. AMIA Annu Symp Proc, 2010. 2010: p. 251-5. [16] Wong, W., et al. Bayesian network anomaly pattern detection for disease outbreaks. in MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-. 2003. [17] Chen, K., et al., Usher: Improving data quality with dynamic forms. Knowledge and Data Engineering, IEEE Transactions on, 2011. 23(8): p. 1138-1153. [18] Chen, K., J.M. Hellerstein, and T.S. Parikh. Designing adaptive feedback for improving data entry accuracy. in Proceedings of the 23nd annual ACM symposium on User interface software and technology. 2010. ACM. [19] Tsamardinos, I., L.E. Brown, and C.F. Aliferis, The max-min hillclimbing Bayesian network structure learning algorithm. Machine learning, 2006. 65(1): p. 31-78. [20] Scutari, M., Learning Bayesian networks with the bnlearn R package. arXiv preprint arXiv:0908.3817, 2009. [21] Højsgaard, S., Graphical Independence Networks with the gRain package for R, 2012, Journal. [22] Arampatzis, A., et al., KUN on the TREC-9 Filtering Track: Incrementality, decay, and threshold optimization for adaptive filtering systems. 2000. [23] Zhang, Y. and J. Callan. Maximum likelihood estimation for filtering thresholds. in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. 2001. ACM. 1) Currently, our tagging method can only return two kinds of tags as “Normal” and “Suspicious”. We would like to extend our model to detect multiple errors as shown in Table I. 2) Our error detecting method only focuses on discrete input value. We plan to improve our method to deal with continuous input values. 3) We plan to conduct a more thorough experimental evaluation by collecting more patient data with more attributes in the future. ACKNOWLEDGMENT This work is supported in part by a Drexel Jumpstart grant on Health Informatics and the NSF grant IIP 1160960 for the center for visual and decision informatics (CVDI). REFERENCES [1] [2] [3] [4] [5] Chandola, V., A. Banerjee, and V. Kumar, Anomaly Detection: A Survey. Acm Computing Surveys, 2009. 41(3). Arts, D.G., N.F. De Keizer, and G.-J. Scheffer, Defining and improving data quality in medical registries: a literature review, case study, and generic framework. Journal of the American Medical Informatics Association, 2002. 9(6): p. 600-611. Beretta, L., et al., Improving the quality of data entry in a low-budget head injury database. Acta Neurochirurgica, 2007. 149(9): p. 903909. Phillips, W. and Y. Gong, Developing a nomenclature for emr errors, in Human-Computer Interaction. Interacting in Various Application Domains. 2009, Springer. p. 587-596. Khare, R., An, Y., Wolf, S., Nyirjesy, P., Liu, L., & Chou, E, Understanding the EMR error control practices among gynecologic physicians. iConference 2013 Proceedings, p.289-301. Hellerstein, J.M., Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008. Nahm, M.L., C.F. Pieper, and M.M. Cunningham, Quantifying Data Quality for Clinical Trials Using Electronic Data Capture. Plos One, 2008. 3(8). Goldberg, S.I., A. Niemierko, and A. Turchin, Analysis of data errors in clinical research databases. AMIA Annu Symp Proc, 2008: p. 242-6. Jacobs, B., Electronic medical record, error detection, and error reduction: a pediatric critical care perspective. Pediatr Crit Care Med, 2007. 8(2 Suppl): p. S17-20. Dasu, T. and T. Johnson, Exploratory data mining and data cleaning. Vol. 479. 2003: Wiley-Interscience. Batini, C. and M. Scannapieca, Data quality. 2006: Springer. Chan, P.K., et al., Distributed data mining in credit card fraud detection. Intelligent Systems and their Applications, IEEE, 1999. 14(6): p. 67-74. Lazarevic, A., et al., A comparative study of anomaly detection schemes in network intrusion detection. Proc. SIAM, 2003. Rahm, E. and H.H. Do, Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 2000. 23(4): p. 3-13. [24] Michael Collins. Language Modelling, http://www.cs.columbia.edu/~mco llins/. 2013. 254