Welcome!

Sunday, October 31, 2010

About Data Mining Concept (2)

Data Mining Language 

  • PMML (Predictive Model Markup Language) which provides a standard way to represent data mining models so that these can be shared between different statistical applications. PMML is an XML-based language developed by the Data Mining Group (DMG), an independent group composed of many data mining companies.
  • R (Programming Language) R is a programming language and software environment for statistical computing and graphics. The R language has become a de facto standard among statisticians for the development of statistical software, and is widely used for statistical software development and data analysis.


Predictive Modeling

Predictive modeling is the process by which a model is created or chosen to try to best predict the probability of an outcome. In many cases the model is chosen on the basis of detection theory to try to guess the probability of an outcome given a set amount of input data
Models or classifiers can use one or more classifiers in trying to determine the probability of a set of data belonging to another set. Here are some modeling technologies:

·         Naive Bayes

·         K-nearest neighbor algorithm

·         Majority classifier

·         Support vector machines

·         Logistic regression

·         Uplift Modeling





(To be continued)

Saturday, October 23, 2010

About Data Mining Concept (1)

Reader attention: This article is presented as below based on my personal understanding of data processing and data miming. I believe that readers from different background or business domains may have differential understandings, but the basic ideas should be very close.

Data Analysis

Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. 
Data Analysis involves the following activities or technologies:
·         Data mining focusing on data modeling and knowledge discovery for predictive rather than purely descriptive purposes, or in another term, predicative analysis. 
·         Business intelligence reporting focusing on business information extraction, data aggregation, and presentation (like by OLAP or Graph), in a business environment where business rules are quite clear. while when we talk about Business Intelligence, it usually refers to all the technologies or approaches used in the Data Analysis in general.
·         Statistical applications, maybe divided into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA)EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. 
·         Predictive analytics focusing on application of statistical or structural models for predictive forecasting or classification, in an environment where business rules or patterns are not clear.
·         Data integration (ETL) is a pre-process of preparing data for data analysis, while data visualization and data dissemination are the main delivery of data analysis results


Data Mining is a process of analysis against large data population to find out useful patterns. The information in the patterns is important to business. Data mining uses the following analysis technologies:
  • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
  • Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.
  • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID). CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID.
  • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k-record(s) most similar to it in a historical dataset (where k 1).
  • Rule induction: The extraction of useful if-then rules from data based on statistical significance.
  • Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.


Data Mining Process
1) Pre-processing
  • A target data set must be assembled. The target dataset must be large enough to contain these patterns while remaining concise enough to be mined in an acceptable timeframe. A common source for data is a data mart or data warehouse
  • The target set is then cleaned. Cleaning removes the observations with noise and missing data.
  • The clean data are reduced into feature vectors, one vector per observation. A feature vector is a summarized version of the raw data observation. The feature(s) selected will depend on what the objective(s) is/are; obviously, selecting the "right" feature(s) is fundamental to successful data mining.
  • The feature vectors are divided into two sets, the "training set" and the "test set". The training set is used to "train" the data mining algorithm(s), while the test set is used to verify the accuracy of any patterns found.
2) Data mining
Data mining commonly involves four classes of tasks:
·         Clustering - the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
·         Classification - the task of dividing records into predefined groups, by assigning a discrete label value to
                          an unlabeled record.
·         Regression - Attempts to find a function which models the data with the least error, it's supervised   modeling task similar to classification, but the label is not discrete.
·         Association rule learning - Searches for relationships between variables. 
·         Visualization
·         Feature Selection
3) Results validation
The final step of knowledge discovery from data is to verify the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the data mining algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set, this is called overfitting. To overcome this, the evaluation uses a test set of data which the data mining algorithm was not trained on. The learnt patterns are applied to this test set and the resulting output is compared to the desired output.
A number of statistical methods may be used to evaluate the algorithm such as ROC curves.
If the learnt patterns do not meet the desired standards, then it is necessary to reevaluate and change the preprocessing and data mining. If the learnt patterns do meet the desired standards then the final step is to interpret the learnt patterns and turn them into knowledge.

(To be continued)