Welcome!

Sunday, October 31, 2010

About Data Mining Concept (2)

Data Mining Language 

  • PMML (Predictive Model Markup Language) which provides a standard way to represent data mining models so that these can be shared between different statistical applications. PMML is an XML-based language developed by the Data Mining Group (DMG), an independent group composed of many data mining companies.
  • R (Programming Language) R is a programming language and software environment for statistical computing and graphics. The R language has become a de facto standard among statisticians for the development of statistical software, and is widely used for statistical software development and data analysis.


Predictive Modeling

Predictive modeling is the process by which a model is created or chosen to try to best predict the probability of an outcome. In many cases the model is chosen on the basis of detection theory to try to guess the probability of an outcome given a set amount of input data
Models or classifiers can use one or more classifiers in trying to determine the probability of a set of data belonging to another set. Here are some modeling technologies:

·         Naive Bayes

·         K-nearest neighbor algorithm

·         Majority classifier

·         Support vector machines

·         Logistic regression

·         Uplift Modeling





(To be continued)

Saturday, October 23, 2010

About Data Mining Concept (1)

Reader attention: This article is presented as below based on my personal understanding of data processing and data miming. I believe that readers from different background or business domains may have differential understandings, but the basic ideas should be very close.

Data Analysis

Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. 
Data Analysis involves the following activities or technologies:
·         Data mining focusing on data modeling and knowledge discovery for predictive rather than purely descriptive purposes, or in another term, predicative analysis. 
·         Business intelligence reporting focusing on business information extraction, data aggregation, and presentation (like by OLAP or Graph), in a business environment where business rules are quite clear. while when we talk about Business Intelligence, it usually refers to all the technologies or approaches used in the Data Analysis in general.
·         Statistical applications, maybe divided into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA)EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. 
·         Predictive analytics focusing on application of statistical or structural models for predictive forecasting or classification, in an environment where business rules or patterns are not clear.
·         Data integration (ETL) is a pre-process of preparing data for data analysis, while data visualization and data dissemination are the main delivery of data analysis results


Data Mining is a process of analysis against large data population to find out useful patterns. The information in the patterns is important to business. Data mining uses the following analysis technologies:
  • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
  • Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.
  • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID). CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID.
  • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k-record(s) most similar to it in a historical dataset (where k 1).
  • Rule induction: The extraction of useful if-then rules from data based on statistical significance.
  • Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.


Data Mining Process
1) Pre-processing
  • A target data set must be assembled. The target dataset must be large enough to contain these patterns while remaining concise enough to be mined in an acceptable timeframe. A common source for data is a data mart or data warehouse
  • The target set is then cleaned. Cleaning removes the observations with noise and missing data.
  • The clean data are reduced into feature vectors, one vector per observation. A feature vector is a summarized version of the raw data observation. The feature(s) selected will depend on what the objective(s) is/are; obviously, selecting the "right" feature(s) is fundamental to successful data mining.
  • The feature vectors are divided into two sets, the "training set" and the "test set". The training set is used to "train" the data mining algorithm(s), while the test set is used to verify the accuracy of any patterns found.
2) Data mining
Data mining commonly involves four classes of tasks:
·         Clustering - the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
·         Classification - the task of dividing records into predefined groups, by assigning a discrete label value to
                          an unlabeled record.
·         Regression - Attempts to find a function which models the data with the least error, it's supervised   modeling task similar to classification, but the label is not discrete.
·         Association rule learning - Searches for relationships between variables. 
·         Visualization
·         Feature Selection
3) Results validation
The final step of knowledge discovery from data is to verify the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the data mining algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set, this is called overfitting. To overcome this, the evaluation uses a test set of data which the data mining algorithm was not trained on. The learnt patterns are applied to this test set and the resulting output is compared to the desired output.
A number of statistical methods may be used to evaluate the algorithm such as ROC curves.
If the learnt patterns do not meet the desired standards, then it is necessary to reevaluate and change the preprocessing and data mining. If the learnt patterns do meet the desired standards then the final step is to interpret the learnt patterns and turn them into knowledge.

(To be continued)

Wednesday, September 8, 2010

How to Implement ETL Process using Talend + SpagoBI? (2)


3. Config ETL Job from SpagoBI


1) In SpagoBI web console, click "User menu" --> "Documents execution":

We found that the newly deployed ETL job is already there:

This is because the following settings in the file "engine-config.xml" at the folder "tomcat_home/webapps/SpagoBITalendEngine/WEB-INF/classes/":

<spagobi_autopublish>true</spagobi_autopublish>
<spagobi_functionality_label>ETL</spagobi_functionality_label>


2) Create a "Analytical Driver" for the context variable (parameter) required by the ETL job


And edit the ETL job detail to associate the analytical driver with the context variable "v_lastname" (need to be exactly the same with that defined in ETL job):

3) Save the changes.


4. Execute the ETL Job within SpagoBI

Now, after the settings are done, we can execute the ETL job from within SpagoBI server

Fill the value for "v_lastname" and execute the job:

ETL process started:

For more detail about the executed ETL job, we can check it's event ("User menus" --> "Events"):

Or, by checking the log file at: tomcat_home/logs/SpagoBITalendEngine.log


5. Schedule the ETL Job using SpagoBI Scheduler

1) Click "User menus" --> "Documents execution" and modify ETL document detail

2) Change the document state to "Released"
Save the change.

3) Click menu "Tools" --> "Scheduler", and click the button "New Activity" to add a new activity:

Also set the way of how to assign values to parameter, then save the activity.

4) Schedule the activity

And create a new schedule for this ETL activity, save the settings, and exit.
So far, the ETL job is scheduled to run.

5) Check the execution of scheduled job.
Click "User menu" --> "Events" to see events lists, also can check the details of start or end execution of scheduled job.



(complete)

How to Implement ETL Process using Talend + SpagoBI? (1)

Talend Open Studio is an open source data integration design tool. It can be used to solve these challenges in complicate enterprise environment: database synchronization or replication; data exchange; ETL (Extract Transform Load) for analytics; Data migration; Complex data transformation and loading; Data quality.

SpagoBI has an ETL engine based on Talend, called SpagoBITalendEngine. The SpagoBI environment provides an server context to schedule or execute ETL process application (which is exported from Talend open studio).

In this short tutorial, I demonstrate using Talend open studio (TOS) to design a simple ETL job to extract data from two Oracle tables, perform data transformation, and load these data into one MySQL table. After the ETL job is tested on TOS, the final job script is exported. The exported job script is a standalone application (in Java or Perl), which contains all the transformation logic and metadata information, and can be executed out side of TOS.

In the following demo, the ETL job is exported/deployed into SpagoBI server, using the deploy tool provided by TOS. On SpagoBI, an ETL document is created and tested, then, it can be released to SpagoBI scheduler.


Environment
============
Java: JDK 1.6.0_20
SpagoBI: version 2.6
Database 1: Oracle 11g R2 + OE (Order Entry) schema
Database 2: MySQL 5.1.48
App Server: Tomcat 6.0.28
Talend: Open Studio 4.0.2


1. Design ETL Process using Talend Open Studio (TOS)

1) Download and install TOS from http://www.talend.com/; for how to use TOS , please see TOS user manual for detail.

2) Define the data source (Oracle) connection and data target (MySQL) connection in the repository of TOS, as shown below:


and "Retrieval Schema" to find out the two source tales ("ORDERS", "CUSTOMERS") and one target table ("ORDR_CUST");

3) Design the ETL flow


and config the tMap component to embed the transformation logic


4) Create a context variable "v_lastname" to restrict the data retrieved from table "CUSTOMERS"




5) Make sure that both the source and target databases are running. Run the ETL job from within the TOS. Check the target database (MySQL) table "ORDR_CUST" to verify that it has been populated or refreshed.

So far so good, the ETL job is designed and tested good on TOS.
From the next step, we will deploy the ETL job to SPagoBI server.


2. Deploy to SpagoBI Server

1) Make sure that SpagoBI server is running.

2) Config the TOS deploy tool
On TOS, select "Windows" --> "Preferences"


3) Run the TOS deploy tool



4) ETL job is successfully deployed to SpagoBI server.


(to be continued)

Saturday, August 14, 2010

How to Build OLAP Application Using Mondrian + XMLA + SpagoBI? (3)

5. Config XMLAEngine to Access OLAP Server via XMLA

On previous section, we have deployed OLAP schema file on Mondrian OLAP server and tested the OLAP cube from XMLA interface.

At this section, we will config the OLAP schema on SpagoBI environment to enable the OLAP front-end tool (SpagoBI) to perform OLAP analysis against Mondrian OLAP server.

1) Install and config the SpagoBI XMLAEngine; for detail information, please refer to the document about XMLAEngine.

2) Edit the file "engine-config.xml" (under: TOMCAT_HOME/webapps/SpagoBIXmlaEngine/WEBINF/classes/). Make it looks like:


<ENGINE-CONFIGURATION>

<CONNECTIONS-CONFIGURATION defaultConnectionName="mondrianXMLA" >
<CONNECTION name="mondrianXMLA" type="xmla"
xmlaServerUrl="http://localhost:8080/mondrian/xmla" />
</CONNECTIONS-CONFIGURATION>

...

<SCHEMAS>
<SCHEMA catalogUri="/WEB-INF/queries/SalesHistory.xml" name="SalesHistory" />
</SCHEMAS>

</ENGINE-CONFIGURATION>

3) Save the changes and re-start the Tomcat server;

4) To test that the "SalesHistory" OLAP schema on Mondrian server is visible from SpagoBI environment via XMLA interface, modify the file "xmla.jsp" under: TOMCAT_HOME/webapps/SpagoBIJPXMLAEngine/WEB-INF/queries;

Replace the MDX query with the following one:


SELECT {[Measures].[Quantity Sold], [Measures].[Amount Sold]} ON columns,
Hierarchize({([Gender].[All Genders], [Location].[All Locations])}) ON rows FROM [Sales]
WHERE ([Time].[All Times])

Save the change, and click the following URL:

http://localhost:8080/SpagoBIJPXMLAEngine/testpage.jsp?query=xmla

We should see the same display as:



6. Build OLAP Document from SpagoBI

We have configured and confirmed that everything is working so far. The last thing to do is to create an OLAP document from SpagoBI dashboard. This OLAP document will be used by users to perform OLAP analysis against the OLAP cube (and data warehouse) we defined before.

1) Insert an OLAP document by clicking the "Insert" button:


2) Config the document detail page as shown below:


3) Save the document, and click the "Template Build" button to config & save a template file:

4) Save the template (or add parameters if need):

5) Execute the newly build OLAP document (application).


(Complete)

Friday, August 13, 2010

How to Build OLAP Application Using Mondrian + XMLA + SpagoBI? (2)

3. Design Aggregation Tables

When the fact table contains huge number of rows, then the query to calculate at higher dimension levels will take longer time, because of the need to scan full dataset; another reason is with how ROLAP works, without aggreation (pre-calculation), ROLAP will read detail data from database tables and calculate on the fly.

To solve the performance issue with huge fact tables, we can build aggregate tables which contain pre-calculated summary data. The build of aggregate tables should be a part of the ETL process that populate / refresh the data warehouse.

We will use "Aggregation Designer" to help design the aggregate tables.

1) Start "Pentaho Aggregation Designer"; (can be downloaded from http://mondrian.pentaho.com)

2) Config the "Database Properties"; In this sample, we use Oracle database.

3) Connect to Data Source, using the OLAP schema file that we created before, as shown below:



4) then click "Connect" button. Once the schema has been consumed and parsed, the Aggregation Designer performs a series of validations to ensure the database is suitable for creation and the validation result is displayed.

5) we can use "Aggregation Designer" in two ways - Manual or Automated (Advisor). For demo purpose, I manually created two Aggregations.



6) to export the aggregation design, click "Export" button.
From the "Export and Publish" window, we can Preview the scripts for creating the aggregation tables; Create the aggregation tables; or Populate the aggregation tables; or Export (save) the Mondrian Schema file (with definition for aggregations) for deploy on Mondrian OLAP server.





4. Deploy Schema File in Mondrian

Suppose that Mondrian OLAP server is setup. (for how to install & config Mondrian OLAP server, please see my other article.)

1) Put the schema file "SalesHistory.xml" under: TOMCAT_HOME/webapps/mondrian/WEB-INF/queries;

2) Create file "datasources.xml" under: TOMCAT_HOME/webapps/mondrian/WEB-INF. This file may look like this:


<DataSources>
<DataSource>
<DataSourceName>Provider=Mondrian;DataSource=MondrianFoodMart;</DataSourceName>

<DataSourceDescription>Mondrian FoodMart Data Warehouse</DataSourceDescription>

<URL>http://localhost:8080/mondrian/xmla</URL>

<DataSourceInfo>Provider=mondrian;Jdbc=jdbc:oracle:thin:@localhost:1521:orcl1;
JdbcUser=sh;JdbcPassword=xxx;JdbcDrivers=oracle.jdbc.OracleDriver;
Catalog=/WEB-INF/queries/SalesHistory.xml
</DataSourceInfo>

<ProviderName>Mondrian</ProviderName>

<ProviderType>MDP</ProviderType>

<AuthenticationMode>Unauthenticated</AuthenticationMode>

<Catalogs>

<Catalog name="SalesHistory">
<DataSourceInfo>Provider=mondrian;Jdbc=jdbc:oracle:thin:@localhost:1521:orcl1;
JdbcUser=sh;JdbcPassword=xxx;JdbcDrivers=oracle.jdbc.OracleDriver;
Catalog=/WEB-INF/queries/SalesHistory.xml
</DataSourceInfo>
<Definition>/WEB-INF/queries/SalesHistory.xml</Definition>
</Catalog>

</Catalogs>
</DataSource>
</DataSources>

3) modify the file "mondrian.properties" under: TOMCAT_HOME/webapps/mondrian/WEB-INF/classes

Add the following two lines to allow the use of aggregates:

mondrian.rolap.aggregates.Use=true
mondrian.rolap.aggregates.Read=true


4) re-start the Tomcat server;

5) to test that the deployed OLAP schema is working, modify the file "xmla.jsp" under: TOMCAT_HOME/webapps/mondrian/WEB-INF/queries;

Replace the MDX query with the following one:


SELECT {[Measures].[Quantity Sold], [Measures].[Amount Sold]} ON columns,
Hierarchize({([Gender].[All Genders], [Location].[All Locations])}) ON rows FROM [Sales]
WHERE ([Time].[All Times])


Save the change, and click the following:

http://localhost:8080/mondrian/testpage.jsp?query=xmla



(to be continued)

Monday, August 9, 2010

How to Build OLAP Application Using Mondrian + XMLA + SpagoBI? (1)

Mondrian is an OLAP (Online Analytical Processing) server that enables business users to analyze large quantities of data in real-time. Users explore business data by drilling into and cross-tabulating information with speed-of-thought response times to complex analytical queries.

XMLA (XML for Analysis) is the industry standard for data access in analytical systems, such as OLAP and Data Mining. XMLA is based on other industry standards such as XML, SOAP and HTTP. In this tutorial, we use Mondrian OLAP server as XMLA provider.

SpagoBI use SpagoBIXMLAEngine to query external XMLA compliant server like Mondrian OLAP server.

This tutorial demonstrate the steps to build the OLAP application using Mondrian OLAP server as XMLA provider and SpagoBI as reporting frontend.


Environment
============
SpagoBI: version 2.6
Mondrian: 3.2.0.13661
Schema Workbench: 3.2.0
Aggregation Designer: 1.2.0
Database: Oracle 11g R2 + SH (Sales History) schema
Java: JDK 1.6
App Server: Tomcat 6.0.28


1. Prepare Data Warehouse

We use the Oracle sample database SH (Sales History) as our data warehouse.
The SH tables are in snowflake schema, as shown below:




2. Build OLAP Model

In Mondrian, we use schema file (in XML format) to define the OLAP model. The schema file contains a logical model, consisting of cubes, hierarchies, and members, and a mapping of this model onto a physical model.

There are two ways to build the schema file: 1) by using a text editor; or 2) by using schema workbench. In this demonstration, We use schema workbench to build the OLAP logical model. To learn more about the Mondrian schema syntax and how to use Mondrian Schema Workbench, please goto http://mondrian.pentaho.com




We can test the above schema by executing MDX query:
First, start the "MDX Query" window:



Then, test the OLAP schema as shown:



Finally, save the schema file as "SalesHistory.xml" and exit Schema Workbench.

Below is the detail schema file:


<Schema name="SalesHistory">
<Cube name="Sales" cache="true" enabled="true">
<Table name="SALES" schema="SH">
</Table>
<Dimension type="StandardDimension" foreignKey="CUST_ID" name="Gender">
<Hierarchy hasAll="true" allMemberName="All Genders" primaryKey="CUST_ID">
<Table name="CUSTOMERS" schema="SH">
</Table>
<Level name="Gender" column="CUST_GENDER" type="String"
uniqueMembers="true" levelType="Regular" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<Dimension type="TimeDimension" foreignKey="TIME_ID" name="Time">
<Hierarchy hasAll="true" primaryKey="TIME_ID">
<Table name="TIMES" schema="SH">
</Table>
<Level name="Year" column="CALENDAR_YEAR" type="Integer" uniqueMembers="true"
levelType="TimeYears" hideMemberIf="Never">
</Level>
<Level name="Quarter" column="CALENDAR_QUARTER_NUMBER" type="Integer"
uniqueMembers="false" levelType="TimeQuarters">
</Level>
<Level name="Month" column="CALENDAR_MONTH_NUMBER" uniqueMembers="false"
type="Integer" levelType="TimeMonths" hideMemberIf="Never">
</Level>
<Level name="Day" column="DAYS_IN_CAL_MONTH" type="Integer"
uniqueMembers="false" levelType="TimeDays" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<Dimension type="StandardDimension" foreignKey="CUST_ID" name="Location">
<Hierarchy hasAll="true" primaryKey="CUST_ID" primaryKeyTable="CUSTOMERS">
<Join leftKey="COUNTRY_ID" rightKey="COUNTRY_ID">
<Table name="CUSTOMERS" schema="SH">
</Table>
<Table name="COUNTRIES" schema="SH">
</Table>
</Join>
<Level name="Country" table="COUNTRIES" column="COUNTRY_NAME" type="String"
uniqueMembers="true" levelType="Regular" hideMemberIf="Never">
</Level>
<Level name="State" table="CUSTOMERS" column="CUST_STATE_PROVINCE" type="String"
uniqueMembers="true" levelType="Regular" hideMemberIf="Never">
</Level>
<Level name="City" table="CUSTOMERS" column="CUST_CITY" type="String"
uniqueMembers="true" levelType="Regular" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<Measure name="Quantity Sold" column="QUANTITY_SOLD" datatype="Numeric"
formatString="#,###" aggregator="sum" visible="true">
</Measure>
<Measure name="Amount Sold" column="AMOUNT_SOLD" datatype="Numeric"
formatString="#,###.##" aggregator="sum" visible="true">
</Measure>
</Cube>
</Schema>


(to be continued)

Thursday, July 8, 2010

How to Install & Config Mondrian OLAP Server? (1)

Mondrian is an OLAP server written in Java. It enables you to interactively analyze very large datasets stored in SQL databases without writing SQL.

SpagoBI uses SpagoBIXMLAEngine to access XMLA compliant server like Mondiran OLAP server.

This article records the detail of installing and configuring Mondrian OLAP server on the following environment. The solution below solved the reported issues that mondrian don't work well with JDK 1.6. You can find these issues at here or there.

Environment
============
SpagoBI: version 2.5
OS Server: Windows server 2008 R2 (x64)
App Server: Tomcat 6.0.26
Database: MySQL 5.1.48
Java: JDK 1.6 or JDK 1.5
Mondrian: 3.2.0.13661
Axis: 1.4


Prepare Works
  • Download Tomcat server 6.0.26 from here, and unzip to local folder tomcat_home/;
  • Download MySQL server 5 and jdbc driver (mysql-connector-java-5.1.12-bin.jar) from here, and setup by following MySQL documentation;
  • Download Mondrian 3.2.0.13661 from here, and unzip to local folder mondrian-3.2.0.13661/;
  • Download Apache Axis 1.4 from here, and unzip to local folder axis-1_4/.

Setup the Test Data in MySQL
Please follow the official Mondrian installation guide to install the test database "foodmart" into MySQL database.

Deploy Mondrian Web Application to Tomcat
  • Explode the war file (mondrian-3.2.0.13661/lib/mondrian.war) to the folder at
    • tomcat_home/webapps/mondrian;
  • Delete the following jar files:
    • saaj-api.jar
    • saaj-ri.jar
    • xalan.jar
    from the folder:
    • tomcat_home/webapps/mondrian/WEB-INF/lib
  • Copy from the folder axis-1_4/lib these jar files:
    • axis.jar
    • commons-discovery-0.2.jar
    • saaj.jar
    • jaxrpc.jar
    • wsdl4j-1.5.1.jar
    to the folder:
    • tomcat_home/webapps/mondrian/WEB-INF/lib

  • Also copy MySQL jdbc driver jar file to the folder ;
    • tomcat_home/webapps/mondrian/WEB-INF/lib
  • Modify the file mondrian.properties from the folder
    • tomcat_home/webapps/mondrian/WEB-INF
  • Modify the file datasources.xml from the folder
    • tomcat_home/webapps/mondrian/WEB-INF
  • Modify the file web.xml from the above same folder;
  • Modify the fourhier.jsp, mondrian.jsp, colors.jsp, testrole.jar, and arrows.jsp files as specified in mondrian installation guide from the folder:
    • tomcat_home/webapps/mondrian/WEB-INF/queries

Start MySQL Database

Start Tomcat server

Hit http://localhost:8080/mondrian.