Global Info Intel: Global Information Intelligence

Intelligence and Solutions on Global Information Trends

Home
About Us
Financial and Banking
HealthMedical Data Mining
Electronic Medical Record
Data Mining
eBooks - GlobalInfoIntel
Security
Privacy
Cloud Computing
CyberSecurity-SmartGrid
Data Loss Prevention
Governance
Risk Management
Compliance
Regulations
Standards
Frameworks
SIEM and Log Management
Data Management
Internet & Society
Global Issues
Auditing
Quantitative Research Ed
Articles
Fraud Detection and Risk
Site Map
Contact Us

Global Quantitative Research on Education Information Intelligence Trends

 

 

Key Intelligence and Expert Resources on Critical Global Quantitative Research Methods for Education Information Trends and Solutions

  
 

 Global Information Intelligence and Trends 

 

Critical Intelligence on Current and Emerging Global Quantitative Research Information Trends and Solutions

 

 

 

Summary

 

Quantitative Research Methods: Evidence Based Research 

Quantitative research methods are applicable to rigorous scientific evaluation of the effectiveness of US Educational, Critical Infrastructures, Electricity Markets, Economics, Health Care Cost, Ethics and Integrity and Regulatory Reforms, etc. 

The Imperical Quantitative Research Approach includes the following:

·        Experimental Quantitative and Qualitative Research Methods

·        Evaluation of existing intervention Programs and Activities 

·        Implementation capacities of Research Programs and Activities

See information on using Quantitative Methods at Harvard: 

http://www.ethics.harvard.edu/

http://www.hsph.harvard.edu/

http://www.edlabs.harvard.edu/

 

Quantitative Research Methods

  • Evidence-Based Research

  • Quantitative Research analysis

  • Comparative Analysis

     

·        Econometrics and Labor Economics

o      Regression Analysis

§        Linear Regression

§        “Natural Experiments” Instead of “Controlled Experiments”

§        Observational Data

§        Omitted-Variable Bias and

§        Single Equation Methods Model – Dependent Variable

§        Simultaneous Equation Methods– Instrumental Variables

o      Data Sets

§        Time Series Data – observations of one variable over time

§        Cross Sectional Data – many students’ performance over in a given year

§        Panel Data  –  Consists of both Time Series Data and Cross Sectional Data

§        Multidimensional Panel Data - contain observations across time, cross-sectionally, and across some third dimension:

Scientific Quantitative Statistical Research Methods

Statistical Methods

·        Statistical Chi Square Methods

·        Statistical Rule Induction

·        Statistical Clustering

·        Variance, Covariance, Distribution, Variables

·        Cluster Means

·        Sum of Squares

·        Support Vector Machines

 

Artificial Intelligence and Machine Learning

·   Statistical Pattern Recognition

·   Classification, Supervised Algorithms

·     Hidden Markov Model and applications,

·    Decision processes and reinforcement learning

·    Machine Learning

·       Bayesian and Neural Networks: Representation, Inference and Learning

·        Statistical Clustering

·       Variance, Covariance, Distribution, Variables

·        Cluster Means

·       Sum of Squares 

 

       Quantitative Research Methods

·        Tools: SPSS, Rosetta, XL Miner Software

·        Data Mining and Reality Mining

·        Hybrid Data Mining and Statistical Analysis

 

Quantitative Research Data

·        Data Sets

·        Real Datasets

·        Historical Data Sets

·        Current Data Sets

·        Emerging Data Sets

·        Training and Test Data Sets

·        Feature Dictionary

·        Feature Sets

·        Feature Attributes

·        Categories

·        Subcategories

·        Classes

·        Types

·        Attributes

 

Quantitative Statistical Research: Data Mining Algorithms

·        Bayesian Classification

·        Discriminant Analysis

·        Ward Clustering

·        Chi Square Statistical Analysis

·        Association Rules

·        K-Means Clustering

·        Sum of Squares

·        Rule Induction Algorithms

·        Cubist/C5.0 Rule Induction

·        Holte's 1 rule Induction

·        Genetic Algorithms

·        Discriminant Component Analysis (DCA)

·        Fisher’s Discriminant Analysis

·        Principal Component Analysis (PCA)

·        Regression test statistics

·        Principal Component Regression

·        Linear combinations of student deviations

·        Statistical methods: Sequence of performance of student profile and past Performance.

·        Probability distribution of educational data

·        Probabilities with maximum variance, correlation and non-correlation

·        Transition matrix of possible groups

·        Bayesian approach to performance transitions

·        Bayes Factor statistic - testing null hypothesis:

- Observed performance transition probabilities

- Profiled performance transition matrix

·        Decision tree classification

·        Rule learner

·        Naive BAYES

·        Maximum Support Rules

·        Conditional Rules

·        Filtering for Maximum Support Attributes

·        Validation of Test Data

·        Statistical theory and neural networks

·        Ripley neural nets- class of statistical models

·        High-dimensional parameters model choice

·        Predictive Bayesian inference in computation

·        Neural Networks

-         Multi-layer perceptron

-         Back propagation network

-         Feed-forward network (FFNN)

-         Performance in multiplayer perceptrons in NN limits of student

-         Performance analysis in terms of over-fitting in NN complexity

·        Linear Discriminant, nearest neighbor, etc.

·        Reducing over-fitting using logistic regression

·        Performance conditions and characteristics

·        Convergence or initialization conditions

·        Representation and data sizes

 

Quantitative Statistical Techniques

·        Pattern recognition using neural networks

·        Frequent pattern mining techniques

·        Pattern Recognition

·        Patterns Analysis

·        Correlation and Cross-Tabulation

·        Cross-Correlation

·        Measurement of Failures and Effectiveness

·        Monitoring, Detection, Prevention and Response

·        Feature Cost-Sensitive (FCS)

·        Cost Factors - Dynamic Costs

·        Fault, Failure and Success Detection

·        Heuristics

 

Quantitative Research: Field research Data Acquisition and Analysis

·        Records on Performance

·        Field Surveys

·        Case Studies

·        Questionnaires

·        Test Cases

·        Test Procedures

·        Performance Matrices

·        Corroborative Data

·        Machine Learning

·        Artificial Intelligence

·        Quantitative Statistical Analysis

  

Evidence Results and Presentation

·        Quantitative Statistical Analysis Outputs

·        Receiver Operating Characteristic (ROC) curves of performance

·        Confusion Matrices 

·        Statistical Tables

·        Effectiveness, Robustness, Scalability, Transportability, Accuracies, Efficiencies 

 

Quantitative Research: Data Mining Techniques

Clustering Methods

 

1. General Agglomerative Algorithm

 

A. Centroid distance: this states that the distance between two clusters is the distance between the cluster means or centroids.

 

B. Median Distance: this defines the distance between two clusters as the distance

 

between the cluster medians.

 

 

2. The Sum-of-Squares Methods

 

Methods that minimizes a sum of squares inaccuracies criterion, includes K-means

 

clustering. The sum-of-squares clustering method identifies a partition of data based on a predefined clustering criterion dependent on the within-class and between-class scatter matrices.

 

 

3. The Clustering Criteria

 

The task is for clustering methods to partition a set of n data samples into g clusters to optimize the clustering criterion.

 

 

4. Clustering Algorithms of the Sum-of-Squares Methods

 

Combinational optimization for the partition of n objects into g groups is the

 

optimized selected criterion. This requires the evaluation of all possible partitions.

 

 

5. K-means Clustering

 

The K-means algorithm partitions data into k clusters to minimize the within-group

 

sum of squares.

 

 

6. Selecting Number of Clusters

 

The selection of the number of clusters has been a problem in cluster validity tests

 

and analysis since no single method can validate unstructured clusters.

 

methods formal validations:

·        Lkelihood ratio

·        Chi square

·        covariance matrices

 

 

7. Analysis of Sum-of-Squares Clustering Methods

 

The problem for clustering methods is to partition a set of n data samples into g

 

clusters to optimize the clustering criterion.

 

Rule Induction Techniques

Rule induction techniques and algorithms are used to extract information from data

because the representation of information is intuitive and readily understood.

Rule induction methods return “if/then” rules as outputs of

  • systems based learning (e.g. k-nearest neighbor)

  • statistical techniques (e.g. naïve Bayes classifier)

  • neural networks

  • support vector machines (SVM)

  • Classification: each training example is represented by a set of predictor attributes and a class attribute. The algorithm analyses relationships between the predictor and the goal attributes to create a model that can be used later to predict the value of the goal attribute of new examples.

 

1. The Divide and Conquer Rule Induction Technique

 

The divide and conquer technique generates decision trees. The decision tree

algorithms use the divide and conquer technique to construct decision trees via a top down, greedy search. The divide and conquer technique evaluates all the predictor attributes to classify the examples in the training set.

 

 2. The Separate and Conquer Rule Induction Technique

The separate and conquer technique generates a set of rules. The technique learns

a rule from a training dataset, then removes from it the examples covered by the rule, and subsequently learns recursively other rules that cover the remaining examples. This is the most common technique for rule induction algorithms.

 

3. Rule Induction Algorithms and Techniques

 

Rule induction algorithms specify procedures based on the above techniques.

I. Search mechanism – this involves the search strategy and method. The search strategy is implemented using the following procedures.

 

a. InitializeRules – specify if the initial rule is a generic rule without an antecedent, as specific rule is derived from an example, or a different rule between these two.

 

b. RefineRules – determine if the current rule is generalized or specialized, so that the chosen operation is consistent with the type of initial rule specified in the    InitializeRules procedure. The search methods are based on SelectCandidates and FilterRules.

 

c. SelectCandidates – the procedure selects the subset of rules that will be generalized or specialized. This procedure is specified through a beam search. This is followed by a specific search method through instantiation of the beam width parameter. A greedy search method can be obtained by setting the beam width parameter to 1.

 

d. FilterRules – the procedure can use the same search method as  

SelectCandidates or a different search method. The search method can

be specified similarly in both procedures.

            

                 I. Rules Representation – this is implemented by the RefineRules

                  procedure, which  determines the conditions that can be added to the

                  candidate rules.

 

                 II. Rule Evaluation – this is directly defined by the procedure

                   EvaluateRule, since it determines the rule-quality measure in the rule

                   evaluation process.

 

                III Pruning Methods – this is determined by the Stopping Criterion and

                   Post Processing procedures. The Stopping Criterion implements pre-

                  pruning methods by determining when to stop refining the rules, 

                  and Post Processing implements post-pruning methods.

 

 

Analysis of Separate and Conquer Algorithms

Many of the rule induction algorithms based on the divide and conquer approach differ from each other in four ways:

1.      The representation of the candidate rules;

2.      The search mechanism used to explore the space of candidate rules;

3.      How the created rules are evaluated; 

4.      The pruning method.

The rule representation has a significant influence in the learning process, since some concepts can be expressed in one presentation but not in others. In particular, rules can be represented in propositional or first order logic.

 

Propositional rule algorithms

Propositional rules comprise of selectors, which are associations between pairs of attribute-values.

       ·      CN2 and C4.5 rules 

      ·       RIPPER

·    FOIL, PROGOL and Reduced Error Pruning (REP)

Prolog representation

Inductive Logic Programming (ILP)

ILP uses the same principles of rule induction algorithms

 

Analysis of Rule Induction Algorithms

 

1. Association Rules Induction Technique

Association rules are defined as rules that are based on the simultaneous occurrence of a set of event items, which satisfy specific conditions. In association-rule discovery, any association algorithm must discover precisely the same rule set, i.e. the set of all rules have support and confidence greater than a user-specified threshold. In rule induction techniques, association rules may be used to analyze multiple features of attributes of various datasets.

 

            1. To format it into a database file where each row is an audit record

                and each column is a field of features in the audit records.

 

2. To enable continuous merging of the rules from each run and thereby aggregate the rule set of previous runs.

 

 

2. Frequent Episodes

 

A frequent episode is defined as a set of frequent events that occur within a time window of a specified length. The events in a serial episode occur in sequence within a specified time of minimum frequency.

 

1. Examination of the frequent associations of the feature attributes of the attack event.

 

2. Computation of the frequent sequential patterns from the associations.

 

3. The associations of attributes and sequential patterns of records are combined into a single rule.

 

2. C4.5 Rules - first generates a decision tree using the divide and conquer technique, and subsequently extracts one rule for each leaf node of the tree

 

3. CN2  - comprises the separate and conquer technique. This also includes

RIPPER and AQ algorithms,  and Evolutionary algorithms, which include genetic algorithms and Genetic Programming (GP), to extract rules from datasets.

 

 

4. Rule induction: ID3 algorithm

Rule induction within the artificial intelligence (AI) research involves the use of algorithms, such as ID3, which uses entropy as the criterion for selecting the data fields for tree branching and the grouping of field values between branches. The generated rules can be more succinctly summarized. One method involves the omission of each rule condition in turn to see whether this results in any misclassification of data.

 

Quinlan “iterative dichotomize” (ID3) system

The ID3 rule induction algorithm is applied to a sample of data to generate a set of rules and then other data items that were misclassified by the current rules are examined. A number of similar data items is added to the initial set and the rule induction algorithm is re-run to generate a new set of rules. 

 

5. Decision Trees

A decision tree corresponds directly to a set of rules, with as many rules as there are

 

leaf nodes in the whole tree. Each rule is a tracing out of the path from the top of the

 

tree to a leaf node. The key question is which of the attributes is the most useful

 

determiner of the conclusion of the rules.

 

 

6. Rule induction: CN2 Algorithm

The CN2 algorithm induces an ordered list of classification rules from the dataset using entropy as its heuristic search. CN2 consists of a search procedure and control procedure.

 

7. Rule induction: C4.5 Rules

C4.5 uses a divide-and-conquer approach to growing decision trees that was pioneered by Hunt et al. The default splitting criterion used by C4.5 is the gain ratio, an information-based measure that takes into account different numbers and different probabilities of test outcomes.

 

8. RIPPER

The RIPPER uses a heuristic v value function and encoding length for determining when to stop adding rules to a rule set and a post pass to optimize the rule set. Individual rules are grown and pruned. The encoding length heuristic is as follows: after each rule is added, the total description length of the rule set and the examples is computed.

 

9. AQ Algorithms

The AQ algorithm is a rule induction technique for producing a complete and consistent description of classes. A class description is formed by a collection of disjuncts of decision rules describing all the training examples given for that particular class. A decision rule is a set of conjuncts of allowable tests of feature values. It uses the given parameters to direct the AQ algorithm in the process of searching for a complete and consistent set of classes.

 

 

 10. Genetic Programming and Genetic Algorithm

Genetic programming (GP) is the main kind of Evolutionary Algorithms (EA) designed to evolve programs. Hence, GP is a kind of EA where the individuals being evolved are computer programs. Banzhaf defined GP as the direct evolution of programs or algorithms for the purpose of inductive learning. The four major kinds of EAs are genetic algorithms, genetic programming, evolutionary strategies and evolutionary programming

 

 

11. Genetic Algorithms in Rule Induction Technique

A genetic algorithm (GA) is used to explore the space of all subsets of the given feature set. Each of the selected feature subsets is evaluated (its fitness measured) by invoking a rule induction algorithms such as AQ15, with the correspondingly reduced feature space and training set and measuring the recognition rate of the rules produced. The best feature subset is used in the actual design of the recognition system.

 

   I. Representation Issues

   The first step in applying GAs to the problem of feature selection is to map the      

   search space into a representation suitable for genetic search.

 

  II. Fitness Function

   In order to use genetic algorithms as the search procedure, it is necessary to

  define  a fitness function, which properly assesses the decision rules generated by

  the AQ algorithm.

 

 

12. Holte’s 1R Algorithm

As an empirical learning method, 1R takes as input each value of several attributes for a given class. It then generates a rule that predicts the class with the values of the attributes.  The 1R algorithm selects the most informative single attribute and bases the rule on this attribute. Holte reported the results of experiments measuring the performance of very simple rules on the datasets commonly used in machine learning research. The specific kind of rules examined, called "1−rules", are rules that classify an object on the basis of a single attribute (i.e. they are 1-level decision trees). Holte described a system, called 1R, whose input is a set of training examples and whose output is a 1−rule.

 

 

Application of Holte’s 1R Using Rosetta

The process of using rule induction to isolate each conditional attribute so that it can be identified in terms of its maximum support is illustrated in Rosetta software. Robert Holte’s 1R Algorithm can be adapted to provide the individual support levels. Holte’s 1R was implemented using Rosetta’s 1R Reducer (Holte’s 1R Reduct) which returns all attribute sets. The set of all 1R rules, i.e., univariate decision rules, are indirectly returned as a child of the returned set of single reducts. The first implication is that 1R can be used to predict the accuracy of the rules produced by more sophisticated machine learning systems.

 

13. Quantitative Scientific Evidence Reports

 

Quantitative Evidence Reports include the following methods: 

 

Hybrid Quantitative Methods

·        Hybrid Quantitative Methods for Performance

·        Effectiveness

·        Optimization

·        Accuracies

·        Evidence Reports

 

Quantitative Statistical Heuristics Methods 

 

 

Related Areas:

 

Select the following pages on the left column to find out more on related subjects:

 

 

n        Data Mining and Reality Mining

n        Global Auditing

n        Global Risk Management

n        Global Compliance

n        Global Regulations

n        Global Standards

n        Global Internet and Society


 

1          All Areas: Links

n        Main Site: Specialized Expert Information Topics: www.globalinfointel.com

n        Main Site: All General Subject Area Topics:  www.globalinfoTrends.com

 

2          Practical Tools -Practice

n        Main Home Page: www.globalinfointel.com

 

 

 

3          Related Resources

n       Main Home Page: www.globalinfointel.com

 

 
Product price or special offer