Diseño Electrónico: Python Knowledge Discovery and Data Mining Research Group KDDRG

https://web.cs.wpi.edu/~ruiz/KDDRG/Resources/Python/

Knowledge Discovery and Data Mining Research Group
KDDRG

Miscellaneous Notes on Python

Lots of the text and materials posted on this page were produced by Ahmedul Kabir (thanks Kabir!)

Python Data Mining Packages:

Python has many open source packages available specifically for Data Mining and Knowledge Management. Here is a list of the most widely used ones, along with brief descriptions:

Scikit-learn: Simple and efficient tools for data mining and data analysis. Has algorithms implemented in the fields of Preprocessing, Classification, Regression, Clustering, Dimensionality Reduction and Model selection. It is built on the commonly used NumPy and SciPy packages. Scikit-learn is usually the default choice when it comes to Data Mining in Python.
Pandas: Python Data Analysis Library: Slightly more advanced library than Scikit-learn. Has a very good API. Pandas introduces some useful data structures, such as .dataframes.. However, Pandas doesn.t provide all of the predictive modelling tools. Pandas is used when more control is needed when working directly on raw data.
Orange: The best thing about Orange is that it has a Graphical User Interface. Has quite a comprehensive collection of algorithms for Classification, Clustering and feature selection. It also has add-ons for Bioinformatics and Text mining.
MLPy: Machine Learning Python: MLPy is a Machine Learning package similar to Scikit-Learn. It has most of the algorithms necessary for Data mining, but is not as comprehensive as Scikit-learn. MLPy can be used for both Python 2 and 3.

Note: Python Package Index: All Python packages can be searched by name or keyword in the Python Package Index.

Data Preprocessing:

Kabir's Data Preprocessing in Python slides.
References used in Kabir's slides, and other useful links:
- Preprocessing Modules in scikit-learn
- Extensive pre-processing functionality offered in Pandas

Model Evaluation:

In Scikit-learn: For evaluation metrics other than accuracy - like precision, recall and ROC curve - use sklearn.metrics. See http://scikit-learn.org/stable/modules/model_evaluation.html
This module defines a number of methods that can be used for performance evaluation, such as:
- precision_recall_curve()
- roc_curve()
- precision_score()
- recall_score()
- roc_auc_score()
These methods will calculate the desired performance metric if the predicted and actual values are supplied as parameters. The first two can be used to plot graphs. Another way to evaluate models is to pass the desired metric as a string as the 'scoring' parameter which is part of some methods. For example:

cross_validation.cross_val_score(model, X, y, scoring='precision')
Some possible values of the scoring parameter are 'precision', 'recall', 'f1', 'roc_auc', 'mean_absolute_error', 'mean_squared_error' and 'r2'.
In Pandas: Python Data Analysis Library:
In Orange:
In MLPy: Machine Learning Python:
In MILK: Machine Learning Toolkit See nfoldcrossvalidation at: https://pythonhosted.org/milk/api.html

Decision Trees:

In Scikit-learn: The DecisionTreeClassifier class is capable of multi-class classification. Found in module sklearn.tree.
See http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Some important parameters:
- criterion: Function to measure the quality of a split: .gini./.entropy.
- max_depth: Maximum depth of the tree
- min_samples_leaf: Minimum number of samples per leaf (comparable to Weka.s MinNumObj parameter)
Major attributes:
- tree_: The Tree object
- classes_: An array with the class labels
- feature_importances_: The .Gini importance. of each feature. The higher the value, the more important the feature.
Some important methods:
- fit(X, y): Build a decision tree from the training set where X is the matrix of predicting attributes and y is the target attribute.
- predict(X): Predict the class value for X
- score(): Returns the mean accuracy for the model
In Pandas: Python Data Analysis Library:
In Orange: See http://orange.biolab.si/docs/latest/reference/rst/Orange.classification.tree.html
In MLPy: Machine Learning Python: See http://mlpy.sourceforge.net/docs/3.5/nonlin_class.html#classification-tree
In MILK: Machine Learning Toolkit See https://pythonhosted.org/milk/api.html#modules

Linear Regression:

In Scikit-learn: The LinearRegression class in Sklearn.linear_model can perform linear regression on a given dataset.
See http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Parameters:
- fit_intercept: Boolean. If false, no intercept will be used for calculation (i.e. the data s assumed to be centered.
- normalize: Boolean. Whether to normalize the data before regression
Attributes:
- coef_: Coefficients of the model
- intercept_: Independent term in the model Important methods:
- decision_function(X): Decision function of the linear model. Returns the predicted values
- fit(), predict() and score() perform the usual tasks
In Pandas: Python Data Analysis Library:
In Orange: See http://orange.biolab.si/docs/latest/reference/rst/Orange.regression.linear.html
In MLPy: Machine Learning Python: See:
- http://mlpy.sourceforge.net/docs/3.5/lin_regr.html#ordinary-least-squares
- http://mlpy.sourceforge.net/docs/3.5/lin_regr.html#ridge-regression
In Modular toolkit for Data Processing (MDP): See http://mdp-toolkit.sourceforge.net/api/mdp.nodes.LinearRegressionNode-class.html

Regression using Trees:

In Scikit-learn: The DecisionTreeRegressor class can perform regression (Regression tree). Also in sklearn.tree.
See http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Some important parameters:
- criterion: Function to measure the quality of a split: .mse. is default
- max_depth: Maximum depth of the tree
- min_samples_leaf: Minimum number of samples per leaf (comparable to Weka.s MinNumObj parameter
Major attributes:
- tree_: The tree object
- feature_importances_: The importance of each feature
Some important methods:
- fit(X, y): Build a decision tree from the training set where X is the matrix of predicting attributes and y is the target attribute.
- predict(X): Predict the class numeric value for X
- score(): Returns R2, the coefficient of determination, of the prediction
Note: Scikit-learn does not provide a Model tree implementation. But since this package is open source, you can modify the DecisionTreeRegressor function to obtain a new function that constructs model trees.
In Pandas: Python Data Analysis Library:
In Orange: See http://orange.biolab.si/docs/latest/reference/rst/Orange.classification.tree.html
In MLPy: Machine Learning Python:

Association Rules:

In Orange: See http://orange.biolab.si/docs/latest/widgets/rst/associate/associationrules.html
Other Python implementations of Association Rules:
- PyFIM - Frequent Item Set Mining for Python By Christian Borgel.
  Contains several Python implementations of Frequent Item Set Mining algorithms including Apriori and FP-Growth among other.
- Many other online Python implementations of association rule mining exist, but Orange above seems the most suitable for our projects.

Clustering:

Scikit-learn:
- K-means: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
- Hierarchical: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
- DBSCAN: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Orange:
- K-Means: http://orange.biolab.si/docs/latest/reference/rst/Orange.clustering.kmeans.html
- Hierarchical: http://orange.biolab.si/docs/latest/reference/rst/Orange.clustering.hierarchical.html

MLPy:
- K-means and hiearchical: http://mlpy.sourceforge.net/docs/3.3/cluster.html

- An independent implementation of DBSCAN: http://iamtawit.blogspot.in/2012/12/dbscan.html

Diseño Electrónico

martes, 12 de septiembre de 2017

Python Knowledge Discovery and Data Mining Research Group KDDRG

Knowledge Discovery and Data Mining Research Group
KDDRG

Miscellaneous Notes on Python

General Information about Python:

Python Tutorials:

Python Books:

Python Environments: