Knowledge Discovery and Data Mining Research Group
KDDRG
Miscellaneous Notes on Python
Lots of the text and materials posted on this page were produced by Ahmedul Kabir (thanks Kabir!)- General Information about Python
- Python Tutorials
- Python Books
- Python Environments
- Python Data Mining Packages
- Data Preprocessing
- Model Evaluation
- Decision Trees
- LinearRegression
- Regression with Trees
- Association Rules
- Clustering
General Information about Python:
Python Tutorials:
- For Python tutorials, see its documentation.
- Google's Python Class
- Python Tutor
Python Books:
Python Environments:
- IDLE
- Enthought Canopy
- IPython Notebook: See Prof. Kong's and Prof. Paffenroth's DS501 webpages for more information about IPython Notebook.
Python Data Mining Packages:
Python has many open source packages available specifically for Data Mining and Knowledge Management. Here is a list of the most widely used ones, along with brief descriptions:- Scikit-learn: Simple and efficient tools for data mining and data analysis. Has algorithms implemented in the fields of Preprocessing, Classification, Regression, Clustering, Dimensionality Reduction and Model selection. It is built on the commonly used NumPy and SciPy packages. Scikit-learn is usually the default choice when it comes to Data Mining in Python.
- Pandas: Python Data Analysis Library: Slightly more advanced library than Scikit-learn. Has a very good API. Pandas introduces some useful data structures, such as .dataframes.. However, Pandas doesn.t provide all of the predictive modelling tools. Pandas is used when more control is needed when working directly on raw data.
- Orange: The best thing about Orange is that it has a Graphical User Interface. Has quite a comprehensive collection of algorithms for Classification, Clustering and feature selection. It also has add-ons for Bioinformatics and Text mining.
- MLPy: Machine Learning Python: MLPy is a Machine Learning package similar to Scikit-Learn. It has most of the algorithms necessary for Data mining, but is not as comprehensive as Scikit-learn. MLPy can be used for both Python 2 and 3.
Data Preprocessing:
- Kabir's Data Preprocessing in Python slides.
- References used in Kabir's slides, and other useful links:
- Preprocessing Modules in scikit-learn
- Extensive pre-processing functionality offered in Pandas
Model Evaluation:
- In
Scikit-learn:
For evaluation metrics other than accuracy - like precision, recall and ROC curve - use sklearn.metrics.
See
http://scikit-learn.org/stable/modules/model_evaluation.html
This module defines a number of methods that can be used for performance evaluation, such as:
- precision_recall_curve()
- roc_curve()
- precision_score()
- recall_score()
- roc_auc_score()
cross_validation.cross_val_score(model, X, y, scoring='precision')
Some possible values of the scoring parameter are 'precision', 'recall', 'f1', 'roc_auc', 'mean_absolute_error', 'mean_squared_error' and 'r2'. - In Pandas: Python Data Analysis Library:
- In Orange:
- In MLPy: Machine Learning Python:
- In MILK: Machine Learning Toolkit See nfoldcrossvalidation at: https://pythonhosted.org/milk/api.html
Decision Trees:
- In
Scikit-learn:
The DecisionTreeClassifier
class is capable of multi-class classification.
Found in module sklearn.tree.
See http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Some important parameters:
- criterion: Function to measure the quality of a split: .gini./.entropy.
- max_depth: Maximum depth of the tree
- min_samples_leaf: Minimum number of samples per leaf (comparable to Weka.s MinNumObj parameter)
- tree_: The Tree object
- classes_: An array with the class labels
- feature_importances_: The .Gini importance. of each feature. The higher the value, the more important the feature.
- fit(X, y): Build a decision tree from the training set where X is the matrix of predicting attributes and y is the target attribute.
- predict(X): Predict the class value for X
- score(): Returns the mean accuracy for the model
- In Pandas: Python Data Analysis Library:
- In Orange: See http://orange.biolab.si/docs/latest/reference/rst/Orange.classification.tree.html
- In MLPy: Machine Learning Python: See http://mlpy.sourceforge.net/docs/3.5/nonlin_class.html#classification-tree
- In MILK: Machine Learning Toolkit See https://pythonhosted.org/milk/api.html#modules
Linear Regression:
- In
Scikit-learn:
The LinearRegression class in Sklearn.linear_model can perform linear regression on a given dataset.
See http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Parameters:
- fit_intercept: Boolean. If false, no intercept will be used for calculation (i.e. the data s assumed to be centered.
- normalize: Boolean. Whether to normalize the data before regression
- coef_: Coefficients of the model
- intercept_: Independent term in the model Important methods:
- decision_function(X): Decision function of the linear model. Returns the predicted values
- fit(), predict() and score() perform the usual tasks
- In Pandas: Python Data Analysis Library:
- In Orange: See http://orange.biolab.si/docs/latest/reference/rst/Orange.regression.linear.html
- In MLPy: Machine Learning Python:
See:
- In Modular toolkit for Data Processing (MDP): See http://mdp-toolkit.sourceforge.net/api/mdp.nodes.LinearRegressionNode-class.html
Regression using Trees:
- In
Scikit-learn:
The DecisionTreeRegressor class can perform regression (Regression tree). Also in sklearn.tree.
See http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Some important parameters:
- criterion: Function to measure the quality of a split: .mse. is default
- max_depth: Maximum depth of the tree
- min_samples_leaf: Minimum number of samples per leaf (comparable to Weka.s MinNumObj parameter
- tree_: The tree object
- feature_importances_: The importance of each feature
- fit(X, y): Build a decision tree from the training set where X is the matrix of predicting attributes and y is the target attribute.
- predict(X): Predict the class numeric value for X
- score(): Returns R2, the coefficient of determination, of the prediction
- In Pandas: Python Data Analysis Library:
- In Orange: See http://orange.biolab.si/docs/latest/reference/rst/Orange.classification.tree.html
- In MLPy: Machine Learning Python:
Association Rules:
- In Orange: See http://orange.biolab.si/docs/latest/widgets/rst/associate/associationrules.html
- Other Python implementations of Association Rules:
- PyFIM - Frequent Item Set Mining for Python
By Christian Borgel.
Contains several Python implementations of Frequent Item Set Mining algorithms including Apriori and FP-Growth among other. - Many other online Python implementations of association rule mining exist, but Orange above seems the most suitable for our projects.
- PyFIM - Frequent Item Set Mining for Python
By Christian Borgel.
Clustering:
Scikit-learn: - K-means: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html - Hierarchical: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html - DBSCAN: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html Orange: - K-Means: http://orange.biolab.si/docs/latest/reference/rst/Orange.clustering.kmeans.html - Hierarchical: http://orange.biolab.si/docs/latest/reference/rst/Orange.clustering.hierarchical.html MLPy: - K-means and hiearchical: http://mlpy.sourceforge.net/docs/3.3/cluster.html - An independent implementation of DBSCAN: http://iamtawit.blogspot.in/2012/12/dbscan.html
No hay comentarios:
Publicar un comentario