iBrokeTheCode commited on
Commit
88dfbf3
Β·
1 Parent(s): 453dbe9

chore: Add LESSONS file

Browse files
Files changed (1) hide show
  1. LESSONS.md +140 -0
LESSONS.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Lessons
2
+
3
+ ## Table of Contents
4
+
5
+ 1. [πŸ—οΈ Building a Consistent Workflow with Pipelines and ColumnTransformers](#1-building-a-consistent-workflow-with-pipelines-and-columntransformers)
6
+ 2. [πŸ€– Efficient Hyperparameter Tuning with RandomizedSearchCV](#2-efficient-hyperparameter-tuning-with-randomizedsearchcv)
7
+ 3. [πŸš€ High-Performance Modeling with LightGBM](#3-high-performance-modeling-with-lightgbm)
8
+ 4. [πŸ’Ύ Saving and Deploying a Complete Model Pipeline](#4-saving-and-deploying-a-complete-model-pipeline)
9
+
10
+ ---
11
+
12
+ ## 1. πŸ—οΈ Building a Consistent Workflow with Pipelines and ColumnTransformers
13
+
14
+ A machine learning model is more than just an algorithm; it's a complete data processing workflow. The `Pipeline` and `ColumnTransformer` classes from `scikit-learn` are essential for creating a robust and reproducible process.
15
+
16
+ - `ColumnTransformer` allows you to apply different preprocessing steps (like scaling numerical data and encoding categorical data) to different columns in your dataset simultaneously.
17
+ - `Pipeline` chains these preprocessing steps with a final model. This ensures that the exact same transformations are applied to your data during training and prediction, preventing data leakage and consistency errors.
18
+
19
+ ```python
20
+ from sklearn.compose import ColumnTransformer
21
+ from sklearn.pipeline import Pipeline
22
+ from sklearn.preprocessing import StandardScaler, OneHotEncoder
23
+
24
+ # Define different preprocessing steps for numerical and categorical data
25
+ numerical_pipeline = Pipeline(steps=[
26
+ ('scaler', StandardScaler())
27
+ ])
28
+
29
+ categorical_pipeline = Pipeline(steps=[
30
+ ('onehot', OneHotEncoder(handle_unknown='ignore'))
31
+ ])
32
+
33
+ # Create a preprocessor that applies these pipelines to the correct columns
34
+ preprocessor = ColumnTransformer(transformers=[
35
+ ('num', numerical_pipeline, numerical_cols),
36
+ ('cat', categorical_pipeline, categorical_cols)
37
+ ])
38
+
39
+ # Build a final pipeline with the preprocessor and the model
40
+ final_pipeline = Pipeline(steps=[
41
+ ('preprocessor', preprocessor),
42
+ ('classifier', MyClassifier())
43
+ ])
44
+
45
+ final_pipeline.fit(X_train, y_train)
46
+ ```
47
+
48
+ ---
49
+
50
+ ## 2\. πŸ€– Efficient Hyperparameter Tuning with RandomizedSearchCV
51
+
52
+ Hyperparameters are settings that are not learned from data but are set before training. Finding the best combination of these settings is crucial for optimal model performance.
53
+
54
+ - `RandomizedSearchCV` is a powerful and efficient method for hyperparameter tuning. Instead of exhaustively checking every possible combination like `GridSearchCV`, it samples a fixed number of combinations from a defined parameter space.
55
+ - This approach is much faster than an exhaustive search and often finds a very good set of hyperparameters, making it an excellent choice when computational resources are limited.
56
+
57
+ <!-- end list -->
58
+
59
+ ```python
60
+ from sklearn.ensemble import RandomForestClassifier
61
+ from sklearn.model_selection import RandomizedSearchCV
62
+ from scipy.stats import randint
63
+
64
+ # Define the model to be tuned
65
+ rf = RandomForestClassifier(random_state=42)
66
+
67
+ # Define the parameter distribution to sample from
68
+ param_dist = {
69
+ 'n_estimators': randint(50, 200),
70
+ 'max_depth': randint(5, 30)
71
+ }
72
+
73
+ # Use RandomizedSearchCV to find the best hyperparameters
74
+ rscv = RandomizedSearchCV(
75
+ estimator=rf,
76
+ param_distributions=param_dist,
77
+ n_iter=10, # Number of random combinations to try
78
+ scoring='roc_auc',
79
+ cv=5,
80
+ random_state=42
81
+ )
82
+
83
+ rscv.fit(X_train, y_train)
84
+ best_params = rscv.best_params_
85
+ ```
86
+
87
+ ---
88
+
89
+ ## 3\. πŸš€ High-Performance Modeling with LightGBM
90
+
91
+ LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is known for its speed and efficiency, making it a popular choice for both simple and complex classification tasks.
92
+
93
+ - **Speed:** LightGBM builds decision trees "leaf-wise" rather than "level-wise," which often leads to faster training and better accuracy.
94
+ - **Performance:** It is highly effective with large datasets and often provides state-of-the-art results with minimal hyperparameter tuning.
95
+ - **Integration:** It integrates seamlessly into the `scikit-learn` ecosystem, allowing it to be used within pipelines and cross-validation routines.
96
+
97
+ <!-- end list -->
98
+
99
+ ```python
100
+ from lightgbm import LGBMClassifier
101
+ from sklearn.pipeline import Pipeline
102
+
103
+ # Create a LightGBM classifier with key parameters
104
+ lgbm = LGBMClassifier(
105
+ n_estimators=500,
106
+ learning_rate=0.05,
107
+ max_depth=-1, # Allows trees to grow to full depth
108
+ random_state=42
109
+ )
110
+
111
+ # You can fit the model directly or within a pipeline
112
+ pipeline = Pipeline(steps=[('classifier', lgbm)])
113
+ pipeline.fit(X_train, y_train)
114
+ ```
115
+
116
+ ---
117
+
118
+ ## 4\. πŸ’Ύ Saving and Deploying a Complete Model Pipeline
119
+
120
+ Once a model is trained, it must be saved to a file to be used later for predictions without needing to be retrained. Saving the entire `Pipeline` object is a critical best practice.
121
+
122
+ - The `joblib` library is the recommended tool for saving `scikit-learn` objects. It is more efficient than the standard `pickle` module for objects containing large NumPy arrays.
123
+ - By saving the entire pipeline, you ensure that the same preprocessing steps used for training are automatically applied to new, raw data during prediction, guaranteeing consistency.
124
+
125
+ <!-- end list -->
126
+
127
+ ```python
128
+ import joblib
129
+
130
+ # Assuming 'final_pipeline' is your fitted pipeline
131
+ # Save the entire pipeline to a file
132
+ joblib.dump(final_pipeline, 'model_pipeline.joblib')
133
+
134
+ # Later, in a new script or application, load the model
135
+ loaded_pipeline = joblib.load('model_pipeline.joblib')
136
+
137
+ # Use the loaded pipeline to make a prediction on new, raw data
138
+ new_data = pd.DataFrame(...)
139
+ prediction = loaded_pipeline.predict(new_data)
140
+ ```