![]() |
In this article, we will learn about one of the state-of-the-art machine learning models: Catboost here cat stands for categorical which implies that this algorithm is highly efficient when your data contains many categorical columns. Table of ContentWhat is CatBoost?CatBoost, (Categorical Boosting), is a high-performance, open-source, gradient-boosting framework developed by Yandex. It is designed for solving a wide range of machine learning tasks, including classification, regression, and ranking, with a particular emphasis on handling categorical features efficiently. Catboost stands out for its speed, accuracy, and ease of use in dealing with structured data. How Catboost Works?Catboost is a high-performance gradient-boosting technique made for machine learning tasks, especially in situations involving structured input. Gradient boosting, an ensemble learning technique, forms the basis of its main workings. Catboost begins by speculating, frequently the mean of the target variable. The ensemble of decision trees is then gradually built, with each tree seeking to eliminate the errors or residuals from the previous ones. Catboost stands out because of how well it handles category features. Catboost uses a method termed “ordered boosting” to process categorical data directly, resulting in faster training and better model performance. Additionally, regularization techniques are incorporated to avoid overfitting. Catboost integrates the predictions from all the trees when making predictions, creating models that are extremely accurate and reliable. Additionally, it offers feature relevance rankings that help with feature selection and comprehension of model choices. Catboost is a useful tool for a variety of machine-learning tasks, such as classification, regressions, etc. Implementation of Regression Using CatBoostWe will use this dataset to perform a regression task using the catboost algorithm. But to use the catboost model we will first have to install the catboost package model using the below command: Installing Packages!pip install catboost
Importing Libraries and DatasetPython libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.
Python3
Loading Dataset and Retriving InformationPython3
Output: Posted On BHK Rent Size Floor Area Type \ Here, we are loading the dataset and printing the top five rows in the datset. Python3
Output: (4746, 12) Here, ‘df.shape’ prints the dimensions of the dataframe ‘df’. Python3
Output: <class 'pandas.core.frame.DataFrame'> Here, ‘df.info()’ displays the summary information about the dataframe ‘df’. It provides details such as no. of null-entries in each column, data types, and memory usage. Python3
Output: BHK Rent Size Bathroom Here, ‘df.describe()’ computes and displays basic statistical summary information for the numeric columns in the dataframe. Exploratory Data AnalysisEDA is an approach to analyzing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations. While performing the EDA of this dataset we will try to look at what is the relation between the independent features that is how one affects the other. Python3
Output: ['Area Type', Categorical Count PlotsNow let’s observe the distribution of the whole data into these categorical columns categories by using a countplot from seaborn. Python3
Output: Here, each plot displays the distribution of counts for a specific categorical column. The ‘plt.tight_layout’ function ensures that the subplots are properly spaced, and ‘plt_show’ displays the grid of the count plots. Numeric Distribution PlotsTo understand the numerical data and its distribution density plots are considered as one of the best tools. Python3
Output: Here we can observe that the both rent and the size column are not normally distributed and it is considered as best practices to have the target and the features columns normally distributed for better results while using regressions in machine learning. To achieve this one of the famous method is logarithmic transformation that we will do in this article later before feeding the data to the model. Python3
Output: Rent Observation that we can conclude from this dataset are as follows:
Most of the observation we have concluded above are as same as we observe in the real life. Data PreprocessingData preprocessing is a very crucial in any ML development lifecycle as we know that the real world dataset is untidy and before making any use of it we will have to convert it into structural form and use it in a manner so, that some value can be bring out of it. In this process first with the current data we will apply the logarithmic transformation on the rent and the size columsn as they are not normally distributed but they are left skewed distributions. Python3
Output: In order to lessen data skewness, this code applies a log transformation to the “Rent” and “Size” columns using np.log1p. The distribution of values in each column is then shown on distribution plots for the log-transformed numeric columns. For side-by-side visualization, the subplots are set up in a 1×2 grid, and ‘plt.tight_layout()’ makes sure there is enough space between each subplot. Lastly, the distribution plots’ subplots are displayed with ‘plt.show()’. One-Hot Encoding Categorical ColumnsPython3
Output: BHK Rent Size Bathroom Built Area Carpet Area Super Area \ One hot encoding is considered as the best practice to convert categorical columns into numerical ones as in this process none of the category is provided any preference that happens in the ordinal encoding method. Splitting DataNow we will split the whole data into training and validation part by using the 85:15 ratio. Python3
Output: ((4034, 21), (712, 21)) Model DevelopmentNow as we are completely ready with the data part it’s preprocessing and splitting into training and the testing data. Now we will import catboostregressor from teh catboost module and train it on our dataset. Python3
Output: Learning rate set to 0.051037 As we can see that the training has been done for around 1000 epochs and now we can use the training and validation data to analyze the performance of the model. Let’s understand this code in detail: ‘CatBosstRegressor‘ is a python class provided by the catboost library for creating regression models. It is specifically designed for regression tasks, where the code is to predict a continuous numeric target variable on input features. Here, in the code ‘CatBoostRegressor(loss function=’RMSE’) initializes catboost regression model with the Root Mean Squared Error(RMSE) as the loss function. The model aims to minimize the error during training. ‘Model.fit()’ method is used to train a model on the given dataset. In this model, it is applied to CatboostRegressor model. Here, X_train: Feature matrix containing independent variables used for training. Together, these tools make it possible to build and train a CatBoost regression model with RMSE as the loss function. The model is trained using the supplied training data (X_train and Y_train), with verbose logging turned on to track the training status. The model can be used to make predictions on new data after training is finished. PredictionPython3
Output: Training RMSE: 0.308254377636178 Above we have seen that before passing the data to the model we have converted the categorical features to the numerical or one hot encoded one. But while we are using the catboost model we can choose not to perform this operation explicitly. ConclusionIn this regression study employing CatBoost, we took advantage of the CatBoostRegressor’s capacity to forecast continuous numerical values from structured data. The preprocessing phase is made simpler by CatBoost’s skillful handling of categorical information. We set up the model hyperparameters carefully, made sure the training and validation procedures worked, and prepared the data methodically. We created a robust regression model by using feature engineering to assess the model’s performance using measures like RMSE. For precise regression predictions across a variety of structured datasets, CatBoost’s automatic category encoding, regularization techniques, and gradient boosting power make it an appealing option. |
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 12 |