Credit Card Fraud Detection in R

Credit card fraud is a major issue that affects both consumers and financial institutions. With the rise of online transactions, detecting fraudulent activities has become increasingly challenging. Here, we will explore the process of detecting credit card fraud using machine learning techniques. We will use a real-world dataset to understand the data, apply various fraud detection algorithms, and discuss the importance of security and prevention measures.

Credit card fraud detection involves identifying unusual patterns in transaction data that deviate from normal behavior. This is typically approached using machine learning algorithms that can classify transactions as fraudulent or legitimate. The main challenges include:

Imbalanced Data: Fraudulent transactions are relatively rare compared to legitimate ones, leading to highly imbalanced datasets.
Feature Engineering: Extracting relevant features that can help in distinguishing between fraudulent and legitimate transactions.
Model Selection: Choosing appropriate algorithms that can handle imbalanced data and provide accurate predictions.
Evaluation Metrics: Using suitable metrics to evaluate model performance, especially considering the imbalance in the dataset.

Below, we provide a step-by-step guide to implementing credit card fraud detection in R Programming Language using a popular dataset and a Random Forest classifier.

Step 1: Load Libraries and Data

First, install and load the necessary libraries. We’ll use the caret package for model training and evaluation, and the randomForest package for the Random Forest algorithm.

Dataset Link: Credit Card Fraud

Time: The time elapsed between this transaction and the first transaction in the dataset.
V1 to V28: These are the principal components obtained using PCA (Principal Component Analysis). They help in reducing the dimensionality of the data.
Amount: The transaction amount.
Class: The label of the transaction. ‘1’ indicates a fraudulent transaction, while ‘0’ indicates a legitimate transaction.

library(readr)
library(dplyr)
library(randomForest)
library(ggplot2)
library(caret)
library(MLmetrics)

set.seed(1234)
creditData <- read_csv("C:/Users/Tonmoy/Downloads/Dataset/creditcard.csv")
head(creditData)

Output:

  Time         V1          V2        V3         V4          V5          V6          V7
1    0 -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778  0.23959855
2    0  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081 -0.07880298
3    1 -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938  0.79146096
4    1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317  0.23760894
5    2 -1.1582331  0.87773675 1.5487178  0.4030339 -0.40719338  0.09592146  0.59294075
6    2 -0.4259659  0.96052304 1.1411093 -0.1682521  0.42098688 -0.02972755  0.47620095
           V8         V9         V10        V11         V12        V13        V14
1  0.09869790  0.3637870  0.09079417 -0.5515995 -0.61780086 -0.9913898 -0.3111694
2  0.08510165 -0.2554251 -0.16697441  1.6127267  1.06523531  0.4890950 -0.1437723
3  0.24767579 -1.5146543  0.20764287  0.6245015  0.06608369  0.7172927 -0.1659459
4  0.37743587 -1.3870241 -0.05495192 -0.2264873  0.17822823  0.5077569 -0.2879237
5 -0.27053268  0.8177393  0.75307443 -0.8228429  0.53819555  1.3458516 -1.1196698
6  0.26031433 -0.5686714 -0.37140720  1.3412620  0.35989384 -0.3580907 -0.1371337
         V15        V16         V17         V18         V19         V20          V21
1  1.4681770 -0.4704005  0.20797124  0.02579058  0.40399296  0.25141210 -0.018306778
2  0.6355581  0.4639170 -0.11480466 -0.18336127 -0.14578304 -0.06908314 -0.225775248
3  2.3458649 -2.8900832  1.10996938 -0.12135931 -2.26185710  0.52497973  0.247998153
4 -0.6314181 -1.0596472 -0.68409279  1.96577500 -1.23262197 -0.20803778 -0.108300452
5  0.1751211 -0.4514492 -0.23703324 -0.03819479  0.80348692  0.40854236 -0.009430697
6  0.5176168  0.4017259 -0.05813282  0.06865315 -0.03319379  0.08496767 -0.208253515
           V22         V23         V24        V25        V26          V27         V28
1  0.277837576 -0.11047391  0.06692807  0.1285394 -0.1891148  0.133558377 -0.02105305
2 -0.638671953  0.10128802 -0.33984648  0.1671704  0.1258945 -0.008983099  0.01472417
3  0.771679402  0.90941226 -0.68928096 -0.3276418 -0.1390966 -0.055352794 -0.05975184
4  0.005273597 -0.19032052 -1.17557533  0.6473760 -0.2219288  0.062722849  0.06145763
5  0.798278495 -0.13745808  0.14126698 -0.2060096  0.5022922  0.219422230  0.21515315
6 -0.559824796 -0.02639767 -0.37142658 -0.2327938  0.1059148  0.253844225  0.08108026
  Amount Class
1 149.62     0
2   2.69     0
3 378.66     0
4 123.50     0
5  69.99     0
6   3.67     0

Step 2: Data Preprocessing

Preprocess the data by scaling features and handling missing values if any. Also, split the data into training and testing sets.

# Checking for missing values
sum(is.na(creditData))

# Summarize the dataset
summary(creditData)

Output:

[1] 0

      Time              V1                  V2                  V3          
 Min.   :     0   Min.   :-56.40751   Min.   :-72.71573   Min.   :-48.3256  
 1st Qu.: 54202   1st Qu.: -0.92037   1st Qu.: -0.59855   1st Qu.: -0.8904  
 Median : 84692   Median :  0.01811   Median :  0.06549   Median :  0.1799  
 Mean   : 94814   Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.0000  
 3rd Qu.:139321   3rd Qu.:  1.31564   3rd Qu.:  0.80372   3rd Qu.:  1.0272  
 Max.   :172792   Max.   :  2.45493   Max.   : 22.05773   Max.   :  9.3826  
       V4                 V5                   V6                 V7          
 Min.   :-5.68317   Min.   :-113.74331   Min.   :-26.1605   Min.   :-43.5572  
 1st Qu.:-0.84864   1st Qu.:  -0.69160   1st Qu.: -0.7683   1st Qu.: -0.5541  
 Median :-0.01985   Median :  -0.05434   Median : -0.2742   Median :  0.0401  
 Mean   : 0.00000   Mean   :   0.00000   Mean   :  0.0000   Mean   :  0.0000  
 3rd Qu.: 0.74334   3rd Qu.:   0.61193   3rd Qu.:  0.3986   3rd Qu.:  0.5704  
 Max.   :16.87534   Max.   :  34.80167   Max.   : 73.3016   Max.   :120.5895  
       V8                  V9                 V10                 V11          
 Min.   :-73.21672   Min.   :-13.43407   Min.   :-24.58826   Min.   :-4.79747  
 1st Qu.: -0.20863   1st Qu.: -0.64310   1st Qu.: -0.53543   1st Qu.:-0.76249  
 Median :  0.02236   Median : -0.05143   Median : -0.09292   Median :-0.03276  
 Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.00000  
 3rd Qu.:  0.32735   3rd Qu.:  0.59714   3rd Qu.:  0.45392   3rd Qu.: 0.73959  
 Max.   : 20.00721   Max.   : 15.59500   Max.   : 23.74514   Max.   :12.01891  
      V12                V13                V14                V15          
 Min.   :-18.6837   Min.   :-5.79188   Min.   :-19.2143   Min.   :-4.49894  
 1st Qu.: -0.4056   1st Qu.:-0.64854   1st Qu.: -0.4256   1st Qu.:-0.58288  
 Median :  0.1400   Median :-0.01357   Median :  0.0506   Median : 0.04807  
 Mean   :  0.0000   Mean   : 0.00000   Mean   :  0.0000   Mean   : 0.00000  
 3rd Qu.:  0.6182   3rd Qu.: 0.66251   3rd Qu.:  0.4931   3rd Qu.: 0.64882  
 Max.   :  7.8484   Max.   : 7.12688   Max.   : 10.5268   Max.   : 8.87774  
      V16                 V17                 V18                 V19           
 Min.   :-14.12985   Min.   :-25.16280   Min.   :-9.498746   Min.   :-7.213527  
 1st Qu.: -0.46804   1st Qu.: -0.48375   1st Qu.:-0.498850   1st Qu.:-0.456299  
 Median :  0.06641   Median : -0.06568   Median :-0.003636   Median : 0.003735  
 Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.000000   Mean   : 0.000000  
 3rd Qu.:  0.52330   3rd Qu.:  0.39968   3rd Qu.: 0.500807   3rd Qu.: 0.458949  
 Max.   : 17.31511   Max.   :  9.25353   Max.   : 5.041069   Max.   : 5.591971  
      V20                 V21                 V22                  V23           
 Min.   :-54.49772   Min.   :-34.83038   Min.   :-10.933144   Min.   :-44.80774  
 1st Qu.: -0.21172   1st Qu.: -0.22839   1st Qu.: -0.542350   1st Qu.: -0.16185  
 Median : -0.06248   Median : -0.02945   Median :  0.006782   Median : -0.01119  
 Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.000000   Mean   :  0.00000  
 3rd Qu.:  0.13304   3rd Qu.:  0.18638   3rd Qu.:  0.528554   3rd Qu.:  0.14764  
 Max.   : 39.42090   Max.   : 27.20284   Max.   : 10.503090   Max.   : 22.52841  
      V24                V25                 V26                V27            
 Min.   :-2.83663   Min.   :-10.29540   Min.   :-2.60455   Min.   :-22.565679  
 1st Qu.:-0.35459   1st Qu.: -0.31715   1st Qu.:-0.32698   1st Qu.: -0.070840  
 Median : 0.04098   Median :  0.01659   Median :-0.05214   Median :  0.001342  
 Mean   : 0.00000   Mean   :  0.00000   Mean   : 0.00000   Mean   :  0.000000  
 3rd Qu.: 0.43953   3rd Qu.:  0.35072   3rd Qu.: 0.24095   3rd Qu.:  0.091045  
 Max.   : 4.58455   Max.   :  7.51959   Max.   : 3.51735   Max.   : 31.612198  
      V28                Amount             Class         
 Min.   :-15.43008   Min.   :    0.00   Min.   :0.000000  
 1st Qu.: -0.05296   1st Qu.:    5.60   1st Qu.:0.000000  
 Median :  0.01124   Median :   22.00   Median :0.000000  
 Mean   :  0.00000   Mean   :   88.35   Mean   :0.001728  
 3rd Qu.:  0.07828   3rd Qu.:   77.17   3rd Qu.:0.000000  
 Max.   : 33.84781   Max.   :25691.16   Max.   :1.000000

Step 3: Data Preprocessing

Converts the Class column to a factor. Selects the first 1000 rows to work with a smaller dataset.

creditData$Class <- factor(creditData$Class)
smallData <- creditData[1:1000, ]

Step 4: Split Data into Training and Test Sets

Splits the subset data into training (700 rows) and testing (300 rows) sets.

train <- smallData[1:700, ]
test <- smallData[701:1000, ]

train %>% select(Class) %>% group_by(Class) %>% summarise(count = n()) %>% glimpse
test %>% select(Class) %>% group_by(Class) %>% summarise(count = n()) %>% glimpse

Output:

Rows: 2
Columns: 2
$ Class <fct> 0, 1
$ count <int> 698, 2

Rows: 1
Columns: 2
$ Class <fct> 0
$ count <int> 300

Step 5: Build and Evaluate the Model

Evaluate the model’s performance on the test data using confusion matrix and other relevant metrics.

rfModel <- randomForest(Class ~ ., data = train, ntree = 100)
test$predicted <- predict(rfModel, test)
confusionMatrix(test$Class, test$predicted)

Output:

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 300   0
         1   0   0
                                     
               Accuracy : 1          
                 95% CI : (0.9878, 1)
    No Information Rate : 1          
    P-Value [Acc > NIR] : 1          
                                     
                  Kappa : NaN        
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity :  1         
            Specificity : NA         
         Pos Pred Value : NA         
         Neg Pred Value : NA         
             Prevalence :  1         
         Detection Rate :  1         
   Detection Prevalence :  1         
      Balanced Accuracy : NA         
                                     
       'Positive' Class : 0

Step 6: Predict New Input Values

Prepare new input values, preprocess them similarly to the training data, and make predictions.

# Assuming the original dataset 'data' has been scaled already, 
#we will use its means and sds for scaling new data
# Extract means and standard deviations for scaling from the original data
means <- colMeans(creditData[, -ncol(creditData)])
sds <- apply(creditData[, -ncol(creditData)], 2, sd)

# Create a new input dataframe with all required features
sample_input <- data.frame(
  Time = c(0, 0),  # Assuming default values for missing columns
  V1 = c(0, 0),
  V2 = c(0, 0),
  V3 = c(0, 0),
  V4 = c(-1.5, 1.5),
  V5 = c(0, 0),
  V6 = c(0, 0),
  V7 = c(0, 0),
  V8 = c(0, 0),
  V9 = c(0.8, -0.8),
  V10 = c(0.3, -0.3),
  V11 = c(-0.1, 0.1),
  V12 = c(0.5, -0.5),
  V13 = c(0, 0),
  V14 = c(1.2, -1.2),
  V15 = c(0, 0),
  V16 = c(-0.7, 0.7),
  V17 = c(-1.0, 2.0),
  V18 = c(0.2, -0.2),
  V19 = c(0, 0),
  V20 = c(0, 0),
  V21 = c(0, 0),
  V22 = c(0, 0),
  V23 = c(0, 0),
  V24 = c(0, 0),
  V25 = c(0, 0),
  V26 = c(1.0, -1.0),
  V27 = c(0, 0),
  V28 = c(0, 0),
  Amount = c(0, 0)
)

# Scale the new input data
for (i in 1:ncol(sample_input)) {
  sample_input[, i] <- (sample_input[, i] - means[i]) / sds[i]
}

# Predict on the scaled new data
sample_predictions <- predict(rfModel, sample_input)
print(sample_predictions)

Output:

1 2 
0 0 
Levels: 0 1

Both sample inputs were classified as class 0 (non-fraudulent).
The levels 0 1 indicate the two possible classes, 0 for non-fraud and 1 for fraud.
The model judged the feature values provided as non-fraudulent.
Consistent prediction of 0 suggests the input values fall within the non-fraudulent range learned by the model.

Conclusion

Credit card fraud detection is crucial for keeping financial transactions safe. Using machine learning algorithms and strong security measures, we can spot and stop fraudulent activities. In this dataset, the techniques used for detecting fraud, and highlights the importance of security in financial transactions.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
The Impact of Big Data on Business
Katz's Back-Off Model in Language Modeling
A Brief History of Data Analysis
Creating Chart Tooltips with DateTime Format
Underrated and Implementing Nu-Support Vector Classification (NuSVC)

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	28