Parallelizing Python For Loops with Numba - Coding

Parallel computing is a powerful technique to enhance the performance of computationally intensive tasks. In Python, Numba is a Just-In-Time (JIT) compiler that translates a subset of Python and NumPy code into fast machine code. One of its features is the ability to parallelize loops, which can significantly speed up your code.

Parallelizing Python for loops is a crucial step in optimizing the performance of computationally intensive applications. Numba, a popular Python library, provides several tools to achieve parallelism, including the prange function and the parallel=True option. In this article, we will delve into the details of how to effectively parallelize Python for loops using Numba, highlighting the key concepts, techniques, and best practices.

Table of Content

Understanding Numba’s Parallelization Capabilities

Why Parallelize Loops?
Identifying Parallel Loops: Key Considerations

Parallelizing Loops with Numba

Example 1: Parallelizing a Simple Loop
Example 2: Parallel Sum of Arrays
Example 3: Estimating Pi Using Monte Carlo Methods
Example 4: Using prange for Explicit Parallelization
Advanced Example: Parallelizing Matrix Multiplication

Measuring Performance Gains from Parallelization
Best Practices for Parallelization

Understanding Numba’s Parallelization Capabilities

Numba offers two primary methods for parallelizing code: automatic parallelization and explicit parallelization using prange. Automatic parallelization is achieved by setting parallel=True when using the @jit decorator. This option attempts to optimize array operations and run them in parallel, making it suitable for embarrassingly parallel loops. On the other hand, prange allows for explicit parallelization of specific loops, providing more control over the parallelization process.

Why Parallelize Loops?

Parallelizing loops can drastically reduce the execution time of your code by distributing the workload across multiple CPU cores. This is particularly beneficial for tasks that are “embarrassingly parallel,” meaning they can be easily divided into independent subtasks.

Identifying Parallel Loops: Key Considerations

Before diving into the parallelization process, it’s crucial to determine if your for loop is a suitable candidate. The ideal loops for parallelization are:

Embarrassingly Parallel: Each iteration is independent and doesn’t rely on data modified in other iterations.
Computationally Intensive: The time spent within each iteration is significant enough to outweigh the overhead of parallel execution.

Parallelizing Loops with Numba

Numba provides the prange function, which is used to parallelize loops. The prange function is similar to Python’s built-in range function but is designed for parallel execution.

Installation: First, you need to install Numba. You can do this using pip:

pip install numba

Example 1: Parallelizing a Simple Loop

Let’s start with a simple example where we parallelize a loop that computes the sum of squares:

Python

import numpy as np
from numba import njit, prange

@njit(parallel=True)
def sum_of_squares(n):
    result = 0
    for i in prange(n):
        result += i ** 2
    return result

n = 1000000
print(sum_of_squares(n))

Output:

333332833333500000

In this example, the loop iterating over prange(n) is executed in parallel, leveraging multiple CPU cores.

Example 2: Parallel Sum of Arrays

Let’s parallelize a loop that computes the sum of elements in an array.

Python

from numba import njit, prange

@njit(parallel=True)
def parallel_sum_array(arr):
    total = 0
    for i in prange(len(arr)):
        total += arr[i]
    return total

# Example usage
import numpy as np
arr = np.arange(1000000)
print(parallel_sum_array(arr))

Output:

499999500000

In this example:

@njit(parallel=True) tells Numba to compile the function with parallel execution.
prange(len(arr)) enables parallel iteration over the array.

Example 3: Estimating Pi Using Monte Carlo Methods

Parallelizing Monte Carlo methods for estimating pi can also lead to substantial performance improvements.

Python

import random

def calc_pi(N):
    M = 0
    for i in range(N):
        x = random.uniform(-1, 1)
        y = random.uniform(-1, 1)
        if x**2 + y**2 <= 1:
            M += 1
    return 4 * M / N

# Define the number of iterations
N = 1000000

# Calculate and print the approximation of pi
pi_approx = calc_pi(N)
print(f"Approximation of pi after {N} iterations: {pi_approx}")

Output:

Approximation of pi after 1000000 iterations: 3.142464

Example 4: Using `prange` for Explicit Parallelization

prange is a Numba-specific function that replaces the standard Python range function in parallelized loops. It is essential to use prange when parallelizing loops, as it informs Numba which loops to parallelize. For example, in the following code snippet, prange is used to parallelize the outer loop:

Python

import numpy as np
from numba import njit, prange

@njit(parallel=True)
def csrMult_numba(x, Adata, Aindices, Aindptr, Ashape):
    numRowsA = Ashape
    Ax = np.zeros(numRowsA)
    for i in prange(numRowsA):
        Ax_i = 0.0
        for dataIdx in range(Aindptr[i], Aindptr[i + 1]):
            j = Aindices[dataIdx]
            Ax_i += Adata[dataIdx] * x[j]
        Ax[i] = Ax_i
    return Ax

# Example usage:

Adata = np.array([1, 2, 3, 4, 5], dtype=np.float32)
Aindices = np.array([0, 2, 2, 0, 1], dtype=np.int32)
Aindptr = np.array([0, 2, 3, 5], dtype=np.int32)
Ashape = 3  # Number of rows

# Define a vector to multiply
x = np.array([1, 2, 3], dtype=np.float32)

# Perform the matrix-vector multiplication
result = csrMult_numba(x, Adata, Aindices, Aindptr, Ashape)
print(result)

Output:

[ 7.  9. 14.]

Advanced Example: Parallelizing Matrix Multiplication

To illustrate a more complex use case, let’s parallelize a matrix multiplication operation.

Python

from numba import njit, prange
import numpy as np

@njit(parallel=True)
def parallel_matrix_multiplication(A, B):
    n, m = A.shape
    m, p = B.shape
    C = np.zeros((n, p))

    for i in prange(n):
        for j in prange(p):
            for k in prange(m):
                C[i, j] += A[i, k] * B[k, j]

    return C

# Example usage
A = np.random.rand(100, 100)
B = np.random.rand(100, 100)
C = parallel_matrix_multiplication(A, B)
print(C)

Output:

[[20.80764878 23.00057672 21.9369858  ... 22.41715703 23.0755662
  22.33375024]
 [21.03665146 24.0755907  22.25624691 ... 21.52803639 22.21485889
  20.41275549]
 [22.08134646 25.5358516  23.7381806  ... 24.65153569 26.01077343
  24.54440725]
 ...
 [20.45125475 24.54111658 22.26924075 ... 22.0734628  23.32851616
  21.40838884]
 [23.03796554 24.14278303 24.24539058 ... 24.092034   26.98564742
  24.086983  ]
 [24.26815164 26.91033613 25.56298534 ... 26.13709548 27.11784094
  26.00035639]]

In this example:

parallel_matrix_multiplication multiplies two matrices A and B.
The nested loops are parallelized using prange.

Measuring Performance Gains from Parallelization

To measure the performance gains from parallelization, you can use the time module or timeit function.

Python

import time
import numpy as np
from numba import njit, prange

# Define the array to sum
arr = np.random.rand(1000000)  # Array of 1,000,000 random numbers

# Without parallelization
def sum_array(arr):
    return np.sum(arr)

# With parallelization using Numba
@njit(parallel=True)
def parallel_sum_array(arr):
    total = 0.0
    for i in prange(len(arr)):
        total += arr[i]
    return total

# Measure execution time without parallelization
start_time = time.time()
sum_result = sum_array(arr)
end_time = time.time()
print("Non-parallel execution time:", end_time - start_time)
print("Sum (Non-parallel):", sum_result)

# Measure execution time with parallelization
start_time = time.time()
parallel_sum_result = parallel_sum_array(arr)
end_time = time.time()
print("Parallel execution time:", end_time - start_time)
print("Sum (Parallel):", parallel_sum_result)

Output:

Non-parallel execution time: 0.0016186237335205078
Sum (Non-parallel): 500147.43266961584
Parallel execution time: 1.089543104171753
Sum (Parallel): 500147.43266962166

Best Practices for Parallelization

Use prange for Parallel Loops: Always use prange instead of range for loops you want to parallelize.
Minimize Dependencies: Ensure that loop iterations are independent of each other to maximize parallel efficiency.
Profile Your Code: Use profiling tools to identify bottlenecks and verify that parallelization is improving performance.

Conclusion

Parallelizing for loops with Numba is a powerful technique to accelerate Python code, especially for numerical computations. By leveraging the @njit(parallel=True) decorator and the prange function, you can easily distribute workloads across multiple CPU cores. This can lead to significant performance improvements, making Numba an invaluable tool for high-performance Python programming.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Machine Learning Algorithms Cheat Sheet
Tree-Based Models for Classification in Python
Exploring Multimodal Large Language Models
How to Get an Internship as a Data Architect
Customizing Minor Ticks in Matplotlib: Turning on Minor Ticks Only on the Y-Axis

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	16