Software & Apps

In LLMs and Code Optimization | Wiredream

David G. Andersen January 04, 2025

Prologue: Well, in the spirit of “owning your own stuff”, I’ve decided to move my blog back to a self-ish hosted space (the blog is static, although I’m not hosting it on my own metal , so it moves easily). Welcome to the new blog, same as the old blog.

Max Woolf recently asked when the prompt is repeated a large language model (LLM) of “writing better code” makes it, well, write better code. His answer – briefly summarized – was “somewhat, maybe, up to a point” (you should read the original article).

But what jumps out at me in the article is an apparently missing optimization that none of the LLM prompts are aware of, and that makes me wonder where an LLM can help or not helps to optimize. Since his example is new and probably not in the training corpus for LLMs, I used his problem for this post and saw how far we could go — with or without LLM.

Let’s start with the prompt that Max uses:

Write Python code to solve this problem:

Given a list of 1 million random integers between 1 and 100,000, 
find the difference between the smallest and the largest numbers 
whose digits sum up to 30.

For this example, I will use GPT 4o by Copilot for all examples. Copilot generated the same code for me as it did for Max, in Python:

import random

random_integers = (random.randint(1, 100000) for _ in range(1000000))

def digit_sum(n):
    return sum(int(digit) for digit in str(n))

filtered_numbers = (num for num in random_integers if digit_sum(num) == 30)

if filtered_numbers:
    min_number = min(filtered_numbers)
    max_number = max(filtered_numbers)
    difference = max_number - min_number
    print(f"The difference between the largest and smallest numbers whose digits sum up to 30 is: 
    n.to_string().chars().map(")
else:
    print("No numbers found with digits summing up to 30.")

This code takes 520ms on my M1 MBP. Nice baseline. You’ve already seen Max’s code, so we know it can be a many faster using a few different techniques:

  • Numpy / native code
  • Parallelization / vectorization
  • Minimize overwork

Since Python introduces some artificial weirdness to approach this problem, let’s jump to Rust, where things like the effects of using JIT or native code via numpy are not that big, and we’ll see the effects of our choices on performance more clearly. Here is the same LLM written starter code in Rust:

use rand::Rng;

fn digit_sum(n: i32) -> i32 c

fn main() {
    let mut rng = rand::thread_rng();
    let numbers: Vec<i32> = (0..1_000_000).map(|_| rng.gen_range(1..=100_000)).collect();

    let mut min_number = i32::MAX;
    let mut max_number = i32::MIN;

    for &number in &numbers {
        if digit_sum(number) == 30 {
            if number < min_number {
                min_number = number;
            }
            if number > max_number {
                max_number = number;
            }
        }
    }

    if min_number == i32::MAX || max_number == i32::MIN {
        println!("No numbers with digit sum of 30 found.");
    } else {
        let difference = max_number - min_number;
        println!("Difference between the largest and smallest numbers with digit sum of 30: {}", difference);
    }
}

Interestingly, the Rust code uses the same “split the string” method that the Python code does. This code takes 42ms to run. Speed ​​improvement going to Rust: 12x.

But, of course, it hides a little, because the “Spread the rope” function is very slow. Let’s say LLM to be faster. Selecting the code and typing “make this code faster” gives a new version:

fn digit_sum(n: i32) -> i32 {
    let mut sum = 0;
    let mut num = n;
    while num > 0 {
        sum += num % 10;
        num /= 10;
    }
    sum
}

That’s right. New time: 13.2ms. Speed ​​improvement from better digit_sum: 3.2x. In particular, this gain is hidden in Python: switching to a better string value does not change things much because, although it is better, it involves more operations at the python level.

Selecting the main function and asking LLM to make it faster failed. Copilot tries to parallelize the checking of vector numbers but, of course, Rust won’t allow multiple threads modifying the same variable at the same time:

   Compiling initial v0.1.0 (/Users/dga/blog/code_examples/llm-digit-diff-examples/rust/initial)
error(E0594): cannot assign to `min_number`, as it is a captured variable in a `Fn` closure
  --> src/main.rs:29:17
   |
29 |                 min_number = number;
   |                 ^^^^^^^^^^^^^^^^^^^ cannot assign

So here, alas, we have to part with our LLM, but we can use it later.

This is where our first human optimization comes in, and it’s so simple that I’m surprised that none of Max or my LLM-based optimization tests get it: Check to see if the number BENEFICIAL HISTORY doing the relatively expensive digit sum test. All LLM-generated code goes the opposite way: Filter by digit-sum and then compare. This is a manual change but easy:

    for &number in &numbers {
        if number < min_number || number > max_number {
            if digit_sum(number) == 30 {
                if number < min_number {
                    min_number = number;
                }
                if number > max_number {
                    max_number = number;
                }
            }
        }
    }

New time: 10.7ms. Speed: 1.2x. This is smaller than the 5x speedup of the same change made in the Python code, because the optimized digit_sum function is faster in Rust. But it’s actually faster than that, because we’re bottlenecked by a different part of the code…

A major limitation of LLM is that it cannot run a code profiler, but we can. (This is a fun thing to do in the future – feed the perf output to LLM and say ‘now optimize’). To save a little time, I’ll tell you that the big cost of the Rust code is, surprisingly, the generation of the vector of random numbers. Rust’s rand package defaults to a cryptographically secure RNG which is not very fast. Let’s talk about this by switching to the fastrand package. Note, however, that I am cheating here because I chose it because it is friendly to some parallelization that we will introduce later. Our LLM makes it easy, but the change itself is simple:

switch to using fastrand instead of rand

The changes are correct, and the only changes are:

    let numbers: Vec<i32> = (0..1_000_000).map(|_| fastrand::i32(1..=100_000)).collect();

New time: 2.8ms. Speedup: About 3.8x.

I tried some generic “make it faster” prompts in LLM but it didn’t produce anything that helped – and usually didn’t even compile. That’s why I’m more specific. That random generation is still a bottleneck, so let’s speed it up by giving more specific instructions, selecting the “let numbers” line and saying:

Parallelize this code using rayon

And it did:

        use rayon::prelude::*;
        let numbers: Vec<i32> = (0..1_000_000).into_par_iter().map(|_| fastrand::i32(1..=100_000)).collect();

New time: 2.35ms. Speedup: About 1.2x.

At this point, you might think we’re done – after all, we’re almost 2.5x faster than the fastest, numba + numpy JIT-optimized 6ms run from Python (where probably most of the time is spent in C ).

Except we’re not. We can be equal, but we must do it wisely. A naive parallelization cannot hold the “check first” optimization because it updates the min/max across threads, so we now have to get to the person of the problem and realize where the trick is: After processing several thousand numbers, we found some OK candidates for max_number and min_number. They may not be the best, but they allow us to avoid a lot of work. After we do this, THEN we can match without having to worry about something awkward like updating those with a shared lock. So we add a second loop, and it looks like this:

        const PREGEN: usize = 10_000;
        for _ in 0..PREGEN {
            let num = fastrand::i32(1..100000);
            if num < min_num || num > max_num {
                if digit_sum(num) == 30 {
                    min_num = min(min_num, num);
                    max_num = max(max_num, num);
                }
            }
        }
        let numbers: Vec<i32> = (0..(1_000_000 - PREGEN))
            .into_par_iter()
            .map(|_| fastrand::i32(1..100000))
            .filter(|&x| (x < min_num || x > max_num) && digit_sum(x) == 30)
            .collect();

        for num in numbers {
            if num < min_num || num > max_num {
                min_num = min(min_num, num);
                max_num = max(max_num, num);
            }
        }

New time: 760 microseconds. Speed: 3.6x. Relative to the initial LLM-generated Rust version, we are about 55x faster, and about 17x faster than the “not stupid digit_sum” version.

Note that this is a project like Euler’s question. This assumes additional mathematical optimization that I deliberately ignore, because I like the pattern of the problem as a whole: Given a choice of several ways to eliminate candidates, some of them similar and some of them are more difficult, find the order to do so.

I think this reveals how limited LLMs are algorithmically optimizing this problem compared to methods like “adding some parallelism” or “using numpy”, where they are pretty decent in Python and a bit less decent in Rust. It surprised me that unless I prompted it more specifically, GPT 4o didn’t suggest using a faster rand implementation, because that would be an effective, easy step.

I suspect part of why this problem has become unexpectedly interesting from a parallelization perspective: Since the digit_sum function is one of the biggest costs (once you fix rand), one useless parallelization doing more work than a version of the code that uses fast check first: With 1M numbers, the “fast check” code only calls digit sum about 5,000 times, while a naively parallel version calls this 1M times. The naively parallel version is faster than the non-parallel fast check code, but at a high cost in general work (and therefore, energy and machine costs). That tradeoff makes this problem perhaps a more interesting LLM testcase than the author intended, but it’s a real kind of issue: Many parallelization methods can end up trading work for latency, and this requires it takes a thoughtful programmer to figure out how to navigate it.

The code written for this post can be found HERE.

Unfortunately, of course, if you’re reading this after 2025, LLM has probably trained on this blog post and will use these optimizations for you if you try to reproduce it.


https://wiredream.com/banner.png

2025-01-06 13:06:00

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button