Software & Apps

Classic ML to take on the Judges of Dumbg Llm

on PAST posts I use a local llm to select which two products are more relevant for a search query (see This Github Repo). Using human labels in an open e-commerce search dataset as a baseline (Wands from Wayfair), I measured LLM’s preference for a product, seeing if it matched human raters. If I can do this, I can use my laptop as a judge related to the search. This can lead to finding quality tuning and iterations, without an expensive openii fee.

My goal, not so much to replace other labels but most reliable in the flag what looks bad not necessarily not necessarily recruit always always recruit always recruit always always recruit always recruit always recruit always recruit always always recruit always always recruit always recruit always always recruit always recruit always recruit always

In this post I will talk about how I combined many dumb llm decisions into one product that attributes to a smart decision. And compare it to an agent of all product metadata.

As an example, from dumb decisions, my laptop Search Search judgments look at prompts like:

Which of these product names (if either) is more relevant to the furniture e-commerce search query:

Query: entrance table

Product LHS name: aleah coffee table
    (remaining product attributes omited)
Or
Product RHS name:  marta coffee table
    (remaining product attributes omited)
Or
Neither / Need more product attributes

Only respond 'LHS' or 'RHS' if you are confident in your decision

RESPONSE:
Neither

Collecting 1000 agent preferences, I compare them with human preferences. Then I see if the agent is good at predicting relevance. After I gain confidence, I release it with a couple of products for a question, and trust the results. (All those who believe that human beings are good at this!)

In my posts, I look at a bunch of different variants of the above prompt:

  • Forcing a decision on the LHS / RHS or allowing “nor” “I don’t know” (see First post)
  • A “DO DOVO DOVE CHOCK” pass, or check both sides, mark “no” if quick LHS transfer is repeated without RHS approval second post)

I made prompts for 4 product characteristics:

  • Product name (as above)
  • Product Toxonomic ClegorIRization (ie more relevant? Outdoor Furniture > chairs > ... > adirondack chairs vs “Outdoor furniture> ACCEDORRIES> …
  • Product classification (ie outdoor furniture vs living room chairs)
  • Product Design (see below)

As examples of these variants, here is a quick description of the two products above:

Which of these product descriptions (if either) is more relevant to the furniture e-commerce search query:

Query: entrance table

Product LHS description: This coffee table is great for your entrance, use it to put in your doorway...
    (remaining product attributes omited)
Or
Product RHS description:  You'll love this table from lazy boy. It goes in your living room. And you'll find ...
    (remaining product attributes omited)
Or
Neither / Need more product attributes

Only respond 'LHS' or 'RHS' if you are confident in your decision

RESPONSE:
LHS

And as an example of the variety where we double check, we can confirm that we get the same preference when swapping LHS / RHS:

Which of these product descriptions (if either) is more relevant to the furniture e-commerce search query:

Query: entrance table

Product LHS description:  You'll love this table from lazy boy. It goes in your living room. And you'll find ...
    (remaining product attributes omited)
Or
Product RHS description: This coffee table is great for your entrance, use it to put in your doorway...
    (remaining product attributes omited)
Or
Neither / Need more product attributes

Only respond 'LHS' or 'RHS' if you are confident in your decision

RESPONSE:
RHS

Now we say that first preference is constant. It’s true that we say the product description “coffee table for your entryway” is the best choice for “entryway”.

All of these permutations (field, double check, allow no) are the experiments I run, with the results below. If the agents of the agent – out (or inconsistent with the double check) We improve the accuracy, but lose recall.

At the bottom right of the table for the product name below, double checking and letting llm say “no”, gives us a preference for 11.9% of 1000 product pairs. Within 11.9%, we correctly predicted the relationship preference 90.76% of the time. That’s a lot from the top left – getting a decision on each pair, but with accuracy lowered to 75.08%.

All permutations of the field X that allow x to be double checked are all listed below

Product name

DOT DOUND CHECK Double check
forces 75.08% / 100% 87.99% / 65%
Do not allow 85.38% / 17.10% 90.76% / 11.90%

Product Classification

DOT DOUND CHECK Double check
forces 70.5% / 100% 87.76% / 58.0%
Do not allow 87.01% / 17.70% 84.47% / 10.3%

Klegorcization Hierarchy

DOT DOUND CHECK Double check
forces 74.6% / 100% 86.1% / 69.70%
Do not allow 85.71% / 18.20% 89.91% / 10.8%

PICTURE

DOT DOUND CHECK Double check
forces 70.31% / 98.70% 76.58% / 72.60%
Do not allow 79.21% / 10.10% 83.02% / 5.3%

Returning products together

Finally, with everything we know, we can test an Uber Prompt in all 4 fields listed.

Which of these product descriptions (if either) is more relevant to the furniture e-commerce search query:

Query: entrance table
Product LHS name: aleah coffee table
Product LHS description:  You'll love this table from lazy boy. It goes in your living room. And you'll find ...
 ...
Or
Product LHS name: marta coffee table
Product RHS description: This coffee table is great for your entrance, use it to put in your doorway...
 ...
Or
Neither / Need more product attributes

Only respond 'LHS' or 'RHS' if you are confident in your decision

RESPONSE:
RHS

If we apply all variants of Uber Prompt. We get fair results in the top right bottom (power / double check)

DOT DOUND CHECK Double check
forces 78.10% / 100.0% 91.72% / 65.2%
Do not allow 91.89% / 11.10% 88.90% / 2.7%

Further development of decision trees

Instead of an Uber Prompt, another strategy could be to combine individual decisions into a general decision.

One way to get all the variants above, we have a set of agent predictions for each strategy:

Ask questions Product of lhs Product of RHS name Deserot CATEGORIES Class. Human Label
table Alea Coffee Table Marta coffee table of Lhs Lhs Rs Lhs

Combining these agents’ dumb decisions with individual decisions in a larger decision seems promising…

The first thought is that we must use an ensemble! The more votes on these less specific agents specific agents win! If they usually point to LHS, then just select LHS, that’s easy!

Or … or … and listen to me.

Look at the data above, imagine repeating it over 1000s of query product pairs, like:

Ask questions Product of lhs Product of RHS name Deserot CATEGORIES Class. Human Label
table Alea Coffee Table Marta coffee table of Lhs Lhs Rs Lhs
News bed Tazy Boy Recliner Suede Gove Seat Rs Lhs Rs of Rs

A truly astute reader will notice something.

This is a machine learning problem!

We have a bunch of ABOUT – The individual predictions of each attribute we have a Hint We want to predict – the human preference

The table above becomes a classification problem!

We “learn” the right ensemble. Perhaps in a decision tree or other simple classification.

First step, throw a big table like the one with 1000s of LLM evaluations in individual fields and variants or off. So essentially we have:

Ask questions Product of lhs Product of RHS name Name (Doug Dound Check) Name (disallow) Name (Doug Double Check / Not Allowed) Deserot Human Label
table Alea Coffee Table Marta coffee table of Lhs Lhs Rs Rs Lhs
Rs

That’s why I collected A in every type of LLM evaluation in one MANUSCRIPTS:

#!/bin/bash

N="$1"

poetry run python -m local_llm_judge.main --verbose --eval-fn category_allow_neither --N $N
poetry run python -m local_llm_judge.main --verbose --eval-fn category --N $N
poetry run python -m local_llm_judge.main --verbose --eval-fn category_allow_neither --check-both-ways --N $N

...
poetry run python -m local_llm_judge.main --verbose --eval-fn name --N $N
...

I emphasized my Faind Paints Heat Paind capabilities by running ./collect.sh 7000 To get 7000 examples (first 1000 same test used above, next 6000 I will use for training).

Each flow above has a query output, LHS, RHS product with human and agent preferences. Everyone’s agent preferences are ours.

Finally, we can point out A training script of the outputs produced by each llm run from the script above:

$ poetry run python -m  local_llm_judge.train --feature_names data/both_ways_category.pkl data/both_ways_name.pkl  data/both_ways_desc.pkl data/both_ways_classs.pkl data/both_ways_category_allow_neither.pkl data/both_ways_name_allow_neither.pkl data/both_ways_desc_allow_neither.pkl data/both_ways_class_allow_neither.pkl

The training script reinforces our training features (the agent’s preference of the various resources we use) with the human preference that we want to predict.

As a basic spike, I’m just using a simple Scikit decision tree to try predicting person_preferences from agent_preferences:

def train_tree(train, test):
    clf = DecisionTreeClassifier()
    clf.fit(train.drop(columns=('query', 'product_id_lhs', 'product_id_rhs', 'human_preference')),
            train('human_preference'))
    ...

And finally, when using wood, we make a precision/remember Tradeff. We implement a threshold of the probability of the prediction to accept the decision, otherwise the label as “no”:

def predict_tree(clf, test, threshold=0.9):
    """Only assign LHS or RHS if the probability is above the threshold"""
    probas = clf.predict_proba(test.drop(columns=('query', 'product_id_lhs', 'product_id_rhs', 'human_preference')))
    definitely_lhs = probas(:, 0) > threshold
    definitely_rhs = probas(:, 1) > threshold

The script tries each type:

('both_ways_desc_allow_neither', 'both_ways_class_allow_neither') 1.0 0.013
('both_ways_name', 'both_ways_class_allow_neither') 0.9861111111111112 0.072
('both_ways_category', 'both_ways_name', 'both_ways_classs', 'both_ways_name_allow_neither', 'both_ways_class_allow_neither') 0.9673366834170855 0.398
('both_ways_category', 'both_ways_name', 'both_ways_classs', 'both_ways_class_allow_neither') 0.9668508287292817 0.362
('both_ways_desc', 'both_ways_class_allow_neither') 0.9666666666666667 0.06
('both_ways_desc', 'both_ways_desc_allow_neither', 'both_ways_class_allow_neither') 0.9666666666666667 0.06
('both_ways_name', 'both_ways_desc_allow_neither', 'both_ways_class_allow_neither') 0.9666666666666667 0.09
('both_ways_category', 'both_ways_name', 'both_ways_classs') 0.9665738161559888 0.359
('both_ways_category', 'both_ways_name', 'both_ways_desc', 'both_ways_classs', 'both_ways_category_allow_neither') 0.9659367396593674 0.411
('both_ways_category', 'both_ways_name', 'both_ways_classs', 'both_ways_category_allow_neither', 'both_ways_name_allow_neither') 0.9654320987654321 0.405
...

Many caveats to this data – how can it be changed? What happens if we do some cross validations or on additional test data? Treat this as an interesting data point in a science lab notebook, not a guarantee that it will work.

The data though is quite interesting. A race appears to classify each human male in an attempt to be correct. But only for 1.3% of the data. Another did 96.5% accuracy on 40.5% of the data.

Trees allow us to see dependencies between features to predict relevance. We can use it as an exploratory tool to guide Searc solutions. We can interrupt the tree print(export_text(clf, feature_names=feature_names))that knows how the tree thinks about data dependencies, helping you plan, prioritize, and strategize what needs to be done.

|--- both_ways_category <= 0.50           # category preference is either LHS (-1) or neither (0)
|   |--- both_ways_category <= -0.50          # category preference is LHS (-1)
|   |   |--- both_ways_name <= 0.50               # Name preference is LHS (-1) or neither (0)
             |--- class: -1                          # Then prediction is "LHS"
...

According to this tree, any search solution may want to focus on the category before considering the name, for example.

Finally, this is the beginning. A dumb tree classifier may not be the best choice. Maybe we want to do some sort of gradient scaling instead of a tree? Given the strength of the trees, there may be another solution that is better.

But the whole of it all we can take many agents dumb, isolated, basic decisions, and combine their outputs into something better using traditional ml. We can use this to understand the reasoning behind people’s labels, helping to inform how we build solutions. Local LLMS will be ml stages of generators, probably too many people at that! We keep their decisions dumb, simple, and talkable together bringing them together in the end with a fast paced fun, old ml that gets the job done.


https://softwaredoug.com//assets/media/2025/unix-system.jpg

2025-01-22 12:25:00

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button