Evaluations in ell #285

MadcowD · 2024-10-04T23:54:35Z

This is a major feature release.
Spec: https://github.com/MadcowD/ell/blob/cd64ab9bb0d3a09195fef7a32ef77ac5d7e6c912/docs/ramblings/evalspec.md
Ramblings: https://github.com/MadcowD/ell/blob/cd64ab9bb0d3a09195fef7a32ef77ac5d7e6c912/docs/ramblings/thoughtsonevals.md
Example: https://github.com/MadcowD/ell/blob/6afad20bc58a99e9f3fe0a76ff6b7642471d63a7/examples/eval.py

The big ones:

Tracing
Performacne Tracking
Hand labeling interface (maybe next PR)

UX/IMPL TODOs

Next Step TODOS

Implement a bunch of standard criteria #373
Dataset construction needs to be easy and there should be libraries around this, also matching parity with OpenAI evals #382

(Misc todos)

Bugs:

Threading issue
Lambda serialziaion incorrect

Metrics from older evals get pulled out on the ocmputation graph

MadcowD · 2024-10-04T23:54:48Z

#269

MadcowD · 2024-10-05T18:04:26Z

Potentially we can add the following as a niceties for dataset construction

# Use this as a handicap for users specifying their own datasets, they need to be explicit about the input.
InputType = Union[Dict[str, Any], NoneType, List[Any]]
class Datapoint(UserDict):
    def __init__(self, input: InputType, **rest):
        super().__init__(input, **rest)
        assert isinstance(input, (dict, list)) or input is None, f"Input must be a dict, list, or None, got {type(input)}"
    
    @property
    def input(self) -> InputType:
        return self.data["input"]

dataset : List[Datapoint] = [
    Datapoint(input={"text": "Hello world"}, random_heuristic="Hello world", hf_score=0.5),
    Datapoint(input=[1, 2, 3, 4, 5], random_heuristic="List of numbers", hf_score=0.5),
    Datapoint(input=None, random_heuristic="No input", hf_score=0.5),
]
     
# But datasets on the otherside will accept arbitrary data typess for consturction.
#XXX: Need to figure out if we should actually build a basic dataset class with validation or just leave it as a list of dicts.
class Dataset:
    def __init__(self, data: Iterable[Dict[str, Any]]):
        self.data = data
        #XXX: Validation
        # If we do this now we can potentially force user to serialize their data in the data store etc.
        self.validate()

    def __iter__(self):
        return iter(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx]
    
    def __len__(self):
        return len(self.data)

    def validate(self):
        for datapoint in self.data:
            if not isinstance(datapoint, dict):
                raise ValueError(f"Each datapoint must be a dictionary, got {type(datapoint)}")
            if "input" not in datapoint:
                raise ValueError("Each datapoint must have an 'input' key", datapoint)
            if not isinstance(datapoint["input"], (list, dict)):
                raise ValueError(f"The 'input' value must be a list or dictionary, got {type(datapoint['input'])}", datapoint)

    @classmethod
    def from_pd(cls, dataframe: pd.DataFrame, input_column: str):
        return cls(dataframe[input_column].to_list())

    @classmethod
    def from_jsonl(cls, file_path: str):
        with open(file_path, "r") as file:
            return cls(json.load(file))

    @classmethod
    def from_pickle(cls, file_path: str):
        with open(file_path, "rb") as file:
            return cls(pickle.load(file))

MadcowD · 2024-10-05T21:31:15Z

We need to solve this antipattenr (what if we just want to eval something with a criterion on no dataset

dataset = [
    { 
        "input": [],
    }
]*10

@ell.simple(model="gpt-4o")
def write_a_bad_poem():
    return "Write a really poorly written poem."


@ell.simple(model="gpt-4o")
def write_a_good_poem():
    return "Write a really well written poem."

@ell.simple(model="gpt-4o", temperature=0.1)
def is_good_poem(poem : str):
    """Include either yes or no in your response at most onnce but not both."""
    return "Is this a good poem yes/no? {poem}"

def score(datapoint, output):
    return float("yes" in output.lower())

eval=ell.evaluation.Evaluation(name="poem_eval", dataset=dataset, criteria={"is_good": score})

print("EVALUATING BAD POEM")
result = eval.run(write_a_bad_poem, n_workers=4)

print("EVALUATING GOOD POEM")
result = eval.run(write_a_good_poem, n_workers=4)

MadcowD · 2024-10-08T20:01:29Z

Add a migration using alembic
https://testdriven.io/blog/fastapi-sqlmodel/

MadcowD · 2024-10-08T20:28:56Z

Okay. So we have a really good spec now for at least invocation labels and invocation labelers allowing us to define arbitrary rubrics using JSON schemas. The big problem currently is that now I'm thinking about dataset serialization and furthermore, by storing invocations with param objects. We have a larger problem where now if I run multiple indications on the same parameters. So for example in the same datapoints I will duplicate the dataset my store for as many different metrics as there are. This seems like it'll end up being really, really slow so someone needs to come through here, and it's probably me and redesign the data model so that this is a lot more efficient.

The other question is in general, should we have a dataset and this matters if we want to reserialize the evaluation at some later point in time rather than just store the version of it. For example, if I specify an evaluation, I probably actually do want to look at the data of the evaluation. So the picture in L Studio would be a list of rows that are a part of the evaluation.

As part of the migration, we could build a new parameter, essentially an input contents for an invocation, and have that hashed. These objects would be stored separately and efficiently. The dataset would then be just a list of these blobs, which are input blobs corresponding with invocations. This is actually the true way to serialize it because we have a bunch of inputs we're going to redundantly use every single time with variable outputs. Alternatively, we could keep the invocation contents as the true parameters.

For the evaluation view, we definitely want to have the dataset and then different evaluation runs. So you can view, I guess it would be three tabs: dataset, invocations, and runs/metrics. Clearly, we need to have first-class support for the dataset, and the dataset schema itself will have an input and then a bunch of other objects on it. These input objects themselves would be for each data point, so each row in the dataset. Another issue is that there's a dataset and a datapoint class, basically. The dataset class contains a list of datapoints. If we wanted to be thorough about this, we would reengineer the blob to separate out these unique objects.

And then we could really go down the Weights and Biases style. We would basically define a dataset object, just like Weave does. And then that dataset is automatically added to the store, which is not something you really want to do necessarily. But inherently it'll be added to the store because we are using the dataset in the eval. So this is tough, right? I mean, I could just say, "Hey, you know, here's the eval. There's a dataset object. It's this size." If you wanted to actually look into the dataset, then you need to open it with Python and so on, except for when we actually run the eval for the first time. Because if we actually run the eval, then it'll get committed to the database. And that's the philosophy: if we don't run the eval, it doesn't change. So what, for example, if I want to change the metric, I don't want to rerun all the completions again. So that's kind of a flaw with this design as well.

MadcowD · 2024-10-08T20:29:39Z

Overall thinking here is we need to redesign our store spec to have better support for redundant entities.

MadcowD · 2024-10-08T20:30:55Z

What about multimodal feedback?

MadcowD · 2024-10-09T19:39:39Z

erDiagram
    SerializedLMP ||--o{ Invocation : "has"
    SerializedLMP ||--o{ SerializedLMPUses : "uses/used_by"
    SerializedLMP ||--o{ EvaluationRun : "evaluated_in"
    Invocation ||--|| InvocationContents : "has"
    Invocation ||--o{ InvocationTrace : "consumes/consumed_by"
    Invocation ||--o{ EvaluationResultDatapoint : "labeled_in"
    Evaluation ||--|{ EvaluationLabeler : "has"
    Evaluation ||--|{ EvaluationRun : "has"
    EvaluationRun ||--|{ EvaluationResultDatapoint : "has"
    EvaluationRun ||--|{ EvaluationRunLabelerSummary : "has"
    EvaluationLabeler ||--|{ EvaluationLabel : "has"
    EvaluationLabeler ||--|{ EvaluationRunLabelerSummary : "has"
    EvaluationResultDatapoint ||--|{ EvaluationLabel : "has"

    SerializedLMP {
        string lmp_id PK
        string name
        string source
        LMPType lmp_type
    }
    Invocation {
        string id PK
        string lmp_id FK
    }
    InvocationContents {
        string invocation_id PK,FK
    }
    Evaluation {
        string id PK
        string name
        string dataset_hash
    }
    EvaluationLabeler {
        int id PK
        string name
        EvaluationLabelerType type
    }
    EvaluationRun {
        int id PK
        int evaluation_id FK
        string evaluated_lmp_id FK
    }
    EvaluationResultDatapoint {
        int id PK
        string invocation_being_labeled_id FK
        string evaluation_run_id FK
    }
    EvaluationLabel {
        int labeled_datapoint_id PK,FK
        string labeler_id PK,FK
    }
    EvaluationRunLabelerSummary {
        string evaluation_run_id PK,FK
        string evaluation_labeler_id PK,FK
    }

MadcowD · 2024-10-09T20:46:54Z

For the no input specification problem ideally we do something like this


# This is the ideal way of doing this.
eval = Evaluation(
    name="swear words",
    n_evals=100,
    metrics={"swear_words_appeared": lambda datapoint, output: output.count("shit")}) # get more statistical significance 

eval.run(lmp, n_workers=10) # can leverage workers however we want fundamentally.

MadcowD · 2024-10-12T12:59:50Z

UX is getting there

MadcowD added this to the v0.1.0 release milestone Oct 4, 2024

MadcowD mentioned this issue Oct 4, 2024

[WIP] Evals #269

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluations in ell #285

Evaluations in ell #285

MadcowD commented Oct 4, 2024 •

edited

Loading

MadcowD commented Oct 4, 2024

MadcowD commented Oct 5, 2024

MadcowD commented Oct 5, 2024

MadcowD commented Oct 8, 2024

MadcowD commented Oct 8, 2024

MadcowD commented Oct 8, 2024

MadcowD commented Oct 8, 2024

MadcowD commented Oct 9, 2024 •

edited

Loading

MadcowD commented Oct 9, 2024

MadcowD commented Oct 12, 2024

Evaluations in ell #285

Evaluations in ell #285

Comments

MadcowD commented Oct 4, 2024 • edited Loading

MadcowD commented Oct 4, 2024

MadcowD commented Oct 5, 2024

MadcowD commented Oct 5, 2024

MadcowD commented Oct 8, 2024

MadcowD commented Oct 8, 2024

MadcowD commented Oct 8, 2024

MadcowD commented Oct 8, 2024

MadcowD commented Oct 9, 2024 • edited Loading

MadcowD commented Oct 9, 2024

MadcowD commented Oct 12, 2024

MadcowD commented Oct 4, 2024 •

edited

Loading

MadcowD commented Oct 9, 2024 •

edited

Loading