Apply here. Applications are rolling, and there’s no set deadline.

About ARC Evals (now METR)

METR does empirical research to determine whether frontier AI models pose a significant threat to humanity. It’s robustly good for civilization to have a clear understanding of what types of danger AI systems pose, and know how high the risk is. You can learn more about our goals from Beth’s talk.

Some highlights of our work so far:

Establishing autonomous replication evals: Thanks to our work, it’s now taken for granted that autonomous replication (the ability for a model to independently copy itself to different servers, obtain more GPUs, etc) should be tested for. For example, labs pledged to evaluate for this capability as part of the White House commitments.
Pre-release evaluations: We’ve worked with OpenAI and Anthropic to evaluate their models pre-release, and our research has been widely cited by policymakers, AI labs, and within government.
Inspiring lab evaluation efforts: Multiple leading AI companies are building their own internal evaluation teams, inspired by our work.
Early commitments from labs: Anthropic credited us for their recent Responsible Scaling Policy (RSP), and OpenAI recently committed to releasing a Risk-Informed Development Policy (RDP). These fit under the category of “evals-based governance”, wherein AI labs can commit to things like, “If we hit capability threshold X, we won’t train a larger model until we’ve hit safety threshold Y”.

We’ve been mentioned by the UK government, Time Magazine, and others. We’re sufficiently connected to relevant parties (labs, governments, and academia) that any good work we do or insights we uncover can quickly be leveraged.

About the role

The engineering lead at METR is in charge of our internal platform for evaluating model capabilities (Concretely: infrastructure to run a hundred agents in parallel against different tasks inside isolated virtual machines), as well as managing the engineers who expand this tooling.

This platform is critical to our success — as increasingly powerful models are created, we’ll need to keep pace by constructing tooling that allows us to evaluate these new models. As models gain new modalities and capabilities, the tooling necessary to test out their capabilities will shift as well.

The work is technically fascinating, and you get to be on the cutting edge of what models can do. If you’re up for it, you may also liaise with our partners — labs, the US and UK governments, etc — as they embark on their own evaluation efforts. There’s room here to help set the standards for tooling that enable evaluations overall.

Compensation is about $250k–$400k, depending on the candidate.

What we’re looking for

This role is best-suited for a generalist who enjoys wearing many hats. Former founders could be a good fit, or engineering managers who enjoy talking to users, or strong ICs or tech leads with at least a bit of management experience.

Engineering

Requirements

Strong technical design — avoids inessential complexity, accurately anticipates where corners can be cut and where they can’t