All roles

Senior Software Engineer — AI Evaluation & Benchmarks (Python)

Remote · USA Full-time New today

Before Applying This role is open to contractors in accepted locations only. Please confirm your country is on the list before applying — we're unable to process applications from unlisted locations. List of accepted countries and locations. For US applicants: This is a 1099 independent contractor role. It is not compatible with F-1 OPT, STEM OPT, or any visa status that requires W-2 employment, guaranteed hours, or employer sponsorship. We are unable to provide offer letters or employment verification for this role. What You'll Be Doing Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work: Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code Build and maintain scalable data pipelines for evaluation workflows Analyze model-generated code for correctness, reliability, and edge-case failures Construct structured evaluation scenarios across large repos and multi-language environments Provide detailed technical feedback on model performance and failure patterns Contribute to evaluation frameworks that set the bar for how coding ability is measured End result: benchmarks that meaningfully separate what frontier models can and can't do — and shape how the next generation is trained and improved. AI coding evaluation in one line: Design task → build harness → run model → analyze failures → feed findings back into the benchmark → evaluations that actually distinguish strong models from weak ones. What You'll Need 4+ years of professional software engineering experience (non-negotiable) Expert Python — clean, performant, well-tested code Hands-on experience working in large, complex codebases Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines Strong command of Git and modern development workflows Track record at a high-growth tech company or top-tier software organization Strong written English communication Identity verification: Applicants will be required to verify their identity and confirm they have valid documentation to work as an independent contractor in their country of residence. Nice to have Senior or Lead-level profile with a history of technical ownership Bachelor's or Master's in CS, ML, or related field (or equivalent professional experience) Proficiency in additional languages: JavaScript, Go, C++, or others CI/CD experience and writing robust unit tests (pytest, Mocha, JUnit) Background in security engineering or significant open-source contributions Familiarity with AI/ML evaluation methodologies or model benchmarking Logistics Location: Fully remote — work from anywhere on the accepted locations list Compensation: $80–$100/hr based on location and seniority Contract length: 3 months, with potential for extension Hours: Full-time availability preferred — hours vary by project and are not guaranteed week to week Engagement: 1099 independent contractor Payment: Weekly via PayPal or Stripe ⚠️ Important: Hours are project-dependent and can vary week to week. We recommend keeping other work options open alongside this engagement rather than relying on it as your sole source of income. Apply To This Job

Related roles

Bilingual Contact Centre Representative

Remote · USA Full-time

VP of Brand

Remote · USA Full-time

Payments Product Manager

Remote · USA Full-time

TikTok Affiliate & Creator Coordinator

Remote · USA Full-time

Staff Product Designer (Dev Portal)

Remote · USA Full-time

Strategic Account Executive

Remote · USA Full-time

Manager, Expand - AMER

Remote · USA Full-time

Senior Technical Product Manager, GPU Orchestration

Remote · USA Full-time

Channel Sales Territory Manager

Remote · USA Full-time

Sales Development Representative, ANZ

Remote · USA Full-time

Medical Scientific Liaison Pipeline

Remote · USA Full-time

Experienced Customer Service Representative I (Hybrid) – Delivering Exceptional Financial Services Experience at arenaflex

Remote · USA Full-time

Manager, Technical SEO Marketing

Remote · USA Full-time

Government, Foundation & Multilateral Partnerships Intern (M&G)

Remote · USA Full-time

Experienced Data Entry Specialist – Entry Level, Part-Time Opportunity at arenaflex

Remote · USA Full-time

Experienced Sales and Data Entry Professionals Wanted for Remote Opportunities at arenaflex

Remote · USA Full-time

Brevant Area Product Manager (IA/MN/SD/ND)

Remote · USA Full-time

We are not looking for job seekers, we are looking for change makers! Junior Customer Success Manager - APPLY TODAY!

Remote · USA Full-time

Experienced Full Stack Data Entry Specialist – Remote Operations for arenaflex

Remote · USA Full-time

SIU Investigator Part time Bilingual Spanish a PLUS

Remote · USA Full-time