AI Benchmark Engineer | Native Language Specialist - Japanese - Remote

LILT · Japan

contract junior Posted 4 months ago

Apply Now Stand out: build a proof-of-work pitch →

Free GitHub-based preview. Direct apply stays one click away.

Get weekly job alerts like this →

Hiring for this role?

AI Market Demand Pack · $29 one-time

Compare this role's skills with the full AI hiring market. Get ranked demand, salary bands, leading companies, public source URLs, and a decision brief.

See the live sample →

llm computer-graphics generative-ai payments

About this role

ABOUT THE OPPORTUNITY We are building a rigorous, verifiable evaluation suite of Terminal-Bench tasks designed to test the limits of large language models on multilingual software challenges. Our goal is to measure multilingual robustness across prompt language effects, non-English data processing, and complex locale/encoding edge cases in terminal workflows. We are seeking experienced native-speaking software engineers to design, build, and validate these benchmarks. You will create high-signal, high-quality tasks that genuinely test a model's ability to handle multilingual environments without relying on English translation crutches. Note this is a remote, freelance opportunity WHAT YOU’LL DELIVER - Task Engineering: Evaluating Coding Agents. - Asset Creation: Build realistic task environments using datasets and files in your native language. Crucially, these assets must remain in the target language to genuinely measure multilingual handling. - Prompting & Translation: finding failure points where AI does not work, in your native language - Implementation & Verification: Support the development of robust solutions (reference implementations) and write highly reliable, deterministic verifier scripts (using rubric-based judging only when strictly necessary). - Calibration & Execution: Analyze execution logs and calibrate task difficulty (Easy to Very Hard) using standard Terminal-Bench run configurations against various model tiers (Haiku, Sonnet, Opus). - Quality Assurance: Participate in a rigorous, 4-layer human quality control process (creation, human review, calibration review, and audit) alongside automated LLM-based checks to ensure fairness, grammatical accuracy, and benchmark integrity. QUALIFICATIONS - Experience: 1+ years in software or prompt engineering. - Background: Proven track record at leading technology companies and/or graduation from top-tier engineering universities. - Language: Native or near-native fluency, with a deep understanding of its grammar, register, and phrasing rules. High English proficiency. - Technical Stack: Strong proficiency in Python, standard shell scripting, and data processing. - Workflow: Extensive experience with Terminal/CLI-based development workflows and a working familiarity with coding agents. - Domain Expertise: Deep technical understanding of multilingual text processing pitfalls, including: - Encoding/decoding robustness and Unicode normalization. - Locale-dependent conventions (collation, casing, non-Gregorian dates). - Text I/O, toolchain interoperability, and safe string operations. - (For specific languages) Bidirectional/RTL handling, font fallbacks, and rendering/typography in UI or artifacts. WHY COLLABORATE WITH LILT? - Your schedule, your rules. As an independent contractor, work when you want, as much or as little as you want. No fixed hours, no check-ins, no micromanaging. - Get paid quickly and fairly. We respect your time and your expertise. Competitive rates, prompt payments, no chasing invoices. - Work on projects that actually matter. Contribute to cutting-edge AI and language technology that is shaping how humans and machines communicate. - Be part of something bigger. Join a global community of linguists, subject matter experts, and language professionals who are advancing human knowledge together. - Grow without limits. As a Lilt contractor you get access to diverse, innovative projects that expand your portfolio and sharpen your skills across industries and domains. - Have fun doing what you love. Bring your language skills to life on projects that are as interesting as they are impactful.We are building a rigorous, verifiable evaluation suite of Terminal-Bench tasks designed to test the limits of large language models on multilingual software challenges. Our goal is to measure multilingual robustness across prompt language effects, non-English data processing, and complex locale/encoding edge cases in terminal workflows. WHAT TO CONSIDER BEFORE APPLYING - Not ideal as a full time job or primary income source. Work availability fluctuates with project demand, making this better suited as a supplemental income stream. As a 1099 contractor, you won't receive benefits such as health insurance, paid time off, or retirement contributions, and hours are not guaranteed. - Requires reliable availability and commitment. Once you accept a task, we expect quality work and on-time delivery. Most tasks require a minimum of 2 hours per day or 10 hours per week. If your schedule is unpredictable, this may not be the right fit. - Geographic restrictions may apply. We cannot engage contractors in regions subject to international embargo or sanctions. As a 1099 contractor, you are solely responsible for your own tax obligations. We recommend consulting a tax professional before engaging. HOW TO JOIN OUR EXPERT COMMUNITY 1 - Submit your applica