Software Engineer, Inference

Pika · Palo Alto, CA

full-time mid Posted 4 days ago

Apply Now Stand out: build a proof-of-work pitch →

Free GitHub-based preview. Direct apply stays one click away.

Get weekly job alerts like this →

Hiring for this role?

mlops gpu llm deep-learning distributed-systems inference

About this role

ABOUT THE ROLE We are seeking Inference Engineers to accelerate the performance of Pika's AI-driven products. In this highly technical role, you will operate at the intersection of cutting-edge inference acceleration, GPU parallelism, advanced model deployment, and video generation technologies. Your expertise will drive significant improvements to model speed and efficiency, ensuring our creative AI systems deliver industry-leading user experiences at scale. You will design and optimize inference pipelines, implement state-of-the-art acceleration techniques, and work closely with researchers and engineers across the team to push the boundaries of what’s possible in real-time AI deployment. Your efforts will play a foundational role in powering the next generation of Pika’s video and language models. WHAT YOU’LL DO - Accelerate Inference: Lead and implement advanced inference acceleration techniques, including attention optimization and quantization for efficient model serving. - Maximize GPU Parallelism: Engineer and optimize GPU strategies across tensor, sequence, and pipeline parallelism (TP, SP, PP) for maximal efficiency and scalability. - Programming for Performance: Develop and optimize high-performance computing kernels and distributed workloads using CUDA and NCCL. - Advance AI Deployment: Collaborate with research and engineering teams to bring state-of-the-art videogen and large language models into production. - Improve Training Efficiency: (Bonus) Contribute to improvements in model training speed, stability, and resource utilization as part of our deployment lifecycle. - Technical Excellence: Drive rigorous code reviews, participate in technical discussions, and mentor fellow engineers on best practices in inference and GPU programming. WHAT WE’RE LOOKING FOR - Experience: 3+ years engineering experience, with a strong track record in inference acceleration and model deployment at scale. - Inference Mastery: Proven expertise in inference optimization, including quantization, attention acceleration, and deep learning compiler stacks. - GPU & Parallelism: Deep knowledge of GPU programming (CUDA, NCCL) and experience with SP, TP, PP, and other forms of parallelism for distributed inference. - AI Domain Knowledge: Familiarity with video generation (videogen) models and large language models (LLMs). - Collaboration: Strong cross-discipline communication skills; able to drive shared goals across research and engineering functions. - Ownership Mindset: Self-driven, solutions-oriented, and capable of managing ambiguity in a fast-paced startup environment. - Bonus: Experience in enhancing training efficiency, stability, or resource optimization for large models. NICE TO HAVE - Experience with high-throughput video or real-time streaming model deployment - Familiarity with distributed training and optimization toolkits - Contributions to open source projects in AI infrastructure or deep learning compilers - Startup or rapid prototyping experience WHAT WE OFFER - Competitive salary in the AI industry - Equity in a fast-growing startup shaping the future of AI - Comprehensive health benefits, monthly stipends, company retreats - A supportive and collaborative office culture—we’re all building and launching together ABOUT PIKA At Pika, we're crafting a future where video creation is seamless, intuitive, and universally accessible. Our mission is to empower creativity by breaking down technical barriers using the transformative power of AI. We’re a tight-knit, energetic team based in Palo Alto, CA, valuing efficiency, curiosity, and the ambition to make a meaningful impact on the world. We work from our Palo Alto office 3–5 days a week and welcome applicants who are eager to contribute onsite.