Job description
Job Responsibilities
- Infrastructure Development: Identify and resolve infrastructure gaps to ensure reliable, efficient, and scalable solutions.
- AI/ML Solutions: Develop advanced AI/ML infrastructure solutions to enhance the efficiency of our ML teams.
- System Design: Design and implement solutions for distributed storage systems, scheduling systems, high availability, and core reliability issues within large-scale GPU clusters.
- Performance Optimization: Monitor and optimize the performance of our AI/ML infrastructure, ensuring high availability, scalability, and efficient resource utilization.
- Automation Tools: Develop and deploy automation tools, monitoring solutions, and operational strategies to streamline infrastructure management and reduce manual tasks.
- Collaboration: Work with various teams, including ML developers, data engineers, and DevOps professionals, to create a cohesive and integrated AI/ML infrastructure ecosystem.
- Parallel Training: Optimize large-scale parallel training for state-of-the-art deep learning algorithms, including large language models, multi-modality models, diffusion, and reinforcement learning.
- Research & Development: Research and develop our machine learning systems, including accelerated computing architecture, management, and monitoring.
- Deployment: Deploy machine learning systems for distributed training and inference.
- Cross-layer Optimization: Manage cross-layer optimization of system and AI algorithms and hardware for machine learning (GPU, ASIC).
Minimum Qualifications
- Bachelor's degree in Computer Science, Engineering, or a related technical field.
- 5-8+ years of experience in software engineering, with a strong background in developing and managing large-scale distributed systems, ideally within the AI/ML infrastructure domain.
- Proficiency in programming languages such as Python, Go, or C++, with knowledge of cloud computing platforms like AWS, Azure, etc.
- Familiarity with machine learning algorithms, platforms, and frameworks such as PyTorch and Jax. Basic understanding of GPU and/or ASIC functionality.
- Expertise in at least one or two programming languages in a Linux environment: C/C++, CUDA, Python.
- Familiar with open-source distributed scheduling/orchestration/storage frameworks, such as Kubernetes (K8S), Yarn (Flink, MapReduce), HDFS, Redis, S3, etc., with practical experience in machine learning system development.
- Mastery of distributed systems principles and participation in the design, development, and maintenance of large-scale distributed systems.
- Strong communication and collaboration abilities, effective in working with diverse teams and individuals.
Preferred Qualifications
- In-depth understanding of AI/ML workflows, including model training, data processing, and inference pipelines.
- Practical experience with containerization technologies (Docker, Kubernetes), automation tools, and monitoring solutions (Prometheus, Grafana).
- Exceptional problem-solving skills, capable of analyzing complex systems, identifying bottlenecks, and implementing scalable solutions.
- A passion for continuous learning and staying abreast of new technologies and best practices in the AI/ML infrastructure space.
- Experience with GPU-based high-performance computing, RDMA high-performance networks (MPI, NCCL, ibverbs).
- Familiarity with distributed training framework optimizations (e.g., DeepSpeed, FSDP, Megatron, GSPMD).
- Knowledge of AI compiler stacks (torch FX, XLA, MLIR).
- Experience with large-scale data processing and parallel computing.
- In-depth CUDA programming and performance tuning experience (cutlass, triton).
About Together AI
We are a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancements such as FlashAttention, Hyena, FlexGen, and RedPajama.
Compensation
We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is:
$160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.
Equal Opportunity
We are an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity,