Administers HPC clusters
ALTEN Mexico - México, México
Apply NowDescripción del trabajo
ALTEN Mexico is a subsidiary of ALTEN Group. ALTEN group has been a leader in engineering and information technology for more than 30 years and operates in 30 countries across Europe, North America, Asia, Africa, and the Middle East and employs more than 46,000 people, 88% of whom are engineers. We are looking for a HPC Analyst:What is it?An HPC CAE Analyst (High-Performance Computing Analyst for Computer-Aided Engineering) is a specialist in high-performance computing systems who manages and optimizes computational infrastructures to run complex engineering simulations. Their work accelerates product development in industries such as automotive, aerospace, and energy, where large-scale calculations and advanced modeling are required.How do they do it?Manages HPC clusters, using job schedulers like SLURM, PBS, or AWS ParallelCluster to efficiently distribute workloads across hundreds or thousands of processing cores.Automates processes with Python and Bash scripts, streamlining repetitive tasks such as the installation of CAE software (ANSYS, NASTRAN, OpenFOAM) or system performance monitoring.Troubleshoots issues related to hardware, software, and applications, ensuring maximum cluster availability for critical simulations.Advises users (engineers, scientists) on how to run their jobs more efficiently, fine-tuning parameters to reduce computation time and operational costs. General activities:1. HPC Cluster Management and OptimizationConfigure, maintain, and scale HPC clusters (on-premises or cloud-based).Optimize resource allocation (CPU/GPU, memory, storage) for parallel workloads.Manage job schedulers (SLURM, PBS, AWS ParallelCluster) to prioritize critical tasks.2. Technical Support and TroubleshootingDiagnose and resolve hardware failures (nodes, networking, storage).Fix software errors (OS, drivers, scientific libraries).Apply security patches and system updates (Linux, firmware).3. Automation and ScriptingDevelop Bash/Python scripts to:Automate CAE software deployments (ANSYS, OpenFOAM).Monitor cluster performance (CPU/GPU usage, network latency).Generate HPC metrics reports (utilization, efficiency).4. CAE User SupportAdvise engineers/scientists on:Optimal simulation setups (MPI/OpenMP parallelization).Efficient resource usage (avoid crashes due to misconfiguration).Train users in HPC/CAE tools (workshops, technical documentation).5. Scientific Software MaintenanceInstall, update, and manage licenses for CAE applications (CFD, FEA, multiphysics).Manage runtime environments (Environment Modules/Lmod).Integrate scientific libraries (Intel MKL, CUDA, OpenMPI).6. Security and ComplianceApply hardening policies to Linux servers.Manage user access (LDAP/Active Directory) and storage quotas.Ensure backups of critical data (simulation files, configurations).7. Collaboration with Cross-Functional TeamsWork with IT teams to scale infrastructure.Coordinate with development engineers to optimize CAE code performance.Document procedures (Wiki, Confluence) to standardize operations.Requirements: Bachelor's degree in:Computer Science, Computer Engineering, Mechanical, Aerospace, or Electrical Engineering (with a focus on HPC/CAE), Physics or Applied Mathematics (with experience in technical computing).5 to 8 years of experienceAdvanced English Required Technical Skills:Systems and AdministrationLinux (RHEL/CentOS, Ubuntu distributions): Advanced (5+ years)HPC cluster administration: Intermediate (2–5 years)Job schedulers (SLURM, PBS Pro, AWS ParallelCluster): Intermediate (2–5 years)Virtualization/Containers (Docker, Singularity): Basic (0–2 years)Programming and Scripting LanguagesBash scripting: Intermediate (2–5 years)Python (automation, data analysis): Intermediate (2–5 years)CAE Software and ToolsInstallation/configuration of CAE software (ANSYS, OpenFOAM, NASTRAN): Intermediate (2–5 years)Parallel computing (MPI, OpenMP): Basic (0–2 years)Networking and StorageHPC networks (InfiniBand, high-performance Ethernet): Basic (0–2 years)Parallel file systems (Lustre, GPFS): Basic (0–2 years)Cloud and DevOpsAWS/GCP/Azure for HPC (Batch, ParallelCluster, EC2): Intermediate (2–5 years)Infrastructure as Code (Terraform, CloudFormation): Basic (0–2 years)CI/CD for HPC environments (GitLab CI, Jenkins): Basic (0–2 years)Optimization and PerformanceTuning of HPC/CAE applications: Intermediate (2–5 years)Benchmarking (HPL, IOR): Basic (0–2 years)Alternative Job SchedulersIBM Spectrum LSF: Basic (0–2 years)Altair PBS Professional: Basic (0–2 years)Security and ComplianceLinux system hardening: Basic (0–2 years)Cybersecurity in HPC environments: Basic (0–2 years)Monitoring and LoggingGrafana/Prometheus for HPC monitoring: Basic (0–2 years)ELK Stack (log analysis): Basic (0–2 years)
Creado: Jue, 01 de Ene de 1970