What you will do:
Observability, Monitoring, Alerting: AIOP's &
Tier 2 Ops and Support Monitoring : candidates will have the unique opportunity to work with cutting-edge technologies and collaborate with top professionals working on very high impact and visible products. Candidates will be part of the team that is responsible for implementation and maintenance of systems generating operational insights and driving corrective actions within Caterpillar's whole digital Ecosystem. Candidates will be responsible for ensuring the reliability, availability, and performance of the Cat Digital Mission Critical web-based systems and infrastructure. Candidates will collaborate with cross-functional teams to develop and implement strategies to improve system stability, automate repetitive tasks, and enhance service delivery. Successful candidates will use AWS cloud, BigPanda, ThousandEyes, AppDynamics, SQL/NoSQL databases, etc. within their projects.
Job Responsibilities and Duties:
Monitor and troubleshoot production systems to identify and resolve performance, scalability, and reliability issues proactively.
Participate in the on-call rotation to provide 24/7 support for Mission Critical applications and services
Create and maintain automated processes and tools to manage deployment and release processes.
Collaborate with cross-functional teams (like development) to define and document operational processes, best practices, and procedures.
Implement and maintain system monitoring tools and dashboards to provide real-time insights into system performance and identify potential issues.
Ensure that systems and infrastructure comply with security, compliance, and regulatory requirements.
Conduct regular capacity planning exercises to anticipate future demand and ensure that we have the resources necessary to support our growing user base.
Continuously evaluate systems and processes to identify areas for improvement and implement changes as needed.