Exploro Solutions
Site Reliability Engineer - Prometheus/Grafana
Job Location
bangalore, India
Job Description
Job Role : Site Reliability Engineer YOE : 4 to 7 yrs Key Responsibilities : Payment Monitoring and Alert Triage : - Monitoring of the Payments Flow Based Alerts across multiple applications in rotation 24 X 7 shifts and identify the issue proactively. - Triage the alerts by analysing the trends on affected dimensions of payment flow, and co-relate the same with other services metrics, logs and traces to find the root cause along with the documentation of triage. - Ensure timely escalation and closure of issues reported while working with Engineering Teams of payment Services. Observability Development : - Design and implement alerting frameworks using tools like Datadog, Grafana, Kiban a, Splunk, and Prometheus. - Set up custom dashboards and streamline alerting to reduce noise while ensuring critical issues are addressed. - Drive the adoption of SLO-based alerting, burn rate metrics, and anomaly detection techniques. Incident Management : - Lead incident management efforts from identification to resolution. - Conduct post-incident reviews and implement preventive measures to avoid recurring issues. - Maintain detailed documentation and performance reports on incident trends and team efficiency. Automation and Optimization : - Automate repetitive processes using programming languages like Python or Java. - Develop and refine scripts to manage and fine-tune alerts. - Collaborate with engineering teams to implement solutions that reduce manual effort and operational toil. Required Skills and Qualifications : - Proven expertise in SRE Observability Concepts and monitoring architecture design. - Extensive experience with alerting frameworks like Prometheus, Grafana, Kibana, Splunk, and Datadog. - Hands-on experience with alert noise reduction and advanced alerting techniques such as anomaly detection and burn rate alerting. - Strong proficiency in incident management, including analysis, root cause identification, and preventive measures. - Familiarity with payment monitoring systems and operational requirements. - Proficient in automation tools and scripting languages like Python or Java. - Excellent collaboration and communication skills to interact with cross-functional teams. - Flexibility to work in rotational 24x7 shifts from the office. Notice Period : Immediate to 20 days (ref:hirist.tech)
Location: bangalore, IN
Posted Date: 5/1/2025
Location: bangalore, IN
Posted Date: 5/1/2025
Contact Information
Contact | Human Resources Exploro Solutions |
---|