Staff-level platform and observability engineer

Jordan Simonovski

Open to work

Staff Software Engineer

Blue Mountains, Australia

Open to Staff/Principal platform and observability roles
AEST (UTC+10)
Remote-first (APAC-friendly, global collaboration)
Replies within 24 hours

Book a call Email LinkedIn GitHub Website Download PDF

Current Role

Staff Software Engineer Blue Mountains, Australia

Domain Focus

99.98% OpenTelemetry traces + logs + metrics

Delivery Profile

0.7x Healthy signal profile

Experience Timeline

4 spans updated 2026-03-15

Experience Timeline

2017-01 to 2026-03 (111 months)

Role summary

Owned and ran high-scale observability systems, ingesting hundreds of terabytes of telemetry per day. Advocated for and evangelised great observability using wide-events with a focus on MTTR reduction, great DX, high cardinality use-cases, and improving AI observability.

Evangelised the shift to wide events. Implemented ClickHouse, scaling to 100TB/day ingestion while focusing on sub-second query times and AI Observability.
Built and singlehandedly ran observability tracing infrastructure with Grafana Tempo, scaling to 180TB/day ingestion.
Introduced and implemented Grafana Scenes internally, implementing a centralised app for debugging core Atlassian experiences using various data sources, promoting good UX, explorability, leading to MTTR reductions for both novel and well understood incidents.
Scaled Grafana Mimir to match existing metric usage with other products internally; Ran complex usage-accurate load tests to validate 200k alert evaluations.
Standardised Kubernetes deployments by implementing helm charts built with my own OSS tools to better vendor, test, and control common requirements and constraints.
Implemented a Grafana Scenes application to replace our existing explore workflows integrating with our internal AI gateway, using an AI assistant to generate accurate queries for: PromQL, TraceQL, SQL (ClickHouse), SPL (Splunk), explain data panels, and generate on-the-fly dashboards for debugging core Atlassian Experiences.
Designed and implemented Atlassian's first accurate service map by covered experience, critical for reducing MTTR during high stress incident scenarios.
Improved Tempo query performance 30%+ with minimal impacts on cost.
Built out a local development observability solution for Atlassian engineers in my free time, with a strong focus on DX and flexibility.
Designed and built multi-region (regional failover) infrastructure for Grafana and Grafana Mimir with a strong focus on low RTO/RPO, while keeping costs under control.

KubernetesOpenTelemetryGrafanaPrometheus/MimirTempoClickHouseGolangTypescript

Wide events + ClickHouse scale-up8% -> 42%

Led the shift to wide events and scaled ClickHouse ingestion to 100TB/day while maintaining fast query performance for operators.

Tempo tracing platform ownership45% -> 78%

Built and ran observability tracing infrastructure on Grafana Tempo, scaling ingestion to 180TB/day with reliability guardrails.

Grafana Scenes debugging UX78% -> 100%

Introduced and implemented Grafana Scenes as a centralized debugging surface across data sources, improving explorability and reducing MTTR.

Top impact highlights

Drove direction, architecture, implementation, and strategy for cloud-native migration.
Implemented SRE principles such as SLOs and error budgets across the organisation.

Top impact highlights

Maintained and improved ECS infrastructure.
Implemented HashiCorp Nomad as a new Docker scheduler.

Top impact highlights

Built resilient serverless messaging workflows using SQS, DynamoDB, and scaling Lambda functions.
Implemented Docker infrastructure in Rancher.

Side Quests

14 project highlights

span-1

Built an assistant-driven query workflow that generated PromQL, TraceQL, and SQL from incident prompts.

span-2

Reduced time-to-first-useful-query during incident debugging by improving discovery and query authoring ergonomics.

span-3

I've kicked off an open source build of this in my free time.

span-4

HelmCov is a dynamic coverage checker which aims to surface coverage issues in helm charts. This is useful for charts which have a lot of branched templating logic. Testing all paths is ideal.

span-5

Helmver is a tool I wrote because I was really missing some of the nice semantic versioning tooling from the JS ecosystem. It borrows heavily from changesets.

span-6

I should really rename this. Heatmap panel started off as a "could I do something like Bubble up but for ClickHouse?" and is turning into a lot more as I continue to test out various bits and pieces. See repo here.

span-7

Built a local-first observability stack for faster debugging in day-to-day engineering loops.

span-8

Improved developer feedback loops by making traces, logs, and metrics available without production dependencies.

span-9

Technical reviewer for Observability Engineering, 2nd Edition.

span-10

DevOpsDays Wollongong organiser, 2025.

span-11

SRECon22 Asia/Pacific organiser.

span-12

Monitoring.Sydney organiser.

span-13

DevOpsDays Sydney 2019 organiser.

span-14

Currently building out an open source Grafana Scenes app on a ClickHouse data source to vastly improve developer experience when debugging applications

Contact

Preferred contact channels

23:10:12 OK Phone: +61 451 309 913
23:11:12 INFO Email: jordan.simonovski@gmail.com
23:12:12 INFO Website: jordansimonov.ski
23:13:12 INFO GitHub: github.com/jordan-simonovski
23:14:12 INFO LinkedIn: linkedin.com/in/jsimonovski

Summary

Staff engineer owning global-scale observability infrastructure across Mimir, Tempo, and ClickHouse, operating largely independently across the full stack. A vocal advocate for moving beyond the three-pillar observability model toward wide events, with hands-on experience designing and driving that migration at scale. Builds Grafana Scenes applications that measurably reduce MTTR and unify visibility across heterogeneous technology stacks and data sources. Technical reviewer for Observability Engineering, 2nd Edition and an active contributor to the broader SRE and observability community.

Recruiter Snapshot

Availability: Open to Staff/Principal platform and observability roles
Timezone: AEST (UTC+10), overlapping US and EU collaboration windows
Work mode: Remote-first, open to distributed global teams
Response SLA: Typically within 24 hours

Core Skills

Kubernetes platform engineering (EKS, Helm, GitOps, ArgoCD, custom operators)
AWS (10+ years, full platform scale up, compute, serverless)
Docker platforms (Kubernetes, EKS, GKE)
OpenTelemetry ecosystem (Weaver, Collector, SDK)
Observability (Prometheus/Mimir, Grafana, Tempo/ClickHouse)
Site Reliability Engineering (SLOs, War games, chaos testing/tooling)
IaC with Terraform/Crossplane and CI/CD with GitHub Actions/Bitbucket Pipelines
Golang, Typescript, and Rust development
Platform engineering i.e. building, validating and running internal products.
Avid speaker and mentor (especially junior engineers)

Experience

Staff Software Engineer - Atlassian (2020-05 to Present)

Drove platform-level observability strategy and execution across traces, metrics, and logs at very high ingestion scale.
Improved query performance and incident response outcomes through better UX, faster diagnostics, and stronger telemetry defaults.

Expand full impact

Evangelised the shift to wide events and implemented ClickHouse observability infra at 100TB/day+ ingestion.
Built and operated Grafana Tempo tracing infrastructure at 180TB/day ingestion.
Consistent and continual improvements in CI with reviewdog + Bitbucket pipes to ease developer workflows and feedback.
Consistent and continual improvements in CD with ArgoCD/Spinnaker work, focusing on ephemeral build environments.
Raised the bar on SLOs both within the observability department and other teams by introducing better practices, and running sessions with teams on good SLOs.
Introduced Grafana Scenes internally and shipped a centralised debugging workflow across core Atlassian experiences.
Was considered an SME for Mimir/Tempo: scaled Grafana Mimir and ran usage-accurate load validation for ~200k alert evaluations.
Introduced span metrics and TraceQL Metrics (metrics derived from traces) to address gaps in traditional monitoring setups.
Standardised Kubernetes deployments with reusable Helm charts built using internal OSS tooling.
Integrated an AI-assisted query experience across PromQL, TraceQL, SQL (ClickHouse), and SPL (Splunk).
Designed and implemented Atlassian's first covered-experience service map for faster incident triage.
Improved Tempo query performance by 30%+ with minimal cost impact.
Built a local development observability stack for developers in spare time to improve day-to-day debugging DX.
Designed multi-region failover foundations for Grafana and Mimir with low RTO/RPO and controlled cost.
Implemented serverless Splunk scaling automation using Lambda Step Functions.
Modernised Splunk infrastructure, with a focus on cloud-native platform management. Scaled to 600TB/day ingestion.
Implemented K6 for load testing infrastructure at scale, hitting 6M metric samples per second.
Led work for, designed, and implemented internal PaaS abstractions for alert management.

Cloud Engineering Lead - Lendi (2018-01 to 2020-05)

Led cloud-native migration direction and implementation, including container platform evolution on AWS.
Introduced SRE practices (SLOs/error budgets) and built internal platform capabilities to improve DX and delivery reliability.

Expand full impact

Drove architecture, implementation, and strategy for cloud-native migration.
Implemented SRE principles (SLOs and error budgets) across teams.
Built developer-centric tooling and internal platform capabilities to remove deployment friction.
Implemented container infrastructure on ECS and EKS.
Implemented Envoy sidecar patterns for Auth0-based authn/authz flows.
Led from-scratch AWS setup including multi-account foundations.
Introduced production resilience practices including controlled chaos testing.

Senior Backend Engineer - HealthEngine (2017-01 to 2018-01)

Built resilient serverless messaging workflows using SQS, DynamoDB, and Lambda.
Implemented monitoring/logging foundations with Prometheus, Grafana, Elasticsearch, and CloudWatch.

Expand full impact

Built resilient serverless messaging workflows using SQS, DynamoDB, and scaling Lambda functions.
Implemented Docker infrastructure in Rancher.
Implemented monitoring infrastructure with Prometheus and Grafana.
Implemented logging infrastructure with Elasticsearch and CloudWatch.
Contributed to Node.js backend application development.

DevOps Engineer - Domain (2017)

Maintained and improved ECS/Elasticsearch infrastructure and delivery workflows.
Implemented Nomad scheduling and Jenkins/Groovy deployment tooling, including ChatOps automation.

Expand full impact

Maintained and improved ECS infrastructure.
Implemented HashiCorp Nomad as a new Docker scheduler.
Maintained and improved Elasticsearch cluster infrastructure.
Implemented Jenkins/Groovy deployment tooling.
Built a serverless deployment bot to enable ChatOps.
Worked with frontend platform teams to scale React rendering workloads on container infrastructure.
Partnered closely with developers on deployment tooling and release ergonomics.

Leadership & Scope

Drove the shift to new observability practices and tooling when implementing Grafana Scenes as a replacement for traditional dashboards, ClickHouse for modern observability based on the wide-event model, and AI observability kick-offs internally.
Owned observability architecture and operations for critical Atlassian workflows, including platforms running at 100TB/day+ (ClickHouse) and 180TB/day (Tempo).
Designed and shipped operator tooling used across teams, including internal Grafana Scenes workflows, AI-assisted query generation, and Atlassian's first service map by covered experience.
Drove reliability practice adoption across organisations by implementing SLO/error-budget patterns, burn-rate alerting, and incident/runbook automation.

Selected Impact

Scaled observability pipelines to 100TB/day+ (ClickHouse) and 180TB/day (Tempo) while preserving fast operator workflows.
Improved Tempo query performance by 30%+ with minimal cost impact.
Standardized Kubernetes delivery patterns with reusable Helm tooling and stronger deployment guardrails.

References

Available upon request.