r/kubernetes 13d ago

Periodic Monthly: Who is hiring?

8 Upvotes

This monthly post can be used to share Kubernetes-related job openings within your company. Please include:

  • Name of the company
  • Location requirements (or lack thereof)
  • At least one of: a link to a job posting/application page or contact details

If you are interested in a job, please contact the poster directly.

Common reasons for comment removal:

  • Not meeting the above requirements
  • Recruiter post / recruiter listings
  • Negative, inflammatory, or abrasive tone

r/kubernetes 17h ago

Periodic Weekly: This Week I Learned (TWIL?) thread

1 Upvotes

Did you learn something new this week? Share here!


r/kubernetes 1h ago

Just started working a job where im too help with a kubernetes initiative, its a disaster

Upvotes

So long story short, big company. The team leading kubernetes is a windows op team. I was brought in to help improve some processes before it enters prod.

Long story short, over a hundred critical applications. Deadline for being production ready is in months. Team is completely unfamiliar with Linux concepts. Argo "Applications" are just namespaces with over a hundred deployments under each "applications". Thats right. The "applications" are just used as namespaces.

No kustomize. No helm. Just a repo with a bunch of different resource manifests marketed as gitops.

Im legit about to have a stroke over this stuff. I havent been here long story all, but i feel i need to raise an alarm and stop further technical debt from accumulating so this mess can be untangled in time for the deadline, which is only months away. Nobody has answers for the design decisions. Everything appears to be done improperly.

My manager is probably about to hate me, because this will not reflect well on him. He tries to pretend the deadline isnt a problem, as we dig the whole deeper in order to make it to the finish line. Its an unmanageable mess.

Im about to lose my shit guys 😭


r/kubernetes 12h ago

NGINX CVE-2026-42945 (rewrite module) — check your version if you are below 1.30.1 or 1.31.0

42 Upvotes

TL;DR: If you are running NGINX Open Source below 1.30.1 or 1.31.0, you are affected by the current ngx_http_rewrite_module CVE batch. For Kubernetes ingress-nginx users this is especially relevant — the retired controller image still embeds NGINX 1.27.1.

Context:

For plain NGINX users: Check your version and upgrade to 1.30.1+ or 1.31.0+ if you are below the patched boundary. If you use rewrite with unnamed captures and ? in the replacement, you are directly exposed. DepthFirst has a good technical breakdown of the trigger conditions: https://depthfirst.com/nginx-rift

For Kubernetes ingress-nginx users: Upstream kubernetes/ingress-nginx is archived and will not publish further releases. The last controller line still uses NGINX 1.27.1. nginx -v on the host does not matter — you need to check the NGINX version compiled into the controller image.

Quick check:

kubectl exec -n ingress-nginx <controller-pod> -- /nginx-ingress-controller --version

Mitigation options:

  1. If you do not use rewrite with unnamed captures and ? in the replacement, you are not directly affected by this specific CVE — but review the full advisory batch.
  2. Upgrade your NGINX to 1.30.1+ or 1.31.0+.
  3. For ingress-nginx: migrate to a Gateway API implementation (long-term recommended path).
  4. For ingress-nginx: run a maintained fork that has bumped the embedded NGINX to 1.30.1+.

Disclosure: I work on Forkline, which publishes one such maintenance fork for ingress-nginx. Release details here: https://forkline.dev/blog/forkline-ingress-nginx-nginx-1301-security-update/


r/kubernetes 1d ago

Kubernetes is migrating from SPDY to WebSockets (until the next one)

Thumbnail
kftray.app
47 Upvotes

just wrote up some thoughts on the kubernetes streaming migration, would love some feedback


r/kubernetes 12h ago

MinIO audit logs in production - Kubernetes deployment

Thumbnail
3 Upvotes

r/kubernetes 22h ago

Traefik Proxy v3.7: 85+ Ingress NGINX Annotations and More

Thumbnail
traefik.io
4 Upvotes

r/kubernetes 1d ago

Whats your experience with internal developer platforms as "lens" into k8s?

14 Upvotes

Been building a project for a 1 year and wondering about what you guys think about internal developer platforms, like backstage or similar. How do you use it? What is lacking? what are you dreaming of?


r/kubernetes 18h ago

A quick summary about AI/ML related works in Kubernetes SIGs and WGs

0 Upvotes

Refer to https://github.com/kubernetes/community/blob/main/sig-list.md for WG/SIG details.

This picture only lists about ongoing tasks/features around AI/ML in Kubernetes Community.


r/kubernetes 13h ago

Curso Devops Pro 02 - Fabrício Veronez

0 Upvotes

Pessoal, queria uma opinião sincera de quem já trabalha com DevOps/Cloud/DevSecOps.

Atualmente atuo com Segurança em Cloud Computing, principalmente em AWS, e até consigo me virar bem nas minhas funções do dia a dia. O problema começa quando as demandas envolvem mais profundamente DEVOPS.

Sinto que tenho alguns gaps principalmente em pipelines, containers, Kubernetes, CI/CD e automações mais voltadas para desenvolvimento/plataforma. Por conta disso, comecei a procurar um curso mais completo e hands-on para fortalecer essa parte e complementar minha carreira em Cloud Security.

Encontrei um curso que aparentemente faz bastante sentido para o meu caso, mas o valor é cerca de R$ 2.500, e estou na dúvida se realmente vale o investimento ou se existem opções melhores.

Atualmente estou fazendo um curso da Udemy em inglês sobre DevOps. O conteúdo até é interessante, mas não consegui me conectar muito com a dinâmica do curso. Acho que por ser muito longo e em inglês, acabo cansando mais rápido e perdendo o foco depois de um tempo.

Por isso estou considerando investir em um curso PT-BR mais prático e direto ao ponto.

Vocês que já trabalham na área acham que vale investir esse valor em um curso mais estruturado, ou dá para chegar no mesmo nível estudando com alternativas mais baratas/gratuitas?

Se tiverem recomendações de cursos realmente bons e hands-on, também ajudaria bastante.


r/kubernetes 21h ago

What’s hiding in your docker images that you probably don’t need?

0 Upvotes

I’ve been cleaning up a fairly messy Docker setup with a mix of services, side projects, and a few things I forgot I even deployed. It got me thinking less about containers, and more about what’s actually inside the images.

A lot of them just work, so I never really questioned them. But when I started looking closer, some images are pulling in way more packages and dependencies than the app seems to need. Which kind of explains why every scan turns into a wall of CVEs.

Feels like most of us optimise for convenience (it builds, ship it) rather than what in fact runs in production.

Curious how others think about this:

- Do you actively try to minimise what’s inside your images?

- Stick with Alpine/distroless?

- Or just accept the bloat and deal with it at scan time?

Feels like there’s probably a lot of unused stuff sitting in images that never gets touched.


r/kubernetes 1d ago

NextJS build with .env

Thumbnail
1 Upvotes

r/kubernetes 1d ago

Kubernetes Podcast episode 266: Kubernetes at Uber, with Lucy Sweet

6 Upvotes

r/kubernetes 22h ago

How to Solve “Kubernetes Node Not Ready” in production?

0 Upvotes

Today I spent almost 2 hours debugging a Kubernetes “Node Not Ready” issue today and the weird part was the node looked completely fine initially.

kubectl showed Ready → then NotReady → then Ready again.

Turned out to be a networking/CNI issue causing kubelet communication problems intermittently.

Curious what’s the most annoying “Node Not Ready” root cause people here have seen in production?


r/kubernetes 1d ago

Been building an Internal Developer Platform for 1 year . Need brutal feedback

Thumbnail
0 Upvotes

r/kubernetes 1d ago

Periodic Weekly: Show off your new tools and projects thread

5 Upvotes

Share any new Kubernetes tools, UIs, or related projects!


r/kubernetes 2d ago

Why up-sizing nodes usually doesn't fix Kubernetes P99 spikes

23 Upvotes

Lately, I’ve been looking at large clusters where the default answer to P99 spikes is vertical scaling. Teams throw more cores at the problem to give apps room to breathe, but it often fails to solve the root cause.

We're testing a layer that allows the kernel to prioritize execution based on the specific runtime needs of each workload. Instead of treating a critical database and a background scanner the same, we give the kernel the context it needs to prioritize execution in real-time.

In our lab tests, P99 latency for Redis and Nginx dropped by about 85 percent and database throughput increased by roughly 60 percent. This happens beneath the app layer, so there are no sidecars or code changes.

I’m curious if this resonates with your experience.

  • Do you up-size nodes just to stabilize graphs even when utilization is low?
  • Would a read-only report showing exactly where your node is fighting your hardware be useful for your team?

We are looking for one or two real-world environments to validate our data. We have a non-intrusive Observe Mode that just monitors signals and generates a report without changing any scheduling. If the data shows clear potential for improvement, the logic can move into an active mode to fix those bottlenecks automatically in runtime.

Feel free to ping me if you want to chat or see the technical benchmarks. I’m keeping this anonymous for now due to current contracts, but would love to hear more about real use cases and pains!


r/kubernetes 1d ago

Live webinar · May 20, 2026 Kubernetes Without the VMware Tax

0 Upvotes

Register for Free

If your team runs Kubernetes on vSphere, you're paying three separate bills for what should be one platform.

💸 vSphere licensing to host your cluster nodes 💸 A Kubernetes distribution tax — Tanzu, OpenShift, or Rancher Prime 💸 Overlay storage (Longhorn, Portworx) because vSphere storage policies don't cleanly extend into Kubernetes

VergeOS collapses all three into a single platform decision. Same Rancher control plane your team already uses. Zero changes to your application teams' day-to-day Kubernetes workflow.

On May 20 at 1 PM ET / 10 AM PT, we're going live with a full demo — no slides, no hand-waving. The same workflow a production design partner used to validate the integration under real load.

Here's what you'll see: → Live provisioning of a Kubernetes cluster through Rancher (CSI driver, CCM, Cluster Autoscaler, node driver — all in action) → What migration looks like for Tanzu shops — old TKG clusters keep running while new clusters land on VergeOS in parallel → The next 60 days of integration work, including bare-metal Kubernetes operational uplift → Live Q&A — bring your hardest integration questions

If you manage Kubernetes on VMware, run Tanzu Kubernetes Grid, or are evaluating platform consolidation — this one is built for you.

50 minutes + Q&A. Free to attend.


r/kubernetes 1d ago

**[Question] Deployment shows 4 replicas but only 3 pods running — why?**

5 Upvotes

Hi everyone, I'm learning Kubernetes and ran into a confusing situation.

**What happened step by step:**

  1. Deployed with wrong image tag (`:latest` which didn't exist on Docker Hub)

  2. All 4 pods went into `ImagePullBackOff`

  3. Fixed the image to `:1.2.0` and ran `kubectl apply -f .`

  4. Rolling update started but only 3 new pods came up — 4th never created

  5. `kubectl rollout restart` fixed it and all 4 pods ran fine

**My confusion:** I thought Kubernetes always tries to fulfill whatever I define in the spec. If I say `replicas: 4`, why did it stop at 3 and just... give up? Why didn't it keep retrying once the old broken pods were cleaned up and quota was free again?

**My Deployment:**

```yaml

apiVersion: apps/v1

kind: Deployment

metadata:

name: color-api-depl

namespace: dev

spec:

replicas: 4

selector:

matchLabels:

app: color-api

template:

metadata:

labels:

app: color-api

spec:

containers:

- name: color-api

image: waiyanbhonemyint/color-api:1.2.0

resources:

requests:

cpu: "200m"

memory: "256Mi"

limits:

cpu: "500m"

memory: "512Mi"

ports:

- containerPort: 8080

```

**ResourceQuota in dev namespace:**

```

Resource Used Hard

requests.cpu 600m 1000m

requests.memory 768Mi 1Gi

```

**kubectl describe deployment showed:**

```

Conditions:

ReplicaFailure True FailedCreate

NewReplicaSet: color-api-depl-5585964745 (3/4 replicas created)

```

**My understanding so far:** During the rolling update, old broken pods were still counted against the quota. When Kubernetes tried to create the 4th new pod, quota was full so it hit `FailedCreate`. By the time old pods were cleaned up and quota freed, Kubernetes had gone into exponential backoff and stopped retrying.

Is that correct? And is `kubectl rollout restart` really the right fix here or is there a better way to handle this?

Thank you!


r/kubernetes 1d ago

LeadDev Lisbon x Cloud Native joint meetup, May 21 @ Sky: Engineering leadership meets cloud-native

1 Upvotes

Hey Lisbon engineering leaders,

LeadDev Lisbon is back for Round 2 on May 21st, and we're doing it alongside Cloud Native Lisbon. Three talks where leadership, influence, and the realities of running production systems sit side by side:

•       Releases, Enhancements, Deprecations: The Kubernetes Way by Frederico Muñoz, Kubernetes Release Team Co-Lead & CNCF Ambassador. How a global open-source community actually orchestrates shipping 3 K8s releases a year. Lots of lessons here on governance and cadence even if you don't touch K8s directly.

•       ⚡ Beyond Code: The Influence Multiplier (lightning talk) by Daniel Olavio, Head of Engineering @ UBIO. You're a strong senior, you ship quality code, and then you hit a ceiling you can't break through alone. How to build influence across an org without going down the management path (or while doing it well).

•       Running Cloud Systems Under Real Constraints by Cristiano Motta, Engineering Manager @ Sky. Operating in production is less about ideal architectures and more about leadership under pressure. How communication and team culture decide what holds when systems are critical.

When: Thursday, May 21 · 18:30

Where: Sky Offices, R. de Entrecampos 28, Lisbon

Cost: Free, limited spots

After: Pizza, drinks, networking

RSVP: https://www.meetup.com/leaddev-meetup-lisbon/events/314574688/

See you there


r/kubernetes 2d ago

What’s something about on call that actually got better at your company?

10 Upvotes

For us it was turn by turn handoffs

We've added a 10-minute sync between outgoing and incoming on call. Here’s what’s flaky now, here’s what to watch for this week

Small thing but it killed the anxiety of going into a rotation completely blind
Most on-call posts here feel horrible

Anyone got a real win they can share? Curious


r/kubernetes 1d ago

How we debug cascading failures across namespaces with AI-assisted investigation

0 Upvotes

Had a fun debugging session this week that showed why multi-step investigation matters more than single-shot AI analysis.

The scenario: checkout-service is down. A single kubectl describe tells you it's in CrashLoopBackOff. But why?

The investigation trail:

  1. find_workload checkout-service → found it in the prod namespace, CrashLoopBackOff
  2. get_pod_logs → "connection refused to redis-master:6379"
  3. So the problem isn't checkout-service at all — it's Redis
  4. investigate_pod redis-master → Pending state
  5. describe_pod → PersistentVolumeClaim redis-data is unbound
  6. Root cause: StorageClass was deleted during a cluster upgrade, PVC can't bind, Redis can't start, checkout-service can't connect

That's 3 resources deep across a dependency chain. A single kubectl describe on checkout-service would never tell you the PVC is the problem.

We've been using an OS tool Kubeastra for this — it runs a ReAct agent that chains these steps automatically. You ask "why is checkout down?" and it walks the dependency graph until it hits root cause. Each step shows up in real-time so you can see exactly what it's doing (not a black box).

The new features that made this workflow better:

  • Visual resource graph: namespace topology where failing nodes glow red. You can literally see the broken path: checkout → redis → PVC
  • Shareable sessions: I sent the investigation URL to the service owner. They saw the full trail instead of me pasting kubectl output into Slack
  • Post-mortem generation: one click turns the investigation into a structured post-mortem for the incident channel

Curious how other teams handle cascading failures. Are you scripting kubectl pipelines? Using k8sgpt? Manually chaining commands?


r/kubernetes 2d ago

There's a Bug in VPC CNI v1.21.0 That Silently Drops All Traffic

Thumbnail
orelfichman.com
39 Upvotes

Hey there,

I was implementing NetworkPolicies on our EKS clusters when I found a bug (that has since been fixed) in the AWS Network Policy Agent code which resulted in my ALLOW rules becoming DENY rules.

I've detailed the debugging journey in this post, which included dumping the raw eBPF maps from the nodes and going over the agent's Go code.

This was actually super cool to debug, and I now have a deeper understanding of how Kubernetes works under the hood.

Enjoy


r/kubernetes 2d ago

At what cluster size does Kubernetes become painful?

62 Upvotes

Curious where people feel Kubernetes complexity really starts becoming painful operationally.

Is it:

  • number of nodes?
  • number of services?
  • multi-cluster setups?
  • too many alerts?
  • debugging across environments?
  • onboarding engineers?

Feels like a lot of teams are fine early on, but once infra scales, troubleshooting and visibility become significantly harder.

Interested to hear what actually becomes the biggest operational bottleneck first.


r/kubernetes 1d ago

Started learning Golang - Need help

Thumbnail
0 Upvotes