Top LinkedIn Content on Cloud Infrastructure Challenges

Cloud & DevOps Engineer @ Monoprix | GCP, Kubernetes | Exploring MLOps & LLMs

4,399 followers 9mo

Most people using Kubernetes today don’t actually need it. They just… followed the hype ⚙️ They needed to: • Run 3 or 4 apps • Expose a few services • Maybe autoscale, maybe not • Deploy occasionally, with zero multi-region needs And instead of going simple, they pulled in the full CNCF zoo 🦁 • Ingress, CRDs, Service Meshes • ArgoCD, Helm, Istio, Prometheus, Linkerd, Vault… All to deploy a to-do app and a PostgreSQL ☕ Kubernetes is powerful. No doubt. But it comes with: • A huge learning curve 📚 • Complex debugging 🧠 • Maintenance overhead • Sharp edges and YAML pain You don’t earn points for making your life harder. You’re not doing “real DevOps” because you manage your own kubelet. If your team is small, your app is simple, and you just want to ship product, you’re better off with a managed PaaS or even a basic VM setup. Kubernetes is not a badge of honor. It’s a tool 🛠️ And like any tool, you should pick it when the problem demands it, not your ego. What do you think? Have you seen teams burn months on Kubernetes setups they didn’t need? Let’s open the comment war 🔥 #Kubernetes #DevOps #CloudNative #PlatformEngineering #SoftwareEngineering #TechLeadership #EngineeringMindset #SRE #Infrastructure #CloudComputing #Microservices #RealTalk #GKR #AWS #EKS #AKS #GoogleCloud #Azure

492 Comments

Deepak Agrawal

Founder & CEO @ Infra360 | DevOps, FinOps & CloudOps Partner for FinTech, SaaS & Enterprises

15,833 followers 10mo

Here are the most expensive Kubernetes mistakes (that nobody talks about). I’ve spent 12+ years in DevOps and I’ve seen K8s turn into a money pit when engineering teams don’t understand how infra decisions hit the bill. Not because the team is bad. But because Kubernetes makes it way too easy to burn cash silently. 𝐇𝐞𝐫𝐞 𝐚𝐫𝐞 𝐭𝐡𝐞 𝐫𝐞𝐚𝐥 𝐦𝐢𝐬𝐭𝐚𝐤𝐞𝐬 that don’t show up in your monitoring tools: 1. 𝐎𝐯𝐞𝐫𝐩𝐫𝐨𝐯𝐢𝐬𝐢𝐨𝐧𝐞𝐝 𝐧𝐨𝐝𝐞𝐬 "𝐣𝐮𝐬𝐭 𝐢𝐧 𝐜𝐚𝐬𝐞". Engineers love to play it safe. So they add buffer CPU and memory for traffic spikes that rarely happen. ☠️ What you get: idle nodes running 24/7, racking up your cloud bill. ✓ 𝐅𝐢𝐱: Use vertical pod autoscaling and limit ranges properly. Educate teams on real usage patterns vs. “just in case” setups. 2. 𝐏𝐞𝐫𝐬𝐢𝐬𝐭𝐞𝐧𝐭 𝐯𝐨𝐥𝐮𝐦𝐞𝐬 𝐭𝐡𝐚𝐭 𝐧𝐞𝐯𝐞𝐫 𝐝𝐢𝐞. You delete the app. But the storage stays. Forever. Cloud providers won’t remind you. They’ll just keep billing you. ✓ 𝐅𝐢𝐱: Use “reclaimPolicy: Delete” where safe. And audit your PVs like your AWS bill depends on it. Because it does. 3. 𝐋𝐨𝐠𝐠𝐢𝐧𝐠 𝐞𝐯𝐞𝐫𝐲𝐭𝐡𝐢𝐧𝐠... 𝐚𝐭 𝐞𝐯𝐞𝐫𝐲 𝐥𝐞𝐯𝐞𝐥. Verbose logging might help you debug. But writing 1TB+ of logs daily to expensive storage? That’s just bad economics. ✓ 𝐅𝐢𝐱: Route logs smartly. Don’t store what you won’t read. Consider tiered logging or low-cost storage for historical data. 4. 𝐔𝐬𝐢𝐧𝐠 𝐒𝐒𝐃𝐬 𝐰𝐡𝐞𝐫𝐞 𝐇𝐃𝐃𝐬 𝐰𝐨𝐮𝐥𝐝 𝐝𝐨. Yes, SSDs are fast. But do you really need them for staging environments or batch jobs? ✓ 𝐅𝐢𝐱: Use storage classes wisely. Match performance to actual workload needs, not just default configs. 5. 𝐈𝐠𝐧𝐨𝐫𝐢𝐧𝐠 𝐢𝐧𝐭𝐞𝐫𝐧𝐚𝐥 𝐭𝐫𝐚𝐟𝐟𝐢𝐜 𝐞𝐠𝐫𝐞𝐬𝐬. You’re not just paying for internet egress. Internal service-to-service comms can spike costs, especially in multi-zone clusters. ✓ 𝐅𝐢𝐱: Optimize service placement. Use node affinity and avoid chatty microservices spraying traffic across zones. 6. 𝐍𝐞𝐯𝐞𝐫 𝐫𝐞𝐯𝐢𝐬𝐢𝐭𝐢𝐧𝐠 𝐲𝐨𝐮𝐫 𝐚𝐮𝐭𝐨𝐬𝐜𝐚𝐥𝐞𝐫 𝐜𝐨𝐧𝐟𝐢𝐠𝐬. Initial HPA/VPA configs get set and never touched again. Meanwhile, your workloads have changed completely. ✓ 𝐅𝐢𝐱: Treat autoscaling like code. Revisit, test, and tune configs every sprint. Truth is most K8s cost overruns aren't infra problems. They're visibility problems. And cultural ones. If your engineering teams aren’t accountable for infra spend, it’s just a matter of time before you’re bleeding cash. ♻️ 𝐏𝐋𝐄𝐀𝐒𝐄 𝐑𝐄𝐏𝐎𝐒𝐓 𝐒𝐎 𝐎𝐓𝐇𝐄𝐑𝐒 𝐂𝐀𝐍 𝐋𝐄𝐀𝐑𝐍.

19 Comments

Govardhana Miriyala Kannaiah

I help businesses with Digital & Cloud Transformation Consulting | Runs Job Surface helping job seekers find hidden DevOps & Cloud roles | 60,000+ read my Practical DevOps & Cloud newsletter

137,372 followers 11mo

I've spent over 5 years in Kubernetes. Here’s a summary of 7 brutal Dont's I’ve learned: 7) Don’t use custom controllers when native Kubernetes resources can do the job. 6) Don’t install service meshes unless your traffic management is substantial. 5) Don’t use StatefulSets for workloads that don’t require persistent identity. 4) Don’t expose internal services with LoadBalancer; use ClusterIP and Ingress instead. 3) Don’t manually edit manifests in production, use GitOps. 2) Don’t use hostPath unless you absolutely need direct node storage. 1) Don’t over engineer or use unnecessary tools. Above all, Most important Don't.. 0) Don't use Kubernetes unless it is really necessary – not every system is meant for it. 40K+ read my free weekday daily TechOps Examples newsletter: https://lnkd.in/gg3RQsRK What do we cover: DevOps, Cloud, Kubernetes, IaC, GitOps, MLOps 🔁 Consider a repost if this is helpful.

47 Comments

Dan Case

Engineering Executive | Scalable Systems, AI Strategy, and Cost-Effective Cloud | DevOps and SRE Leadership

7,112 followers 2mo

Kubernetes was built by Google for Google. It was designed to run software at massive scale across global data centers. Most companies will never operate at that level, yet they start with the same tooling. The cost shows up immediately. Before shipping anything, teams must make dozens of decisions about configuration, resources, networking, restarts, and deployments. Each decision can fail in subtle ways. Progress slows before customers see value. What used to be a simple deploy becomes pages of fragile setup. When something breaks, it is hard to see why. The problem might live in the code, the container, the node, the scheduler, or the network. Engineers spend more time figuring out where the issue is than fixing it. Clear cause and effect disappears. Team flow suffers too. Someone has to run the platform, so a platform team appears. Developers stop deploying directly. They open tickets and wait. Feedback loops stretch. The system meant to speed delivery quietly adds friction. Costs climb at the same time. Clusters are built for peak traffic, not normal days. Most capacity sits idle, but you still pay for it. That tradeoff makes sense at extreme scale. It does not for most teams. A simpler setup works for many products and is easier to recover when things go wrong. Start with one solid server. Run your app with systemd or basic containers. Keep deployment scripts simple and owned by the same people who write the code. When traffic grows, move to a bigger machine before adding more machines. For reliability, add a second server in a different availability zone or region. Keep it warm or ready to start. Replicate your data using built in database replication or regular backups tested by real restores. Put a basic load balancer or DNS failover in front. If one server goes down, traffic shifts. Recovery is clear and predictable. This approach is boring by design. Fewer moving parts. Fewer places for failure to hide. When something breaks, you know where to look and how to bring it back. Kubernetes is not bad technology. It solves real problems for very large systems. The mistake is starting with that level of complexity when a simpler setup can ship faster, recover more easily, cost less, and keep teams focused on building the product.

143 Comments

Erik Osterman (Cloud Posse)

DevOps Accelerator 🚀Cloud Posse, LLC (CEO)

10,072 followers 1y

Gitpod, a platform with 1.5 million users, has made the decision to move away from Kubernetes after six years of trying to make it work for their cloud development environments (CDEs). Despite exhausting every possible optimization, they ultimately realized Kubernetes wasn’t suited for their unique requirements. Hosting a real-time desktop experience comes with zero tolerance for lag or interruptions caused by pod rescheduling. Unlike traditional stateless or stateful services, this operational model demands an entirely different level of performance and predictability. Gitpod’s thorough write-up dives deep into the challenges they faced, such as: • Complex resource management • Storage performance bottlenecks • Networking limitations with isolation and bandwidth sharing • Security trade-offs required for user flexibility This shift highlights an important lesson: while Kubernetes is a powerful tool for many applications, it’s not a one-size-fits-all solution. Teams often adopt Kubernetes because it’s seen as the “default” choice, only to discover that it doesn’t align with their specific needs. In some cases, a tailored or alternative approach may be the better path, even if it means moving away from an industry standard. For anyone considering Kubernetes, this write-up is a must-read to understand its limitations and whether it fits your use case before making a commitment. https://lnkd.in/g49tz9ax

7 Comments

Arshad Siddieque

Senior DevSecOps Engineer | Simplifying DevSecOps, Cloud & AI | GenAI, MLOps, Kubernetes & AWS

4,554 followers 3mo

Kubernetes can scale your app, but it can’t fix the code running inside it. Saw an engineer keep scaling a service that refused to start. Infra was fine. Cluster was fine. The real issue was a small Python import error. This happens a lot. Many DevOps engineers know cloud and Kubernetes well, but get stuck when the failure is inside the application. In modern production, infra skills are only half the job. To keep systems healthy, you need to understand how the app behaves. Not to become a developer. But to debug what actually runs in production. Key skills that matter: • Knowing how startup logic and dependencies load. • Understanding how resource usage links to specific code paths. • Reading stack traces and logs with confidence. • Recognizing how concurrency and I O shape performance. • Telling infra problems apart from application defects. Engineers who master both sides stand out fast. They can scale a service, but they can also trace the code and find the real issue. In an AI driven world, this mixed skill set is essential. Your growth depends on it.

60 Comments

Suman Chakraborty

40,366 followers 1y

𝑾𝒉𝒆𝒏 𝑲𝒖𝒃𝒆𝒓𝒏𝒆𝒕𝒆𝒔 𝒇𝒂𝒊𝒍𝒔: 𝑻𝒉𝒆 𝒂𝒏𝒂𝒕𝒐𝒎𝒚 𝒐𝒇 𝑶𝒑𝒆𝒏𝑨𝑰 𝑮𝒍𝒐𝒃𝒂𝒍 𝑶𝒖𝒕𝒂𝒈𝒆 😱 When OpenAI’s services suffered a five-hour global outage on December 11, the ripple effects highlighted the fragile underpinnings of the tech industry’s most advanced #AI infrastructure. While millions rely on AI products like #ChatGPT for everything from creative work to customer support, the services went haywire when a small tweak causes those systems to break catastrophically The trouble began when OpenAI rolled out a new observability (telemetry) service on its #Kubernetes clusters that spreads across regions to support the backbone of AI infrastructure. The #k8s data plane operate largely independently of the control plane, but DNS relies on the control plane. With the addition of the telemetry service, somehow, the configuration change introduced, unexpectedly generated massive Kubernetes #API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery 𝐏𝐨𝐬𝐭-𝐦𝐨𝐫𝐭𝐞𝐦 𝐑𝐂𝐀 🔎 📍 The telemetry service consumed more resources (memory/CPU), specially in large clusters,no precautions were taken to assess Kubernetes API server load 📍 Rollout process monitored service health but lacked sufficient cluster health monitoring protocols 📍 DNS caching responded to the requests for brief period, until the time to live (TTL) for the #DNS cache phased out 𝐑𝐞𝐦𝐞𝐝𝐢𝐚𝐭𝐢𝐨𝐧 ⚒️ ✅ Cluster size was scaled down to reduce the aggregate Kubernetes API load ✅ Blocking network access to Kubernetes admin APIs, allowing to bring down the "rogue" telemetry deployment ✅ Scaling up Kubernetes API servers to handle pending requests 𝐖𝐡𝐚𝐭 𝐜𝐨𝐮𝐥𝐝 𝐡𝐚𝐯𝐞 𝐛𝐞𝐞�� 𝐝𝐨𝐧𝐞 𝐛𝐞𝐭𝐭𝐞𝐫 🤔 📌 Better proactive roll-outs while stress testing the cluster on Staging environment to capture any nuances 📌 Induce chaos testing to build resiliency at the cluster level 📌 De-coupling dependency between the control plane & data plane 📌 Improving caching & rate-limiters at the DNS level This outage is a good case-study for #engineers to think through on the high availability, reliability & resilience of any Kubernetes cluster before onboarding newer services everytime !! Check out the full report here -> https://lnkd.in/gC4afrAB #devops #sre #outages #infrastructure #platform #engineering #cloud

16 Comments

Avi Keinan

Senior Cloud Architect at DoiT International

4,480 followers 1mo

One of the nice features of AWS is Service Control Policies (SCPs). They allow you to define policies for AWS services, and every time someone tries to create something new, AWS checks whether it complies with the rules you’ve defined. For example: EBS volume must be encrypted It’s forbidden to create an S3 bucket in Singapore and so on. The problem is that it’s somewhat hard to keep track of this, especially when introducing a new policy. Even if you’re very careful and apply it only to a few accounts, it can still be a recipe for trouble. However, when combined with CloudWatch, you can create a metric filter that turns every event where an action was denied due to an SCP into a metric. Then, you can create a CloudWatch alarm with anomaly detection, and get notified whenever there’s an unexpected deviation. In about an hour of work, you can get excellent visibility into what’s happening across your organization.

5 Comments

Adan Álvarez Vilchez

Principal Security Architect & Researcher | AWS Community Builder | Building Trust. Reducing Risk. Scaling Securely.

3,741 followers 1y

🔒 The ReadOnlyAccess AWS Managed Policy is not as low-risk as some people think 🔒 While this has been shared many times before, it's crucial to reiterate that AWS managed policies like ReadOnlyAccess often include permissions that might not be suitable, especially for third-party access. Here are some examples of permissions to watch out for: 📊 athena:GetQueryResults, GetQueryExecution : Can retrieve results of sensitive queries from Amazon Athena. 🗄️ dynamodb:Query, GetItem, GetRecords, Scan: Allows access to data stored in DynamoDB tables. 🔑 ec2:GetPasswordData: Can retrieve Windows password data for EC2 instances. 📂 s3:GetObject: Grants access to read objects in S3 buckets, potentially exposing private data. ✉️ sqs:ReceiveMessage: Can read messages from SQS queues, possibly revealing sensitive information. These permissions, among others, can potentially expose critical data. Each organization is different, but it's essential to carefully evaluate these permissions before assuming this or other policies are risk-free. For identifying permissions that might expose data, credentials, or allow privilege escalation, I recommend using aws.permissions.cloud Check out all the permissions for the ReadOnlyAccess policy here: https://lnkd.in/dHzHubsF

aws.permissions.cloud aws.permissions.cloud

Cloud Infrastructure Challenges

More in Cloud Infrastructure Challenges

More Technology topics

Explore categories