Outages Archive - pingdom.com

Reddit Outage (August 2023) Recap

Pingdom Team — Mon, 04 Mar 2024 16:53:07 +0000

On August 2, 2023, Reddit—the self-proclaimed “front page of the internet”—experienced a service outage that impacted many of the site’s logged-in users. Unfortunately for the popular web forum, the outage occurred on the heels of API changes that many in the community viewed as controversial. While the outage didn’t grab as many headlines as the API changes, we can still learn something as we examine the incident.

In this post, we’ll use Reddit’s lightweight incident report and other sources to explore what happened. We’ll also provide three essential tips you can take away from the incident to improve your infrastructure monitoring and incident response.

Scope of the outage

The Reddit service status page for August-October 2023. (Source)

According to its incident report, Reddit began investigating the incident at 13:58 PDT on August 2, 2023. It was marked resolved at 15:15 PDT the same day. Per Reddit’s status page for August 2023, the outage affected the desktop web app, mobile web app, and native mobile app. The vote processing, comment processing, spam processing, and Modmail services were unaffected.

Reddit has over 50 million daily active users (DAUs), and many authenticated users were affected by the incident. According to The Verge, Downdetector reports for Reddit peaked near 30,000 during the incident. Based on sources such as The Verge article, a Variety post, and tweets around the time of the incident, issues reported during the incident included:

Blank white screens on page loads
Many generic “down” reports
Encountering “Our CDN was unable to reach our servers” error messages
Services generally work if users are not logged in

What was the root cause of the outage?

“Elevated error rates” are the best pointers to root cause we have available. Of course, that doesn’t tell us precisely what was wrong, and without a detailed post-mortem from the Reddit team, we may never know the technical details. Even the “Our CDN was unable to reach our servers” error is generic enough that it doesn’t give us technical specifics (the same error was reported by Reddit users three years ago).

However, we can examine the information we know and think through possible root causes for similar issues. Frankly, this approach can be even more effective for teams looking to improve their availability and address potential gaps in their infrastructure and processes.

According to Himalayas, the Reddit tech stack includes a wide range of components such as:

Node.js: A cross-platform JavaScript runtime.
Amazon EC2: Compute nodes on the AWS platform.
Google Compute Engine: Virtual machines on the Google Cloud Platform (GCP).
Nginx: A web server, load balancer, and reverse proxy.
Kubernetes: A container orchestration platform.
Redis: An in-memory datastore often used for caching, as a database, or as a message broker.
Hadoop: A distributed computing framework.
PostgreSQL: A popular relational database management system.
Amazon Route 53:The AWS DNS service.
Pingdom: A website monitoring service we think you might like.
Fastly: A content delivery network (CDN).

And that’s far from an exhaustive list of Reddit’s tech stack components! Simply looking at this list, we can generate some ideas of what could create “elevated error rates”. For example, perhaps there was a configuration issue between the CDN (Fastly) and backend compute nodes running Ngnix as the origin servers. Perhaps some Kubernetes pods were failing and not restarting due to misconfigured liveness or readiness probes.

Alternatively, a Route 53 tweak could lead to another instance of “it’s always DNS.” Viewing the problem differently, since authenticated users seemed to have experienced issues, an underlying problem with the databases and services used to support identity management could have caused issues.

What can we learn from the Reddit August 2023 outage?

Like any outage, one of the most important questions is: What can we learn? Many of the best lessons in tech come through learning from failures. The Reddit outage is another great opportunity to identify key takeaways that can be applied to our projects. Below are our three biggest takeaways from this incident:

Takeaway #1: Do the preparation work it takes to act fast

Reddit has had several headline-grabbing incidents in recent years. Fortunately for the massive online forum, this one wasn’t a hot topic for too long. That was in large part because Reddit solved it quickly.

The ability to solve incidents quickly depends mainly on preparation. So, if you’re only acting fast after an incident begins, you’re already too late. Being “fast” when it matters depends on:

Effective monitoring: You should know about issues before your users do. The right instrumentation and tooling for effective observability can go a long way here.
Automation: Ideally, you’ll want to automate recovery from as many scenarios as practical. High availability and failover technologies, such as advanced load balancing with Ngnix, can help teams automate recovery when it counts.
A fast-acting incident response team: Try as we might, not everything can be automated. If no one is available to handle an incident and the solution isn’t automated, it isn’t going to be resolved. Your incident response team can make or break your incident response times.
Well-tuned alerts: Alert fatigue is a real problem for incident responders. With so many different systems to monitor, and each one capable of sending alerts, knowing what really matters at any given moment can be challenging. Fine-tuning alerts to ensure you set suitable notification thresholds can reduce cognitive load and make it easier for incident responders to do their jobs.
Plans: Incident response plans, backup and recovery plans, and even playbooks are all examples of plans and processes you should have before an incident. With these in place, you’ll have a guiding light during the chaos. Without them, your incident responders will be left scrambling.

Takeaway #2: Ping isn’t always enough

An HTTPS check in Pingdom.

In this incident, service was down, but the site wasn’t completely unresponsive. This type of failure mode isn’t exclusive to distributed web apps. Individual servers and even IoT devices can fail similarly, and one of the most common symptoms is ping responding while other services fail.

For very small sites and teams, basic ping monitoring is something, but it isn’t going to catch issues like this. That’s where more advanced service checks, such as checking HTTPS services, come into play. Checking the status of a specific service goes a step further than ping and can help clarify health at the application layer.

Takeaway #3: Consider user journeys

Creation of a transaction check in Pingdom.

With the complexity of modern web applications, even simple application-layer checks might not be enough. User journeys often involve multiple steps that interact with various underlying services. Technically, we can treat these different journeys as transactions we can monitor.

Monitoring transactions helps teams catch those nuanced issues that can easily slip through the cracks with simple service monitoring. For example, in Reddit’s case, monitoring a transaction to log in and view a page would likely have generated an alert for their incident response team during the August 2023 outage.

Transaction monitoring also helps to keep monitoring user-focused, which can be challenging when there are so many moving parts. For example, in Pingdom, transaction monitoring for a user journey that involves filling out a web form may be constructed with these steps:

Go to a specific URL
Authenticate
Click selections based on an element
Click a submit button based on an element

Creating a basic transaction check for a webform in Pingdom.

If that end-to-end transaction works, then all the underlying services are doing what they need to. If it doesn’t, then we can drill down further armed with knowledge of where the user journey is breaking.

How Pingdom can help

A dashboard displaying metrics for a transaction check in Pingdom.

Pingdom^® is a simple, end-user experience monitoring platform that enables teams to quickly respond to incidents. Teams can use Pingdom to monitor services from over 100 locations, communicate to users with a public-facing status page, and implement synthetic and real-user monitoring (RUM).

Teams can also create detailed checks based on transactions that mimic user journeys in a web application. In addition to building the checks based on web elements yourself, Pingdom offers a transaction recorder that reduces the technical knowledge required to create complex checks.

To try Pingdom for yourself, claim your free 30-day trial today!

The post Reddit Outage (August 2023) Recap appeared first on pingdom.com.

Chase Outage (July 2023) Recap

Pingdom Team — Wed, 14 Feb 2024 14:00:18 +0000

On July 25, 2023, Chase Bank customers who use the Zelle payment network experienced an outage that lasted for nearly a day. The outage came months after a headline-grabbing outage involving Zelle and Bank of America in January 2023. What is interesting about these Zelle outages is that they provide us with some high-profile examples of the risks and challenges of services with complex dependencies. They also offer a useful case study on the importance of “mean time to innocence” (MTTI) from a technical (not blame!) perspective.

In this post, we’ll take a closer look at the outage, gleaning three key takeaways for teams responsible for infrastructure uptime and end-user experience.

Scope of the outage

The Chase/Zelle outage lasted nearly a day, impacting Zelle transactions initiated by Chase customers. Given the nature of Zelle, this outage directly impacted end users’ ability to conduct financial transactions in the middle of a work week.

Here’s a breakdown of the associated timeline of events:

The incident began around 10 a.m. ET on Tuesday, July 25, 2023.
Zelle tweeted in the early afternoon that the issue was with Chase, not the Zelle payment network.

The incident was still unresolved at 10 p.m. but reports of issues on Downdetector had reduced significantly by this time.
Complaints continued into Wednesday, July 26, 2023, but the outage seemed to be resolved that day.
Chase confirmed no other services were affected.

What was the root cause of the outage?

Chase and Zelle both acknowledged that the issue was on Chase’s side. Chase did not identify the underlying cause, but we can rule out Zelle infrastructure and external networks with this information.

Peter Tapling, a former Early Warning executive, and Richard Crone, CEO of Crone Consulting LLC, were quoted in a related Yahoo! Finance article offering insight into how the combination of modern payment-processing services and legacy banking systems may have led to the outage. Tapling noted that modern payment infrastructure like the Federal Reserve’s FedNow and The Clearing House Real-Time Payments (RTP) typically have network-level resilience. However, core banking infrastructure isn’t typically as modern. It’s also not easy to upgrade or replace.

That means that modern peer-to-peer (P2P) payment apps like Zelle have a lot going on under the hood that could fail. In addition to the standard complexities in maintaining a high-traffic, distributed system, a Zelle transaction also depends on the availability of bank data. If any of the pieces involved in checking balances, validating information, or completing the transaction fails, the system is effectively “down” from the perspective of the end user.

What can we learn from the outage?

This outage is a helpful case study in MTTI, dependencies, and communication. Even if you don’t work in fintech, there are plenty of valuable insights to consider. Let’s consider our top three takeaways from the incident.

Takeaway #1: Map your dependencies

An application dependency map from SolarWinds^® Server & Application Monitor (SAM).

In the world of payment processing, where decades-old backend systems and modern payment networks all come together to reconcile a transaction, dependencies can create unexpected failures and debugging challenges. While massive payment networks have unique nuances, the issue of dependency risk isn’t exclusive to fintech.

Even in a relatively simple, modern web application, the end user experience may involve dependencies on components such as:

Web server (such as Apache or Nginx)
A database
An identity provider
Network connectivity
A content delivery network (CDN)

To mitigate risk and enable effective troubleshooting, teams should map all the dependencies involved in delivering service to an end user. Additionally, your monitoring strategy should account for behavior from the end user’s perspective. If your end users are in Los Angeles, uptime measured from Chicago or New York won’t mean much.

Takeaway #2: Get to (technical) “mean time to innocence” quickly

A graph of an HTTPS check in Pingdom^®.

For a brand in the payments industry where trust and reliability are essential, it’s easy to understand why Zelle wanted to clarify that the issue was NOT with their network. Given that Chase owns part of Zelle’s parent company, Early Warning Services, maybe even Chase was incentivized to make that clear.

Nonetheless, finger-pointing isn’t a sound site reliability engineering or infrastructure management practice. So, you won’t see us advocating for blaming a person or team. However, identifying a problem’s technical root cause is an essential aspect of incident response and service restoration.

From that perspective, MTTI is critical. Incident responders need to leverage variable isolation to understand where to focus their energy and restore service. That starts with clear indicators of current service health.

Synthetic monitoring with checks of specific protocols, transaction monitoring that mimics specific user journeys, and status pages can be beneficial here. If a check is “green”, responders can quickly move on to the next suspect service or component to continue troubleshooting.

Takeaway #3: Design out large-scale failures

A silver lining during the outage was that only Chase users were affected, and Zelle payments were the only affected feature. That means the overall system and interconnections between services were designed to limit the impact of this incident to a relatively narrow scope. While that didn’t help the affected users directly, it does mean that the overall negative impact was isolated.

It also means there were potential workarounds for users during the outage, such as using a different transfer method or a different bank. While these workarounds certainly aren’t ideal, they’re better than nothing.

Teams looking to learn from this outage should consider how they can design their systems to reduce the blast radius of any particular system failure. Ideally, this should emerge naturally in a microservices architecture that embraces loose coupling, but that isn’t always true. Accounting for variables such as network connections, cloud providers, and DNS can also be tricky.

To ensure you’re balancing risk and effort appropriately, be intentional about understanding dependencies, potential failure modes, and acceptable downtime. Then, test your assumptions. Wherever practical, reduce the risk of one system failure causing another system to go offline. To ensure you’re being realistic with your assumptions, consider leveraging chaos engineering to inject faults and see what breaks in a test environment.

How Pingdom can help

Creating a transaction check in Pingdom.

Pingdom is a simple but powerful website monitoring tool that can help teams quickly understand the health of their web applications. With transaction monitoring, you can map out specific user journeys. And if they break, you can drill down fast. With monitoring from over 100 locations and support for multiple checks, Pingdom can help you reduce MTTI and MTTR, improve root cause analysis, and increase uptime and service quality in production.

To try Pingdom for yourself, claim your free 30-day trial today!

The post Chase Outage (July 2023) Recap appeared first on pingdom.com.

Coffee Meets Bagel Outage (August 2023) Recap

Pingdom Team — Thu, 25 Jan 2024 16:39:59 +0000

On August 27, 2023, CoffeeMeetsBagel (CMB)—a popular dating app—services went down in one of the more extensive outages of the year. Users couldn’t log in to the app, and services remained unavailable for over a week. Given CMB’s previous history of technical issues and the extent of the outage, the incident became a significant customer service fiasco for the company.

In this article, we’ll use CMB’s FAQ and other sources to unpack the outage details. Then, we’ll look at three key takeaways you can learn from the incident to help improve your infrastructure monitoring and business processes.

Scope of the outage

The CoffeeMeetsBagel status page shows the outage started in the last week of August 2023. (Source: CoffeeMeetsBagel)

According to the CoffeeMeetsBagel status page, the outage began on August 27, 2023, and lasted just over a week until September 3, 2023. During the outage, users could not sign in or use the application. While we don’t have a precise count of users affected, CMB hit 10 million users in 2019, so the impact of the downtime was certainly not narrow.

The immediate effect of the outage was CMB users being unable to use the app to find a match and set up dates. For several days after the outage, issues such as missing chats, fewer “bagels” in the matching system, and missing “boosts” remained. During and after the outage, users took to forums like Reddit to complain, inquire about status, and discuss alternatives to the platform.

Additionally, recent history fueled the fire of customer concerns about application reliability and security. The dating site had been impacted by previous headline-grabbing events, such as a 2019 data breach, so user frustration was compounded by concerns the app has had too many technical challenges.

Root cause of the outage

A threat actor deleted CMB data and files. While we don’t have all the details, this was clearly an incident caused by a malicious actor rather than a system failure, a configuration error made by a legitimate user (such as Facebook’s 2021 outage), or a vaguely defined “technical issue” (like Instagram’s 2023 outage).

According to Himalayas, the dating service uses multiple languages and frameworks, including Python, PHP, Go, and Java. It also stores data with Redis, PostgreSQL, Cassandra, and other popular services. Of course, an application can tie those different components together in many ways that a threat actor could exploit. Unfortunately, it’s not clear from the information available exactly how CMB systems were compromised in this case.

Based on the official FAQ stating CMB “quickly re-established a secure environment for [its] technology team to restore [its] production service,” it seems plausible a threat actor compromised an account or service critical to maintaining CMB production services.

What can you learn from the outage?

The CMB outage is another opportunity for IT teams to learn from incidents that impact other organizations. Here are three key takeaways from the outage you can use to improve your processes and uptime.

Lesson #1: Emphasize all phases of the incident response life cycle

The NIST incident response life cycle phases. Source: NIST SP 800-61R2

Incidents like the CMB outage remind us to review incident response basics like the incident response life cycle. Using NIST’s Computer Security Incident Handling Guide as a reference, the phases of the life cycle are:

Preparation
Detection and analysis
Containment, eradication, and recovery
Post-incident activity

During the CMB outage, the recovery aspect of the life cycle was where users felt the most pain. For an app with millions of users, a week of service disruption is crippling. Teams should ensure they can quickly restore services if an incident takes them offline. Or, to put it another way: Test your backup and recovery plan!

Of course, what qualifies as a “quick” restoration of services is fuzzy. That’s where thinking deeply about your recovery time objectives (RTOs) and recovery point objectives (RPOs) comes into play.

Additionally, effective detection can reduce the time a threat actor has to do damage. For effective detection, organizations turn to tools such as:

Anti-malware software
Intrusion detection systems (IDS)
Intrusion prevention systems (IPS)
Endpoint detection and response (EDR)
Real-user monitoring (RUM)
Synthetic monitoring (SM)
Observability solutions

While detection and recovery often drive headlines, it’s also important to execute well in the other life cycle phases. Root cause analysis and lessons-learned exercises are common post-incident activities that can drive organizational changes to reduce the risk of repeat issues. Similarly, activities in the preparation phase—like training, simulations, and vulnerability scans—can help teams mitigate risks before a threat actor exploits them.

Lesson #2: Store (or don’t store!) data wisely

Fortunately, no payment data was compromised during the CMB outage. In part because the dating platform uses third-party payment processes and does not store payment data. Using a secure third party is often an easy decision for businesses that need to accept payments online.

However, there is a more general lesson here: Storing data comes with a risk proportional to the data’s sensitivity.

Organizations operate in an environment where data is the new gold. As a result, storing sensitive data can lead to increased negative impact in the event of a breach. Reduce the risk of sensitive data exposure by ensuring your teams are intentional about data classification and retention. To take the intentionality even further, determine if there is data your organization doesn’t even need to store in the first place.

Lesson #3: Make it right with your users

If you’re running a business, things will occasionally go wrong. How you engage your users after an incident is just as important as how you handle the incident itself. In the case of CMB, the company provided active premium and mini subscribers with a free 14-day extension to compensate for the outage. Ideally, this helped CMB retain some users who would have otherwise walked away.

Another way to make it right with your users is to be transparent in your communications. Looking at comments in posts like this on the CMB subreddit related to the incident, we see tech-savvy and highly invested users particularly want your transparency, and they can often be the loudest voices of discontent. Despite CMB being a dating site, commenters call out site reliability engineering and web development issues as they speculate on the root cause.

If you have a highly technical user base, then remember their expectations for your communication during an outage may be higher than the average consumer. Here are a few ways you can boost transparency during and after an outage:

Maintain a status page and update it every four hours or less.
Let users know if their data was compromised.
Explain what happened and what you’re doing to prevent it in the future.

How Pingdom can help

A real-user monitoring (RUM) experience dashboard in Pingdom. (Source)

SolarWinds^® Pingdom^® is a simple and scalable end-user experience monitoring platform that enables teams to detect problems so they can respond to them quickly. With Pingdom, you can monitor services from over 100 locations using synthetic and real-user monitoring. In the event of an extended outage, Pingdom’s public status page makes it easy for teams to provide users with up-to-date information about service status.

To try Pingdom for yourself, claim your free 30-day trial today!

The post Coffee Meets Bagel Outage (August 2023) Recap appeared first on pingdom.com.

Instagram Outage (May 2023) Recap

Pingdom Team — Thu, 10 Aug 2023 14:48:08 +0000

One of Meta’s biggest assets, Instagram, experienced an outage on May 21, 2023, that left users unable to access the social media platform. While outages are never ideal, the social media giant reacted quickly and restored service in about two hours.

Limited details have been made available regarding the underlying cause of the outage. However, there’s still plenty to discuss and learn from the incident. In this post, we’ll examine what happened and what you can learn from the outage.

Scope of the outage

Much of the data around the outage came from users reporting availability issues on social media and website availability trackers. The Verge reported on the outage, receiving correspondence from a Meta spokesperson. The outage started just after 6:00 p.m. ET and ended around 7:30 p.m. Instagram Comms tweeted at 8:19 p.m. ET to confirm the outage was over.

The Verge cited over 175,000 user reports during the peak of the outage. With over 2 billion monthly Instagram users worldwide, the outage likely impacted significantly more people than those that reported issues.

What was the root cause of the outage?

Meta is well known for sharing about their engineering efforts, often on their public-facing engineering site. It has even posted about the causes of outages, such as when they cited a router configuration change as the cause of a 2021 outage. However, the cause of the Instagram outage in May of 2023 was described simply as a “technical issue.”

That means we can rule out root causes such as DDoS attacks. And the fact other Meta services weren’t affected—as was the case when Facebook, Instagram, and WhatsApp were all simultaneously down in 2021—implies the issue wasn’t related to a large data center or network backbone that may have impacted other Meta services too.

Unfortunately, we don’t have a detailed analysis to lean on as we did with the Roblox Halloween outage, so we can’t derive a root cause after eliminating those possibilities.

Instead, let’s look at the Instagram tech stack to understand (in theory) what might have gone wrong.

The Instagram architecture and tech stack

In a world where serverless and microservices get a lot of attention, Instagram and Facebook buck that trend with an architecture that HAMY Labs describes as “service-based monoliths.” In short, Instagram uses a monolith for its core app logic and discrete services for specialized workloads like machine learning and video encoding.

Given that Instagram publishes more data—in one day—than what’s contained in the Library of Congress, we have compelling evidence that Instagram’s monoliths can scale. This also means there is probably some risk of the core app logic breaking in a way that causes a large-scale service disruption. Instagram uses a Python and Django-based tech stack to build its core application. So, many common web app issues—such as an incorrect path in a Django URL pattern or an incorrectly defined variable class—could lead to a service disruption.

What can you learn from the outage?

While we can’t say what specifically caused this Instagram outage, the incident does help us reinforce some important lessons in observability. Let’s look at our top three takeaways from the incident.

1. Prioritize monitoring of the end-user experience

Given the broad social media response, the Instagram outage clearly impacted end users. Hopefully, user reports weren’t the first way that Instagram heard of the outage. What’s the lesson? Make sure you can detect user-impacting issues before your users need to tell you about them.

Getting this right means knowing what and how to monitor your infrastructure. There is a lot you could monitor. However, regardless of your underlying architecture and tech stack, the end-user experience is what ultimately matters.

Your monitoring and observability strategy should account for where your users are located and how they access your applications. For example, network issues in Europe could create a service outage for all your European users while North American users are unaffected. If you’re only monitoring metrics and uptime from North America, the issue may go undetected until the complaints from Europe start rolling in.

Similarly, users of your website might be unaffected by a bug that breaks the native mobile app. If you’re only monitoring the web app, then you’ll be caught by surprise.

Two key tactics for proactive monitoring with the end-user experience in mind are:

Real-user monitoring (RUM) provides detailed metrics related to how end users interact with your apps and sites—including active sessions, bounce rate, and page views—along with the ability to filter based on criteria like device type, browser, and location.
Synthetic monitoring (SM) simulates user interactions on your site so you can know if something breaks before your users do. With SM, you can be notified once a workflow is slow or unresponsive so you can quickly respond and resolve the issue. You can also create reports for users and customers demonstrating compliance with service-level agreements (SLAs).

2. Assume outages will occur

Despite our best efforts, outages happen. Outages occur for tech giants like Instagram and small IT teams alike. While you should design your infrastructure to be as resilient as is reasonable (there is always a point of diminishing returns), contingencies are essential. Therefore, you should set expectations for stakeholders and the teams responsible for maintaining your infrastructure about how you’ll handle the inevitable outage.

Defining an error budget is often overlooked, but it will help to build a culture that proactively accounts for outages. SLAs and service-level objectives (SLOs) should lead to the establishment of an error budget. An error budget defines the maximum time a service can fail without breaching service commitments. Ensure all stakeholders understand this, and then optimize your processes and infrastructure to stay under budget. The power of error budgets is that they foster a realistic approach to reliability, balancing user experience with engineering effort in a way that makes business sense.

3. Communicate when something goes wrong

Uncertainty can frustrate and confuse end users. When a service is down, they might not know if the issue is with their network or your application. A quick check of comments related to outages for platforms like Instagram will remind you of just how true this is. The problem is that when there is a significant outage, teams are often more focused on resolving the issue than on outbound communication.

In addition to outbound communication (like tweets or emails to users), public-facing status pages are a great way to inform your users about an ongoing incident. A status page can display key service metrics and clarify your services’ current status. While outbound communications are still important, status pages give users a source of truth that can remove ambiguity when issues occur.

How Pingdom can help

Pingdom^® is a simplified end-user experience monitoring platform that makes it easy to monitor what matters when it comes to user-facing applications. With Pingdom, you can implement synthetic and real-user monitoring from a single pane of glass. With the ability to monitor services from over 100 locations across the globe, you can easily diagnose application performance regardless of your users’ regions. Additionally, Pingdom offers a public status page feature so you can keep your users informed about service availability.

Real-user monitoring (RUM) experience dashboard in Pingdom. (Source)

To try Pingdom for yourself, claim your free 30-day trial today!

The post Instagram Outage (May 2023) Recap appeared first on pingdom.com.

Twitter Outages – Why Is Twitter Going Down So Often Recently?

Kendall Calderon — Mon, 07 Aug 2023 20:07:06 +0000

Twitter has been in the news a lot lately. It’s one of the biggest internet companies in the world, so the public notices right away when its servers are down, and its brand reputation, stock price, revenue, and users can be affected. In recent months, users have noticed more Twitter outages and service disruptions than usual.

While it’s impossible to avoid outages, looking at why Twitter may be experiencing issues can be a good learning experience for all web development teams. In this post, we’ll cover some of the most recent Twitter outages and help you understand why they occur and how they could have been prevented.

Significant Twitter outages

First, let’s look at some of the most recent and significant Twitter outages.

December 28 outage

On December 28, 2022, Twitter users could not use Twitter for almost five hours. Instead, they saw the following message:

Some users also reported seeing “Rate Limit Exceeded” when they tried to access Twitter during this time, which indicates Twitter’s servers could not handle the number of incoming requests, leading to a complete service disruption. Even users who could access the website reported insanely slow load times and frequent connection issues.

Twitter reported upgrading its back-end server architecture shortly after the outage to optimize speed and performance.

Jan 23 outage

On January 23, Android users reported they could not post a tweet. Also, some Android users reported some tweets weren’t loading for them at all. Users saw the message, “Oops, something went wrong.” It appeared loading and sending tweets wasn’t working as intended on Twitter’s Android app.

February 8 and February 18 outage

A similar problem where tweets wouldn’t load also occurred on February 8. Users reported seeing error messages and were unable to post.

Another outage occurred roughly ten days later when timelines and replies on Twitter threads started breaking.

March 6 outage

On March 6, 2023, Twitter again went down for a few hours, and users couldn’t use it normally or experienced problems accessing links, images, and videos. Thousands were affected by this outage, and many people in some regions reported the website was slower than usual.

Compared to 2022, when Twitter experienced nine service disruptions for the entire year, the frequency of Twitter outages has increased significantly during the past eight months. According to NetBlocks, an organization devoted to tracking internet outages, Twitter experienced at least four widespread outages in February 2023.

Potential causes of recent Twitter outages

Many of these outages happened because of architecture changes in Twitter’s APIs, configurational changes in its back-end system, and routine server updates. Let’s look at each cause in detail.

Internal configurational changes

The outage on March 6, 2023, was caused by a configurational change. However, this configurational change was only made in one part of the server, a stand-alone API. So how did a slight misconfiguration lead to an entire service going down? An engineer attempting to fix a separate issue in the system made a configuration change. Since the effect of an internal configurational change in Twitter’s back-end system wasn’t isolated, it escalated to other services, eventually causing an outage for several hours affecting millions of users.

The effects of internal configurational changes can be isolated by refining the back-end architecture. If your back-end system has a bunch of discrete microservices communicating with each other, a change in one microservice shouldn’t bring another down. Even in a monolith architecture, implementing process separation and proper error handling can help prevent this kind of effect on an internal configurational change.

External integrations

Another possible cause of Twitter outages is external integrations. We know Twitter integrates many third-party APIs and services for some of its features. Third-party integrations allow users to share videos from YouTube Xbox gameplay and send tweets directly to Slack, Discord, and Service Now. But if your system is too tightly coupled with third-party integrations, you are at a higher risk of creating a problem. A bug in a third-party service shouldn’t disrupt your whole server, but it appears this is what happened to Twitter on February 8.

Poorly maintained code

As you scale to your next million users, your codebase will grow immensely in size and complexity. The scale at which Twitter operates is beyond imagination. Picture this: tweets are constantly fetched, content is dynamically generated for millions of users, threads are loading up with messages, video is streaming worldwide, and more. However, having a perfect codebase to handle this might be impossible, which is where routine refactoring can help.

Over the last few years, Twitter has continuously evolved its architecture, scale, and features. The codebase is mindbogglingly large, which can contribute to the complexity of feature updates and routine maintenance. But not updating or maintaining code leads to errors and unanticipated effects from issues.

Routinely refactoring the codebase, especially the more complex pieces, is essential. Having the appropriate code quality prevents unknown errors and bugs from surfacing and leading to a total outage and helps you iterate and update your codebase faster.

Brittle APIs

APIs are the pillar of back-end services in any system. You must properly handle exceptions when developing and designing an API. If an API is too brittle, it can easily break things and cause a service outage. The Twitter outage on March 1, 2023, is reported as resulting from an issue in the timeline API. Due to this issue, the timeline stopped working altogether and caused tweets and replies to break and disappear.

Preventing outages

The best defense against outages is a strong back-end architecture and system design with robust and reliable APIs forming a stable backbone of a disruption-free system. However, there will always be edge case scenarios, flukes, errors, and bugs you can’t plan or test for and can cause an outage later. You and your development team must be the first to know about those cases. You want to avoid learning about problems from your users or reading about them on a social media site. There are plenty of availability and performance monitoring tools you can easily integrate to help you understand your system better.

Pingdom^® is one such tool designed to alert you about potential problems. It can also detect potential downtime, which can result later. It can provide actionable insights into your application’s uptime and performance you can use to prevent a potential outage from occurring.

Why preventing outages matters

The recent outages experienced by Twitter show the effect outages can have on a large tech company and serve as a warning to developers and engineers. However, the varying causes of Twitter’s recent outages aren’t unusual and are common problems many development engineering teams face daily.

The overall lesson from looking at these incidents is you need to ensure back-end systems can handle internal configuration changes and external integration changes during their creation. The code must also be well-designed and built around robust APIs. Investing in strong monitoring tools is as important as investing in a reliable technical architecture. In today’s fast-paced digital world, even a brief outage can have significant consequences for businesses and their customers. The organizations surviving and thriving are those taking proactive steps to prevent outages to help ensure their services remain reliable and accessible.This post was written by Siddhant Varma. Siddhant is a full-stack JavaScript developer with expertise in front-end engineering. He’s worked with scaling multiple startups in India and has experience building products in the Ed-Tech and healthcare industries. Siddhant has a passion for teaching and a knack for writing. He’s also taught programming to many graduates, helping them become better future

The post Twitter Outages – Why Is Twitter Going Down So Often Recently? appeared first on pingdom.com.

Microsoft Outage (Jan. 2023) Recap

Pingdom Team — Thu, 13 Jul 2023 13:30:47 +0000

Change comes with risk. This is as true for a one-person IT department as it is for a cloud giant. And earlier this year, a Microsoft outage demonstrated how high-performing teams aren’t immune to change-induced service disruptions.

Less than two weeks after a Windows update unexpectedly deleted icons and shortcuts from users’ PCs, Microsoft experienced a different change-induced issue. This time it was related to their own infrastructure. Specifically, a WAN update took down multiple popular services belonging to Microsoft 365. The downtime lasted for several hours on January 25, 2023.

This article will unpack what happened during this outage and why. You’ll learn what you can do to reduce the risk and impact of outages in your environments.

Scope of the outage

The Microsoft outage—which is logged under Microsoft service incident number MO502273—impacted multiple services, including these popular applications:

Teams
Exchange Online
Outlook Online
SharePoint Online
OneDrive for Business
Power BI
Microsoft Graph
Microsoft Intune
Microsoft Defender for Identity

Microsoft Store and Xbox Live apps that depend on Azure servers were also affected.

With Microsoft Teams alone serving over 280 million monthly users, it’s safe to say the outage had a widespread impact.

Overall, the incident lasted nearly seven and a half hours. However, the broadest impact occurred during a 90-minute window between the beginning of the incident and the rollback of the change. Here’s how the timeline came together:

07:31 UTC: The Microsoft 365 Status Twitter account tweets that they are investigating issues impacting multiple Microsoft 365 services.
08:15 UTC: A networking issue is identified as the likely cause of the outage.
09:26 UTC: Microsoft announces that a change suspected to have caused the outage has been rolled back.
14:31 UTC: Microsoft confirms that impacted services are recovered and stable.

Root cause analysis

As with many outages, some wondered if this one was the result of a cyberattack. Others questioned if the outage was related to the recent large-scale layoffs at the tech giant. In a statement given to BBC, Microsoft indicated that a cyberattack was not involved. Instead, it determined the root cause to be a WAN update. Rolling back the update resolved the issue.

WAN and BGP

Further analysis showed that the network issues caused by the updates were related to the rapid readvertising of border gateway protocol (BGP) router prefixes, leading to significant packet loss. Let’s take a step back to unpack BGP.

BGP is the routing protocol for internet traffic, facilitating communication between autonomous systems (ASes). An AS is a group of routable addresses with a routing policy that belongs to one organization. ASes use BGP—specifically external BGP (eBGP)—to advertise routing information with its peers, build routing tables, and make decisions about path selection. All of these are important processes that keep the internet running.

After the WAN update, multiple network prefixes that Microsoft ASes were advertising were withdrawn and readvertised several times. The cycle of advertisements impacted route selection on other ASes and led to a flood of traffic to systems that couldn’t keep up, resulting in large-scale packet loss.

Learning from the outage

Even if you don’t operate IT infrastructure at the same scale as Microsoft, you can learn plenty of good lessons from this incident. Here are our top three takeaways:

Takeaway #1: Expect complex systems to have issues.

IT Ops is hard. You’re dealing with complex dependency chains, dynamic workloads, and rapid changes. When dealing with complex systems, you should occasionally expect something to go wrong. The key is to plan for incidents, implementing contingencies that intelligently balance risk.

Microsoft’s CTO, Mark Russinovich, made this clear back in 2020 in a post that was aptly titled “Advancing the outage experience—automation, communication, and transparency”:

We will never be able to prevent all outages. In addition to Microsoft trying to prevent failures, when building reliable applications in the cloud your goal should be to minimize the effects of any single failing component.
Microsoft’s CTO, Mark Russinovich

While Russinovich focused on cloud infrastructure in his post, his fundamental idea holds true for most IT infrastructure.

Error budgets are a great way to balance planning for this complexity without overengineering a system or becoming too change-averse. Error budgets place a ceiling on how much failure or downtime can occur without violating agreements. You can’t predict every possible failure, but you can architect to stay within your error budgets and take corrective action if you don’t. This practice can help you strike a balance that makes business sense.

Takeaway #2: Test your updates.

Updates are a regular part of IT life, and patches are often necessary to keep your infrastructure secure and performant. However, updates also come with risks. Microsoft’s outage provides a textbook example of why testing before you roll out an upgrade is important.

Thorough testing of a patch before you deploy it to production is ideal, as it can help you identify major problems before they cause downtime. That’s why testing is a key part of many patching guidelines, frameworks, and best practices. However, you also need to weigh the tradeoff between the thoroughness of your testing and the time it takes for a patch to reach production. Even a National Institute of Standards and Technology (NIST) patching guide acknowledges the tradeoffs between early deployment and more testing.

A QA or staging environment that mirrors production is a great way to test an upgrade before release. Unfortunately, you don’t always have a test environment when you’re working in IT, especially if you’re patching hardware appliances or network infrastructure. In these cases, teams can reduce risk with the following:

Incremental rollouts: If the upgrade applies to multiple sites, servers, or appliances, roll it out incrementally to a small subset. Verify the change works as expected, then proceed to the next batch. This limits the risk of an issue impacting your entire deployment.
Performance monitoring:Monitor performance after an upgrade to detect anomalies before they become outright service disruptions. This also enables you to verify stability after recovery from an incident.

Takeaway #3: Have a rollback plan.

Microsoft was able to restore service by rolling back their update. Even with thorough testing in place, you will always have the potential need to roll back a change. If you only take one thing away from Microsoft’s outage, let it be this: Always have a rollback plan.

Ideally, your rollback plan should ensure that you can restore service after a failure while meeting your recovery time objective (RTO) and recovery point objective (RPO). If possible, aim to automate the process of rolling back an update. This will minimize the time it takes to recover. Additionally, for mission-critical infrastructure, you should thoroughly test your rollback plan to make sure it will work when you need it to.

How Pingdom^® can help

Because modern IT is a complex web of dependencies, a change that causes a service disruption might go unnoticed for much longer than you (or your users) can stomach. With multiple checks—including ping, HTTP, and DNS—from over 100 servers across the globe, Pingdom enables you to quickly detect and respond if something goes wrong in production.

Additionally, real user monitoring (RUM) helps teams understand application performance from an end user’s perspective. In the example of the Microsoft outage, Pingdom’s RUM capabilities could help a team identify who is impacted by an outage. You would be able to visually explore performance impacts across regions and adapt your remediation rollout.

The post Microsoft Outage (Jan. 2023) Recap appeared first on pingdom.com.

Average Cost of Downtime per Industry

Pingdom Team — Mon, 09 Jan 2023 11:32:47 +0000

When I’m online shopping, reading an article, or watching a video and the website becomes unresponsive, it takes me about eight seconds before I refocus elsewhere. From the point of view of the industry, if customers lose interest when their server goes out, the cost of losing out on so much business can substantially impact revenue.

The average cost of downtime across all industries has historically been about $5,600 per minute, but recent studies have shown this cost has grown to about $9,000 per minute.

Of course, other factors play a role in this, such as:

The size of the business: There’s a significant difference in downtime costs between larger and smaller companies. Relatively small businesses’ cost of downtime falls into a range of $137 to $427 per minute, whereas for larger businesses, the downtime can cost over $16,000 per minute ($1 million per hour) for just a short outage. Some examples include:
- In March 2015, Apple lost approximately $25 million over an outage lasting 12 hours
- In March 2019, Facebook had a 14-hour downtime costing them nearly $90 million
- In August 2016, Delta lost almost $150 million during five hours of downtime

Industry vertical: Means a collection of companies that are more specialized or have a common niche in the market that are at higher risk. Some examples include finance, government, healthcare, manufacturing, media, retail, and transportation. Because these are higher-risk industries, their average cost of downtime tends to be over $5 million per hour.

Business models: Play a big role, especially in distinguishing whether your storefront is primarily virtual or physical. An e-commerce business that relies on being digitally available to customers is more susceptible to losing revenue and customers from downtime than a store with a physical location. When your entire business is digitally-based, any time your website is inaccessible to customers results in lost revenue.

When considering these factors, the full range of downtime costs across all businesses ends up falling between $2,300 to $9,000 per minute.

Calculating Downtime

If you’re curious about how much your company’s downtime costs are, you can approximate them using the formula:

Downtime Cost = Minutes of Downtime x Cost per Minute

The cost per minute for small businesses lands around $427, whereas, for larger businesses, it’ll be closer to $9,000. While this formula will offer you quantitative data, it’s important to account for the other impacts resulting from outages. Some additional costs of downtime include:

Customer Impact: This focuses on damage to the company’s reputation in terms of reliability, accessibility, and quality, as well as customer churn. With infinite options for every need in today’s digital markets, having a reliable platform with minimized downtime is necessary for low customer turnover.

Company Productivity: Not only does downtime affect customers, but it also affects the company internally. Outages hinder progress toward organizational goals and ultimately delay the process of reaching results at a timely rate. On a smaller scale, it requires more attention from employees to resolve, which lowers day-to-day productivity.

Employee Turnover: When a company has an outage or frequent outages, employees work twice as hard to get everything back online and communicate with frustrated customers. With every outage, employees lose company pride as the reliability of the organization decreases. These factors contribute to lower employee retention and an estimated $15,000 to find a replacement.

Cost of Downtime per Industry

About 98% of organizations claim only one hour of downtime costs over $100,000. Looking at each industry’s breakdown, we’ll find out if this is true.

In the IT industry, downtime is typically calculated at about $5,600 per minute. Depending on the company’s size, the full range of its lost revenue spans from $145,000 to $450,000 per hour.

In the auto industry, downtime costs rise to about $50,000 per minute, which translates to about $3 million per hour. Nearly 70% of the time, downtime issues can be attributed to people being unaware of their equipment’s maintenance or update requirements; another example of why it’s crucial to catch outages before they happen.

In the manufacturing industry, the cost of downtime is approximately $260,000 per hour. Manufacturers experience approximately 800 hours of downtime every year due to maintenance, tool breaks, adjustments, etc. Because so much of the business relies on technology running smoothly, the repercussions can be costly when you’re unable to catch it ahead of time.

In the enterprise industry, the cost of downtime is valued at over $1 million per hour and can reach $5 million, excluding fines or penalties. When taking into consideration server outages specifically, the average cost of downtime can go as high as $300,000 per minute due to lost business, productivity disruptions, and remediation efforts. In the past seven years, hourly downtime costs have risen 32% because of how much each industry now relies on its digital marketplace.

The cost of downtime for a few more industries include:

Media at $90,000 per hour
Health care at $636,000 per hour
Retail at $1.1 million per hour
Telecommunications at $2 million per hour
The energy industry at $2.48 million per hour

And one of the most expensive costs of downtime is in the brokerage service industry, which clocks in at $6.48 million.

No matter what industry you’re in, downtime is expensive. With an increasingly virtual world, Solarwinds Pingdom^® enables you to deliver exceptional customer experiences and enhanced visibility into how your consumers interact with your business. Using features like synthetic and real user monitoring, you’re able to catch issues with your digital storefront before they happen and gain visibility into how users are experiencing your website. Use Pingdom to avoid downtime and make the most of your business. Download a free trial here.

Sources:

https://www.atlassian.com/incident-management/kpis/cost-of-downtime#:~:text=The%20average%20cost%20of%20downtime%20is%20%245%2C600%20per%20minute%2C%20according,company%20size%20and%20industry%20vertical

https://www.the20.com/blog/the-cost-of-it-downtime/

https://garvey.com/7-downtime-statistics-find-terrifying/

https://www.iiot-world.com/predictive-analytics/predictive-maintenance/the-actual-cost-of-downtime-in-the-manufacturing-industry/

The post Average Cost of Downtime per Industry appeared first on pingdom.com.

Historical Internet Outages: The 12 Most Impactful

Pingdom Team — Wed, 23 Nov 2022 17:02:21 +0000

An internet outage can have major consequences for a digital business, especially when it happens during peak usage times and on holidays. Outages can lead to revenue loss, complaints, and customer churn.

Of course, internet outages regularly impact companies across all verticals, including some of the largest internet companies in the world. And they can happen when you least expect them.

Read on to learn about some of the most impactful internet outages to date and some steps you can take to keep your business out of harm’s way.

Historical Internet Outages You Need to Know About

1. Amazon Web Services

Amazon Web Services (AWS) experienced a major outage in December 2021, lasting for several hours. The outage impacted operations for many leading businesses, including Netflix, Disney, Spotify, DoorDash, and Venmo.

Amazon blames the outage on an automation error causing multiple systems to act abnormally. The outage also prevented users from accessing some cloud services.

This outage proved the largest and safest cloud providers are also susceptible to downtime.

2. Facebook

Facebook as well suffered a major outage in 2021, leaving billions of users unable to access its services, including its main social network, Instagram, and WhatsApp.

According to Facebook, the cause of the outage was a configuration change on its backbone routers responsible for transmitting traffic across its data centers. The outage lasted roughly six hours, an eternity for a social network.

3. Fastly

Cloud service provider Fastly had its network go down in June 2021, taking down several sizeable global news websites, including the New York Times and CNN. It also impacted retailers like Target and Amazon, and several other organizations.

The outage resulted from a faulty software update, stemming from a misconfiguration, causing disruptions across multiple servers.

4. British Airways

British Airways experienced a massive IT failure in 2017 during one of the busiest travel weekends in the United Kingdom.

This event created a nightmare scenario for the organization and its customers. Altogether, it grounded 672 flights and stranded tens of thousands of customers.

According to the company, the outage ensued when an engineer disconnected the data center’s power supply. A massive power surge came next, bringing the business’s network down in the process.

5. Google

Google had a major service outage in 2020. It only lasted about forty-five minutes, but it still impacted users worldwide.

Services including Gmail, YouTube, and Google Calendar all crashed. So did Google Home apps. The outage also impacted third-party applications using Google for authentication.

The issue happened due to inadequate storage capacity for the company’s authentication services.

6. Dyn

Undoubtedly, one of the biggest distributed denial of service (DDoS) attacks in history occurred in 2016 against Dyn, which was a major backbone provider.

The attack occurred in three waves, overwhelming the company’s servers. As a result, many internet users were unable to access partnering platforms like Twitter, Spotify, and Netflix.

7. Verizon Fios

Verizon had a major internet outage in January 2021, which disrupted tens of thousands of customers along the East Coast.

While the internet outage lasted only about an hour, Verizon experienced a sharp drop in traffic volume. Naturally, many customers complained about the loss of service.

At first, the company reported the incident was the result of someone cutting fiber cables. However, it was unrelated and turned out to be a “software issue” during routine network maintenance activities.

8. Microsoft

Another major internet outage occurred at Microsoft when its Azure service went under in December 2021. Azure’s Active Directory service crashed for about ninety minutes.

Compared to some other outages, this one was relatively small. Nonetheless, it prevented users from signing in to Microsoft services such as Office 365. Although applications remained online, users couldn’t access them, making this a major productivity killer for many organizations worldwide.

9. Comcast

There was an internet outage at Comcast in November 2021, which happened when its San Francisco backbone shut down for about two hours.

Following the outage, a broader issue occurred, spanning multiple U.S. cities, including hubs like Philadelphia and Chicago. Several thousand customers lost service, leaving them unable to access basic network functionality during the height of the pandemic.

10. Akamai Edge DNS

Akamai, a global content delivery provider, experienced an outage with its DNS service in 2021. The Akamai outage resulted from a faulty software configuration update activating a bug in its Secure Edge Content Delivery Network.

In a similar fashion to other attacks against service providers, Akamai’s outage caused widespread damage. Other websites—including American Airlines, Fox News, and Steam—all experienced performance issues following the incident.

11. Cox Communications

Cox Communications reported a major internet outage in March 2022, impacting nearly seven thousand customers in the Las Vegas region.

The problem resulted from an NV Energy backhoe damaging a transmission line and triggering a power event. The surge caused a cable modem to reset, and many customers tried to reconnect simultaneously. As a result, it took several hours for service to resume.

12. Slack

The recent Slack outage in January 2021 created havoc for distributed workers who rely on the platform for communication and collaboration.

The platform’s outage impacted organizations across the US, UK, Germany, Japan, and India, with interruptions occurring for about two and a half hours. Slack says the issue came from scaling problems on the AWS Transit Gateway, which couldn’t accommodate a spike in traffic.

Best Practices for Avoiding Internet Outages

At the end of the day, there’s nothing you can do to prevent outages entirely, especially if your business relies on multiple third-party systems. Eventually, your company or a partner will experience some level of service disruption. It’s best to plan for them and, where possible, enable systems to ‘fail gracefully.’

As part of your resiliency planning, here are some steps to mitigate damage, maximize uptime, and keep your organization safe, along with some best practices to help you avoid disruptions from network and connectivity issues.

Set Up a Backup Internet Solution

It’s impossible to protect your business from local internet outages completely. They can stem from issues like local construction, service disruptions, and more.

Consider setting up a backup internet solution as a workaround, so you never lose connectivity. For example, you may choose to combine broadband with a wireless failover solution.

Consider a Multi-Cloud Strategy

If your business is in the cloud, it’s a good idea to explore a multi-cloud strategy. By spreading your workloads across multiple cloud providers, you can prevent cloud service disruptions from knocking your digital applications offline. This approach can also improve uptime and resiliency.

Use Website Performance and Availability Monitoring

One of the best ways to protect your business is to use website performance and availability monitoring. It provides real-time visibility into how end users are interacting with and experiencing your website.

A robust website performance and availability monitoring solution can provide actionable insights into the health and stability of your website. As a result, you can track uptime and performance over time and troubleshoot issues when they occur.

The Pingdom Approach to Website Performance Monitoring

SolarWinds^® Pingdom^® provides real-time and historical end-user experience monitoring, giving your team deep visibility from a single pane of glass. With Pingdom, it’s possible to protect against the kind of outages helping your company make headlines for the wrong reasons.

When you’re ready to jump in, try Pingdom by requesting a free trial today.

This post was written by Justin Reynolds. Justin is a freelance writer who enjoys telling stories about how technology, science, and creativity can help workers be more productive. In his spare time, he likes seeing or playing live music, hiking, and traveling.

The post Historical Internet Outages: The 12 Most Impactful appeared first on pingdom.com.

The Roblox Outage

Pingdom Team — Wed, 26 Oct 2022 15:19:00 +0000

Just before Halloween 2021, Roblox engineers experienced a horror story: a service outage that also took down critical monitoring systems. It seemed like the issue was a hardware problem, but it wasn’t. Users were frustrated, and the clock was ticking. After three full days of downtime, service was finally restored on Halloween day.

While the incident itself was an IT nightmare, Roblox’s detailed technical post-mortem several months later was an excellent way to bounce back. The post-mortem was a pleasant, late Christmas gift to users and SysAdmins everywhere.

Thanks to transparency from Roblox, we have an outage case study to learn from. In this post, we’ll summarize the scope and root cause of the outage, explain what other ITOps teams can learn from it, and consider how SolarWinds^® Pingdom^® can help you reduce the risk of extended downtime in your environment.

The scope of the outage

The outage began on October 28, 2021, and was resolved 73 hours later on October 31. The outage affected 50 million Roblox users. While we don’t have a specific dollar cost for the outage, it was a significant incident for Roblox and HashiCorp.

Roblox uses HashiCorp’s HashiStack to manage its global infrastructure. The stack consists of these components:

Nomad: Schedules containers on specific hardware nodes and checks container health.
Vault: A secrets management solution for securing sensitive data like credentials.
Consul: An identity-based networking solution that provides service discovery, health checks, and session locking.

Roblox was one of HashiCorp’s hallmark customers. Therefore, the Roblox outage impacted the reputation of HashiCorp as well. Fortunately, HashiCorp engineers worked with Roblox to troubleshoot and triage the issue, demonstrating their commitment to customer success even when the going gets tough.

The root cause of the outage

From a technical perspective, two issues together formed the root cause of the Roblox outage:

Roblox enabled a new streaming feature on Consul at a time when database reads and writes were unusually high.
The BoltDB database, used by Consul, experienced a performance issue.

While the above two factors made up the root cause, several other factors contributed to the length of the outage. For example, because Roblox’s monitoring systems depended on the same systems that were down, visibility was limited.

Additionally, the issues were technically nuanced and took time to debug. Several initial attempts to restore service failed, and one of the recovery attempts (adding faster hardware) may have worsened things.

What can you learn from the outage?

Roblox did a great job making the details of the outage public, which gives other ITOps teams plenty to study. The top four lessons we can learn from the Roblox outage include:

Visibility is vital.

In the Roblox outage, there was an unfortunate chicken-and-egg problem. Roblox had monitoring in place, but the monitoring tooling depended on the systems that were down. Therefore, they didn’t have deep visibility into the problem when they began troubleshooting.

For IT operations, there are two takeaways from this. First, make sure you’re monitoring all your critical infrastructure. Second, mitigate single points of failure that could knock your monitoring offline along with your production systems.

Limit single points of failure.

Roblox was susceptible to this outage because they ran their backend services on a single Consul cluster. They’ve learned from the outage and have since built infrastructure for an additional backend data center. Other ITOps teams should consider doing the same for their critical systems. If it’s mission-critical, aim for N+1 at a minimum. Of course, everything is ultimately a business decision. If you can’t justify the cost of N+1 (or better), ensure you’re comfortable with the downtime risk.

Know your tradeoffs.

Roblox chose to avoid the public cloud for its core infrastructure. After the outage, some questioned if that was a wise choice, but Roblox has made a purposeful decision to stay the course. Public cloud infrastructure isn’t immune to outages, and Roblox considered the tradeoffs. They feel that maintaining control of their core infrastructure is more important than a public cloud’s benefits.

For other organizations, the calculations on the tradeoffs will be different. Weighing the tradeoffs to make a decision based on your preferences, expertise, resources, and risk appetite is what’s important.

Learn from your mistakes.

Continuous improvement leads to long-term success. After the outage, Roblox was transparent about what happened and implemented solutions (like the additional backend infrastructure) to prevent repeat failures. Mistakes happen, and no system is perfect. Teams that recognize this and adopt practices like blameless postmortems tend to do better in the long run.

How Pingdom can help

Pingdom is a simple but robust website and web application monitoring platform that provides ITOps teams visibility into availability and performance. As a cloud platform, Pingdom helps you avoid the chicken-and-egg problem Roblox experienced during its outage. Because Pingdom is decoupled from your critical infrastructure, you can limit single points of failure that could take out your servers and monitoring tools at the same time.

Additionally, Pingdom can monitor site availability from over 100 locations across the globe. That means your team can detect performance issues and outages that may only affect users in specific geographic areas. It can also make troubleshooting tough-to-diagnose issues easier because you can immediately isolate symptoms based on region. To see what SolarWinds Pingdom can do for you, sign up for a free 30-day trial today

The post The Roblox Outage appeared first on pingdom.com.

Internet Availability Threats Following the Russian invasion of Ukraine

Pingdom Team — Mon, 19 Sep 2022 15:09:00 +0000

Since Russia’s invasion of Ukraine on February 24, 2022, denial-of-service (DoS) attacks impacting availability have been rising. The attacks aren’t only affecting Russia and Ukraine either. Public and private organizations in multiple industries have been impacted, and several nations—including the U.S. and U.K.—have issued warnings about cyberthreats from Russia.

A recent report from the Uptime Institute notes some 60% of outages cost at least $100,000, and 15% cost at least $1 million. Falling victim to one of these attacks can significantly impact your bottom line.

In this post, we’ll take a closer look at recent attacks attributed to Russia, how different DoS attacks work, and what you can do to help keep your site available.

What Attacks Have Been Seen in the Wild?

Let’s start by looking at some of the significant availability threats and attacks that have occurred in recent months.

BrownFlood Malicious JavaScript Code

According to CERT-UA#4553, attackers have been embedding malicious JavaScript code in compromised websites (mostly sites running WordPress) to perform distributed DoS (DDoS) attacks against various Ukrainian Government and pro-Ukraine sites. When a visitor accesses a compromised site, this executes the malicious BrownFlood code, and the visitor’s computer subsequently spams the target sites with requests.

Example of a site running BrownFlood code to perform a DDoS attack against target sites.
Source: https://cert.gov.ua/article/39923

Researchers have discovered versions of BrownFlood using Math.random() to generate URIs in their requests, but attackers may modify this approach in the future. Malicious BrownFlood code can be embedded in HTML, JavaScript, and other website assets and is often obfuscated with base64 encoding.

DDoS Attacks Against Ukrainian Banks

The U.K. Foreign, Commonwealth & Development Office and National Cyber Security Centre indicated Russia was likely responsible for DDoS attacks against several Ukrainian banks before the invasion. The attacks occurred February 15 – 16, 2022, and the report suggests the Russian Main Intelligence Directorate (GRU) was involved.

While the details of the DDoS attacks are unclear, The Register confirmed the websites of PrivatBank and Oschadbank were knocked offline at the time. PrivatBank ATMs and online transactions were also affected.

Attacks Against Russia

Separate from attacks by Russia, there are many “hacktivists” launching attacks against Russia. An NBC News report indicated that DDoS attacks are the most visible way pro-Ukraine hackers are attempting to make an impact. One of the things making DDoS attacks so appealing to hacktivists is they’re relatively technically simple to execute and lead to immediate results.

Politically or Militarily Motivated Attacks

Unfortunately, the Russia-Ukraine War isn’t the first instance of politically or militarily motivated DDoS attacks. As evidence of the evolving digital world, recent history has brought several other examples, including:

DDoS attack against Israeli communications providers: In March 2022, a DDoS attack—labeled the largest ever against Israel—against Israeli communications providers took down multiple government websites and led to a state of emergency.
DDoS attack against NotePad++ because of their Free Uyghur edition: The popular NotePad++ text editor fell victim to a DDoS attack after releasing its 7.8.1 version, titled the “Free Uyghur edition.”
DDoS attack against PopVote: In 2014, before the Hong Kong protests known as the “Umbrella Movement,” the PopVote public opinion voting system was hit by the largest DDoS attack in history at the time.

What’s the takeaway here? While it’s true organizations dealing with finance, government, and critical infrastructure directly related to the Russia-Ukraine War have an increased risk, they’re far from the only ones needing to be concerned.

In fact, The Hacker News reports education, tech, healthcare, and online gaming are all common targets. Additionally, you don’t have to be a large enterprise to be a target. Small and medium-sized enterprises are common targets of threat actors looking to compromise availability.

How Do the Attacks Work?

Now we know the what, let’s look into how different attacks against availability work. All these attacks against availability fall into the category of DoS attacks. However, a “DoS attack” can take several different forms. Threat actors use various techniques to implement DoS attacks.

Let’s cover the three main categories of DoS attacks and how they work.

Volume-Based DoS Attacks

Volume-based DoS attacks are classic DoS. They generate more load (requests) than a server can handle, and as the server tries to keep up with the malicious requests, its bandwidth is exhausted. As a result, legitimate users are unable to access the server.

Common examples of volume-based DoS attacks include:

UDP floods: This type of attack exploits how responses to UDP packets work. If there’s no service listening on a port, then the server sends an Internet Control Message Protocol (ICMP) error response. In a UDP flood, these error responses consume the server’s bandwidth and lead to denial of service.
ICMP (ping) floods: An ICMP echo request (a ping) elicits an echo reply. With an ICMP flood, attackers exploit this behavior by spamming a site with pings to overload it.
NTP amplification: Network Time Protocol (NTP) amplification attacks involve spoofing a victim’s IP address in NTP request packets. The attacker sends the NTP requests to NTP servers, but the servers reply to the victim’s address.

Protocol Attacks

Protocol attacks exploit vulnerabilities in the network and transport layers of the OSI model. Instead of compromising availability by exhausting bandwidth, protocol attacks exhaust processing capacity.

Synchronize (SYN) flood attacks are a great example of a protocol attack. In a SYN flood attack, the attacker sends many TCP SYN packets—these packets are the first step in the TCP handshake—with spoofed IP addresses. The server responds to the spoofed addresses with a SYN-ACK. Because the IP addresses were spoofed, the server never receives the acknowledge (ACK) response to the SYN-ACK. Eventually, if the server is left waiting for too many responses, it may crash or hang.

Layer 7 Attacks

Layer 7 (also known as the application layer) attacks exploit application layer protocols like HTTP to force a server to exhaust its resources. Application layer attacks are particularly tricky to deal with for two reasons:

It’s hard to distinguish legitimate layer 7 traffic from malicious layer 7 traffic.
A relatively small request can trigger a large amount of server resource consumption.

For example, an attacker may send HTTP GET or POST requests to the same API endpoints as legitimate users would. The attacker, however, sends requests at a much higher frequency or with specifically crafted headers and payloads.

How Can You Protect Your Sites?

Fortunately, there are several ways you can protect sites against availability threats from Russia and beyond. The following tools and techniques can help make your websites more resilient:

Using a load balancer to distribute traffic across multiple servers
Using a web application firewall (WAF) to provide traffic filtering and monitoring
Applying Geo-IP filtering to block traffic from specific origins
Turning off unused or unneeded services, which could otherwise be exploited (for example, ping responses)
Using a content delivery network (CDN) to serve static content
Implementing CAPTCHA-style challenges to prevent automated attacks

Additionally, it’s essential to remember the need for real-time monitoring. Proactive monitoring allows you to quickly detect and respond to anomalies, which can make a world of difference when mitigating the impact of an attack. Specifically, sites should implement the following:

Resource monitoring: Real-time monitoring of server resources like CPU and network I/O can help detect anomalies, possibly indicating a DoS attack is in progress. For example, a spike in CPU utilization might be related to a SYN flood. The sooner you detect it, the sooner you can mitigate it.
Uptime and performance monitoring: In addition to detecting outages, website monitoring with tools like Pingdom^® enables you to detect “brownouts” (degraded performance) before they become “blackouts” (complete outages).

Detect Availability Issues Globally and Locally With Pingdom

DoS threats from Russia significantly affect website availability and performance across the internet. The landscape is constantly changing, with attacks and countermeasures forming a constant cat-and-mouse game. With the Pingdom State of the Internet live map, you can stay up to date with a real-time view of website outages around the world.

If you’re interested in going beyond a global view and monitoring your own site’s availability, sign up for a 30-day free trial of Pingdom today!

The post Internet Availability Threats Following the Russian invasion of Ukraine appeared first on pingdom.com.

Outages Archive - pingdom.com

Reddit Outage (August 2023) Recap

Scope of the outage

What was the root cause of the outage?

What can we learn from the Reddit August 2023 outage?

Takeaway #1: Do the preparation work it takes to act fast

Takeaway #2: Ping isn’t always enough

Takeaway #3: Consider user journeys

How Pingdom can help

Chase Outage (July 2023) Recap

Scope of the outage

What was the root cause of the outage?

What can we learn from the outage?

Takeaway #1: Map your dependencies

Takeaway #2: Get to (technical) “mean time to innocence” quickly

Takeaway #3: Design out large-scale failures

How Pingdom can help

Coffee Meets Bagel Outage (August 2023) Recap

Scope of the outage

Root cause of the outage

Lesson #1: Emphasize all phases of the incident response life cycle

Lesson #2: Store (or don’t store!) data wisely

Lesson #3: Make it right with your users

How Pingdom can help

Instagram Outage (May 2023) Recap

Scope of the outage

What was the root cause of the outage?

The Instagram architecture and tech stack

What can you learn from the outage?

1. Prioritize monitoring of the end-user experience

2. Assume outages will occur

3. Communicate when something goes wrong

How Pingdom can help

Twitter Outages – Why Is Twitter Going Down So Often Recently?

Significant Twitter outages

December 28 outage

Jan 23 outage

February 8 and February 18 outage

March 6 outage

Potential causes of recent Twitter outages

Internal configurational changes

External integrations

Poorly maintained code

Brittle APIs

Preventing outages

Why preventing outages matters

Microsoft Outage (Jan. 2023) Recap

Scope of the outage

Root cause analysis

WAN and BGP

Learning from the outage

Takeaway #1: Expect complex systems to have issues.

Takeaway #2: Test your updates.

Takeaway #3: Have a rollback plan.

How Pingdom® can help

Average Cost of Downtime per Industry

Calculating Downtime

Cost of Downtime per Industry

Historical Internet Outages: The 12 Most Impactful

Historical Internet Outages You Need to Know About

1. Amazon Web Services

2. Facebook

3. Fastly

4. British Airways

5. Google

6. Dyn

7. Verizon Fios

8. Microsoft

9. Comcast

10. Akamai Edge DNS

11. Cox Communications

12. Slack

Best Practices for Avoiding Internet Outages

Set Up a Backup Internet Solution

Consider a Multi-Cloud Strategy

Use Website Performance and Availability Monitoring

The Pingdom Approach to Website Performance Monitoring

The Roblox Outage

The scope of the outage

The root cause of the outage

How Pingdom^® can help