Live Weather & Live Traffic Server Monitoring - Auto Reset For Outages or Degradation

Wouldn’t a group at Microsoft be responsible for the day-to-day, online aspects of MSFS? From what we’ve seen so far, it seems like no one (or group) is responsible for this important function.

For all features and functionality that rely on remote connections, servers, monitoring server activity, operational status:

-Live air traffic
-Live weather
-Multiplayer
-Streaming scenery data (Bing?)
-Streaming Photogrammetry data
-Login Authentication/Authorization
-The Marketplace
-The ones I’ve missed

Not on the Development side (Asobo), but on the Production side (Microsoft). For all the functions listed above that rely on online services or data being provided by servers. This makes up such an important part of the Sim. It seems like there’s no one responsible for this aspect.

Wouldn’t a group be responsible for tracking status of the online services? Notifying the appropriate party to initiate a repair, communicating to the community, following up and staying on top of problems?

Internally, do you think there’s a dashboard or some software that shows the status of the various MSFS servers and online data sources in one place? That they may get an alert that a particular server is down or service is degraded?

It seems that Microsoft should know that a server or a online data source has failed without us having to tell them. Advanced monitoring tools should notify someone who can take action to restore. Or automate the reset process. I’m sure Microsoft has the ability to do so.

Makes me wonder with all of the Associates they have involved with MSFS on a daily basis, wouldn’t someone notice a server outage? Or notice a degradation of performance? For instance no Live Air Traffic. And if they do notice it, can they report it and get it fixed?

Im just curious about this. What do yooze guys and gals think?

8 Likes

I can only speak to my experience as an IT Operations guy for way too long. What you wrote is my job (but for real airplanes). Typically, development teams will setup monitoring for the “whole application”. This means in our case: “can you start the game and fly” or “no one can start the game and fly”. Deeper level features can be monitored of course, but that’s not something I typically see until something breaks or most of the bigger bugs are worked out and the team has time to setup these deeper level feature monitors.

Probably not the answer you wanted, but it’s what I’ve seen 95% of my 25+ year career.

Ultimately, we’re entering new territory where a game/sim depends on many external interfaces ALL working at the same time to create the immersion promised to us. Most games are tied to one company, whereas MSFS is connect to weather, flight data, Bing maps, and so on.

It’ll get better, but we have to continue to bring these issues and outages to Asobo’s attention. Please create Zendesk tickets.

7 Likes

The sim is deployed on an all encompassing platform called Game Stack.

Note the list of sub-servicers at the bottom of the screen.

In general, if aspects of the game (or the game itself) are unavailable, the two places to look are XBox Live Status, and Playfab Status (which matches the sub-servicers list above). A partial outage in either one can be tied to potential sim unavailability or service degradation.

Beyond that, I have no further information.

6 Likes

Thank you very much! I appreciate the thoughtful, detailed reply.

1 Like

Great info! Very cool. Thank you very much.

We use Nagios at work. Various feeds come in to it, flagging if a server becomes unresponsive, not responding to pings, or a service dies, where a TCP port is no longer accepting connections. We know when a service has done down before our end users.

Perhaps this thread could be re-written as the question “Why does the community need to inform MS/Asobo when a service underpinning the sim has stopped working?”

I realise some of these failures I refer to are not as simple as using ICMP, or attempting to connect to service, but even so.

It’s also fair to say that we on the outside cannot readily tell how proactive they are rather than merely being reactive.

2 Likes

From my perspective as a Network engineer that works for an ISP at a Datacenter in Network Operations, this likely more of a communications issue than the end user having to report all the issues

This happens to some extent in every datacenter I am familiar with. It certainly happens in my current work place. We may have the alarms, and be working on the issue, that does not mean we have actually communicated with our helpdesk/tech support to inform them that there is an issue. When you have literally 1000s of alarms coming in with major systems offline communication outside of the department becomes secondary. If this is right or wrong is an entirely different debate.

You really have 2 issues here, both which the end user tends to consider as an issue with the server.

The first is not really an issue at the datacenter, and that is general internet connectivity issues that can occur between the end user and the datacenter. These are not monitored for and the datacenter would not know about unless reported by an end user. In some cases the datacenter NOC may be able to reach out to the affected intermediary carriers and assist with resolution, but that is not guaranteed.

The second is actual issues at the datacenter. These are being monitored for by an internal NOC group. More than likely they are known about and being worked on before any end user reports them.

I can easily see where this could create the impression from the outside, that NOC is oblivious and that the customer has to point out all the issues to them.

4 Likes

Yes, not sure whether you meant that literally or not, but I just renamed it.

Thank you for your input. Great to hear from people that actually have real world experience that applies to this topic.

I really appreciate your insight!

1 Like

Thank you very much for the insightful reply!

Given your experience and professional knowledge in this area, do you think that someone (or a group) at Microsoft should be aware of a failure like Live Air Traffic quickly without the Community having to notify them?

I’d like to hear from all others regarding his question as well.

Perhaps they should have written a stipulation into the data provider’s Service Level Agreement that it is their responsibility to notify Microsoft if the service goes down or they can’t provide the data that Microsoft is expecting. That would cover the source of the data.

Also, within Microsoft’s Data Center, they should be monitoring to see if the inbound data falls below an expected volume.

And they should be able to monitor the outbound volume of Live Air Traffic that is being sent out to all users from the Data Center.

You would think that they would know almost immediately when there’s a problem, even before users notice missing Live Air Traffic in the Sim.

Of course, this applies to all of the data that is needed from external interfaces, some of which I listed in my first post.

Sir, I agree 100%.

Let us take the Live Traffic issue of this last weekend (ending Monday Sep 27, 2021). The reports from users started trickling in, here on the forum. The number of reports increased and a community manager named Ollie or something like that suggested that we needed to upgrade the XBox app to a version that only (as we later found out) existed if you sign up for the insider program. Days went by with no acknowledgement from MS/Asobo.

I tried to ping a community manager and was informed that we cannot contact community managers directly. I think that is odd since isn’t that why they are here? Are they not the liaison between the company’s and us?

If it were not for the help of a forum mod who jumped in, asked the right questions, was able to contact the right people, I bet the Live Traffic would still be broken.

My points are that:

A. Someone form the companies should be a lot more active on these forums.
B. We should have the ability to alert them of issues, just like we can alert the Moderators of forum rule breaches.
C. They should believe, within reason, that when we say there is an issue, that there is an issue. At least acknowledge that they read our reports.
D. The excuse of “It is a weekend” should be abandoned. We are not talking about a flower shop here…these are multi-billion dollar companies.

2 Likes

D. The excuse of “It is a weekend” should be abandoned. We are not talking about a flower shop here…these are multi-billion dollar companies.

Sorry what? it’s a piece of entertainment software, no-one needs to work weekends for a non-critical issue like the spotty live traffic stopping working. It won’t stop you flying and using the sim.

I’m sure they knew very well that the servers or something had gone wrong and just didn’t deem it important enough to fix.

Get some perspective, contrary to some peoples expectations on these forums, Asobo is made up of humans with families and lives.

2 Likes

My point is that people ARE working on the weekends already. To think otherwise is kinda foolish.

1 Like

Running to the defense of those that don’t need defending.

Humans sometimes decide that the want to work in a job that deems them Essential Personnel. They work weekends, Holidays and overtime because their job requires it of them. They get paid for their efforts.

This is big business and downtime is not acceptable. Someone is already on call for server outages. There are Service Level Agreements in place that require specific response times. If someone or some group hadn’t dropped the ball, it would’ve been fixed much quicker.

For one reason or another, they didn’t catch the failure like they should have. Let’s just hope that they learn from their mistake and tighten up the procedures so it doesn’t happen again.

2 Likes

If the underlying issue was in house, absolutely they should and probably did know within a very small time frame. If it was something outside their control like say a bad BGP route out on the internet disrupting connections from some subset of ISP’s, it could have been user reports that alerted them.

Important to note that being aware of an issue, and being able to isolate and fix that issue are not the same thing. They could be aware within 5 minutes and it could still take days to fix depending on the root cause.

Either way in my opinion is that the issue is not how quickly they where engaged, but rather how long it took them to communicate to the end user that they were aware of the issue. I am sure like most large companies, even if they know, any statement has to go through PR before being released to us.

3 Likes

@Skedge7226

Excellent post… makes perfect sense.

Thank you!

True, but in this case, the fix occurred just about the same time as start of business on Monday in Europe. It is almost as if something was switched back on.

1 Like

It was somewhat tongue in cheek, but it’s still a valid question, in my opinion.

Communication seems to be a big hang up, but I note there have been recent attempts to address this, with the Thursday updates including a triage section, and the new forum tags indicating whether a bug has been logged or not.

Both are welcome changes.

3 Likes

Apparently some humans don’t realise they’re talking about a $60 piece of entertainment software either…

1 Like

How do we get MSFS monitored by downdetector.com?

I can’t believe this commonsense monitoring and problem resolution processes are not in place at MS already. The tools and best practices have been out there for a long time.

4 Likes