We've just emerged from a week of hell in which Azure and Microsoft have completely lost my trust. It's raised a lot of questions about Azure and Marketplace and came very close to making front-page news.
There are obviously certain details that I can't talk about but I'll say this. The upper echelons at Microsoft were made fully aware of the damage they were causing and the impact that our five day outage was having on several very large players and also on hundreds of individuals. They were completely ineffectual and did nothing to resolve the solution.
Image by dexmac from Pixabay |
A Word About Billing
I don't think I've talked about how bad Microsoft's billing systems are, so it's worth spending a little time here. I've dealt with billing from hundreds of companies over the years but nothing has ever approached the complete obscurity of their billing.
It's not just the big things that are obscure either. Even when you obtain a small pay-as-you-go service, such as a cloud PC, they refer to it as a 3 year reservation, even though you've not signed up for three years. When you get a product from the Azure Marketplace, it's just billed as "Azure Marketplace" with no mention of which product it is.
Microsoft billing makes it more or less impossible to be sure what you're paying.
How it all Started
Some years back, we had a manager who liked to trial services without documenting them. He'd often forget to cancel these services and they'd eventually start charging us. There are lots of managers like this and while it's not the way I like to do things, it's not necessarily a major character flaw. If Microsoft's billing was clearer, it wouldn't be a problem at all.
When he left the company, we were left with Microsoft bills coming from all directions with no useful descriptions on them simply because Microsoft doesn't label their invoices properly.
We began a multi-year project to discover and close down the unused services. We engaged a few contractors and several people at Microsoft, most of whom were unable to make sense of their own bills. Little by little we got those services removed (we still have at least two that Microsoft can't identify).
One of the more effective clean-ups occurred a little over a year ago when we spent months on teams calls with Microsoft and involved several of their people. We got a lot of services removed then, though as it turned out, they removed and cancelled are two very different things.
It's not unusual for us to have services with delayed payments at Microsoft because we're always having to get them to explain their charges. Something that could be avoided with clearly written invoices.
The Cause
We detected the shutdown of an external service, "SendGrid" which was enabled by Azure Marketplace last Friday. I can't say more on this other than the absence of this service had the potential to affect payrolls around the country.
We raised calls with both Microsoft and Twilio SendGrid. Both blamed each-other but both also highlighted billing as a cause. We spent Friday and Saturday going through all of our unpaid Microsoft invoices and paying everything by credit card. Microsoft's payment systems are a little problematic and we ended up paying for a few things twice.
Nevertheless, we quickly ended up with a payments panel full of green-tick icons.
You can imagine our surprise when despite our efforts and Microsoft's assurances, the service remained suspended on Monday.
We spent the remainder of Monday and half of Tuesday in a blame game with Microsoft. They kept blaming SendGrid and SendGrid kept blaming them. They asked us to provide all kinds of files to see if we had a technical glitch on our side (despite us telling them that it was clearly a billing issue and that they were wasting valuable time).
We were also able to send them very clear screen shots and files which made it clear that the problem was on Microsoft's end. We had a billing person assigned to our call and they were reluctant to involve any technical people, so we ended up going in circles even though we escalated to their bosses.
By late Monday, we had a Priority 1 case but they still told us that their technicians were too busy to be engaged. We sent them several screenshots but it wasn't until late Tuesday that they agreed to a teams session to actually look at the problem.
The Problem but not the Fix
It turned out that when Microsoft had done a review of our bills many months back, they'd somehow turned off visibility of a bill. That bill for a paltry $18, was outstanding but not visible to us. We couldn't pay what we couldn't see. Microsoft spent some time trying to make the bill visible but ultimately couldn't. In fact, they couldn't even produce the invoice for us and I have doubts that the bill actually existed.
We were obviously willing to pay but Microsoft waived that fee. After all, the combined 'lost wages' of everyone engaged on the problem, plus the productivity and reputational losses we were suffering were considerably higher.
All good right? Wrong. The service remained stubbornly suspended.
Not Quite a Fix
By Wednesday, with the organisation in panic mode, and the problem having been pushed higher with Microsoft via Twitter and LinkedIn, we were able to talk to higher people but they were disinterested in our plight.
It had been determined that Microsoft's billing issue had resulted in a service being disconnected and that the service in question (SendGrid) could not be reconnected. The option was there but it was greyed out. We could see a simple file change that would resolve the problem but Microsoft was still unwilling to engage the right technical people - despite us having paid support.
We'd also found out from SendGrid, that while they couldn't reactivate our existing service and we could create another, they had no facilities to migrate the 100+ templates from one system to another. There's essentially no backup and transfer for that service. (Shame on you SendGrid).
We have some pretty capable people in our team and they spent Wednesday rebuilding the service on SendGrid (without any connection to Marketplace) and manually moving templates by copying and pasting HTML. There was no further contact with Microsoft - it was obviously too hard for them.
We got up and running again by the end of Wednesday but our faith in Microsoft is gone and I can't see us ever using their Marketplace again.
Even today, or service remains suspended. (on the plus side, I pointed out to Microsoft that their error message spelled "subscription" wrong and they at least fixed that). |
The Past is the Future
We moved this particular service from Domino to Azure in 2018. In the 17 years it ran on Domino, it racked up a total outage of 5 hours. In one week, on Azure it got 40 hours (and there have been several other Azure outages prior to this).
It doesn't make sense to move the application back to Domino as the life expectancy of the application is drawing to a close but what we can do is move the external portions, such as SendGrid back to Domino because it provides a reliable service with proven DR capabilities.
It also makes it harder to recommend Azure as a platform in the future.
My advice: Think hard before you trust bigger, cloud-based services. Don't trust their ability to engage in a crisis, don't trust their customer care or their billing and certainly don't trust their DR capabilities.
Comments