有没有办法确定 Azure 应用服务重启的原因?

本文介绍了有没有办法确定 Azure 应用服务重启的原因?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一堆网站在 Azure 应用服务的单个实例上运行，并且它们都设置为始终开启.它们都突然同时重新启动，导致一切都变慢了几分钟，因为一切都遇到了冷请求.

I have a bunch of websites running on a single instance of Azure App Service, and they're all set to Always On. They all suddenly restarted at the same time, causing everything to go slow for a few minutes as everything hit a cold request.

如果该服务已将我转移到新主机，我会期待这一点，但这并没有发生 - 我仍然使用相同的主机名.

I would expect this if the service had moved me to a new host, but that didn't happen -- I'm still on the same hostname.

重启时CPU和内存使用正常，我没有启动任何部署或类似的事情.我没有看到重启的明显原因.

CPU and memory usage were normal at the time of the restart, and I didn't initiate any deployments or anything like that. I don't see an obvious reason for the restart.

是否有任何我可以看到的日志以找出为什么它们都重新启动?或者这只是应用服务不时做的一件正常的事情?

Is there any logging anywhere that I can see to figure out why they all restarted? Or is this just a normal thing that App Service does from time to time?

推荐答案

所以，这个问题的答案似乎是不，你不能真正知道为什么，你可以推断它做到了."

So, it seems the answer to this is "no, you can't really know why, you can just infer that it did."

我的意思是，您可以添加一些 Application Insights 日志记录，例如

I mean, you can add some Application Insights logging like

    private void Application_End()
    {
        log.Warn($"The application is shutting down because of '{HostingEnvironment.ShutdownReason}'.");

        TelemetryConfiguration.Active.TelemetryChannel.Flush();

        // Server Channel flush is async, wait a little while and hope for the best
        Thread.Sleep(TimeSpan.FromSeconds(2));
    }

并且您最终会得到 应用程序正在关闭，因为 'ConfigurationChange'." 或 应用程序正在关闭，因为 'HostingEnvironment'."，但它并没有真正告诉您主机级别发生了什么.

and you will end up with "The application is shutting down because of 'ConfigurationChange'." or "The application is shutting down because of 'HostingEnvironment'.", but it doesn't really tell you what's going on at the host level.

我需要接受的是，应用服务会时不时地重新启动，并问自己为什么我在乎.应用服务应该足够聪明，可以在向应用程序池发送请求之前等待应用程序池预热(如重叠回收).然而，我的应用程序在回收后会在 CPU 上运行 1-2 分钟.

What I needed to accept is that App Service is going to restart things from time to time, and ask myself why I cared. App Service is supposed to be smart enough to wait for the application pool to be warmed up before sending requests to it (like overlapped recycling). Yet, my apps would sit there CPU-crunching for 1-2 minutes after a recycle.

我花了一段时间才弄明白，但罪魁祸首是我所有的应用程序都有一个重写规则来从 HTTP 重定向到 HTTPS.这不适用于应用程序初始化模块:它向根发送请求，并且所有它从 URL 重写模块获得 301 重定向，并且 ASP.NET 管道根本没有受到影响，艰苦的工作不是't实际上完成了.应用服务/IIS 然后认为工作进程已准备就绪，然后向其发送流量.但第一个真正的"请求实际上遵循 301 重定向到 HTTPS URL，并且砰！那个用户遇到了冷启动的痛苦.

It took me a while to figure out, but the culprit was that all of my apps have a rewrite rule to redirect from HTTP to HTTPS. This does not work with the Application Initialization module: it sends a request to the root, and all it gets its a 301 redirect from the URL Rewrite module, and the ASP.NET pipeline isn't hit at all, the hard work wasn't actually done. App Service/IIS then thought the worker process was ready and then sends traffic to it. But the first "real" request actually follows the 301 redirect to the HTTPS URL, and bam! that user hits the pain of a cold start.

我添加了此处描述的重写规则，以免除应用程序初始化模块对 HTTPS 的需求，因此当它到达站点的根目录时，它实际上会触发页面加载，从而触发整个管道:

I added a rewrite rule described here to exempt the Application Initialization module from needing HTTPS, so when it hits the root of the site, it will actually trigger the page load and thus the whole pipeline:

<rewrite>
  <rules>
    <clear />
    <rule name="Do not force HTTPS for application initialization" enabled="true" stopProcessing="true">
      <match url="(.*)" />
      <conditions>
        <add input="{HTTP_HOST}" pattern="localhost" />
        <add input="{HTTP_USER_AGENT}" pattern="Initialization" />
      </conditions>
      <action type="Rewrite" url="{URL}" />
    </rule>
    <rule name="Force HTTPS" enabled="true" stopProcessing="true">
      <match url="(.*)" ignoreCase="false" />
      <conditions>
        <add input="{HTTPS}" pattern="off" />
      </conditions>
      <action type="Redirect" url="https://{HTTP_HOST}/{R:1}" appendQueryString="true" redirectType="Permanent" />
    </rule>
  </rules>
</rewrite>

这是将旧应用程序迁移到 Azure 的日记中的众多条目之一——事实证明，当某些东西在很少重新启动的传统 VM 上运行时，您可以避免很多事情，但它需要一些 TLC 才能在迁移到我们美丽的云端新世界时解决问题......

It's one of many entries in a diary of moving old apps into Azure -- turns out there's a lot of things you can get away with when something's running on a traditional VM that seldom restarts, but it'll need some TLC to work out the kinks when migrating to our brave new world in the cloud....

更新 10/27/2017: 自撰写本文以来，Azure 在诊断和解决问题"下添加了一个新工具.单击Web App Restarted"，它会告诉您原因，通常是由于存储延迟或基础架构升级.不过，以上仍然适用，因为在迁移到 Azure 应用服务时，最好的前进方式是你真的只需要哄你的应用适应随机重启.

UPDATE 10/27/2017: Since this writing, Azure has added a new tool under "Diagnose and solve problems". Click "Web App Restarted", and it'll tell you the reason, usually because of storage latency or infrastructure upgrades. The above still stands though, in that when moving to Azure App Service, the best way forward is you really just have to coax your app into being comfortable with random restarts.

更新 2/11/2018: 在将多个旧系统迁移到中等应用服务计划的单个实例(具有大量 CPU 和内存开销)后，我遇到了一个令人烦恼的问题，我的从暂存槽部署将无缝进行，但是每当我因为 Azure 基础架构维护而启动到新主机时，一切都会变得混乱，停机时间为 2-3 分钟.我很想弄清楚为什么会发生这种情况，因为应用服务应该等到它收到来自您的应用的成功响应，然后再将您引导到新主机.

UPDATE 2/11/2018: After migrating several legacy systems to a single instance of a medium App Service Plan (with plenty of CPU and memory overhead), I was having a vexing problem where my deployments from staging slots would go seamlessly, but whenever I'd get booted to a new host because of Azure infrastructure maintenance, everything would go haywire with downtime of 2-3 minutes. I was driving myself nuts trying to figure out why this was happening, because App Service is supposed to wait until it receives a successful response from your app before booting you to the new host.

我对此感到非常沮丧，以至于我准备将应用服务归类为企业垃圾并返回到 IaaS 虚拟机.

I was so frustrated by this that I was ready to classify App Service as enterprise garbage and go back to IaaS virtual machines.

事实证明这是多个问题，我怀疑其他人在将他们自己的旧版 ASP.NET 应用程序移植到应用服务时会遇到这些问题，所以我想我会在这里全部解决.

It turned out to be multiple issues, and I suspect others will come across them while porting their own beastly legacy ASP.NET apps to App Service, so I thought I'd run through them all here.

首先要检查的是，您实际上是否在 Application_Start 中进行了实际工作.例如，我正在使用 NHibernate，虽然它在很多方面都擅长加载其配置，但我确保在 Application_Start 期间实际创建 SessionFactory 以确保完成艰苦的工作.

The first thing to check is that you're actually doing real work in your Application_Start. For example, I'm using NHibernate, which while good at many things is quite a pig at loading its configuration, so I make sure to actually create the SessionFactory during Application_Start to make sure that the hard work is done.

如上所述，要检查的第二件事是您没有干扰应用服务预热检查的 SSL 重写规则.如上所述，您可以从重写规则中排除预热检查.或者，自从我最初编写该解决方案以来，应用服务添加了一个 HTTPS Only 标志，允许您在负载均衡器而不是在您的 web.config 文件中执行 HTTPS 重定向.由于它是在应用程序代码之上的间接层处理的，因此您不必考虑它，因此我建议使用 HTTPS Only 标志作为可行的方法.

The second thing to check, as mentioned above, is that you don't have a rewrite rule for SSL that is interfering with App Service's warmup check. You can exclude the warmup checks from your rewrite rule as mentioned above. Or, in the time since I originally wrote that work around, App Service has added an HTTPS Only flag that allows you to do the HTTPS redirect at the load balancer instead of within your web.config file. Since it's handled at a layer of indirection above your application code, you don't have to think about it, so I would recommend the HTTPS Only flag as the way to go.

要考虑的第三件事是您是否正在使用应用服务本地缓存选项.简而言之，这是一个选项，应用服务会将您的应用程序的文件复制到它正在运行的实例的本地存储中，而不是从网络共享中复制，如果您的应用程序不关心它，这是一个很好的启用选项丢失写入本地文件系统的更改.它加快了 I/O 性能(这很重要，因为请记住，App Service 在土豆上运行) 并消除因网络共享的任何维护而导致的重新启动.但是，应用服务的基础结构升级有一个具体的微妙之处，记录不充分，您需要注意.具体来说，本地缓存选项在第一次请求后在单独的应用程序域中的后台启动，然后在本地缓存准备好时切换到应用程序域.这意味着应用服务将针对您的站点发出预热请求，获得成功响应，将流量指向该实例，但是(哎呀！)现在本地缓存正在后台处理 I/O，如果您有很多站点在这种情况下，您已经停止了，因为应用服务 I/O 非常糟糕.如果您不知道正在发生这种情况，它在日志中看起来很奇怪，因为就好像您的应用程序在同一个实例上启动了两次(因为确实如此).解决方案是遵循此 Jet 博客文章和创建一个应用程序初始化预热页面来监视环境变量，该变量告诉您本地缓存何时准备就绪.这样，您可以强制应用服务延迟引导您到新实例，直到本地缓存完全准备好.这是我用来确保我也可以与数据库对话的一个:

The third thing to consider is whether or not you're using the App Service Local Cache Option. In brief, this is an option where App Service will copy your app's files to the local storage of the instances that it's running on rather than off of a network share, and is a great option to enable if your app doesn't care if it loses changes written to the local filesystem. It speeds up I/O performance (which is important because, remember, App Service runs on potatoes) and eliminates restarts that are caused by any maintenance on network share. But, there is a specific subtlety regarding App Service's infrastructure upgrades that is poorly documented and you need to be aware of. Specifically, the Local Cache option is initiated in the background in a separate app domain after the first request, and then you're switched to the app domain when the local cache is ready. That means that App Service will hit a warmup request against your site, get a successful response, point traffic to that instance, but (whoops!) now Local Cache is grinding I/O in the background, and if you have a lot of sites on this instance, you've ground to a halt because App Service I/O is horrendous. If you don't know this is happening, it looks spooky in the logs because it's as if your app is starting up twice on the same instance (because it is). The solution is to follow this Jet blog post and create an application initialization warmup page to monitors for the environment variable that tells you when the Local Cache is ready. This way, you can force App Service to delay booting you to the new instance until the Local Cache is fully prepped. Here's one that I use to make sure I can talk to the database, too:

public class WarmupHandler : IHttpHandler
{
    public bool IsReusable
    {
        get
        {
            return false;
        }
    }

    public ISession Session
    {
        get;
        set;
    }

    public void ProcessRequest(HttpContext context)
    {
        if (context == null)
        {
            throw new ArgumentNullException("context");
        }

        var request = context.Request;
        var response = context.Response;

        var localCacheVariable = Environment.GetEnvironmentVariable("WEBSITE_LOCAL_CACHE_OPTION");
        var localCacheReadyVariable = Environment.GetEnvironmentVariable("WEBSITE_LOCALCACHE_READY");
        var databaseReady = true;

        try
        {
            using (var transaction = this.Session.BeginTransaction())
            {
                var query = this.Session.QueryOver<User>()
                    .Take(1)
                    .SingleOrDefault<User>();
                transaction.Commit();
            }
        }
        catch
        {
            databaseReady = false;
        }

        var result = new
        {
            databaseReady,
            machineName = Environment.MachineName,
            localCacheEnabled = "Always".Equals(localCacheVariable, StringComparison.OrdinalIgnoreCase),
            localCacheReady = "True".Equals(localCacheReadyVariable, StringComparison.OrdinalIgnoreCase),
        };

        response.ContentType = "application/json";

        var warm = result.databaseReady && (!result.localCacheEnabled || result.localCacheReady);

        response.StatusCode = warm ? (int)HttpStatusCode.OK : (int)HttpStatusCode.ServiceUnavailable;

        var serializer = new JsonSerializer();
        serializer.Serialize(response.Output, result);
    }
}

还要记住映射路由并添加应用程序初始化您的web.config:

Also remember to map a route and add the application initialization your web.config:

<applicationInitialization doAppInitAfterRestart="true">
  <add initializationPage="/warmup" />
</applicationInitialization>

要考虑的第四件事是，有时应用服务会出于看似垃圾的原因重新启动您的应用.似乎将 fcnMode 属性设置为 Disabled 会有所帮助；如果有人篡改服务器上的配置文件或代码，它会阻止运行时重新启动您的应用程序.如果您正在使用暂存槽并以这种方式进行部署，这应该不会打扰您.但是，如果您希望能够通过 FTP 进入并处理文件并看到该更改反映在生产中，那么请不要使用此选项:

The fourth thing to consider is that sometimes App Service will restart your app for seemingly garbage reasons. It seems that setting the fcnMode property to Disabled can help; it prevents the runtime from restarting your app if someone diddles with configuration files or code on the server. If you're using staging slots and doing deployments that way, this shouldn't bother you. But if you expect to be able to FTP in and diddle with a file and see that change reflected in production, then don't use this option:

     <httpRuntime fcnMode="Disabled" targetFramework="4.5" />

要考虑的第五件事，这主要是我一直以来的问题，是您是否使用启用了 AlwaysOn 选项的暂存槽.AlwaysOn 选项的工作原理是每分钟左右 ping 您的站点，以确保它是温暖的，这样 IIS 就不会停止运行.莫名其妙，这不是一个粘性设置，因此您可能在生产和暂存槽上都启用了AlwaysOn，因此您没有每次都惹它.这会在将您引导到新主机时导致应用服务基础结构升级出现问题.发生的情况如下:假设您在一个实例上托管了 7 个站点，每个站点都有自己的暂存槽，所有站点都启用了 AlwaysOn.应用服务对 7 个生产槽进行预热和应用程序初始化，并尽职尽责地等待它们成功响应，然后再重定向流量.但它不会为暂存槽执行此操作.因此它将流量引导到新实例，但是 AlwaysOn 在 1-2 分钟后在暂存槽上启动，所以现在您还有 7 个站点同时启动.请记住，应用服务在土豆上运行，因此所有这些额外的 I/O 都会发生同时会破坏生产槽的性能，并会被视为停机.

The fifth thing to consider, and this was primarily my problem all along, is whether or not you are using staging slots with the AlwaysOn option enabled. The AlwaysOn option works by pinging your site every minute or so to make sure it's warm so that IIS doesn't spin it down. Inexplicably, this isn't a sticky setting, so you may have turned on AlwaysOn on both your production and staging slots so you don't have to mess with it every time. This causes a problem with App Service infrastructure upgrades when they boot you to a new host. Here's what happens: let's say you have 7 sites hosted on an instance, each with its own staging slot, everything with AlwaysOn enabled. App Service does the warmup and application initialization to your 7 production slots and dutifully waits for them to respond successfully before redirecting traffic over. But it doesn't do this for the staging slots. So it directs traffic over to the new instance, but then AlwaysOn kicks in 1-2 minutes later on the staging slots, so now you have 7 more sites starting up at the same time. Remember, App Service runs on potatoes, so all this additional I/O happening at the same time is going to destroy the performance of your production slots and will be perceived as downtime.

解决方案是让 AlwaysOn 在您的暂存槽上保持关闭，这样您就不会在基础架构更新后被这种同步 I/O 狂潮所困扰.如果您通过 PowerShell 使用交换脚本，请保持暂存时关闭，生产中打开".令人惊讶的冗长:

The solution is to keep AlwaysOn off on your staging slots so you don't get nailed by this simultaneous I/O frenzy after an infrastructure update. If you are using a swap script via PowerShell, maintaining this "Off in staging, On in production" is surprisingly verbose to do:

Login-AzureRmAccount -SubscriptionId {{ YOUR_SUBSCRIPTION_ID }}

$resourceGroupName = "YOUR-RESOURCE-GROUP"
$appName = "YOUR-APP-NAME"
$slotName = "YOUR-SLOT-NAME-FOR-EXAMPLE-STAGING"

$props = @{ siteConfig = @{ alwaysOn = $true; } }

Set-AzureRmResource `
    -PropertyObject $props `
    -ResourceType "microsoft.web/sites/slots" `
    -ResourceGroupName $resourceGroupName `
    -ResourceName "$appName/$slotName" `
    -ApiVersion 2015-08-01 `
    -Force

Swap-AzureRmWebAppSlot `
    -SourceSlotName $slotName `
    -ResourceGroupName $resourceGroupName `
    -Name $appName

$props = @{ siteConfig = @{ alwaysOn = $false; } }

Set-AzureRmResource `
    -PropertyObject $props `
    -ResourceType "microsoft.web/sites/slots" `
    -ResourceGroupName $resourceGroupName `
    -ResourceName "$appName/$slotName" `
    -ApiVersion 2015-08-01 `
    -Force

此脚本将暂存槽设置为打开 AlwaysOn，进行交换以使暂存现在处于生产状态，然后将暂存槽设置为关闭 AlwaysOn，所以它不会在基础设施升级后搞砸.

This script sets the staging slot to have AlwaysOn turned on, does the swap so that staging is now production, then sets the staging slot to have AlwaysOn turned off, so it doesn't blow things up after an infrastructure upgrade.

一旦你开始工作，拥有一个 PaaS 为你处理安全更新和硬件故障确实很好.但在实践中实现它比营销材料可能暗示的要困难一些.希望这对某人有所帮助.

Once you get this working, it is indeed nice to have a PaaS that handles security updates and hardware failures for you. But it's a little bit more difficult to achieve in practice than the marketing materials might suggest. Hope this helps someone.

2020 年 7 月 17 日更新:在上面的介绍中，我谈到需要玩弄AlwaysOn"如果您使用临时插槽，因为它会与插槽交换，并且将它放在所有插槽上会导致性能问题.在某些时候我不清楚，他们似乎已经解决了这个问题，以便AlwaysOn"没有交换.我的脚本实际上仍然使用 AlwaysOn 进行欺骗，但实际上它现在最终成为空操作.因此，为您的暂存槽保持 AlwaysOn 关闭的建议仍然有效，但您不必再在脚本中做这种小玩意儿了.

UPDATE 07/17/2020: In the blurb above, I talk about needing to diddle with "AlwaysOn" if you're using staging slots, as it would swap with the slots, and having it on all slots can cause performance issues. At some point that isn't clear to me, they seem to have fixed this so that "AlwaysOn" isn't swapped. My script actually still does the diddling with AlwaysOn, but in effect it ends up being a no-op now. So the advice to keep AlwaysOn off for your staging slots still stands, but you shouldn't have to do this little juggle in a script anymore.

这篇关于有没有办法确定 Azure 应用服务重启的原因?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

asp