8.深入k8s：资源控制Qos和eviction及其源码分析

8.深入k8s：资源控制Qos和eviction及其源码分析-LMLPHP

又是一个周末，可以愉快的坐下来静静的品味一段源码，这一篇涉及到资源的回收，工作量是很大的，篇幅会比较长，我们可以看到k8s在资源不够时会怎么做的，k8s在回收资源的时候有哪些考虑，我们的pod为什么会无端端的被干掉等等。

limit&request

在k8s中，CPU和内存的资源主要是通过这limit&request来进行限制的，在yaml文件中的定义如下：

spec.containers[].resources.limits.cpu
spec.containers[].resources.limits.memory
spec.containers[].resources.requests.cpu
spec.containers[].resources.requests.memory

在调度的时候，kube-scheduler 只会按照 requests 的值进行计算，而真正限制资源使用的是limit。

下面我引用一个官方的例子：

apiVersion: v1
kind: Pod
metadata:
  name: cpu-demo
  namespace: cpu-example
spec:
  containers:
  - name: cpu-demo-ctr
    image: vish/stress
    resources:
      limits:
        cpu: "1"
      requests:
        cpu: "0.5"
    args:
    - -cpus
    - "2"

在这个例子中，args参数给的是cpus等于2，表示这个container可以使用2个cpu进行压测。但是我们的limits是1，以及requests是0.5。

当我们创建好这个pod之后，然后使用kubectl top去查看资源使用情况的时候会发现cpu使用并不会超过1：

NAME                        CPU(cores)   MEMORY(bytes)
cpu-demo                    974m         <something>

这说明这个pod的cpu资源被限制在了1个cpu，即使container想使用，也是没有办法的。

在容器没有指定 request 的时候，request 的值和 limit 默认相等。

QoS 模型与Eviction

下面说一下由不同的 requests 和 limits 的设置方式引出的不同的 QoS 级别。

kubernetes 中有三种 Qos，分别为：

Guaranteed：Pod中所有Container的所有Resource的limit和request都相等且不为0；
Burstable：pod不满足Guaranteed条件，但是其中至少有一个container设置了requests或limits ；
BestEffort：pod的 requests 与 limits 均没有设置；

当宿主机资源紧张的时候，kubelet 对 Pod 进行 Eviction（即资源回收）时会按照Qos的顺序进行回收，回收顺序是：BestEffort>Burstable>Guaranteed

Eviction有两种模式，分为 Soft 和 Hard。Soft Eviction 允许你为 Eviction 过程设置grace period，然后等待一个用户配置的grace period之后，再执行Eviction，而Hard则立即执行。

那么什么时候会发生Eviction呢？我们可以为Eviction 设置threshold，比如设置设定内存的 eviction hard threshold 为 100M，那么当这台机器的内存可用资源不足 100M 时，kubelet 就会根据这台机器上面所有 pod 的 QoS 级别以及他们的内存使用情况，进行一个综合排名，把排名最靠前的 pod 进行迁移，从而释放出足够的内存资源。

thresholds定义方式为[eviction-signal][operator][quantity]

eviction-signal

eviction-signal按照官方文档的说法分为如下几种：

nodefs和imagefs表示两种文件系统分区：

nodefs：文件系统，kubelet 将其用于卷和守护程序日志等。

imagefs：文件系统，容器运行时用于保存镜像和容器可写层。

operator

就是所需的关系运算符，如"<"。

quantity

是阈值的大小，可以容量大小，如：1Gi；也可以用百分比来表示：10%。

如果kubelet在节点经历系统 OOM 之前无法回收内存，那么oom_killer将基于它在节点上使用的内存百分比算出一个oom_score，然后结束得分最高的容器。

Qos源码分析

qos的代码位于pkg\apis\core\v1\helper\qos\包下面：

qos#GetPodQOS

//pkg\apis\core\v1\helper\qos\qos.go
func GetPodQOS(pod *v1.Pod) v1.PodQOSClass {
	requests := v1.ResourceList{}
	limits := v1.ResourceList{}
	zeroQuantity := resource.MustParse("0")
	isGuaranteed := true
	allContainers := []v1.Container{}
	//追加所有的初始化容器
	allContainers = append(allContainers, pod.Spec.Containers...)
	allContainers = append(allContainers, pod.Spec.InitContainers...)
	//遍历container
	for _, container := range allContainers {
		// process requests
		//遍历request 里面的cpu、memory 获取其中的值
		for name, quantity := range container.Resources.Requests {
			if !isSupportedQoSComputeResource(name) {
				continue
			}
			if quantity.Cmp(zeroQuantity) == 1 {
				delta := quantity.DeepCopy()
				if _, exists := requests[name]; !exists {
					requests[name] = delta
				} else {
					delta.Add(requests[name])
					requests[name] = delta
				}
			}
		}
		// process limits
		qosLimitsFound := sets.NewString()
		//遍历 limit 里面的cpu、memory 获取其中的值
		for name, quantity := range container.Resources.Limits {
			if !isSupportedQoSComputeResource(name) {
				continue
			}
			if quantity.Cmp(zeroQuantity) == 1 {
				qosLimitsFound.Insert(string(name))
				delta := quantity.DeepCopy()
				if _, exists := limits[name]; !exists {
					limits[name] = delta
				} else {
					delta.Add(limits[name])
					limits[name] = delta
				}
			}
		}
		//如果limits 没有同时设置cpu 、Memory，那么就不是Guaranteed
		if !qosLimitsFound.HasAll(string(v1.ResourceMemory), string(v1.ResourceCPU)) {
			isGuaranteed = false
		}
	}
	//如果requests 和 limits都没有设置，那么为BestEffort
	if len(requests) == 0 && len(limits) == 0 {
		return v1.PodQOSBestEffort
	}
	// Check is requests match limits for all resources.
	if isGuaranteed {
		for name, req := range requests {
			if lim, exists := limits[name]; !exists || lim.Cmp(req) != 0 {
				isGuaranteed = false
				break
			}
		}
	}
	// 都设置了limits 和 requests，则是Guaranteed
	if isGuaranteed &&
		len(requests) == len(limits) {
		return v1.PodQOSGuaranteed
	}
	return v1.PodQOSBurstable
}

上面有注释我就不过多介绍，非常的简单。

下面这里是QOS OOM打分机制，通过给不同的pod打分来判断，哪些pod可以被优先kill掉，分数越高的越容易被kill。

policy

//\pkg\kubelet\qos\policy.go
// 分值越高越容易被kill
const (
	// KubeletOOMScoreAdj is the OOM score adjustment for Kubelet
	KubeletOOMScoreAdj int = -999
	// KubeProxyOOMScoreAdj is the OOM score adjustment for kube-proxy
	KubeProxyOOMScoreAdj  int = -999
	guaranteedOOMScoreAdj int = -998
	besteffortOOMScoreAdj int = 1000
)

policy#GetContainerOOMScoreAdjust

//\pkg\kubelet\qos\policy.go
func GetContainerOOMScoreAdjust(pod *v1.Pod, container *v1.Container, memoryCapacity int64) int {
	//静态Pod、镜像Pod和高优先级Pod，直接可以是guaranteedOOMScoreAdj
	if types.IsCriticalPod(pod) {
		// Critical pods should be the last to get killed.
		return guaranteedOOMScoreAdj
	}
	//获取pod的qos等级，这里只处理Guaranteed与BestEffort
	switch v1qos.GetPodQOS(pod) {
	case v1.PodQOSGuaranteed:
		// Guaranteed containers should be the last to get killed.
		return guaranteedOOMScoreAdj
	case v1.PodQOSBestEffort:
		return besteffortOOMScoreAdj
	}
	memoryRequest := container.Resources.Requests.Memory().Value()
	//如果我们占用的内存越少，则打分就越高
	oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity

	//这里是为了保证burstable能有个更高的 OOM score
	if int(oomScoreAdjust) < (1000 + guaranteedOOMScoreAdj) {
		return (1000 + guaranteedOOMScoreAdj)
	}

	if int(oomScoreAdjust) == besteffortOOMScoreAdj {
		return int(oomScoreAdjust - 1)
	}
	return int(oomScoreAdjust)
}

这个方法里面给不同的pod进行打分，静态Pod、镜像Pod和高优先级Pod，QOS直接被设置成为guaranteed；

然后调用qos的GetPodQOS方法获取一个pod的评分，但是如果一个pod是burstable，那么需要根据其直接使用的内存来进行评分，占用的内存越少，则打分就越高，如果分数小于1000 + guaranteedOOMScoreAdj，也就是2分，那么被直接设置成2分，避免分数过低。

Eviction Manager源码分析

kubelet在实例化一个kubelet对象的时候，调用eviction.NewManager新建了一个evictionManager对象。然后kubelet再Run方法开始工作的时候，创建一个goroutine，每5s执行一次updateRuntimeUp。

在updateRuntimeUp中，待确认runtime启动成功后，会调用initializeRuntimeDependentModules完成runtime依赖模块的初始化工作。

然后在initializeRuntimeDependentModules中会调用evictionManager的start方法进行启动。

代码如下，具体的kubelet流程我们留到以后慢慢分析：

func NewMainKubelet(...){
	...
	evictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.podManager.GetMirrorPodByPod, klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock, etcHostsPathFunc)

	klet.evictionManager = evictionManager
    ...
}

func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
    ...
    go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
    ...
}

func (kl *Kubelet) updateRuntimeUp() {
    ...
    kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules)
    ...
}


func (kl *Kubelet) initializeRuntimeDependentModules() {
    ...
    kl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)
    ...
}

下面我们来到\pkg\kubelet\eviction\eviction_manager.go去看一下Start方法怎么实现eviction的。

managerImp#Start

// 开启一个控制循环去监视和响应资源过低的情况
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) {
	thresholdHandler := func(message string) {
		klog.Infof(message)
		m.synchronize(diskInfoProvider, podFunc)
	}
	//是否要利用kernel memcg notification
	if m.config.KernelMemcgNotification {
		for _, threshold := range m.config.Thresholds {
			if threshold.Signal == evictionapi.SignalMemoryAvailable || threshold.Signal == evictionapi.SignalAllocatableMemoryAvailable {
				notifier, err := NewMemoryThresholdNotifier(threshold, m.config.PodCgroupRoot, &CgroupNotifierFactory{}, thresholdHandler)
				if err != nil {
					klog.Warningf("eviction manager: failed to create memory threshold notifier: %v", err)
				} else {
					go notifier.Start()
					m.thresholdNotifiers = append(m.thresholdNotifiers, notifier)
				}
			}
		}
	}
	// start the eviction manager monitoring
	// 启动一个goroutine，for循环里每隔monitoringInterval（10s）执行一次synchronize
	go func() {
		for {
			//synchronize是主要的eviction控制循环，返回被kill的pod，或返回nill
			if evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil {
				klog.Infof("eviction manager: pods %s evicted, waiting for pod to be cleaned up", format.Pods(evictedPods))
				m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)
			} else {
				time.Sleep(monitoringInterval)
			}
		}
	}()
}

下面的synchronize方法会很长，需要点耐心：

managerImpl#synchronize

根据上面介绍的不同的eviction signal会有不同的排序方法，以及设置节点资源回收方法

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	if m.dedicatedImageFs == nil {
		hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()
		if ok != nil {
			return nil
		}
		m.dedicatedImageFs = &hasImageFs
		//注册各个eviction signal所对应的资源排序方法
		m.signalToRankFunc = buildSignalToRankFunc(hasImageFs)
		// 注册节点资源回收方法，例如imagefs.avaliable对应的是删除无用容器和无用镜像
		m.signalToNodeReclaimFuncs = buildSignalToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)
	}
	...
}

看一下buildSignalToRankFunc方法的实现：

func buildSignalToRankFunc(withImageFs bool) map[evictionapi.Signal]rankFunc {
	signalToRankFunc := map[evictionapi.Signal]rankFunc{
		evictionapi.SignalMemoryAvailable:            rankMemoryPressure,
		evictionapi.SignalAllocatableMemoryAvailable: rankMemoryPressure,
		evictionapi.SignalPIDAvailable:               rankPIDPressure,
	}
	if withImageFs {
		signalToRankFunc[evictionapi.SignalNodeFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage)
		signalToRankFunc[evictionapi.SignalNodeFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes)
		signalToRankFunc[evictionapi.SignalImageFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot}, v1.ResourceEphemeralStorage)
		signalToRankFunc[evictionapi.SignalImageFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot}, resourceInodes)
	} else {
		signalToRankFunc[evictionapi.SignalNodeFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage)
		signalToRankFunc[evictionapi.SignalNodeFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes)
		signalToRankFunc[evictionapi.SignalImageFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage)
		signalToRankFunc[evictionapi.SignalImageFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes)
	}
	return signalToRankFunc
}

这个方法里面会将各个eviction signal的排序方法放入到一个map中返回，如MemoryAvailable、NodeFsAvailable、ImageFsAvailable等。

获取所有的活跃的pod，以及整体的stat信息

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//获取当前active的pods
	activePods := podFunc()
	updateStats := true
	//获取节点的整体概况，即nodeStsts和podStats
	summary, err := m.summaryProvider.Get(updateStats)
	if err != nil {
		klog.Errorf("eviction manager: failed to get summary stats: %v", err)
		return nil
	}
	//如果Notifiers有超过10s没有刷新，那么更新Notifiers
	if m.clock.Since(m.thresholdsLastUpdated) > notifierRefreshInterval {
		m.thresholdsLastUpdated = m.clock.Now()
		for _, notifier := range m.thresholdNotifiers {
			if err := notifier.UpdateThreshold(summary); err != nil {
				klog.Warningf("eviction manager: failed to update %s: %v", notifier.Description(), err)
			}
		}
	}
	...
}

根据summary信息创建相应的统计信息到observations对象中

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//根据summary信息创建相应的统计信息到observations对象中，如SignalMemoryAvailable、SignalNodeFsAvailable等。
	observations, statsFunc := makeSignalObservations(summary)
	...
}

下面抽取部分代码makeSignalObservations

func makeSignalObservations(summary *statsapi.Summary) (signalObservations, statsFunc) {
	...
	if memory := summary.Node.Memory; memory != nil && memory.AvailableBytes != nil && memory.WorkingSetBytes != nil {
		result[evictionapi.SignalMemoryAvailable] = signalObservation{
			available: resource.NewQuantity(int64(*memory.AvailableBytes), resource.BinarySI),
			capacity:  resource.NewQuantity(int64(*memory.AvailableBytes+*memory.WorkingSetBytes), resource.BinarySI),
			time:      memory.Time,
		}
	}
	...
}

这个方法主要是将summary里面的资源利用情况根据不同的eviction signal封装到result里面返回。

根据获取的observations判断是否已到达阈值的thresholds

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
    //根据获取的observations判断是否已到达阈值的thresholds，然后返回
	thresholds = thresholdsMet(thresholds, observations, false)

    if len(m.thresholdsMet) > 0 {
		//Minimum eviction reclaim 策略
		thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)
		thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)
	}
	...
}

thresholdsMet

func thresholdsMet(thresholds []evictionapi.Threshold, observations signalObservations, enforceMinReclaim bool) []evictionapi.Threshold {
	results := []evictionapi.Threshold{}
	for i := range thresholds {
		threshold := thresholds[i]
		observed, found := observations[threshold.Signal]
		if !found {
			klog.Warningf("eviction manager: no observation found for eviction signal %v", threshold.Signal)
			continue
		}
		thresholdMet := false
        // 根据资源容量获取阈值的资源大小
		quantity := evictionapi.GetThresholdQuantity(threshold.Value, observed.capacity)
		//Minimum eviction reclaim 策略，具体看：https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#minimum-eviction-reclaim
		if enforceMinReclaim && threshold.MinReclaim != nil {
			quantity.Add(*evictionapi.GetThresholdQuantity(*threshold.MinReclaim, observed.capacity))
		}
		//如果observed.available比quantity大，那么返回1
		thresholdResult := quantity.Cmp(*observed.available)
		//检查Operator标识符
		switch threshold.Operator {
		//如果是小于号"<",当thresholdResult大于0，返回true
		case evictionapi.OpLessThan:
			thresholdMet = thresholdResult > 0
		}
		//如果append到results，表示已经到达阈值
		if thresholdMet {
			results = append(results, threshold)
		}
	}
	return results
}

thresholdsMet会遍历整个thresholds，然后从observations里面获取eviction signal对应的资源情况。因为我们上面讲了设置的threshold可以是1Gi，也可以是百分比，所以需要调用GetThresholdQuantity方法换算一下，得到quantity；

然后根据Minimum eviction reclaim 策略判断一下是否还需要提高这个需要eviction的资源，具体的信息查看文档：https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#minimum-eviction-reclaim；

然后用quantity和available比较一下，如果已达阈值，那么加入到results集合中返回。

记录eviction signal 第一次的时间，并将Eviction Signals映射到对应的Node Conditions

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	now := m.clock.Now()
	//主要用来记录 eviction signal 第一次的时间，没有则设置 now 时间
	thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)

	// the set of node conditions that are triggered by currently observed thresholds
	// Kubelet会将对应的Eviction Signals映射到对应的Node Conditions
	nodeConditions := nodeConditions(thresholds)
	if len(nodeConditions) > 0 {
		klog.V(3).Infof("eviction manager: node conditions - observed: %v", nodeConditions)
	}
	...
}

nodeConditions

func nodeConditions(thresholds []evictionapi.Threshold) []v1.NodeConditionType {
	results := []v1.NodeConditionType{}
	for _, threshold := range thresholds {
		if nodeCondition, found := signalToNodeCondition[threshold.Signal]; found {
			//检查results里是否已有nodeCondition
			if !hasNodeCondition(results, nodeCondition) {
				results = append(results, nodeCondition)
			}
		}
	}
	return results
}

nodeConditions方法主要就是根据signalToNodeCondition来映射对应的nodeCondition，其中nodeCondition如下：

	signalToNodeCondition = map[evictionapi.Signal]v1.NodeConditionType{}
	signalToNodeCondition[evictionapi.SignalMemoryAvailable] = v1.NodeMemoryPressure
	signalToNodeCondition[evictionapi.SignalAllocatableMemoryAvailable] = v1.NodeMemoryPressure
	signalToNodeCondition[evictionapi.SignalImageFsAvailable] = v1.NodeDiskPressure
	signalToNodeCondition[evictionapi.SignalNodeFsAvailable] = v1.NodeDiskPressure
	signalToNodeCondition[evictionapi.SignalImageFsInodesFree] = v1.NodeDiskPressure
	signalToNodeCondition[evictionapi.SignalNodeFsInodesFree] = v1.NodeDiskPressure
	signalToNodeCondition[evictionapi.SignalPIDAvailable] = v1.NodePIDPressure

也就是将Eviction Signals分别映射成了MemoryPressure或DiskPressure，整理出来的表格如下：

本轮 node condition 与上次的observed合并，以最新的为准

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//本轮 node condition 与上次的observed合并，以最新的为准
	nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)
	...
}

防止Node的资源不断在阈值附近波动，从而不断变动Node Condition值

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//PressureTransitionPeriod参数默认为5分钟
	//防止Node的资源不断在阈值附近波动，从而不断变动Node Condition值
	//具体查看文档：https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#oscillation-of-node-conditions
	nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)
	if len(nodeConditions) > 0 {
		klog.V(3).Infof("eviction manager: node conditions - transition period not met: %v", nodeConditions)
	}
	...
}

nodeConditionsObservedSince

func nodeConditionsObservedSince(observedAt nodeConditionsObservedAt, period time.Duration, now time.Time) []v1.NodeConditionType {
	results := []v1.NodeConditionType{}
	for nodeCondition, at := range observedAt {
		duration := now.Sub(at)
		if duration < period {
			results = append(results, nodeCondition)
		}
	}
	return results
}

如果已经超过了5分钟，那么需要排除。

对eviction-soft做判断

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//设置 eviction-soft-grace-period，默认为90秒，超过该值加入阈值集合
	thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)
	...
}

thresholdsMetGracePeriod

func thresholdsMetGracePeriod(observedAt thresholdsObservedAt, now time.Time) []evictionapi.Threshold {
	results := []evictionapi.Threshold{}
	for threshold, at := range observedAt {
		duration := now.Sub(at)
		//Soft Eviction Thresholds，必须要等一段时间之后才能进行trigger
		if duration < threshold.GracePeriod {
			klog.V(2).Infof("eviction manager: eviction criteria not yet met for %v, duration: %v", formatThreshold(threshold), duration)
			continue
		}
		results = append(results, threshold)
	}
	return results
}

设值，然后比较更新

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	// update internal state
	m.Lock()
	m.nodeConditions = nodeConditions
	m.thresholdsFirstObservedAt = thresholdsFirstObservedAt
	m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt
	m.thresholdsMet = thresholds

	// 阈值集合跟上次比较是否需要更新
	thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)
	debugLogThresholdsWithObservation("thresholds - updated stats", thresholds, observations)

	//将本次的信息设置为上次信息
	m.lastObservations = observations
	m.Unlock()
	...
}

排序之后找到第一个需要释放的threshold，以及对应的resource

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//如果没有 eviction signal 集合则本轮结束流程
	if len(thresholds) == 0 {
		klog.V(3).Infof("eviction manager: no resources are starved")
		return nil
	}

	//排序之后获取thresholds集合中的第一个元素
	sort.Sort(byEvictionPriority(thresholds))
	thresholdToReclaim, resourceToReclaim, foundAny := getReclaimableThreshold(thresholds)
	if !foundAny {
		return nil
	}
	...
}

getReclaimableThreshold

func getReclaimableThreshold(thresholds []evictionapi.Threshold) (evictionapi.Threshold, v1.ResourceName, bool) {
	//遍历thresholds，然后根据对应的Eviction Signals找到对应的resource
	for _, thresholdToReclaim := range thresholds {
		if resourceToReclaim, ok := signalToResource[thresholdToReclaim.Signal]; ok {
			return thresholdToReclaim, resourceToReclaim, true
		}
		klog.V(3).Infof("eviction manager: threshold %s was crossed, but reclaim is not implemented for this threshold.", thresholdToReclaim.Signal)
	}
	return evictionapi.Threshold{}, "", false
}

下面我们看一下signalToResource的定义：

	signalToResource = map[evictionapi.Signal]v1.ResourceName{}
	signalToResource[evictionapi.SignalMemoryAvailable] = v1.ResourceMemory
	signalToResource[evictionapi.SignalAllocatableMemoryAvailable] = v1.ResourceMemory
	signalToResource[evictionapi.SignalImageFsAvailable] = v1.ResourceEphemeralStorage
	signalToResource[evictionapi.SignalImageFsInodesFree] = resourceInodes
	signalToResource[evictionapi.SignalNodeFsAvailable] = v1.ResourceEphemeralStorage
	signalToResource[evictionapi.SignalNodeFsInodesFree] = resourceInodes
	signalToResource[evictionapi.SignalPIDAvailable] = resourcePids

signalToResource将Eviction Signals分成了memory、ephemeral-storage、inodes、pids几类。

回收节点级别的资源

```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//回收节点级别的资源
	if m.reclaimNodeLevelResources(thresholdToReclaim.Signal, resourceToReclaim) {
		klog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim)
		return nil
	}
	...
}
```

**reclaimNodeLevelResources**

```go
func (m *managerImpl) reclaimNodeLevelResources(signalToReclaim evictionapi.Signal, resourceToReclaim v1.ResourceName) bool {
	//调用buildSignalToNodeReclaimFuncs中设置的方法
	nodeReclaimFuncs := m.signalToNodeReclaimFuncs[signalToReclaim]
	for _, nodeReclaimFunc := range nodeReclaimFuncs {
		// 删除没用使用到的images或 删除已经是dead状态的Pod 和 container
		if err := nodeReclaimFunc(); err != nil {
			klog.Warningf("eviction manager: unexpected error when attempting to reduce %v pressure: %v", resourceToReclaim, err)
		}

	}
	//回收之后再检查一下资源占用情况，如果没有达到阈值，那么直接结束
	if len(nodeReclaimFuncs) > 0 {
		summary, err := m.summaryProvider.Get(true)
		if err != nil {
			klog.Errorf("eviction manager: failed to get summary stats after resource reclaim: %v", err)
			return false
		}

		observations, _ := makeSignalObservations(summary)
		debugLogObservations("observations after resource reclaim", observations)

		thresholds := thresholdsMet(m.config.Thresholds, observations, false)
		debugLogThresholdsWithObservation("thresholds after resource reclaim - ignoring grace period", thresholds, observations)

		if len(thresholds) == 0 {
			return true
		}
	}
	return false
}
```

首先根据需要释放的signal从signalToNodeReclaimFuncs中找到对应的释放资源的方法，这个方法在上面buildSignalToNodeReclaimFuncs中设置的，如：

```
nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
```

这个方法会调用相应的GC方法，删除无用的container以及无用的images来释放资源。

然后会检查释放完资源之后是否依然超过阈值，如果没有的话就直接结束了。

获取相应的排序函数并进行排序

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//得到上面的eviction signal 排序函数，在buildSignalToRankFunc方法中设置
	rank, ok := m.signalToRankFunc[thresholdToReclaim.Signal]
	if !ok {
		klog.Errorf("eviction manager: no ranking function for signal %s", thresholdToReclaim.Signal)
		return nil
	}

	//如果没有 active pod 直接返回
	if len(activePods) == 0 {
		klog.Errorf("eviction manager: eviction thresholds have been met, but no pods are active to evict")
		return nil
	}

	//将pod按照特定资源排序
	rank(activePods, statsFunc)
	...
}

将排好序的pod删除，并返回

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	for i := range activePods {
		pod := activePods[i]
		gracePeriodOverride := int64(0)
		if !isHardEvictionThreshold(thresholdToReclaim) {
			gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
		}
		message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc)
		//kill pod
		if m.evictPod(pod, gracePeriodOverride, message, annotations) {
			metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc()
			return []*v1.Pod{pod}
		}
	}
	...
}

只要有一个pod被删除了，那么就返回~

到这里eviction manager就分析完了~

总结

这一篇讲解了其中资源控制是怎么做的，理解了通过limit和request的设置会影响到pod被删除的优先级，所以我们在设置pod的时候尽量设置合理的limit和request可以不那么容易被kill掉；然后通过分析了源码知道了limit和request会影响到QOS的评分，从而影响到pod被kill掉的优先级。

接下来通过源码分析了k8s中对阈值的设定是怎样的，当资源不够的时候pod是根据什么条件被kill掉的，这一部分花了很大的篇幅来介绍。通过源码也可以知道在eviction发生的时候k8s也是做了很多的考虑，比如说对于节点状态振荡应该怎么处理、首先应该回收什么类型的资源、minimum-reclaim最小回收资源在源码里是怎么做到的等等。

Reference

https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/

https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/

https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/

https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/

https://zhuanlan.zhihu.com/p/38359775

https://cloud.tencent.com/developer/article/1097431

https://developer.aliyun.com/article/679216

https://my.oschina.net/jxcdwangtao/blog/841937

luozhiyun