又是一个周末,可以愉快的坐下来静静的品味一段源码,这一篇涉及到资源的回收,工作量是很大的,篇幅会比较长,我们可以看到k8s在资源不够时会怎么做的,k8s在回收资源的时候有哪些考虑,我们的pod为什么会无端端的被干掉等等。
limit&request
在k8s中,CPU和内存的资源主要是通过这limit&request来进行限制的,在yaml文件中的定义如下:
spec.containers[].resources.limits.cpu
spec.containers[].resources.limits.memory
spec.containers[].resources.requests.cpu
spec.containers[].resources.requests.memory
在调度的时候,kube-scheduler 只会按照 requests 的值进行计算,而真正限制资源使用的是limit。
下面我引用一个官方的例子:
apiVersion: v1
kind: Pod
metadata:
name: cpu-demo
namespace: cpu-example
spec:
containers:
- name: cpu-demo-ctr
image: vish/stress
resources:
limits:
cpu: "1"
requests:
cpu: "0.5"
args:
- -cpus
- "2"
在这个例子中,args参数给的是cpus等于2,表示这个container可以使用2个cpu进行压测。但是我们的limits是1,以及requests是0.5。
当我们创建好这个pod之后,然后使用kubectl top去查看资源使用情况的时候会发现cpu使用并不会超过1:
NAME CPU(cores) MEMORY(bytes)
cpu-demo 974m <something>
这说明这个pod的cpu资源被限制在了1个cpu,即使container想使用,也是没有办法的。
在容器没有指定 request 的时候,request 的值和 limit 默认相等。
QoS 模型与Eviction
下面说一下由不同的 requests 和 limits 的设置方式引出的不同的 QoS 级别。
kubernetes 中有三种 Qos,分别为:
Guaranteed
:Pod中所有Container的所有Resource的limit
和request
都相等且不为0;Burstable
:pod不满足Guaranteed条件,但是其中至少有一个container设置了requests或limits ;BestEffort
:pod的 requests 与 limits 均没有设置;
当宿主机资源紧张的时候,kubelet 对 Pod 进行 Eviction(即资源回收)时会按照Qos的顺序进行回收,回收顺序是:BestEffort>Burstable>Guaranteed
Eviction有两种模式,分为 Soft 和 Hard。Soft Eviction 允许你为 Eviction 过程设置grace period,然后等待一个用户配置的grace period之后,再执行Eviction,而Hard则立即执行。
那么什么时候会发生Eviction呢?我们可以为Eviction 设置threshold,比如设置设定内存的 eviction hard threshold 为 100M,那么当这台机器的内存可用资源不足 100M 时,kubelet 就会根据这台机器上面所有 pod 的 QoS 级别以及他们的内存使用情况,进行一个综合排名,把排名最靠前的 pod 进行迁移,从而释放出足够的内存资源。
thresholds定义方式为[eviction-signal][operator][quantity]
eviction-signal
eviction-signal按照官方文档的说法分为如下几种:
nodefs和imagefs表示两种文件系统分区:
nodefs:文件系统,kubelet 将其用于卷和守护程序日志等。
imagefs:文件系统,容器运行时用于保存镜像和容器可写层。
operator
就是所需的关系运算符,如"<"。
quantity
是阈值的大小,可以容量大小,如:1Gi;也可以用百分比来表示:10%。
如果kubelet在节点经历系统 OOM 之前无法回收内存,那么oom_killer将基于它在节点上 使用的内存百分比算出一个oom_score,然后结束得分最高的容器。
Qos源码分析
qos的代码位于pkg\apis\core\v1\helper\qos\包下面:
qos#GetPodQOS
//pkg\apis\core\v1\helper\qos\qos.go
func GetPodQOS(pod *v1.Pod) v1.PodQOSClass {
requests := v1.ResourceList{}
limits := v1.ResourceList{}
zeroQuantity := resource.MustParse("0")
isGuaranteed := true
allContainers := []v1.Container{}
//追加所有的初始化容器
allContainers = append(allContainers, pod.Spec.Containers...)
allContainers = append(allContainers, pod.Spec.InitContainers...)
//遍历container
for _, container := range allContainers {
// process requests
//遍历request 里面的cpu、memory 获取其中的值
for name, quantity := range container.Resources.Requests {
if !isSupportedQoSComputeResource(name) {
continue
}
if quantity.Cmp(zeroQuantity) == 1 {
delta := quantity.DeepCopy()
if _, exists := requests[name]; !exists {
requests[name] = delta
} else {
delta.Add(requests[name])
requests[name] = delta
}
}
}
// process limits
qosLimitsFound := sets.NewString()
//遍历 limit 里面的cpu、memory 获取其中的值
for name, quantity := range container.Resources.Limits {
if !isSupportedQoSComputeResource(name) {
continue
}
if quantity.Cmp(zeroQuantity) == 1 {
qosLimitsFound.Insert(string(name))
delta := quantity.DeepCopy()
if _, exists := limits[name]; !exists {
limits[name] = delta
} else {
delta.Add(limits[name])
limits[name] = delta
}
}
}
//如果limits 没有同时设置cpu 、Memory,那么就不是Guaranteed
if !qosLimitsFound.HasAll(string(v1.ResourceMemory), string(v1.ResourceCPU)) {
isGuaranteed = false
}
}
//如果requests 和 limits都没有设置,那么为BestEffort
if len(requests) == 0 && len(limits) == 0 {
return v1.PodQOSBestEffort
}
// Check is requests match limits for all resources.
if isGuaranteed {
for name, req := range requests {
if lim, exists := limits[name]; !exists || lim.Cmp(req) != 0 {
isGuaranteed = false
break
}
}
}
// 都设置了limits 和 requests,则是Guaranteed
if isGuaranteed &&
len(requests) == len(limits) {
return v1.PodQOSGuaranteed
}
return v1.PodQOSBurstable
}
上面有注释我就不过多介绍,非常的简单。
下面这里是QOS OOM打分机制,通过给不同的pod打分来判断,哪些pod可以被优先kill掉,分数越高的越容易被kill。
policy
//\pkg\kubelet\qos\policy.go
// 分值越高越容易被kill
const (
// KubeletOOMScoreAdj is the OOM score adjustment for Kubelet
KubeletOOMScoreAdj int = -999
// KubeProxyOOMScoreAdj is the OOM score adjustment for kube-proxy
KubeProxyOOMScoreAdj int = -999
guaranteedOOMScoreAdj int = -998
besteffortOOMScoreAdj int = 1000
)
policy#GetContainerOOMScoreAdjust
//\pkg\kubelet\qos\policy.go
func GetContainerOOMScoreAdjust(pod *v1.Pod, container *v1.Container, memoryCapacity int64) int {
//静态Pod、镜像Pod和高优先级Pod,直接可以是guaranteedOOMScoreAdj
if types.IsCriticalPod(pod) {
// Critical pods should be the last to get killed.
return guaranteedOOMScoreAdj
}
//获取pod的qos等级,这里只处理Guaranteed与BestEffort
switch v1qos.GetPodQOS(pod) {
case v1.PodQOSGuaranteed:
// Guaranteed containers should be the last to get killed.
return guaranteedOOMScoreAdj
case v1.PodQOSBestEffort:
return besteffortOOMScoreAdj
}
memoryRequest := container.Resources.Requests.Memory().Value()
//如果我们占用的内存越少,则打分就越高
oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity
//这里是为了保证burstable能有个更高的 OOM score
if int(oomScoreAdjust) < (1000 + guaranteedOOMScoreAdj) {
return (1000 + guaranteedOOMScoreAdj)
}
if int(oomScoreAdjust) == besteffortOOMScoreAdj {
return int(oomScoreAdjust - 1)
}
return int(oomScoreAdjust)
}
这个方法里面给不同的pod进行打分,静态Pod、镜像Pod和高优先级Pod,QOS直接被设置成为guaranteed;
然后调用qos的GetPodQOS方法获取一个pod的评分,但是如果一个pod是burstable,那么需要根据其直接使用的内存来进行评分,占用的内存越少,则打分就越高,如果分数小于1000 + guaranteedOOMScoreAdj,也就是2分,那么被直接设置成2分,避免分数过低。
Eviction Manager源码分析
kubelet在实例化一个kubelet对象的时候,调用eviction.NewManager
新建了一个evictionManager对象。然后kubelet再Run方法开始工作的时候,创建一个goroutine,每5s执行一次updateRuntimeUp。
在updateRuntimeUp中,待确认runtime启动成功后,会调用initializeRuntimeDependentModules完成runtime依赖模块的初始化工作。
然后在initializeRuntimeDependentModules中会调用evictionManager的start方法进行启动。
代码如下,具体的kubelet流程我们留到以后慢慢分析:
func NewMainKubelet(...){
...
evictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.podManager.GetMirrorPodByPod, klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock, etcHostsPathFunc)
klet.evictionManager = evictionManager
...
}
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
...
go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
...
}
func (kl *Kubelet) updateRuntimeUp() {
...
kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules)
...
}
func (kl *Kubelet) initializeRuntimeDependentModules() {
...
kl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)
...
}
下面我们来到\pkg\kubelet\eviction\eviction_manager.go去看一下Start方法怎么实现eviction的。
managerImp#Start
// 开启一个控制循环去监视和响应资源过低的情况
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) {
thresholdHandler := func(message string) {
klog.Infof(message)
m.synchronize(diskInfoProvider, podFunc)
}
//是否要利用kernel memcg notification
if m.config.KernelMemcgNotification {
for _, threshold := range m.config.Thresholds {
if threshold.Signal == evictionapi.SignalMemoryAvailable || threshold.Signal == evictionapi.SignalAllocatableMemoryAvailable {
notifier, err := NewMemoryThresholdNotifier(threshold, m.config.PodCgroupRoot, &CgroupNotifierFactory{}, thresholdHandler)
if err != nil {
klog.Warningf("eviction manager: failed to create memory threshold notifier: %v", err)
} else {
go notifier.Start()
m.thresholdNotifiers = append(m.thresholdNotifiers, notifier)
}
}
}
}
// start the eviction manager monitoring
// 启动一个goroutine,for循环里每隔monitoringInterval(10s)执行一次synchronize
go func() {
for {
//synchronize是主要的eviction控制循环,返回被kill的pod,或返回nill
if evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil {
klog.Infof("eviction manager: pods %s evicted, waiting for pod to be cleaned up", format.Pods(evictedPods))
m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)
} else {
time.Sleep(monitoringInterval)
}
}
}()
}
下面的synchronize方法会很长,需要点耐心:
managerImpl#synchronize
根据上面介绍的不同的eviction signal会有不同的排序方法,以及设置节点资源回收方法
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... if m.dedicatedImageFs == nil { hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs() if ok != nil { return nil } m.dedicatedImageFs = &hasImageFs //注册各个eviction signal所对应的资源排序方法 m.signalToRankFunc = buildSignalToRankFunc(hasImageFs) // 注册节点资源回收方法,例如imagefs.avaliable对应的是删除无用容器和无用镜像 m.signalToNodeReclaimFuncs = buildSignalToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs) } ... }
看一下buildSignalToRankFunc方法的实现:
func buildSignalToRankFunc(withImageFs bool) map[evictionapi.Signal]rankFunc { signalToRankFunc := map[evictionapi.Signal]rankFunc{ evictionapi.SignalMemoryAvailable: rankMemoryPressure, evictionapi.SignalAllocatableMemoryAvailable: rankMemoryPressure, evictionapi.SignalPIDAvailable: rankPIDPressure, } if withImageFs { signalToRankFunc[evictionapi.SignalNodeFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage) signalToRankFunc[evictionapi.SignalNodeFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes) signalToRankFunc[evictionapi.SignalImageFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot}, v1.ResourceEphemeralStorage) signalToRankFunc[evictionapi.SignalImageFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot}, resourceInodes) } else { signalToRankFunc[evictionapi.SignalNodeFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage) signalToRankFunc[evictionapi.SignalNodeFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes) signalToRankFunc[evictionapi.SignalImageFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage) signalToRankFunc[evictionapi.SignalImageFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes) } return signalToRankFunc }
这个方法里面会将各个eviction signal的排序方法放入到一个map中返回,如MemoryAvailable、NodeFsAvailable、ImageFsAvailable等。
获取所有的活跃的pod,以及整体的stat信息
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //获取当前active的pods activePods := podFunc() updateStats := true //获取节点的整体概况,即nodeStsts和podStats summary, err := m.summaryProvider.Get(updateStats) if err != nil { klog.Errorf("eviction manager: failed to get summary stats: %v", err) return nil } //如果Notifiers有超过10s没有刷新,那么更新Notifiers if m.clock.Since(m.thresholdsLastUpdated) > notifierRefreshInterval { m.thresholdsLastUpdated = m.clock.Now() for _, notifier := range m.thresholdNotifiers { if err := notifier.UpdateThreshold(summary); err != nil { klog.Warningf("eviction manager: failed to update %s: %v", notifier.Description(), err) } } } ... }
根据summary信息创建相应的统计信息到observations对象中
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //根据summary信息创建相应的统计信息到observations对象中,如SignalMemoryAvailable、SignalNodeFsAvailable等。 observations, statsFunc := makeSignalObservations(summary) ... }
下面抽取部分代码makeSignalObservations
func makeSignalObservations(summary *statsapi.Summary) (signalObservations, statsFunc) { ... if memory := summary.Node.Memory; memory != nil && memory.AvailableBytes != nil && memory.WorkingSetBytes != nil { result[evictionapi.SignalMemoryAvailable] = signalObservation{ available: resource.NewQuantity(int64(*memory.AvailableBytes), resource.BinarySI), capacity: resource.NewQuantity(int64(*memory.AvailableBytes+*memory.WorkingSetBytes), resource.BinarySI), time: memory.Time, } } ... }
这个方法主要是将summary里面的资源利用情况根据不同的eviction signal封装到result里面返回。
根据获取的observations判断是否已到达阈值的thresholds
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //根据获取的observations判断是否已到达阈值的thresholds,然后返回 thresholds = thresholdsMet(thresholds, observations, false) if len(m.thresholdsMet) > 0 { //Minimum eviction reclaim 策略 thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true) thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved) } ... }
thresholdsMet
func thresholdsMet(thresholds []evictionapi.Threshold, observations signalObservations, enforceMinReclaim bool) []evictionapi.Threshold { results := []evictionapi.Threshold{} for i := range thresholds { threshold := thresholds[i] observed, found := observations[threshold.Signal] if !found { klog.Warningf("eviction manager: no observation found for eviction signal %v", threshold.Signal) continue } thresholdMet := false // 根据资源容量获取阈值的资源大小 quantity := evictionapi.GetThresholdQuantity(threshold.Value, observed.capacity) //Minimum eviction reclaim 策略,具体看:https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#minimum-eviction-reclaim if enforceMinReclaim && threshold.MinReclaim != nil { quantity.Add(*evictionapi.GetThresholdQuantity(*threshold.MinReclaim, observed.capacity)) } //如果observed.available比quantity大,那么返回1 thresholdResult := quantity.Cmp(*observed.available) //检查Operator标识符 switch threshold.Operator { //如果是小于号"<",当thresholdResult大于0,返回true case evictionapi.OpLessThan: thresholdMet = thresholdResult > 0 } //如果append到results,表示已经到达阈值 if thresholdMet { results = append(results, threshold) } } return results }
thresholdsMet会遍历整个thresholds,然后从observations里面获取eviction signal对应的资源情况。因为我们上面讲了设置的threshold可以是1Gi,也可以是百分比,所以需要调用GetThresholdQuantity方法换算一下,得到quantity;
然后根据Minimum eviction reclaim 策略判断一下是否还需要提高这个需要eviction的资源,具体的信息查看文档:https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#minimum-eviction-reclaim;
然后用quantity和available比较一下,如果已达阈值,那么加入到results集合中返回。
记录eviction signal 第一次的时间,并将Eviction Signals映射到对应的Node Conditions
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... now := m.clock.Now() //主要用来记录 eviction signal 第一次的时间,没有则设置 now 时间 thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now) // the set of node conditions that are triggered by currently observed thresholds // Kubelet会将对应的Eviction Signals映射到对应的Node Conditions nodeConditions := nodeConditions(thresholds) if len(nodeConditions) > 0 { klog.V(3).Infof("eviction manager: node conditions - observed: %v", nodeConditions) } ... }
nodeConditions
func nodeConditions(thresholds []evictionapi.Threshold) []v1.NodeConditionType { results := []v1.NodeConditionType{} for _, threshold := range thresholds { if nodeCondition, found := signalToNodeCondition[threshold.Signal]; found { //检查results里是否已有nodeCondition if !hasNodeCondition(results, nodeCondition) { results = append(results, nodeCondition) } } } return results }
nodeConditions方法主要就是根据signalToNodeCondition来映射对应的nodeCondition,其中nodeCondition如下:
signalToNodeCondition = map[evictionapi.Signal]v1.NodeConditionType{} signalToNodeCondition[evictionapi.SignalMemoryAvailable] = v1.NodeMemoryPressure signalToNodeCondition[evictionapi.SignalAllocatableMemoryAvailable] = v1.NodeMemoryPressure signalToNodeCondition[evictionapi.SignalImageFsAvailable] = v1.NodeDiskPressure signalToNodeCondition[evictionapi.SignalNodeFsAvailable] = v1.NodeDiskPressure signalToNodeCondition[evictionapi.SignalImageFsInodesFree] = v1.NodeDiskPressure signalToNodeCondition[evictionapi.SignalNodeFsInodesFree] = v1.NodeDiskPressure signalToNodeCondition[evictionapi.SignalPIDAvailable] = v1.NodePIDPressure
也就是将Eviction Signals分别映射成了MemoryPressure或DiskPressure,整理出来的表格如下:
本轮 node condition 与上次的observed合并,以最新的为准
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //本轮 node condition 与上次的observed合并,以最新的为准 nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now) ... }
防止Node的资源不断在阈值附近波动,从而不断变动Node Condition值
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //PressureTransitionPeriod参数默认为5分钟 //防止Node的资源不断在阈值附近波动,从而不断变动Node Condition值 //具体查看文档:https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#oscillation-of-node-conditions nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now) if len(nodeConditions) > 0 { klog.V(3).Infof("eviction manager: node conditions - transition period not met: %v", nodeConditions) } ... }
nodeConditionsObservedSince
func nodeConditionsObservedSince(observedAt nodeConditionsObservedAt, period time.Duration, now time.Time) []v1.NodeConditionType { results := []v1.NodeConditionType{} for nodeCondition, at := range observedAt { duration := now.Sub(at) if duration < period { results = append(results, nodeCondition) } } return results }
如果已经超过了5分钟,那么需要排除。
对eviction-soft做判断
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //设置 eviction-soft-grace-period,默认为90秒,超过该值加入阈值集合 thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now) ... }
thresholdsMetGracePeriod
func thresholdsMetGracePeriod(observedAt thresholdsObservedAt, now time.Time) []evictionapi.Threshold { results := []evictionapi.Threshold{} for threshold, at := range observedAt { duration := now.Sub(at) //Soft Eviction Thresholds,必须要等一段时间之后才能进行trigger if duration < threshold.GracePeriod { klog.V(2).Infof("eviction manager: eviction criteria not yet met for %v, duration: %v", formatThreshold(threshold), duration) continue } results = append(results, threshold) } return results }
设值,然后比较更新
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... // update internal state m.Lock() m.nodeConditions = nodeConditions m.thresholdsFirstObservedAt = thresholdsFirstObservedAt m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt m.thresholdsMet = thresholds // 阈值集合跟上次比较是否需要更新 thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations) debugLogThresholdsWithObservation("thresholds - updated stats", thresholds, observations) //将本次的信息设置为上次信息 m.lastObservations = observations m.Unlock() ... }
排序之后找到第一个需要释放的threshold,以及对应的resource
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //如果没有 eviction signal 集合则本轮结束流程 if len(thresholds) == 0 { klog.V(3).Infof("eviction manager: no resources are starved") return nil } //排序之后获取thresholds集合中的第一个元素 sort.Sort(byEvictionPriority(thresholds)) thresholdToReclaim, resourceToReclaim, foundAny := getReclaimableThreshold(thresholds) if !foundAny { return nil } ... }
getReclaimableThreshold
func getReclaimableThreshold(thresholds []evictionapi.Threshold) (evictionapi.Threshold, v1.ResourceName, bool) { //遍历thresholds,然后根据对应的Eviction Signals找到对应的resource for _, thresholdToReclaim := range thresholds { if resourceToReclaim, ok := signalToResource[thresholdToReclaim.Signal]; ok { return thresholdToReclaim, resourceToReclaim, true } klog.V(3).Infof("eviction manager: threshold %s was crossed, but reclaim is not implemented for this threshold.", thresholdToReclaim.Signal) } return evictionapi.Threshold{}, "", false }
下面我们看一下signalToResource的定义:
signalToResource = map[evictionapi.Signal]v1.ResourceName{} signalToResource[evictionapi.SignalMemoryAvailable] = v1.ResourceMemory signalToResource[evictionapi.SignalAllocatableMemoryAvailable] = v1.ResourceMemory signalToResource[evictionapi.SignalImageFsAvailable] = v1.ResourceEphemeralStorage signalToResource[evictionapi.SignalImageFsInodesFree] = resourceInodes signalToResource[evictionapi.SignalNodeFsAvailable] = v1.ResourceEphemeralStorage signalToResource[evictionapi.SignalNodeFsInodesFree] = resourceInodes signalToResource[evictionapi.SignalPIDAvailable] = resourcePids
signalToResource将Eviction Signals分成了memory、ephemeral-storage、inodes、pids几类。
回收节点级别的资源
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
//回收节点级别的资源
if m.reclaimNodeLevelResources(thresholdToReclaim.Signal, resourceToReclaim) {
klog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim)
return nil
}
...
}
```
**reclaimNodeLevelResources**
```go
func (m *managerImpl) reclaimNodeLevelResources(signalToReclaim evictionapi.Signal, resourceToReclaim v1.ResourceName) bool {
//调用buildSignalToNodeReclaimFuncs中设置的方法
nodeReclaimFuncs := m.signalToNodeReclaimFuncs[signalToReclaim]
for _, nodeReclaimFunc := range nodeReclaimFuncs {
// 删除没用使用到的images或 删除已经是dead状态的Pod 和 container
if err := nodeReclaimFunc(); err != nil {
klog.Warningf("eviction manager: unexpected error when attempting to reduce %v pressure: %v", resourceToReclaim, err)
}
}
//回收之后再检查一下资源占用情况,如果没有达到阈值,那么直接结束
if len(nodeReclaimFuncs) > 0 {
summary, err := m.summaryProvider.Get(true)
if err != nil {
klog.Errorf("eviction manager: failed to get summary stats after resource reclaim: %v", err)
return false
}
observations, _ := makeSignalObservations(summary)
debugLogObservations("observations after resource reclaim", observations)
thresholds := thresholdsMet(m.config.Thresholds, observations, false)
debugLogThresholdsWithObservation("thresholds after resource reclaim - ignoring grace period", thresholds, observations)
if len(thresholds) == 0 {
return true
}
}
return false
}
```
首先根据需要释放的signal从signalToNodeReclaimFuncs中找到对应的释放资源的方法,这个方法在上面buildSignalToNodeReclaimFuncs中设置的,如:
```
nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
```
这个方法会调用相应的GC方法,删除无用的container以及无用的images来释放资源。
然后会检查释放完资源之后是否依然超过阈值,如果没有的话就直接结束了。
获取相应的排序函数并进行排序
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //得到上面的eviction signal 排序函数,在buildSignalToRankFunc方法中设置 rank, ok := m.signalToRankFunc[thresholdToReclaim.Signal] if !ok { klog.Errorf("eviction manager: no ranking function for signal %s", thresholdToReclaim.Signal) return nil } //如果没有 active pod 直接返回 if len(activePods) == 0 { klog.Errorf("eviction manager: eviction thresholds have been met, but no pods are active to evict") return nil } //将pod按照特定资源排序 rank(activePods, statsFunc) ... }
将排好序的pod删除,并返回
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... for i := range activePods { pod := activePods[i] gracePeriodOverride := int64(0) if !isHardEvictionThreshold(thresholdToReclaim) { gracePeriodOverride = m.config.MaxPodGracePeriodSeconds } message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc) //kill pod if m.evictPod(pod, gracePeriodOverride, message, annotations) { metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc() return []*v1.Pod{pod} } } ... }
只要有一个pod被删除了,那么就返回~
到这里eviction manager就分析完了~
总结
这一篇讲解了其中资源控制是怎么做的,理解了通过limit和request的设置会影响到pod被删除的优先级,所以我们在设置pod的时候尽量设置合理的limit和request可以不那么容易被kill掉;然后通过分析了源码知道了limit和request会影响到QOS的评分,从而影响到pod被kill掉的优先级。
接下来通过源码分析了k8s中对阈值的设定是怎样的,当资源不够的时候pod是根据什么条件被kill掉的,这一部分花了很大的篇幅来介绍。通过源码也可以知道在eviction发生的时候k8s也是做了很多的考虑,比如说对于节点状态振荡应该怎么处理、首先应该回收什么类型的资源、minimum-reclaim最小回收资源在源码里是怎么做到的等等。
Reference
https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/
https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/
https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/
https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
https://zhuanlan.zhihu.com/p/38359775
https://cloud.tencent.com/developer/article/1097431