StableDiffusion

Models

CheckPoint

WebUI

# ubuntu 22.04
sudo apt install libgl1 libglib2.0-0 libgoogle-perftools4 libtcmalloc-minimal4
git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui
# 切换到指定版本
git checkout -b v1-10-1 tags/v1.10.1
./webui.sh
# chinese interface
https://github.com/VinsonLaro/stable-diffusion-webui-chinese
阅读全文

WSL2使用CUDA

安装

windows安装新驱动,wsl2内部不需要安装驱动,直接安装cuda-toolkit

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda-repo-wsl-ubuntu-12-8-local_12.8.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-8-local_12.8.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8

nvidia-smi使用

ln -s nvidia-smi -> /usr/lib/wsl/lib/nvidia-smi

卸载

sudo apt --purge remove cuda-toolkit-12-8
dpkg -l | grep cuda
sudo dpkg --purge cuda-repo-wsl-ubuntu-12-8-local cuda-toolkit-12-8-config-common cuda-toolkit-12-config-common cuda-toolkit-config-common cuda-visual-tools-12-8

再次运行检查命令,确认系统中已无任何 CUDA 相关包。

dpkg -l | grep cuda

docker-toolkit

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-the-nvidia-container-toolkit

参考文档

https://learn.microsoft.com/en-us/windows/ai/directml/gpu-cuda-in-wsl
https://docs.nvidia.com/cuda/wsl-user-guide/index.html#wsl-2-support-constraints
https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local

阅读全文

使用nerdctl替代docker-ce

nerdctl是一个兼容Docker CLI的命令行工具, 用于管理containerd容器.

nerdctl

使用rootfull模式.

mkdir nerdctl
wget -c https://github.com/containerd/nerdctl/releases/download/v1.7.7/nerdctl-1.7.7-linux-amd64.tar.gz
tar -xzvf nerdctl-1.7.7-linux-amd64.tar.gz -C nerdctl
sudo cp -a nerdctl/nerdctl /usr/local/bin/

配置别名docker命令, 指向nerdctl.

# sudo vim /usr/local/bin/docker


#!/bin/bash


COMMAND="nerdctl"
if [[ $EUID -ne 0 ]]; then
    sudo $COMMAND "$@"
else
    $COMMAND "$@"
fi

containerd

复用k3s的containerd, 配置nerdctl.toml指向containerd sock.

mkdir /etc/nerdctl


cat << EOF > /etc/nerdctl/nerdctl.toml
debug          = false
debug_full     = false
address        = "unix:///run/k3s/containerd/containerd.sock"
namespace      = "k8s.io"
cni_path       = "/var/lib/nerdctl/cni/bin"
cni_netconfpath = "/var/lib/nerdctl/cni/net.d"
EOF

docker ps/exec/image 相关命令就可以使用了, 单纯管理k8s/k3s配置到这里就可以了.

如果需要完整的docker体验,docker run/build支持, 需要添加cni pulgin和buildkit支持.

cni plugin

使用docker run 创建启动容器, 需要配置cni plugin, 让nerdctl在创建容器的使用告诉containerd调用哪里的cni配置容器网络.

上面的nerdctl.toml已经配置了路径, 需要安装cni plugin到对应路径.

cni_path       = "/var/lib/nerdctl/cni/bin"
cni_netconfpath = "/var/lib/nerdctl/cni/net.d"
mkdir cni
wget -c https://github.com/containernetworking/plugins/releases/download/v1.5.1/cni-plugins-linux-amd64-v1.5.1.tgz
tar -xzvf cni-plugins-linux-amd64-v1.5.1.tgz -C cni
sudo cp cni/* /var/lib/nerdctl/cni/bin/
# docker run --rm -ti alpine sh
# docker network ls


NETWORK ID      NAME      FILE
17f29b073143    bridge    /var/lib/nerdctl/cni/net.d/nerdctl-bridge.conflist
                host
                none

会发现不做配置, nerdctl在执行docker run的时候回创建默认的网桥网络.

# ifconfig
nerdctl0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 10.4.0.1  netmask 255.255.255.0  broadcast 10.4.0.255
        inet6 fe80::a096:c4ff:fe7e:a72f  prefixlen 64  scopeid 0x20<link>
        ether a2:96:c4:7e:a7:2f  txqueuelen 1000  (Ethernet)
        RX packets 15  bytes 1000 (1000.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 13  bytes 1494 (1.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

主机查看网络, 多了一个nerdctl0网桥, 和docker的docker0体验是一样的.
注意这里在配置nerdctl.toml的时候故意将cni路径配置到了/var/lib/nerdctl, 因为默认会使用/opt/cni/bin和/etc/cni/net.d, 如果主机安装的k8s使用相同路径, 可能产生影响.

buildkit

安装buildctl和buildkitd.

mkdir buildkit
wget -c https://github.com/moby/buildkit/releases/download/v0.15.2/buildkit-v0.15.2.linux-amd64.tar.gz
tar -xzvf buildkit-v0.15.2.linux-amd64.tar.gz -C buildkit
sudo cp buildkit/bin/buildctl /usr/local/bin/
sudo cp buildkit/bin/buildkitd /usr/local/bin

配置buildkitd的worker使用k3s的containerd.

# mkdir /etc/buildkit
# vim /etc/buildkit/buildkitd.toml


[worker.oci]
  enabled = false


[worker.containerd]
  enabled = true
  namespace = "k8s.io"
  address = "/run/k3s/containerd/containerd.sock"

配置buildkitd守护启动.

# vim /etc/systemd/system/buildkit.service
# systemctl start buildkit.service
# systemctl enable buildkit.service


[Unit]



After=network.target local-fs.target


[Service]
#uncomment to enable the experimental sbservice (sandboxed) version of containerd/cri integration
#Environment="ENABLE_CRI_SANDBOXES=sandboxed"
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/buildkitd


Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999


[Install]
WantedBy=multi-user.target

nerdctl的rootfull模式和buildkitd通信会使用/run/buildkit/buildkitd.sock.

multi-platform

添加docker run/build/image的跨平台支持, 首先要再主机上注册binfmt_misc使用qemu, 可以使用这个项目tonistiigi/binfmt.

主机在遇到binfmt是其他平台的指令时候, 会自动调用qemu(容器进程也是主机进程, 所以也可以).

# docker run --privileged --rm tonistiigi/binfmt:master --install all


# ls -1 /proc/sys/fs/binfmt_misc/qemu*
/proc/sys/fs/binfmt_misc/qemu-aarch64
/proc/sys/fs/binfmt_misc/qemu-arm
/proc/sys/fs/binfmt_misc/qemu-mips64
/proc/sys/fs/binfmt_misc/qemu-mips64el
/proc/sys/fs/binfmt_misc/qemu-ppc64le
/proc/sys/fs/binfmt_misc/qemu-riscv64
/proc/sys/fs/binfmt_misc/qemu-s390x


# docker run --rm --platform=arm64 alpine uname -a


Linux 20910835f3ab 4.18.0-553.16.1.el8_10.x86_64 #1 SMP Thu Aug 8 17:47:08 UTC 2024 aarch64 Linux

summary

以上组合nerdctl/containerd/cni-plugin/buildkit, 实现docker-ce的体验. 在k8s/k3s开发调试场景, 共用k8s的containerd, 还是会很方便的.

阅读全文

搞搞CPU测试

环境

win10/hyperv/intel i5 4590T
win11/hyperv/amd ryzen 7 5825u

time命令

数据结果

## intel 1C1T

Fri Sep  6 10:45:28 CST 2024
162.27s user 7.78s system 97% cpu 2:54.08 total
Fri Sep  6 10:48:22 CST 2024

## intel 2C2T

Fri Sep  6 11:44:51 CST 2024
165.11s user 9.07s system 184% cpu 1:34.64 total
Fri Sep  6 11:46:25 CST 2024

## intel 4C4T

Fri Sep  6 10:51:47 CST 2024
181.02s user 12.18s system 351% cpu 54.935 total
Fri Sep  6 10:52:42 CST 2024

## amd 1C1T(不开超线程)

Thu Sep  5 22:58:40 EDT 2024
80.03s user 6.29s system 95% cpu 1:30.80 total
Thu Sep  5 23:00:10 EDT 2024

## amd 1C2T

Thu Sep  5 23:04:24 EDT 2024
92.03s user 7.57s system 179% cpu 55.421 total
Thu Sep  5 23:05:19 EDT 2024

## amd 2C4T

Thu Sep  5 22:54:27 EDT 2024
114.52s user 9.26s system 355% cpu 34.789 total
Thu Sep  5 22:55:01 EDT 2024

## amd 4C8T

Thu Sep  5 23:29:44 EDT 2024
144.99s user 13.63s system 596% cpu 26.607 total (CPU 600% 跑不满了?)
Thu Sep  5 23:30:10 EDT 2024

## amd 8C16T

Thu Sep  5 23:34:57 EDT 2024
210.34s user 21.47s system 933% cpu 24.830 total
Thu Sep  5 23:35:22 EDT 2024

hyperv绑核了吗

阅读全文

节点重启网络Flannel Pod IP不通排查记录

问题现场

自己的实验集群,使用的k3s v1.25.7+k3s1, 网络使用的默认flannel vxlan模式。

NAME          STATUS   ROLES                  AGE    VERSION
whoops-home   Ready    <none>                 253d   v1.25.7+k3s1
whoops-k3s    Ready    control-plane,master   268d   v1.25.7+k3s1

节点whoops-k3s是CentOS7.9,折腾网络,改了ifcfg-eth0配置,systemctl restart network重启网络,一开始没发现,第二天发现nginx到集群的访问504。

这个小集群是在家里NAS和软路由上跑的,whoops-home上有一个内网穿透了的nginx,集群流量都经过这个nginx转进来走traefik ingress controller。

链路大概这样:

域名流量->nginx(whoops-home)->traefik(nodeport)->pod

Nginx报504,那就是Nginx到traefik不通,现在nginx在whoops-home上,traefik在whoops-k3s上,查了下发现从whoops-home上ping不通whoops-k3s上的Pod,但是从whoops-k3s可以ping通whoops-home上的pod。

whoops-k3s上的flannel vxlan因为重启网络挂了?

原因分析

先来看看netfilter的FORWARD,没问题

iptables -t filter -nvL | grep FORWARD
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)

看看路由, 也没问题,whoops-k3s是10.42.0.0/24网段,whoops-home是10.42.1.0/24网段。

default via 192.168.31.1 dev eth0
10.42.0.0/24 dev cni0 proto kernel scope link src 10.42.0.1
10.42.1.0/24 via 10.42.1.0 dev flannel.1 onlink
...

看看fdb? 不应该啊,whoops-k3s都能ping通whoops-home上的pod,arp要是有问题,咋ping出去的包,一看果然没问题。

sudo bridge fdb show | grep flannel.1
# 192.168.31.166 是 whoops-home
d6:f8:ec:13:3c:07 dev flannel.1 dst 192.168.31.166 self permanent

vxlan是走的flannel.1设备,这个设备是flanneld创建的,重启网络影响它了?但是如果配置不对,flanneld是不是应该自己感知调整回去呢?重启下k3s试试。

systemctl restart k3s

好了!重启k3s/flanneld,因为重配了flannel.1? 复现一下,特意检查flannel.1,网段配置、路由、arp、设备状态重启前后都没问题,重配重不配flannel.1应该不影响的啊,而且个人认为flanneld不应该这么不健壮,节点重启下network,整个节点的pod网络都挂掉了。

flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.42.0.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::b45b:f1ff:fe24:82e4  prefixlen 64  scopeid 0x20<link>
        ether b6:5b:f1:24:82:e4  txqueuelen 0  (Ethernet)
        RX packets 237089  bytes 37524217 (35.7 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 174244  bytes 32444345 (30.9 MiB)
        TX errors 29  dropped 5 overruns 0  carrier 29  collisions 0

抓包!

tcpdump -i flannel.1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
22:28:45.852513 IP 10.42.1.0 > 10.42.0.210: ICMP echo request, id 12, seq 48, length 64
22:28:46.872698 IP 10.42.1.0 > 10.42.0.210: ICMP echo request, id 12, seq 49, length 64
22:28:47.896618 IP 10.42.1.0 > 10.42.0.210: ICMP echo request, id 12, seq 50, length 64
22:28:49.851093 IP 10.42.1.0 > 10.42.0.210: ICMP echo request, id 13, seq 1, length 64
22:28:50.872665 IP 10.42.1.0 > 10.42.0.210: ICMP echo request, id 13, seq 2, length 64
22:28:51.896473 IP 10.42.1.0 > 10.42.0.210: ICMP echo request, id 13, seq 3, length 64

从whoops-home上的ping包到flannel.1了,但是没给到Pod网卡,丢哪了?重启网络操作到底干了啥?真的要研究kernel怎么实现的vxlan了么?debug flannel.1设备?

network.service

直接看/etc/rc.d/init.d/network脚本,看看重启网络做了哪些操作。

start)
    apply_sysctl
...
stop)
   sysctl -w net.ipv4.ip_forward=0 > /dev/null 2>&1
...
restart|force-reload)
    cd "$CWD"
    $0 stop
    $0 start
...

好嘛,stop时候关了net.ipv4.ip_forward,flannel.1当然不把包往pod的网卡转了,重启k3s/flanneld,flanneld动态又把参数改回来了,所以就通了,复现验证一下,还真是。
搞不懂network.service stop为啥关这个,但是start时候也应该把之前的参数还回来吧?对不起,没有持久化的sysctl在start的apply_sysctl里还不回来。

修复搞定

echo net.ipv4.ip_forward=1 >> /etc/sysctl.conf

验证一下,重启网络不影响Pod网络联通性了!

总结

之前特意研究了下flannel的vxlan模式,这次遇这么个case,又深刻了~

阅读全文