内核驱动开发记录

前言

如果你刚接触Linux内核驱动开发,那么这篇博客应该对你有所帮助!祝你好运

推荐阅读:
《C和指针》 《C专家编程》 《C陷阱与缺陷》
《Linux设备驱动程序》 《linux内核设计与实现》 《深入理解linux内核》

《Linux内核源代码情景分析》

《Debug Hacks中文版—深入调试的技术和工具》
第一要义:学会放弃
第二要义:不要修改代码屎山
第三要义:遇到无法解决的问题/bug,备份代码后重构代码
第四要义:若bug实在无法解决,尝试不同的实现方式,不过于追求简洁与优雅

相关驱动:
块设备驱动,网卡驱动
内核版本:5.4/4.19

参考驱动代码:

nvme驱动 r8169网卡驱动

驱动开发背景:
块设备驱动特点:一次只能发送一条指令,需同时实现协议栈请求与自定义ioctl请求处理
网卡驱动特点:驱动移植,将以字符设备实现的网卡驱动移植为以网络设备实现的网卡驱动,将用户态代码移植内核态

一:银河麒麟操作系统+飞腾处理器

内核切换后显卡驱动失效问题
内核切换后网卡驱动失效问题
内核切换后glibc冲突问题
networking restart失败
内核切换后安装常见软件均失败

切换内核时make install显示:

1
2
Error! Bad return status for module build on kernel:4.19.0(aarch64)
Consult /var/lib/dkms/nvidia/460.32.03/build/make.log for more information.

解决:
删除 /var/lib/dkms/nvidia整个文件夹即可

update-grub时未发现刚install的内核,只有原系统的image
解决:编译版本的image未被安装至/boot文件夹,复制编译目录/arch/arm64中image至/boot,并按对应格式重命名。不过看/boot目录下原系统拥有Image和uImage文件,但arm64目录下只有一个Image,我们直接复制两份,分别改成Image和uImage。

国产操作系统特有的问题带来的麻烦——此处省略一万字

二:用户空间访问问题

块设备驱动提供ioctl接口供用户直接向特定扇区读写数据,在ioctl系统调用参数中传递用户缓冲区的地址。ioctl内核处理函数中直接将请求打包成相应的命令,加入SSD请求队列等待执行。由于未实现中断功能,利用定时器定时轮询对应位置查看命令执行情况并在设备空闲时发送请求。此时调用copy_from_user函数失败。这是因为定时器回调函数是异步执行的,它类似于一种“软件中断”,而且是处于非进程的上下文中,所以回调函数必须遵守以下规则:

  1. 没有 current 指针、不允许访问用户空间。因为没有进程上下文,相关代码和被中断的进程没有任何联系。

  2. 不能执行休眠(或可能引起休眠的函数)和调度。

  3. 任何被访问的数据结构都应该针对并发访问进行保护,以防止竞争条件。

inux内核timer执行上下文,内核定时器的使用(好几个例子add_timer)

解决:
如果是写命令,在用户空间上下文中将数据拷贝到额外申请的内核缓冲区(kmalloc而不是dma_alloc_coherent),再将内核缓冲区的地址写入命令。如果是读命令,中断上下文中先将读到的数据拷贝到内核缓冲区,用户上下文中再将数据写入用户缓冲区。
ioctl请求的一般执行过程为:①解析参数 ②申请内核缓冲区存放数据 ③ 将请求封装成命令,加入命令队列 ④等待命令执行完成,并执行必要的数据传输工作。

三:模块卸载出错

有时候编写代码时并不会检查函数返回值等信息,有时候资源并没有申请成功,而在.remove函数中却照常进行了资源释放,或者申请了资源却忘了在remove函数中进行释放,便会造成空指针 死机等一系列的后果。
例如在.probe函数中注册了定时器却忘了注销,注册块设备失败却照常注销块设备,get_device与put_device数目不一致

1
2
3
4
5
6
add_timer(&hps_dev->timer);                         //注册定时器
del_timer(&hps_dev->timer); //注销定时器
device_add_disk(hps_dev->dev, hps_dev->disk, hps_attr_groups); // 注册块设备
del_gendisk(hps_dev->disk); // 移除块设备
hps_dev->dev = get_device(&pdev->dev); // 增加设备计数
put_device(hps_dev->dev); // 减小设备计数

所以比较安全的方式就是增加出错处理代码或在资源释放时判断是否持有。

1
2
3
4
5
6
ret = request();
if(ret!=0){
// 出错处理
}

if(hold(p)) release p;

四:DMA缓冲区大小问题

dma_alloc_coherent函数一次申请超过4M的空间便会报错,配置CMA使DMA能够申请更大的空间,这可能需要需要重新编译内核(如果高版本宿主机编译内核建议先通过deb安装文件切换至相近版本的内核,再通过源码编译内核,跨大版本源码编译过程中有可能会出现一些奇奇怪怪的问题)。

1
dev->data_buf = dma_alloc_coherent(dev->dev, MAX_DATA_SIZE, &dev->data_buf_dma_addr, GFP_KERNEL);

参考博客:
dma_alloc_coherent 申请内存用法和问题总结
在Linux内核模块中使用CMA内存分配

大致步骤如下:

1
2
3
4
5
6
7
8
9
cat /boot/config-$(uname -r) | grep DMA_CMA  # 查看当前内核是否开启了CMA功能
如果不存在则需要重新编译内核,建议配置内核时顺带勾选kgdb选项
CONFIG_DMA_CMA选项没在memuconfig找到,可直接通过vim .config修改,也可将其他的CMA选项打开
在编译内核是会选择CMA空间大小,后面也可以在grub配置文件中修改相应大小。
cat /proc/meminfo # 查看是否修改成功
CmaTotal: 1048576 kB
CmaFree: 1048576 kB
dmesg | grep cma # 显示内核启动时cma相关信息
配置成功后就能申请更大的连续物理内存

五:linux内存页大小问题

在用户程序中使用mmap系统调用,传入的映射长度为32k,可是内核mmap处理函数中vm_area的长度却显示为64k。

1
2
void* mmap(void* start,size_t length,int prot,int flags,int fd,off_t offset);
int munmap(void* start,size_t length);

解决:

1
2
getconf PAGE_SIZE # 查看当前页大小
65536

mmap 必须以PAGE_SIZE为单位进行映射,而内存也只能以页为单位进行映射,若要映射非PAGE_SIZE整数倍的地址范围,要先进行内存对齐,强行以PAGE_SIZE的倍数大小进行映射。由于传入长度小于64k,进行了一次向上取整。故重新编译内核,在menuconfig中修改page size。

六:网络不稳定,同属于一个局域网却不能ping通

通过打印内核启动时的信息可以发现网卡在一直自动协商连接速度,通过ethtool工具关闭自动协商功能即可。

1
2
# 将网卡eth2速度固定1000M,双工,且关闭自动协商
sudo ethtool -s eth2 speed 1000 duplex full autoneg off

Linux 关闭链路自动协商功能

七:BUG: scheduling while atomic

在定时器的回调函数中调用msleep函数,造成BUG: scheduling while atomic错误,输出一长串的警告(dump_stack,backtrace之类的),这是因为定时器回调函数处于原子上下文,不允许系统调度,调用休眠函数后就会引发这个问题。但很奇怪的是,网卡驱动中的start_xmit函数中调用休眠函数也会报这个错误,这个函数也处于原子上下文吗?可以通过preempt_count判断内核当前处于原子上下文还是进程上下文。

如果是分配内存,可以将GFP_ KERNEL标志改成 GFP_ATOMIC

内存申请 GFP_KERNEL GFP_ATOMIC
BUG: scheduling while atomic 分析
用户空间与内核空间,进程上下文与中断上下文
Linux中的preempt_count

八:设备名混淆错误

在参照r8169驱动代码编写网卡驱动时,误以为dev变量为struct device类型的通用设备,后面才发现为struct net_device类型,但内核中全用dev来命名,比较容易混淆。最奇怪的是,一个用来解析协议类型的函数竟然顺带对skb的dev赋值,使得这个错误比较难被发现。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
skb->protocol = eth_type_trans(skb, dev);
__be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
{
unsigned short _service_access_point;
const unsigned short *sap;
const struct ethhdr *eth;

skb->dev = dev;
skb_reset_mac_header(skb);

eth = (struct ethhdr *)skb->data;
skb_pull_inline(skb, ETH_HLEN);

if (unlikely(is_multicast_ether_addr_64bits(eth->h_dest))) {
if (ether_addr_equal_64bits(eth->h_dest, dev->broadcast))
skb->pkt_type = PACKET_BROADCAST;
else
skb->pkt_type = PACKET_MULTICAST;
}
else if (unlikely(!ether_addr_equal_64bits(eth->h_dest,
dev->dev_addr)))
skb->pkt_type = PACKET_OTHERHOST;

/*
* Some variants of DSA tagging don't have an ethertype field
* at all, so we check here whether one of those tagging
* variants has been configured on the receiving interface,
* and if so, set skb->protocol without looking at the packet.
*/
if (unlikely(netdev_uses_dsa(dev)))
return htons(ETH_P_XDSA);

if (likely(eth_proto_is_802_3(eth->h_proto)))
return eth->h_proto;

/*
* This is a magic hack to spot IPX packets. Older Novell breaks
* the protocol design and runs IPX over 802.3 without an 802.2 LLC
* layer. We look for FFFF which isn't a used 802.2 SSAP/DSAP. This
* won't work for fault tolerant netware but does for the rest.
*/
sap = skb_header_pointer(skb, 0, sizeof(*sap), &_service_access_point);
if (sap && *sap == 0xFFFF)
return htons(ETH_P_802_3);

/*
* Real 802.2 LLC
*/
return htons(ETH_P_802_2);
}

九:运算符优先级

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
int main() {
int type = 0x80001;
if (type & 0xf == 0x1) {
printf("branch 1\n");
} else if (type & 0xf == 0x2) {
printf("branch 2\n");
} else {
printf("branch 3\n");
}
if ((type & 0xf) == 0x1) {
printf("branch 1\n");
} else if ((type & 0xf) == 0x2) {
printf("branch 2\n");
} else {
printf("branch 3\n");
}
printf("%d %d\n", 0xf == 0x1, type & 0xf == 0x1);
}
// branch 3
// branch 1
// 0 0

相等运算符优先级高于位运算符,故先执行==,得到结果0,再与type进行与。

十:网卡驱动提供修改MTU接口

提供简单Set方法,并在探测函数中设置min_mtu max_mtu mtu值

1
2
3
4
5
6
7
8
9
10
11
// 示例代码 r8169.c
.ndo_change_mtu = rtl8169_change_mtu,
5855 static int rtl8169_change_mtu(struct net_device *dev, int new_mtu)
5856 {
5864 dev->mtu = new_mtu;
5867 return 0;
5868 }

7626 /* MTU range: 60 - hw-specific max */
7627 dev->min_mtu = ETH_ZLEN;
7629 dev->max_mtu = jumbo_max;

十一:收包与napi

由于网卡对于收包速度要求较高,且并没有实现中断功能,故只能通过轮询的方式收包。
有两种方式实现该功能:
1 使用kthread_run另开一个线程,持续查询是否有数据包到来
2 使用napi机制
在刚开始的时候我错误理解了napi poll的时机,采用的是定时器+napi的方式,每次在定时器过期时调用napi_schedule_irqoff函数,启动轮询,而在每次轮询中调用napi_complete函数并返回1。但问题是定时器中断频率太低,远远达不到性能需求。后面看了看napi相关文章才发现,只要轮询函数中每次返回的值为当前权值/额度,内核就会认为网卡还有消息包需要处理,会将该任务重新加入轮询队列,故没必要使用定时器,也没必要在轮询函数中调用napi_complete函数。
复制粘贴一些napi相关函数注释加深印象(源码在线查询网站:bootlin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
/**
* netif_napi_add() - initialize a NAPI context
* @dev: network device
* @napi: NAPI context
* @poll: polling function
* @weight: default weight
*
* netif_napi_add() must be used to initialize a NAPI context prior to calling
* *any* of the other NAPI-related functions.
*/
static inline void
netif_napi_add(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight)

/**
* napi_enable - enable NAPI scheduling
* @n: NAPI context
*
* Resume NAPI from being scheduled on this context.
* Must be paired with napi_disable.
*/
void napi_enable(struct napi_struct *n)

/**
* napi_schedule - schedule NAPI poll
* @n: NAPI context
*
* Schedule NAPI poll routine to be called if it is not already
* running.
*/
static inline void napi_schedule(struct napi_struct *n)
{
if (napi_schedule_prep(n))
__napi_schedule(n);
}

/**
* napi_schedule_irqoff - schedule NAPI poll
* @n: NAPI context
*
* Variant of napi_schedule(), assuming hard irqs are masked.
*/
static inline void napi_schedule_irqoff(struct napi_struct *n)
{
if (napi_schedule_prep(n))
__napi_schedule_irqoff(n);
}

/**
* napi_schedule_prep - check if napi can be scheduled
* @n: napi context
*
* Test if NAPI routine is already running, and if not mark
* it as running. This is used as a condition variable to
* insure only one NAPI poll instance runs. We also make
* sure there is no pending NAPI disable.
*/
bool napi_schedule_prep(struct napi_struct *n)

/**
* napi_complete - NAPI processing complete
* @n: NAPI context
*
* Mark NAPI processing as complete.
* Consider using napi_complete_done() instead.
* Return false if device should avoid rearming interrupts.
*/
static inline bool napi_complete(struct napi_struct *n)
{
return napi_complete_done(n, 0);
}

static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi,
unsigned int length)
{
return __napi_alloc_skb(napi, length, GFP_ATOMIC);
}
/**
* __napi_alloc_skb - allocate skbuff for rx in a specific NAPI instance
* @napi: napi instance this buffer was allocated for
* @len: length to allocate
* @gfp_mask: get_free_pages mask, passed to alloc_skb and alloc_pages
*
* Allocate a new sk_buff for use in NAPI receive. This buffer will
* attempt to allocate the head from a special reserved region used
* only for NAPI Rx allocation. By doing this we can save several
* CPU cycles by avoiding having to disable and re-enable IRQs.
*
* %NULL is returned if there is no free memory.
*/
struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
gfp_t gfp_mask)

int netif_rx(struct sk_buff *skb);
int __netif_rx(struct sk_buff *skb);
int netif_receive_skb(struct sk_buff *skb);
int netif_receive_skb_core(struct sk_buff *skb);
void netif_receive_skb_list_internal(struct list_head *head);
void netif_receive_skb_list(struct list_head *head);
gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);

/**
* netif_rx - post buffer to the network code
* @skb: buffer to post
*
* This function receives a packet from a device driver and queues it for
* the upper (protocol) levels to process via the backlog NAPI device. It
* always succeeds. The buffer may be dropped during processing for
* congestion control or by the protocol layers.
* The network buffer is passed via the backlog NAPI device. Modern NIC
* driver should use NAPI and GRO.
* This function can used from interrupt and from process context. The
* caller from process context must not disable interrupts before invoking
* this function.
*
* return values:
* NET_RX_SUCCESS (no congestion)
* NET_RX_DROP (packet was dropped)
*
*/
int netif_rx(struct sk_buff *skb)

/**
* netif_receive_skb - process receive buffer from network
* @skb: buffer to process
*
* netif_receive_skb() is the main receive data processing function.
* It always succeeds. The buffer may be dropped during processing
* for congestion control or by the protocol layers.
*
* This function may only be called from softirq context and interrupts
* should be enabled.
*
* Return values (usually ignored):
* NET_RX_SUCCESS: no congestion
* NET_RX_DROP: packet was dropped
*/
int netif_receive_skb(struct sk_buff *skb)


/**
* netif_receive_skb_core - special purpose version of netif_receive_skb
* @skb: buffer to process
*
* More direct receive version of netif_receive_skb(). It should
* only be used by callers that have a need to skip RPS and Generic XDP.
* Caller must also take care of handling if ``(page_is_)pfmemalloc``.
*
* This function may only be called from softirq context and interrupts
* should be enabled.
*
* Return values (usually ignored):
* NET_RX_SUCCESS: no congestion
* NET_RX_DROP: packet was dropped
*/
int netif_receive_skb_core(struct sk_buff *skb)

gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
{
gro_result_t ret;

skb_mark_napi_id(skb, napi);
trace_napi_gro_receive_entry(skb);

skb_gro_reset_offset(skb, 0);

ret = napi_skb_finish(napi, skb, dev_gro_receive(napi, skb));
trace_napi_gro_receive_exit(ret);

return ret;
}

相关博客:
NAPI机制
Linux NAPI机制分析

十二:mac设置问题

在编写网卡驱动代码时,并没有仔细设置mac地址的值,在完成网卡驱动大致功能时,发现只能通过UDP进行通信(禁用arp时),而无法使用TCP进行通信。使用tcpdump -i 指定网卡 追踪网卡数据包时发现client向server发起第一次握手,但server迟迟没有回应,client一次次重复发送SYN数据包,最终TCP连接失败,同时也能看见一些ARP数据包显示oui unknown。使用arp -a显示当前ip与mac地址映射,发现对应IP地址的mac地址显示incomplete

问题原因:有时候设备随机设置的mac不是单播地址,mac地址无效。需设置成有效的mac地址(复制现有网卡mac地址前24位,后24位随便打即可)

组织唯一标识符(OUI)由IEEE(电气和电子工程师协会)分配给厂商,它包含24位。厂商再用剩下的24位(EUI,扩展唯一标识符)为其生产的每个网卡分配一个全球唯一的全局管理地址,一般来说大厂商都会购买多个OUI。

I/G(Individual/Group)位,如果I/G=0,则是某台设备的MAC地址,即单播地址;如果I/G=1,则是多播地址(组播+广播=多播)。

G/L(Global/Local,也称为U/L位,其中U表示Universal)位,如果G/L=0,则是全局管理地址,由IEEE分配;如果G/L=1,则是本地管理地址,是网络管理员为了加强自己对网络管理而指定的地址。
参考博客:
arp详细讲解
MAC地址格式详解
TCP/IP协议——ARP详解
linux中ifconfig 命令详解详解
内核相关函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

/**
* is_valid_ether_addr - Determine if the given Ethernet address is valid
* @addr: Pointer to a six-byte array containing the Ethernet address
*
* Check that the Ethernet address (MAC) is not 00:00:00:00:00:00, is not
* a multicast address, and is not FF:FF:FF:FF:FF:FF.
*
* Return true if the address is valid.
*
* Please note: addr must be aligned to u16.
*/
static inline bool is_valid_ether_addr(const u8 *addr)
{
/* FF:FF:FF:FF:FF:FF is a multicast address so we don't need to
* explicitly check for it here. */
return !is_multicast_ether_addr(addr) && !is_zero_ether_addr(addr);
}

/**
* eth_random_addr - Generate software assigned random Ethernet address
* @addr: Pointer to a six-byte array containing the Ethernet address
*
* Generate a random Ethernet address (MAC) that is not multicast
* and has the local assigned bit set.
*/
static inline void eth_random_addr(u8 *addr)
{
get_random_bytes(addr, ETH_ALEN);
addr[0] &= 0xfe; /* clear multicast bit */
addr[0] |= 0x02; /* set local assignment bit (IEEE802) */
}

十三:BAR基址寄存器与总线地址

摘录:
物理地址与总线地址

  1. 物理地址是与CPU相关的。在CPU的地址信号线上产生的就是物理地址。在程序指令中的虚拟地址经过段映射和页面映射后,就生成了物理地址,这个物理地址被放到CPU的地址线上。 (从CPU端看)
  2. 总线地址,顾名思义,是与总线相关的,就是总线的地址线或在地址周期上产生的信号。 外设使用的是总线地址。(从设备端看)
  3. 物理地址与总线地址之间的关系由系统的设计决定的。在x86平台上,物理地址与PCI总线地址是相同的。 在其他平台上,也许会有某种转换,通常是线性的转换。

对于处理器来说,虚拟地址 逻辑地址都是一个输入源,处理器对这些地址进行转换(比如利用MMU),转换为物理地址,真正处理器发出的地址是物理地址。

假如某个PCI设备具有DMA能力,要去操作RAM,这时该设备看到的RAM的地址就应该是由系统总线映射到PCI总线上的总线地址。
映射关系由PCI控制器地址窗口来配置,一般是一个偏移量,所以这时映射到PCI总线上的RAM的总线地址就不是RAM在处理器系统地址空间上的物理地址(也可以称为系统总线地址)了。
因此总线地址 != 物理地址。
当然PCI控制器地址窗口可以配置为平映射,这时总线地址就跟物理地址相同了。

TLP能根据地址被路由到对应设备的BAR空间中去。比如说现在有一个mem read request,如果路由地址(地址信息包含在TLP中)是0x71000000,而有一个设备func0的mem空间范围是0x70000000~0x80000000,那么这个TLP就会被这个func处理。从func0的0x71000000对应的地址读取相应数据。这就是TLP中的地址字段和BAR空间的地址之间的关系
TLP中的地址哪里来?ATU(Address Translation Unit)转换过来的。这个问题就是这么的简单。ATU是什么?是一个地址转换单元,负责将一段存储器域的地址转换到PCIe总线域地址,除了地址转换外,还能提供访问类型等信息,这些信息都是ATU根据总线上的信号自己做的,数据都打包到TLP中,不用软件参与。软件需要做的是配置ATU,所以如果ATU配置完成,并且能正常工作,那么CPU访问PCIe空间就和访问本地存储器空间方法是一样的,只要读写即可。

BAR寄存器数据的初始化
BAR寄存器的数据是怎么初始化,由谁进行初始化的?因为初始化的数据是PCIE设备所在的总线域的地址空间,所以肯定不会是EP自己进行初始化,因为如果这样EP是不知道其他PCIE设备对应的总线地址空间的,所以可能会引起总线地址空间的冲突,所以BAR寄存器的初始化是由内核进行初始化的,在系统开机时,内核会遍历查找各个PCIE设备,然后为PCIE设备分配对应的总线地址空间。
BAR寄存器存储的总线地址和应用程序内存地址的关系
BAR寄存器存储的总线地址,应用程序是不能直接利用的,应用程序首先要做的就是读出BAR寄存器的值,然后用mmap函数建立应用程序内存空间和总线地址空间的映射关系。这样应用程序往PCIE设备内存读写数据的时候,直接利用PCIE设备映射到应用程序中的内存地址即可。但是应用程序的内存地址该由谁解析到PCIE设备对应的总线空间给EP呢,这个工作是由北桥或者是RC(root complex)来完成的,解析到总线地址空间之后,EP会把总线的地址空间解析成PCIE设备对应的设备内存地址。

RC访问EP演示样例(黑色箭头):
(1)首先,RC端需要配置outbound(一般内核中配好),EP端须要inbound(0x5b000000 inbound到BAR2),这样就建立了RC端0x20100000(BAR2)到EP端0x5b000000的映射
(2)在EP端改动0x5b000000内存的内容,在RC端0x20100000能够看到对应的变化,从RC端读/写0x20100000和从EP端读/写0x5b000000,结果是一样的

对于EP,inbound一般是将BAR空间的总线地址与存储器域相应地址区间映射起来,outbound一般是访问主机内存地址。以linux 为RC+块设备为EP举例,在上电时linux内核会为块设备的各个BAR空间分配总线地址,块设备驱动的探测函数将总线地址与主机虚拟地址映射起来,EP自动将总线地址与存储器域相应分配的地址区间映射起来(配置inbound寄存器)。当主机想要读/写块设备数据时,将申请一块DMA缓冲区,并将缓冲区物理地址传递给块设备,这个物理地址就是PCI域的总线地址。以写操作为例,块设备若想读取主机内存中DMA缓冲区的数据,需进行一次映射,将相应的存储器域地址区间与DMA缓冲区物理地址进行一次映射(注意这两端地址区间不相同)。一般来说有两个相关的寄存器进行这样的偏移/映射(inbound寄存器与outbound寄存器),不搞固件 逻辑什么的,不太了解细节。

Generally speaking, if your card has SoC, the FW on the SoC will configure the iATU mapping with BAR match mode. And don’t let host side driver to configure it.
参考博客:
物理地址和总线地址区别
什么是物理地址、虚拟地址、总线地址
PCI设备配置空间、BAR空间、BUS总线的理解整理
PCIe实践之路:BAR空间和TLP
PCIE BAR空间
pcie inbound和outbound关系
PCI/PCIe iATU

十四 诡异的问题 【未解决】

在实现网卡驱动的数据收发功能时,使用start_xmit napi poll函数完成协议栈数据的收发,使用ioctl完成用户自定义数据收发。二者使用不同的消息通道,但大致功能一样,故将两个功能集成在一个函数中,使用参数type区分协议栈与用户自定义数据。
大致代码如下:

1
2
3
4
5
6
7
8
9
10
11
#define PROCOTOL_TYPE 0
#define IOCTL_TYPE 1
int send_recv_api(xxx,int type){
if(type==PROCOTOL_TYPE){
// 处理数据
hardware_api(data,type1)
}else if(type==IOCTL_TYPE){
// 处理数据
hardware_api(data,type2)
}
}

然后网卡驱动加载,在未开启arp时,程序运行正常,实际上此时协议栈也未对数据进行处理与回复,两台主机加载驱动模块,不能相互ping通,主机并不会响应另一台主机的ARP request。而后将两个网卡的ARP功能打开,再次ping,接收ARP request的主机便当场死机。
由于暂时未调用用户自定义数据收发,故type一定为PROCOTOL_TYPE,故在函数前面加上判断

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#define PROCOTOL_TYPE 0
#define IOCTL_TYPE 1
int send_recv_api(xxx,int type){
if(type!=PROCOTOL_TYPE){
// 打印错误信息
return;
}
if(type==PROCOTOL_TYPE){
// 处理数据
hardware_api(data,type1)
}else if(type==IOCTL_TYPE){
// 处理数据
hardware_api(data,type2)
}
}

而后程序就能正常进行通信了,但问题在于错误信息并没有被打印,也就是说并没有进入这个分支
故将return语句注释掉,然后程序又崩溃了,而且崩溃时并未打印相关信息。
控制变量几次都是一样的效果,这就非常令人迷惑。不考虑问题的原因,单看问题的表象。
既然return语句有用,证明进入了该分支,那就应该有相应的打印语句
既然没有打印相关信息,证明就没进入过该分支

然后我怀疑打印函数是不是有问题,故在if语句前后打印信息,都打印成功。
由于函数基本逻辑比较简单,一般的想法便是程序哪块有点细节上的问题,例如赋值运算符与相等运算符,逗号运算符,运算符优先级等,但一直没找到程序的问题在哪里,而且越找类似的悖论就越多。
为什么进入这个分支却不打印信息呢? kgdb看崩溃时的输出也没看到进入其他分支的打印信息
搞了很久都没有解决这个问题,终于决定放弃了。最后想不如写成两个函数算了,没用几分钟就写完了。蚌

十五:利用信号量实现同步/互斥

场景:块设备一次只能处理一条命令,待一个请求执行完成后才能执行下一个请求

1
2
3
4
5
6
请求开始
执行必要的数据传输(写类型命令)
发送请求至块设备进行处理
等待设备返回请求执行结果
执行必要的数据传输(读类型命令)
请求完成

协议栈请求可直接使用blk_mq_start_request函数开始请求,blk_mq_end_request函数结束请求,对于ioctl用户自定义请求,我刚开始的实现方式如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
代表用户进程执行的内核ioctl处理函数(进程上下文):
解析用户请求
申请内核缓冲区
将请求数据从用户空间拷贝至内核缓冲区(写类型请求)
发送请求至请求通道:
当前请求通道空闲:
将请求数据从内核缓冲区拷贝至DMA缓冲区(写类型请求)
发送给块设备执行请求
当前请求通道忙碌:压入内核FIFO队列
每隔一段时间查看当前请求是否完成
将请求数据从内核缓冲区拷贝至用户空间(读类型请求)
请求完成
定时器轮询函数(中断上下文):
定时接收块设备的请求执行结果
若请求执行完成:
将请求数据从DMA缓冲区拷贝至内核缓冲区(读类型请求)
标识请求已经完成
若FIFO队列存在请求,发送下一请求给块设备

申请内核缓冲区,并进行额外数据拷贝的原因是:请求并不是一开始就执行,它有可能被压入FIFO队列,在定时器轮询函数(中断上下文)执行请求,而在中断上下文又无法访问用户空间(参见 二:用户空间访问问题)。虽然以上的实现方式可以基本实现预期功能,但额外的数据拷贝,内核缓冲区,FIFO队列总是显得笨拙。在重新审视代码设计后发现,这个场景不就是同步与互斥吗?故修改代码,使用信号量完成预期功能。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
同步信号量:sync (初值为0)  等待请求执行完成
互斥信号量:mutex (初值为1) 请求通道同一时间只能由一个进程访问
代表用户进程执行的内核ioctl处理函数(进程上下文):
down_interruptible(&mutex)
解析用户请求
将请求数据从用户空间拷贝至DMA缓冲区(写类型请求)
发送请求至块设备:
down_interruptible(&sync)
将请求数据从DMA缓冲区拷贝至用户空间(读类型请求)
请求完成
up(&mutex)
定时器轮询函数(中断上下文):
定时接收块设备的请求执行结果
若请求执行完成:
up(&sync)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// 信号量
// jiffies指超时时间
extern int __must_check down_timeout(struct semaphore *sem, long jiffies);

int __init down_timeout_init(void)
{
int ret;
long iffies = 1000; //1000个时钟节拍,即是4s
sema_init( &sema, 5 ); //信号量初始化,count = 5

/* 输出初始化后信号量的信息 */
printk("after sema_init, sema.count: %d\n", sema.count);
ret = down_timeout( &sema, iffies); //获取信号量

/* 输出down_timeout操作后信号量的信息 */
printk("first down_timeout, ret = %d\n", ret);
printk("first down_timeout, sema.count: %d\n", sema.count);

sema_init( &sema, 0 ); //信号量初始化,count = 0
ret = down_timeout( &sema, iffies);

printk("second down_timeout, ret = %d\n", ret);
printk("second down_timeout, sema.count: %d\n", sema.count);
return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// 定时器
struct timer_list {
/*
* All fields that change during normal runtime grouped to the
* same cacheline
*/
struct hlist_node entry;
unsigned long expires; // 过期时间
void (*function)(struct timer_list *);
u32 flags;

#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
};

#include <linux/time.h>

/* 定义一个定时器指针 */
static struct timer_list timer = NULL;

/* 参数是timer中的变量data */
void function_handle(unsigned long data){
/* 做你想做的事 */
......

/* 因为内核定时器是一个单次的定时器,所以如果想要多次重复定时需要在定时器绑定的函数结尾重新装载时间,并启动定时 */
/* Kernel Timer restart */
mod_timer(&timer, jiffies + HZ);
}

int xxxx_init(void){
/* 具体任务的注册等 */
......

init_timer(&timer); /* 初始化定时器 */
timer.function = function_handle; /* 绑定定时时间到后的执行函数 */
timer.expires = jiffies + HZ; /* 定时的时间点,HZ是jiffies时钟的周期,当前时间的1s之后 */
timer.data = 0; /* function_handle的参数*/
add_timer(&timer); /* 添加并启动定时器 */
}


void xxxx_exit(void){
/* 具体任务的卸载等 */
......

/* 删除定时器 */
del_timer(&timer);
}

两者时间的使用方式不同

Linux内核定时器
Linux内核API down_timeout
聊聊Linux内核信号量
Linux内核API

十六:内核线程

承接第七点,使用内核线程进行轮询
打印相关信息

1
2
dbg("poll task:%d  irq:%d  atomic:%d",in_task(),in_interrupt(),in_atomic());
// poll task:1 irq:0 atomic:0

可以看出创建的内核线程处于进程上下文,不属于中断上下文,也不属于原子上下文,可以放心使用休眠函数,每执行一次轮询都可以调用msleep进行一次休眠。
在卸载驱动时,一直卡在rmmod操作处,发现忘了在remove函数中stop内核线程,但即使stop了内核线程,也还是卡在卸载驱动处。
阅读kthread_stop问题探讨才知道要在内核线程的while循环中调用kthread_should_stop()函数,查询是否应该退出线程。

1
2
3
4
5
6
7
8
int demo_thread1(void *data) {
pr_info("%s ===>\n", __func__);
while(!kthread_should_stop()) {
pr_info("%s, I am alive\n", __func__);
}
pr_info("%s <===\n", __func__);
return 23;
}

十七:竞争

在编写简单的块设备驱动时,使用轮询而不是中断的方式接收返回结果。在发送SQ后,定时轮询特定的BAR空间位置,获取对应的CQ。如何判断收到CQ呢?每次收到CQ后将该位置清零,若下次发现相应位置的数据的命令ID位恰好为期望的命令ID,则表示收到CQ。

1
2
3
4
5
data = readl(addr);
cid = data & CID_MASK >> CID_OFFSET
if(cid==expect_cid){
接收处理
}

以上的处理流程似乎没有啥问题,但CQ由多个32位数组成,cid所属32位数写入成功并不代表所有的CQ都写入完成。故程序运行几十分钟会出现一次异常情况:其他32位数为0(即还没有写入,32位是写入数据的基本单位)。
可选的解决方案有:

  1. 使用中断告知数据已到达
  2. 对所有32位数都进行非零判断(需保证数据永远不为0)
  3. 增加一个标志位,在全部数据写入成功后设置该标志位

十八:调试相关

一:通过编译输出找出可能的问题(make gcc相关)
利用gcc 警告选项组合与标准错误重定向分析代码问题

1
2
#示例
EXTRA_CFLAGS := -g -Wint-to-pointer-cast -Wno-unused-parameter -Wno-sign-compare -Wno-unused-function -Wno-format-extra-args -Wall-w

【GCC】gcc警告选项汇总–编辑中|gcc编译选项

1
make xxx 2> build_output.txt

将Linux 标准输出,错误输出重定向到文件

make -n:仅输出指令调用,但不执行,便于观察Makefile的修改是否生效

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
- Call the first target specified in the Makefile (usually named "all"):
make

- Call a specific target:
make {{target}}

- Call a specific target, executing 4 jobs at a time in parallel:
make -j{{4}} {{target}}

- Use a specific Makefile:
make --file {{file}}

- Execute make from another directory:
make --directory {{directory}}

- Force making of a target, even if source files are unchanged:
make --always-make {{target}}

- Override variables defined in the Makefile by the environment:
make --environment-overrides {{target}}

make-选项

Makefile设置头文件路径

1
2
3
4
5
目录结构
--driver
----src(源文件目录)
------Makefile
----include(头文件目录)

通过Makefile位置找到头文件位置

1
2
3
mkfile_path := $(abspath $(lastword $(MAKEFILE_LIST)))
include_dir := $(abspath $(mkfile_path)/../../include)
ccflags-y += -g -I$(include_dir)

如何获取Makefile的当前相对目录?
Makefile 关于realpath的研究

二:运行调试

自定义内核打印函数,方便调试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// 输出函数
#ifdef DEBUG_OPT
#define PRINT_DBG(format, arg...) printk(KERN_DEBUG format, ##arg);
#else
#define PRINT_DBG(format, arg...) \
do \
{ \
} while (0)
#endif

#ifdef WARN_OPT
#define PRINT_WARN(format, arg...) printk(KERN_WARNING format, ##arg);
#else
#define PRINT_WARN(format, arg...) \
do \
{ \
} while (0)
#endif

#ifdef INFO_OPT
#define PRINT_INFO(format, arg...) printk(KERN_INFO format, ##arg);
#else
#define PRINT_INFO(format, arg...) \
do \
{ \
} while (0)
#endif
1
2
3
4
5
6
7
8
obj-m			+= test.o
# -DINFO_OPT -DDEBUG_OPT -DWARN_OPT 输出选项
ccflags-y := -DWARN_OPT
test-y := test1.o test2.o
all:
make -C /lib/modules/$(shell uname -r)/build/ M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build/ M=$(PWD) clean

do {…} while (0) 在宏定义中的作用
使用dmesg查看内核打印信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
dmesg -w -T -H	# 实时查看dmesg输出
Usage:
dmesg [options]

Display or control the kernel ring buffer.

Options:
-C, --clear clear the kernel ring buffer
-c, --read-clear read and clear all messages
-D, --console-off disable printing messages to console
-E, --console-on enable printing messages to console
-F, --file <file> use the file instead of the kernel log buffer
-f, --facility <list> restrict output to defined facilities
-H, --human human readable output
-k, --kernel display kernel messages
-L, --color[=<when>] colorize messages (auto, always or never)
colors are enabled by default
-l, --level <list> restrict output to defined levels
-n, --console-level <level> set level of messages printed to console
-P, --nopager do not pipe output into a pager
-p, --force-prefix force timestamp output on each line of multi-line messages
-r, --raw print the raw message buffer
-S, --syslog force to use syslog(2) rather than /dev/kmsg
-s, --buffer-size <size> buffer size to query the kernel ring buffer
-u, --userspace display userspace messages
-w, --follow wait for new messages
-x, --decode decode facility and level to readable string
-d, --show-delta show time delta between printed messages
-e, --reltime show local time and time delta in readable format
-T, --ctime show human-readable timestamp (may be inaccurate!)
-t, --notime don't show any timestamp with messages
--time-format <format> show timestamp using the given format:
[delta|reltime|ctime|notime|iso]
Suspending/resume will make ctime and iso timestamps inaccurate.

-h, --help display this help
-V, --version display version

Supported log facilities:
kern - kernel messages
user - random user-level messages
mail - mail system
daemon - system daemons
auth - security/authorization messages
syslog - messages generated internally by syslogd
lpr - line printer subsystem
news - network news subsystem

Supported log levels (priorities):
emerg - system is unusable
alert - action must be taken immediately
crit - critical conditions
err - error conditions
warn - warning conditions
notice - normal but significant condition
info - informational
debug - debug-level messages

For more details see dmesg(1).

kmsg日志的存储与读取
linux内核调试之kmsg和dmesg

调试环境下,或许可以删除之前的日志便于之后的日志查看

至今我都没有成功的方法获取内核崩溃前日志
获取Linux内核卡死前的日志

调试方法:
1 检查异常情况,打印相关信息,立即返回(面向printk开机关机编程)
2 配置kgdb,使用kgdb打印内核崩溃时输出信息或通过断点调试
kgdb使用经验:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
开发机:开发代码,运行驱动进行测试
调试机:使用kgdb通过串口调试开发机

两机器开发目录的绝对路径保持一致,将开发机代码拷贝至调试机同一位置
1 开发机
echo g > /proc/sysrq-trigger
此时开发机卡死
2 调试机
gdb vmlinux
set serial baud 115200
target remote /dev/ttyAMA1
看到输出后按c继续运行
此时开发机恢复正常
3 开发机
加载驱动 insmod xx.ko
查看插入驱动后代码段位置
cat /sys/module/驱动名/sections/.text
得到地址0x123456789
echo g > /proc/sysrq-trigger
4 调试机
add-symbol-file ko文件绝对路径 0x123456789
使用b打断点
按c继续运行
5 开发机
运行测试方法,击中断点
6 调试机
触发断点,打印相关信息

不过不知道为什么,按n有时候会进入中断(entry handler),导致无法调试
相关命令:
lx-dmesg
lx-symbols

注意驱动初始化时不能有错误,要不然无法得到驱动符号地址,为了方便,我们将测试的函数放在remove函数中,然后通过rmmod xx来触发断点。

调试过程中,被调试的内核运行在目标机上,GDB 调试器运行在开发机上。

使用 KGDB 调试 Kernel On Red Hat Linux

十九:调试经历

有些时候与其他开发人员沟通比一个人调试高效,但需确保自己已完全理解问题,可以完整表述问题且尝试了许多方法仍未解决
有可能程序实现并没有问题,程序已准确实现你心中的概念,但你对业务的理解有偏差,故不能单纯靠代码调试解决问题
调试经历1
我编写的驱动在openEuler上运行,但一插入驱动就报空指针错误,开始远程调试

编译运行环境,在x64 Ubuntu上交叉编译,在arm 嵌入式openEuler运行

1:首先dmesg查看内核日志,但很奇怪的是没有任何打印,驱动一加载就显示空指针错误,没有模块初始化时我加的的输出语句
检查日志输出级别也正常,可以输出info级别的消息,这相当于还没进入任何我编写的驱动部分就空指针异常了,而且崩溃时调用栈也全是不认识的函数。

所以我提出使用hello world简单驱动程序测试当前编译运行环境是否正常,而后发现hello world驱动程序出现同样的问题,后对方进行一些调整,解决问题

2:驱动插入时对BAR空间的信息全部错误(BAR起始地址,结束地址,标志位,大小),即pci_dev的resource成员的信息都很奇怪,但从lspci -v看来又是正确的,非常奇怪,怀疑是linux底层pci枚举出现问题,后想先将ioremap函数参数写成定值,暂时跳过该步,但接下来device_add_disk有发送非法地址访问错误,觉得BAR问题应该饶不开了,只能先解决这个问题

先切换运行环境内核版本为4.19,驱动运行正常,故代码没有太大问题,还是出在运行环境上,后应该是通过分离内核源码,配置方式重新编译驱动,不再统一编译,解决了该问题,我也不太了解

交叉编译,嵌入式,国产全是bug多发地

调试经历2
进行块设备驱动的验收,采用的数据收发流程和nvme类似,写sq,读cq。由于没有实现中断,只能轮询相应位置读取cq命令。但诡异的是用一个简单程序(在初始化函数中写sq,然后while循环读cq,直至读到相应cq命令)可以正常收发数据,而我的内核线程却一直轮询不到正确的数据,但观察串口打印确实执行了相应命令,cq相应位置的数据也确实发生了改变,唯独缺少了那个正确的数据。
示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cq位置
我的程序输出
x
x
x
y
y
y

测试程序输出
x
x
x
z
y
y
y

后面对测试程序进行修改,和我内核线程一样使用一定的休眠延时,同样没有收到cq。合理推测这数据会过期,只能在一定时间内取到,但不理解为什么要把cq数据设置为定时的数据。直到后面与固件开发人员交流后才发现原来cq逻辑是这样的:cq对应位置是一个fifo,刚开始数据确实在fifo中,但随即device会将fifo数据传递给host的fifo,所以我们本不应该在cq fifo读到数据,只是频率太高,凑巧读到了。

调试经历3
由于块设备驱动一次只能发送一条指令,对于协议栈请求,我的处理方式如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
int allow = 1;

请求到达
if(allow==1){
allow = 0;
执行请求
}else{
将请求入队
}

请求处理完成
allow = 1;
下一个请求出队,处理该请求

请求队列的tagset.queue_depth设的比较大,一次可能有多个请求到达。所以问题就很明显了,并发导致竞态。
当执行mkfs命令格式化文件系统时,一时间有多个请求到达,可能有多个请求同时进入了allow==1分支,导致向SQ写入的命令非常诡异

解决方法也比较简单,就是使用原子变量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
atomic_t allow;
atomic_set(&allow, 1);
请求到达
// atomic_cmpxchg(v, old, new)
// 执行原子比较交换,如果原子变量v的值等于old,那么把原子变量v的值设置为new,返回值总是原子变量v的旧值
if(atomic_cmpxchg(&allow,1,0)==1){
执行请求
}else{
将请求入队
}

请求处理完成
atomic_set(&allow, 1);
下一个请求出队,处理该请求

这种模式的问题如果放在多线程专题的实验或者锁专题的实验里面一眼就能看出来了,但实际工程中却比较容易犯,因为在写项目代码中没有人暗示需要进行处理并发问题,就比如运算符优先级问题,==与=之类的,理论上都不会犯,但实际写代码时就是容易写出类似的代码,这也是工程实践的意义

二十:设置文件系统块大小

由于块设备驱动最小只支持64K数据读取,修改内核设置页大小又觉得有点麻烦,后面看到在mkfs就有设置块大小的选项

1
2
3
4
5
6
7
8
9
mkfs.ext4: invalid option -- '-'
Usage: mkfs.ext4 [-c|-l filename] [-b block-size] [-C cluster-size]
[-i bytes-per-inode] [-I inode-size] [-J journal-options]
[-G flex-group-size] [-N number-of-inodes] [-d root-directory]
[-m reserved-blocks-percentage] [-o creator-os]
[-g blocks-per-group] [-L volume-label] [-M last-mounted-directory]
[-O feature[,...]] [-r fs-revision] [-E extended-option[,...]]
[-t fs-type] [-T usage-type ] [-U UUID] [-e errors_behavior][-z undo_file]
[-jnqvDFSV] device [blocks-count]

示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
mkfs.ext4   -b 65536 /dev/nvme1n1p1
Warning: blocksize 65536 not usable on most systems.
mke2fs 1.45.5 (07-Jan-2020)
/dev/nvme1n1p1 contains a ext4 file system
created on Fri Apr 21 17:00:47 2023
Proceed anyway? (y,N) y
mkfs.ext4: 65536-byte blocks too big for system (max 4096)
Proceed anyway? (y,N) y
Warning: 65536-byte blocks too big for system (max 4096), forced to continue
Creating filesystem with 163824 64k blocks and 164352 inodes
Filesystem UUID: 0d389dff-8d36-43b1-8362-2482ef7002b7
Superblock backups stored on blocks:
65528

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

本来以为大功告成了,但挂载时出错

1
2
3
4
5
mount -t ext4   /dev/nvme1n1p1 test
mount: test: wrong fs type, bad option, bad superblock on /dev/nvme1n1p1, missing codepage or helper program, or other error.

dmesg输出
[18738.045713] EXT4-fs (nvme1n1p1): bad block size 65536

当时想研究一下ext4文件系统的实现细节,看一看哪个环节出了错误,后面才看到有现成的命令输出文件系统信息,所以说遇到问题先找工具,看看有没有现成的,不要重复造轮子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
tldr dumpe2fs
dumpe2fs
Print the super block and blocks group information for ext2/ext3/ext4 filesystems.Unmount the partition before running this command using umount {{device}}.More information: https://manned.org/dumpe2fs.

- Display ext2, ext3 and ext4 filesystem information:
dumpe2fs {{/dev/sdXN}}

- Display the blocks which are reserved as bad in the filesystem:
dumpe2fs -b {{/dev/sdXN}}

- Force display filesystem information even with unrecognizable feature flags:
dumpe2fs -f {{/dev/sdXN}}

- Only display the superblock information and not any of the block group descriptor detail information:
dumpe2fs -h {{/dev/sdXN}}

- Print the detailed group information block numbers in hexadecimal format:
dumpe2fs -x {{/dev/sdXN}}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
dumpe2fs /dev/nvme1n1p1
dumpe2fs 1.45.5 (07-Jan-2020)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 0d389dff-8d36-43b1-8362-2482ef7002b7
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 164352
Block count: 163824
Reserved block count: 8191
Free blocks: 159068
Free inodes: 164341
First block: 0
Block size: 65536
Fragment size: 65536
Group descriptor size: 64
Reserved GDT blocks: 2
Blocks per group: 65528
Fragments per group: 65528
Inodes per group: 54784
Inode blocks per group: 214
Flex block group size: 16
Filesystem created: Sun Apr 23 11:16:03 2023
Last mount time: n/a
Last write time: Sun Apr 23 11:16:03 2023
Mount count: 0
Maximum mount count: -1
Last checked: Sun Apr 23 11:16:03 2023
Check interval: 0 (<none>)
Lifetime writes: 261 kB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: bd1ab816-317b-43f9-b664-1d1f1bf2962a
Journal backup: inode blocks
Checksum type: crc32c
Checksum: 0x4a3a900f
Journal features: (none)
Journal size: 256M
Journal length: 4096
Journal sequence: 0x00000001
Journal start: 0


Group 0: (Blocks 0-65527) csum 0x3132
Primary superblock at 0, Group descriptors at 1-1
Reserved GDT blocks at 2-3
Block bitmap at 4 (+4), csum 0x0e2f600d
Inode bitmap at 7 (+7), csum 0x5eb14e2c
Inode table at 10-223 (+10)
64872 free blocks, 54773 free inodes, 2 directories, 54773 unused inodes
Free blocks: 656-65527
Free inodes: 12-54784
Group 1: (Blocks 65528-131055) csum 0x80b5 [INODE_UNINIT]
Backup superblock at 65528, Group descriptors at 65529-65529
Reserved GDT blocks at 65530-65531
Block bitmap at 5 (bg #0 + 5), csum 0x1c5bc20f
Inode bitmap at 8 (bg #0 + 8), csum 0x00000000
Inode table at 224-437 (bg #0 + 224)
61428 free blocks, 54784 free inodes, 0 directories, 54784 unused inodes
Free blocks: 69628-131055
Free inodes: 54785-109568
Group 2: (Blocks 131056-163823) csum 0x06dc [INODE_UNINIT]
Block bitmap at 6 (bg #0 + 6), csum 0xa7a42c48
Inode bitmap at 9 (bg #0 + 9), csum 0x00000000
Inode table at 438-651 (bg #0 + 438)
32768 free blocks, 54784 free inodes, 0 directories, 54784 unused inodes
Free blocks: 131056-163823
Free inodes: 109569-164352

常规的方法行不通,就去网上找找有没有其他方法,然后就看到了这篇帖子
How can I mount filesystems with > 4KB block sizes?
5分钟搞懂用户空间文件系统FUSE工作原理

使用用户文件系统的方式挂载分区

1
2
3
4
5
6
7
8
9
10
fuseext2 /dev/nvme1n1p1 test_dir -o rw+
fuse-umfuse-ext2: version:'0.4', fuse_version:'29' [main (fuse-ext2.c:331)]
fuse-umfuse-ext2: enter [do_probe (do_probe.c:30)]
fuse-umfuse-ext2: leave [do_probe (do_probe.c:55)]
fuse-umfuse-ext2: opts.device: /dev/nvme1n1p1 [main (fuse-ext2.c:358)]
fuse-umfuse-ext2: opts.mnt_point: test_dir [main (fuse-ext2.c:359)]
fuse-umfuse-ext2: opts.volname: [main (fuse-ext2.c:360)]
fuse-umfuse-ext2: opts.options: rw+ [main (fuse-ext2.c:361)]
fuse-umfuse-ext2: parsed_options: rw,fsname=/dev/nvme1n1p1 [main (fuse-ext2.c:362)]
fuse-umfuse-ext2: mounting read-write [main (fuse-ext2.c:376)]

此时就可以正常使用分区且lsblk可以看到挂载目录

二十一:数据落盘

背景:块设备为保证数据落盘,关机前需发送shutdown指令,执行完shutdown指令后设备不再处理请求
第一个解决方案:关机前执行程序,手动发送shutdown指令
由于执行完shutdown指令后协议栈仍然会下发请求,故需记录当前是否执行shutdown指令,若执行了shutdown指令则直接结束请求。
这种方式的问题在于:①手动执行shutdown指令比较麻烦 ②关机时操作系统会将页缓存的数据刷入硬盘,若执行了shutdown指令则请求不能执行
第二个解决方案:驱动关机时自动发送shutdown指令
刚开始的时候我觉得驱动在关机的时候也会执行remove函数(不知道为什么),所以我在remove函数中向设备发送shutdown指令,但关机时看设备的串口打印并没有看到执行shutdown指令,故关机并不会执行remove函数,后在nvme驱动代码中找了一下,看到了shutdown函数,故在shutdown函数中发送shutdown指令并等待执行执行完成。这样实现后就能从串口中看到关机时首先执行了许多write指令,最后执行一个shutdown指令。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
struct module;

/**
* struct pci_driver - PCI driver structure
* @node: List of driver structures.
* @name: Driver name.
* @id_table: Pointer to table of device IDs the driver is
* interested in. Most drivers should export this
* table using MODULE_DEVICE_TABLE(pci,...).
* @probe: This probing function gets called (during execution
* of pci_register_driver() for already existing
* devices or later if a new device gets inserted) for
* all PCI devices which match the ID table and are not
* "owned" by the other drivers yet. This function gets
* passed a "struct pci_dev \*" for each device whose
* entry in the ID table matches the device. The probe
* function returns zero when the driver chooses to
* take "ownership" of the device or an error code
* (negative number) otherwise.
* The probe function always gets called from process
* context, so it can sleep.
* @remove: The remove() function gets called whenever a device
* being handled by this driver is removed (either during
* deregistration of the driver or when it's manually
* pulled out of a hot-pluggable slot).
* The remove function always gets called from process
* context, so it can sleep.
* @suspend: Put device into low power state.
* @suspend_late: Put device into low power state.
* @resume_early: Wake device from low power state.
* @resume: Wake device from low power state.
* (Please see Documentation/power/pci.rst for descriptions
* of PCI Power Management and the related functions.)
* @shutdown: Hook into reboot_notifier_list (kernel/sys.c).
* Intended to stop any idling DMA operations.
* Useful for enabling wake-on-lan (NIC) or changing
* the power state of a device before reboot.
* e.g. drivers/net/e100.c.
* @sriov_configure: Optional driver callback to allow configuration of
* number of VFs to enable via sysfs "sriov_numvfs" file.
* @err_handler: See Documentation/PCI/pci-error-recovery.rst
* @groups: Sysfs attribute groups.
* @driver: Driver model structure.
* @dynids: List of dynamically added device IDs.
*/
struct pci_driver {
struct list_head node;
const char *name;
const struct pci_device_id *id_table; /* Must be non-NULL for probe to be called */
int (*probe)(struct pci_dev *dev, const struct pci_device_id *id); /* New device inserted */
void (*remove)(struct pci_dev *dev); /* Device removed (NULL if not a hot-plug capable driver) */
int (*suspend)(struct pci_dev *dev, pm_message_t state); /* Device suspended */
int (*suspend_late)(struct pci_dev *dev, pm_message_t state);
int (*resume_early)(struct pci_dev *dev);
int (*resume)(struct pci_dev *dev); /* Device woken up */
void (*shutdown)(struct pci_dev *dev);
int (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
const struct pci_error_handlers *err_handler;
const struct attribute_group **groups;
struct device_driver driver;
struct pci_dynids dynids;
};

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
static void nvme_shutdown(struct pci_dev *pdev)
{
struct nvme_dev *dev = pci_get_drvdata(pdev);
nvme_dev_disable(dev, true);
}

static struct pci_driver nvme_driver = {
.name = "nvme",
.id_table = nvme_id_table,
.probe = nvme_probe,
.remove = nvme_remove,
.shutdown = nvme_shutdown,
.driver = {
.pm = &nvme_dev_pm_ops,
},
.sriov_configure = pci_sriov_configure_simple,
.err_handler = &nvme_err_handler,
};

二十二 kfifo误区

写块设备驱动代码时使用kfifo保证一次只发一个请求,搜Kfifo用法时候看到“内核无锁队列”这几个词,自然而然没有对Kfifo加锁,但驱动写完测试的时候出队的结果千奇百怪,刚开始还找是不是其他模块出了问题,通过打印发现就是队列取出的数据有问题,所以看一看Kfifo源码,看到了以下注释:
include/linux/kfifo.h

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
/*
* Note about locking: There is no locking required until only one reader
* and one writer is using the fifo and no kfifo_reset() will be called.
* kfifo_reset_out() can be safely used, until it will be only called
* in the reader thread.
* For multiple writer and one reader there is only a need to lock the writer.
* And vice versa for only one writer and multiple reader there is only a need
* to lock the reader.
*/

/**
* kfifo_in_spinlocked - put data into the fifo using a spinlock for locking
* @fifo: address of the fifo to be used
* @buf: the data to be added
* @n: number of elements to be added
* @lock: pointer to the spinlock to use for locking
*
* This macro copies the given values buffer into the fifo and returns the
* number of copied elements.
*/
#define kfifo_in_spinlocked(fifo, buf, n, lock) \
({ \
unsigned long __flags; \
unsigned int __ret; \
spin_lock_irqsave(lock, __flags); \
__ret = kfifo_in(fifo, buf, n); \
spin_unlock_irqrestore(lock, __flags); \
__ret; \
})

/**
* kfifo_out_spinlocked - get data from the fifo using a spinlock for locking
* @fifo: address of the fifo to be used
* @buf: pointer to the storage buffer
* @n: max. number of elements to get
* @lock: pointer to the spinlock to use for locking
*
* This macro get the data from the fifo and return the numbers of elements
* copied.
*/
#define kfifo_out_spinlocked(fifo, buf, n, lock) \
__kfifo_uint_must_check_helper( \
({ \
unsigned long __flags; \
unsigned int __ret; \
spin_lock_irqsave(lock, __flags); \
__ret = kfifo_out(fifo, buf, n); \
spin_unlock_irqrestore(lock, __flags); \
__ret; \
}) \
)

所以说有多个写者/读者的时候还是需要加锁的,只是单个读者+单个写者不需要加锁
相关博客推荐:
linux内核之无锁缓冲队列kfifo原理(结合项目实践)
kfifo(内核无锁队列)

1
2
3
4
5
6
7
8
9
                                               |<--写入-->|
+--------------------------------------------------------------+
| |<----------data----->| |
+--------------------------------------------------------------+
|<--读取-->|
^ ^ ^
| | |
out in size

二十三 kmalloc vmalloc kvmalloc

由于刚开始写驱动代码,所有的内存申请都使用kmalloc。在块设备驱动中ioctl请求需申请内核缓冲区存放中间数据,有时候发现申请内存失败。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[  720.936419] 8_ioctl_rw_test: page allocation failure: order:10, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[ 720.936427] CPU: 3 PID: 5741 Comm: 8_ioctl_rw_test Tainted: G OE 5.4.18-35-generic #21-KYLINOS
[ 720.936429] Hardware name: GITSTAR GITSTAR-MF20A/GM9-2665, BIOS 03.12 02/15/22 23:09:42
[ 720.936430] Call trace:
[ 720.936435] dump_backtrace+0x0/0x178
[ 720.936437] show_stack+0x14/0x20
[ 720.936440] dump_stack+0xac/0xd0
[ 720.936444] warn_alloc+0xec/0x158
[ 720.936446] __alloc_pages_slowpath+0x9ec/0xa18
[ 720.936447] __alloc_pages_nodemask+0x244/0x2a8
[ 720.936449] alloc_pages_current+0x7c/0xe8
[ 720.936452] kmalloc_order+0x1c/0x88
[ 720.936454] __kmalloc+0x1cc/0x208
[ 720.936461] hps_ioctl_cmd+0x60/0x308 [Hps]
[ 720.936463] hps_ioctl+0xe8/0xf8 [Hps]
[ 720.936466] blkdev_ioctl+0x4b0/0xac0
[ 720.936469] block_ioctl+0x34/0x40
[ 720.936471] do_vfs_ioctl+0x370/0x7a8
[ 720.936472] ksys_ioctl+0x78/0xa8
[ 720.936474] sys_ioctl+0xc/0x18
[ 720.936476] el0_svc_naked+0x30/0x34
[ 720.936477] Mem-Info:
[ 720.936481] active_anon:259253 inactive_anon:130401 isolated_anon:0
active_file:173204 inactive_file:345209 isolated_file:0
unevictable:16 dirty:1 writeback:0 unstable:0
slab_reclaimable:16313 slab_unreclaimable:22005
mapped:113679 shmem:12956 pagetables:7084 bounce:0
free:19674 free_pcp:1903 free_cma:326

使用cat /proc/buddyinfo发现高阶内存比较少,确实很可能出现申请不到内存的现象
相关博客:Linux | 内存 | 由内存页不足(page allocation failure)引起程序杀死(OOM Killer)
没找到啥具有可行性的方法,并且这错误也不是稳定出现,就没咋管,后面请求密集了发现这问题稳定复现了,就不得不想想怎么解决了。
刚开始我的代码申请失败直接返回,没有啥错误处理

1
2
3
4
5
6
void *buffer = kmalloc(length, GFP_KERNEL); 
if (buffer == NULL)
{
PRINT_WARN("alloc kernel buffer failed.");
return -1;
}

后面想想为啥申请几M内存会申请不了,难道kmalloc和dma_alloc_coherent一样要求物理内存连续吗?搜一搜发现还真是,并且发现了kvmalloc这个智能函数,把kmalloc kfree改成kvmalloc kvfree了
vmalloc & kmalloc
最终测试的时候会将之前默认的选择重新审视一遍

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
static inline void *kvmalloc(size_t size, gfp_t flags)
{
return kvmalloc_node(size, flags, NUMA_NO_NODE);
}
/**
* kvmalloc_node - attempt to allocate physically contiguous memory, but upon
* failure, fall back to non-contiguous (vmalloc) allocation.
* @size: size of the request.
* @flags: gfp mask for the allocation - must be compatible (superset) with GFP_KERNEL.
* @node: numa node to allocate from
*
* Uses kmalloc to get the memory but if the allocation fails then falls back
* to the vmalloc allocator. Use kvfree for freeing the memory.
*
* Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
* __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
* preferable to the vmalloc fallback, due to visible performance drawbacks.
*
* Please note that any use of gfp flags outside of GFP_KERNEL is careful to not
* fall back to vmalloc.
*
* Return: pointer to the allocated memory of %NULL in case of failure
*/
void *kvmalloc_node(size_t size, gfp_t flags, int node)


/**
* vmalloc - allocate virtually contiguous memory
* @size: allocation size
*
* Allocate enough pages to cover @size from the page level
* allocator and map them into contiguous kernel virtual space.
*
* For tight control over page level allocator and protection flags
* use __vmalloc() instead.
*
* Return: pointer to the allocated memory or %NULL on error
*/
void *vmalloc(unsigned long size)
{
return __vmalloc_node_flags(size, NUMA_NO_NODE,
GFP_KERNEL);
}

/**
* kmalloc - allocate memory
* @size: how many bytes of memory are required.
* @flags: the type of memory to allocate.
*
* kmalloc is the normal method of allocating memory
* for objects smaller than page size in the kernel.
*
* The allocated object address is aligned to at least ARCH_KMALLOC_MINALIGN
* bytes. For @size of power of two bytes, the alignment is also guaranteed
* to be at least to the size.
*
* The @flags argument may be one of the GFP flags defined at
* include/linux/gfp.h and described at
* :ref:`Documentation/core-api/mm-api.rst <mm-api-gfp-flags>`
*
* The recommended usage of the @flags is described at
* :ref:`Documentation/core-api/memory-allocation.rst <memory-allocation>`
*
* Below is a brief outline of the most useful GFP flags
*
* %GFP_KERNEL
* Allocate normal kernel ram. May sleep.
*
* %GFP_NOWAIT
* Allocation will not sleep.
*
* %GFP_ATOMIC
* Allocation will not sleep. May use emergency pools.
*
* %GFP_HIGHUSER
* Allocate memory from high memory on behalf of user.
*
* Also it is possible to set different flags by OR'ing
* in one or more of the following additional @flags:
*
* %__GFP_HIGH
* This allocation has high priority and may use emergency pools.
*
* %__GFP_NOFAIL
* Indicate that this allocation is in no way allowed to fail
* (think twice before using).
*
* %__GFP_NORETRY
* If memory is not immediately available,
* then give up at once.
*
* %__GFP_NOWARN
* If allocation fails, don't issue any warnings.
*
* %__GFP_RETRY_MAYFAIL
* Try really hard to succeed the allocation but fail
* eventually.
*/
static __always_inline void *kmalloc(size_t size, gfp_t flags)

二十三 延时函数相关博客

How to sleep in the Linux kernel?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
delays - Information on the various kernel delay / sleep mechanisms
-------------------------------------------------------------------

This document seeks to answer the common question: "What is the
RightWay (TM) to insert a delay?"

This question is most often faced by driver writers who have to
deal with hardware delays and who may not be the most intimately
familiar with the inner workings of the Linux Kernel.


Inserting Delays
----------------

The first, and most important, question you need to ask is "Is my
code in an atomic context?" This should be followed closely by "Does
it really need to delay in atomic context?" If so...

ATOMIC CONTEXT:
You must use the *delay family of functions. These
functions use the jiffie estimation of clock speed
and will busy wait for enough loop cycles to achieve
the desired delay:

ndelay(unsigned long nsecs)
udelay(unsigned long usecs)
mdelay(unsigned long msecs)

udelay is the generally preferred API; ndelay-level
precision may not actually exist on many non-PC devices.

mdelay is macro wrapper around udelay, to account for
possible overflow when passing large arguments to udelay.
In general, use of mdelay is discouraged and code should
be refactored to allow for the use of msleep.

NON-ATOMIC CONTEXT:
You should use the *sleep[_range] family of functions.
There are a few more options here, while any of them may
work correctly, using the "right" sleep function will
help the scheduler, power management, and just make your
driver better :)

-- Backed by busy-wait loop:
udelay(unsigned long usecs)
-- Backed by hrtimers:
usleep_range(unsigned long min, unsigned long max)
-- Backed by jiffies / legacy_timers
msleep(unsigned long msecs)
msleep_interruptible(unsigned long msecs)

Unlike the *delay family, the underlying mechanism
driving each of these calls varies, thus there are
quirks you should be aware of.


SLEEPING FOR "A FEW" USECS ( < ~10us? ):
* Use udelay

- Why not usleep?
On slower systems, (embedded, OR perhaps a speed-
stepped PC!) the overhead of setting up the hrtimers
for usleep *may* not be worth it. Such an evaluation
will obviously depend on your specific situation, but
it is something to be aware of.

SLEEPING FOR ~USECS OR SMALL MSECS ( 10us - 20ms):
* Use usleep_range

- Why not msleep for (1ms - 20ms)?
Explained originally here:
http://lkml.org/lkml/2007/8/3/250
msleep(1~20) may not do what the caller intends, and
will often sleep longer (~20 ms actual sleep for any
value given in the 1~20ms range). In many cases this
is not the desired behavior.

- Why is there no "usleep" / What is a good range?
Since usleep_range is built on top of hrtimers, the
wakeup will be very precise (ish), thus a simple
usleep function would likely introduce a large number
of undesired interrupts.

With the introduction of a range, the scheduler is
free to coalesce your wakeup with any other wakeup
that may have happened for other reasons, or at the
worst case, fire an interrupt for your upper bound.

The larger a range you supply, the greater a chance
that you will not trigger an interrupt; this should
be balanced with what is an acceptable upper bound on
delay / performance for your specific code path. Exact
tolerances here are very situation specific, thus it
is left to the caller to determine a reasonable range.

SLEEPING FOR LARGER MSECS ( 10ms+ )
* Use msleep or possibly msleep_interruptible

- What's the difference?
msleep sets the current task to TASK_UNINTERRUPTIBLE
whereas msleep_interruptible sets the current task to
TASK_INTERRUPTIBLE before scheduling the sleep. In
short, the difference is whether the sleep can be ended
early by a signal. In general, just use msleep unless
you know you have a need for the interruptible variant.

一文入门linux内核高精度定时器hrtimer机制
usleep_range()函数
LINUX内核中使用USLEEP_RANGE(MIN, MAX)的注意事项

上下文切换

根据 Tsuna 的测试报告,每次上下文切换都需要几十纳秒到到微秒的CPU时间,这些时间对CPU来说,
就好比人类对1分钟或10分钟的感觉概念。在分秒必争的计算机处理环境下,浪费太多时间在切换上,
只能会降低真正处理任务的时间,表象上导致延时、排队、卡顿现象发生。

二十四 奇怪的255扇区

fio引发的一些问题

二十五 内核调试工具 crash

内核调试工具crash使用

二十六 未知的bug

 编写的块设备驱动在运行fio测试时电脑会卡死,鼠标动不了,键盘没反应。本来以为有了crash这个工具后一切bug都会迎刃而解,但实际上并没有任何变化,并没有自动重启,/var/crash也没有任何新增文件。
 那是不是crash并没有成功安装呢? 就像内核调试工具crash使用一样,自己写一个bug,在请求大小为100k时设置req为空。然后进行fio测试,bs=100k时内核panic,自动重启,/var/crash目录下新增相关的文件,证明crash还是有效的。
 买了一本《Debug Hacks中文版—深入调试的技术和工具》影印版,只要16块钱,看了看相关博客:
解读内核 sysctl 配置中 panic、oops 相关项目
Linux Watchdog 机制
使用sysctl配置内核参数
FT2000+模块在麒麟系统下串口输出功能调试
每日一小时linux(1)–sysRq
【开发工具】【sysrq】魔术键(sysRq)的使用
宋宝华: Kernel Oops和Panic是一回事吗?
串口通信简介——发展历史与基本概念
linux下查看与设置串口(或串口终端)信息和属性

 我想到了3个找到bug的途径:① 将串口输出至另一台主机 ② 开启watchdog功能 ③ sysrq魔术键

首先是串口打印,飞腾主板上是RS232(DB9)接口的串口,刚开始没找到相应的串口线,后面使用一个双母线+RS232转usb线连接到笔记本,但ttyS0一直没反应,相关命令为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
 dmesg | grep ttyS0
[ 0.946297] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A

cat /proc/tty/driver/serial
serinfo:1.0 driver revision:
0: uart:16550A port:000003F8 irq:4 tx:64 rx:0 RTS|CTS|DTR|DSR|CD
1: uart:unknown port:000002F8 irq:3
2: uart:unknown port:000003E8 irq:4
3: uart:unknown port:000002E8 irq:3
4: uart:unknown port:00000000 irq:0
5: uart:unknown port:00000000 irq:0
6: uart:unknown port:00000000 irq:0
7: uart:unknown port:00000000 irq:0
8: uart:unknown port:00000000 irq:0
9: uart:unknown port:00000000 irq:0

cat /proc/tty/drivers
/dev/tty /dev/tty 5 0 system:/dev/tty
/dev/console /dev/console 5 1 system:console
/dev/ptmx /dev/ptmx 5 2 system
/dev/vc/0 /dev/vc/0 4 0 system:vtmaster
ttyprintk /dev/ttyprintk 5 3 console
max310x /dev/ttyMAX 204 209-224 serial
serial /dev/ttyS 4 64-111 serial
pty_slave /dev/pts 136 0-1048575 pty:slave
pty_master /dev/ptm 128 0-1048575 pty:master
unknown /dev/tty 4 1-63 console


可以用stty指令,查看或设置某个串口的波特率等信息。

stty查看串口参数: stty -F /dev/ttySn -a #ttySn为要查看的串口

stty设置串口参数:

stty -F /dev/ttyS0 ispeed 115200 ospeed 115200 cs8 -parenb -cstopb -echo
该命令将串口1(/dev/ttyS0)输入输出都设置成115200波特率,8位数据模式,

奇偶校验位-parenb,停止位-cstopb,同时-echo禁止终端回显。

一般情况下设置波特率和数据模式这两个参数就可以了,如果显示数据乱码,可能还需要设置其它参数,使用man查看stty其它设置选项。

问了问客服,换成ttyAMA1接口,好像arm的就是ttyAMA1
使用stty -F /dev/ttyAMA1 -a查看波特率
What is the difference between ttyS0, ttyUSB0 and ttyAMA0 in Linux?

1
2
3
ttyS0 is the device for the first UART serial port on x86 and x86_64 architectures. If you have a PC motherboard with serial ports you'd be using a ttySn to attach a modem or a serial console.
ttyUSB0 is the device for the first USB serial convertor. If you have an USB serial cable you'd be using a ttyUSBn to connect to the serial port of a router.
ttyAMA0 is the device for the first serial port on ARM architecture. If you have an ARM-based TV box with a serial console and running Android or OpenELEC, you'd be using a ttyAMAn to attach a console to it.

虽然现在串口可以用了,但修改启动选项更改console一直没有生效,好像一直是printk:ttyAMA0 enabled,暂时放弃该方法。

而后尝试第二个方法,watchdog。理论上只有Hardlockup中断不可响应,sysrq魔术键也无法生效,故只需要把SoftLockup检测机制,Hardlockup检测机制都打开,并且设置lockup时panic应该所有问题都能解决吧。
linux虚拟机相关内核编译选项

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ls /proc/sys/kernel/ | grep "watchdog"
nmi_watchdog
soft_watchdog
watchdog
watchdog_cpumask
watchdog_thresh

ls /proc/sys/kernel/ | grep "panic"
hardlockup_panic
hung_task_panic
panic
panic_on_io_nmi
panic_on_oops
panic_on_rcu_stall
panic_on_unrecovered_nmi
panic_on_warn
panic_print
softlockup_panic
unknown_nmi_panic

但问题是在麒麟系统+飞腾处理器的主机上显示的结果不是这样,grep “watchdog”没有任何显示。出师未捷身先死! 看了看系统的config,只开启了softlockup detector选项,竟然没有hardlockup detector选项,只能感慨一句国产的真是好啊! watchdog方法胎死腹中。

第三个方法sysrq实际上我已经预感到没啥用了,cat /proc/sys/kernel/sysrq刚开始显示176,后在 /etc/sysctl.conf中写入kernel.sysrq=1,正常使用电脑时使用alt+sysrq+c确实能自动重启,但当电脑卡死时sysrq键没有任何反应,失败!

既然以上3个方法都没啥用,就尝试其他方法:① 重新编译内核 ② 查看块设备固件代码 ③ 寻找正确设置串口打印的方法

麒麟操作系统切换内核有太多不可控的因素了,之前切换内核后全是问题,如果不是实在没办法了都不想动内核一点。设备固件代码调试起来比较麻烦,修改后还要烧录,比较耗时。至于串口,打算使用vmware来实验一下串口功能,可惜vmware好像不能虚拟arm的主机,而且我有点怀疑卡死的时候真有内核打印吗?

解决问题的路上总是有一系列失败的尝试,但我认为这些经历比那些成功的经历更有意义!

串口虚拟机测试
使用输出文件的方式使用vmware的串口 VMware虚拟串口的设置与使用
发现新增的串口是ttyS1而不是ttyS0
修改/etc/default/grub

1
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash console=ttyS1,115200 debug loglevel=7"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 修改grub前
dmesg | grep console
[ 0.171761] printk: console [tty0] enabled
[ 3.699707] systemd[1]: Starting Set the console keyboard layout...

# 修改grub后
dmesg | grep console
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0 root=UUID=5a55429a-0931-4339-814e-a532007c5cc4 ro quiet splash console=ttyS1,115200 debug loglevel=7 crashkernel=512M-:192M
[ 0.096099] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0 root=UUID=5a55429a-0931-4339-814e-a532007c5cc4 ro quiet splash console=ttyS1,115200 debug loglevel=7 crashkernel=512M-:192M
[ 1.144103] printk: console [ttyS1] enabled
[ 7.214703] systemd-sysv-generator[408]: Native unit for console-setup.service already exists, skipping.
[ 7.225267] systemd-sysv-generator[408]: Ignoring S01console-setup.sh symlink in rc2.d, not generating console-setup.service.
[ 7.225654] systemd-sysv-generator[408]: Ignoring S01console-setup.sh symlink in rc3.d, not generating console-setup.service.
[ 7.225979] systemd-sysv-generator[408]: Ignoring S01console-setup.sh symlink in rc4.d, not generating console-setup.service.
[ 7.226323] systemd-sysv-generator[408]: Ignoring S01console-setup.sh symlink in rc5.d, not generating console-setup.service.
[ 7.231177] systemd[1]: unit_file_build_name_map: normal unit file: /lib/systemd/system/console-getty.service

串口的输出文件显示了内核打印信息

所以说什么时候输出了printk: console [ttyAMA1] enabled,什么时候我的串口设置就成功了

可惜没找到方法,不论怎么设置都是printk: console [ttyAMA0] enabled,无奈

既然找不到问题原因,就只能大刀阔斧重构代码了,就当刚开始写代码,重新写一份再修bug

———————-很久之后的更新—————

实习几个月后回来继续搞项目,发现之前的主机卡死问题再也没有出现了,而驱动代码并没有任何改变,合理推测主机卡死问题是由设备固件方面导致的。调试方法对应程序开发来说真的很重要!!!

二十七 测量驱动执行时间

参考博客:
宋宝华:关于Ftrace的一个完整案例
perf-tools

一般驱动开发到最后都会有测试驱动各部分执行时间,找到性能瓶颈的需求,实际上perf-tools中的funcgraph就够用了,在nvme驱动nvme_queue_rq函数中插入dump_stack,使用fio测试,获取函数调用栈(可以采用fio引发的一些问题中的方法在虚拟机中测试),然后找一个上层函数监听即可(例如aio_write)

libaio引擎

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[  +0.000000] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
[ +0.000001] Call Trace:
[ +0.000001] dump_stack+0x6d/0x9a
[ +0.000002] nvme_queue_rq.cold+0x28/0x97 [nvme]
[ +0.000002] __blk_mq_try_issue_directly+0x116/0x1c0
[ +0.000001] blk_mq_request_issue_directly+0x4b/0xe0
[ +0.000001] blk_mq_try_issue_list_directly+0x46/0xb0
[ +0.000001] blk_mq_sched_insert_requests+0xae/0x100
[ +0.000001] blk_mq_flush_plug_list+0x1e8/0x290
[ +0.000001] blk_flush_plug_list+0xe3/0x110
[ +0.000001] blk_finish_plug+0x26/0x34
[ +0.000001] blkdev_write_iter+0xbd/0x140
[ +0.000001] aio_write+0xec/0x1a0
[ +0.000001] ? __switch_to_asm+0x40/0x70
[ +0.000001] ? __check_object_size+0x5d/0x150
[ +0.000001] ? _copy_to_user+0x2c/0x30
[ +0.000001] ? aio_read_events+0x215/0x320
[ +0.000001] ? _cond_resched+0x19/0x30
[ +0.000001] ? io_submit_one+0x7b/0xb50
[ +0.000001] io_submit_one+0x449/0xb50
[ +0.000000] ? wait_woken+0x80/0x80
[ +0.000002] __x64_sys_io_submit+0x90/0x180
[ +0.000001] ? __x64_sys_io_submit+0x90/0x180
[ +0.000001] ? __x64_sys_io_getevents+0x5f/0xb0
[ +0.000001] ? exit_to_usermode_loop+0x8f/0x160
[ +0.000001] do_syscall_64+0x57/0x190
[ +0.000001] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ +0.000000] RIP: 0033:0x7f27ccbd473d

sync引擎

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[  +0.000001] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
[ +0.000001] Workqueue: kblockd blk_mq_run_work_fn
[ +0.000001] Call Trace:
[ +0.000001] dump_stack+0x6d/0x9a
[ +0.000002] nvme_queue_rq.cold+0x5/0xa [nvme]
[ +0.000002] blk_mq_dispatch_rq_list+0x93/0x550
[ +0.000001] ? __switch_to_asm+0x34/0x70
[ +0.000000] ? __switch_to_asm+0x40/0x70
[ +0.000001] ? __switch_to_asm+0x34/0x70
[ +0.000001] ? blk_mq_flush_busy_ctxs+0x18d/0x1c0
[ +0.000002] blk_mq_sched_dispatch_requests+0x162/0x180
[ +0.000001] __blk_mq_run_hw_queue+0x5a/0x110
[ +0.000000] blk_mq_run_work_fn+0x1b/0x20
[ +0.000002] process_one_work+0x1eb/0x3b0
[ +0.000000] worker_thread+0x4d/0x400
[ +0.000002] kthread+0x104/0x140
[ +0.000001] ? process_one_work+0x3b0/0x3b0
[ +0.000001] ? kthread_park+0x90/0x90
[ +0.000001] ret_from_fork+0x22/0x40

选择aio_write作为监听函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
./kernel/funcgraph aio_write > write.txt
Tracing "aio_write"... Ctrl-C to end.
1) | aio_write() {
1) ==========> |
1) | smp_irq_work_interrupt() {
1) | irq_enter() {
1) 0.121 us | rcu_irq_enter();
1) 0.391 us | }
1) | __wake_up() {
1) | __wake_up_common_lock() {
1) 0.091 us | _raw_spin_lock_irqsave();
1) | __wake_up_common() {
1) | autoremove_wake_function() {
1) | default_wake_function() {
1) | try_to_wake_up() {
1) 0.091 us | _raw_spin_lock_irqsave();
1) | select_task_rq_fair() {
1) | select_idle_sibling() {
1) 0.101 us | available_idle_cpu();
1) 0.341 us | }
1) 0.591 us | }
1) 0.091 us | _raw_spin_lock();
1) 0.210 us | update_rq_clock();
1) | ttwu_do_activate() {
1) | activate_task() {
1) | psi_task_change() {
1) 0.100 us | record_times();
1) 0.260 us | record_times();
1) 0.231 us | record_times();
1) 0.100 us | record_times();
1) 1.533 us | }
...

也可以考虑使用ktime_get_ns函数手动计算各函数执行时间

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ktime_t ktime_get(void)
CLOCK_MONOTONIC

Useful for reliable timestamps and measuring short time intervals accurately. Starts at system boot time but stops during suspend.

ktime_t ktime_get_boottime(void)
CLOCK_BOOTTIME

Like ktime_get(), but does not stop when suspended. This can be used e.g. for key expiration times that need to be synchronized with other machines across a suspend operation.

ktime_t ktime_get_real(void)
CLOCK_REALTIME

Returns the time in relative to the UNIX epoch starting in 1970 using the Coordinated Universal Time (UTC), same as gettimeofday() user space. This is used for all timestamps that need to persist across a reboot, like inode times, but should be avoided for internal uses, since it can jump backwards due to a leap second update, NTP adjustment settimeofday() operation from user space.

ktime_t ktime_get_clocktai(void)
CLOCK_TAI

Like ktime_get_real(), but uses the International Atomic Time (TAI) reference instead of UTC to avoid jumping on leap second updates. This is rarely useful in the kernel.

ktime_t ktime_get_raw(void)
CLOCK_MONOTONIC_RAW

Like ktime_get(), but runs at the same rate as the hardware clocksource without (NTP) adjustments for clock drift. This is also rarely needed in the kernel.

https://www.kernel.org/doc/html/latest/core-api/timekeeping.html

也可以考虑使用控制变量法,例如对应驱动内存拷贝时间占比,可以注释掉拷贝语句,然后测试性能,计算之间的差值。

测试preempt_count,BUG_ON(), WARN_ON()和panic(),dump_stack常用调试函数

1
2
3
4
5
6
if (blk_rq_bytes(req) == 100 * 1024) {
dump_stack();
}
BUG_ON(blk_rq_bytes(req) == 101 * 1024);
WARN_ON(blk_rq_bytes(req) == 102 * 1024);
printk(KERN_WARNING "nvme_queue_rq preempt_count:%d", preempt_count());

BUG_ON输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
[  361.694454] nvme_queue_rq preempt_count:0
[ 361.694464] nvme_queue_rq preempt_count:0
[ 361.694476] nvme_queue_rq preempt_count:0
[ 361.694503] nvme_queue_rq preempt_count:0
[ 361.695012] nvme_queue_rq preempt_count:0
[ 361.695434] nvme_queue_rq preempt_count:0
[ 361.695448] nvme_queue_rq preempt_count:0
[ 361.695451] nvme_queue_rq preempt_count:0
[ 361.695456] nvme_queue_rq preempt_count:0
[ 361.695473] nvme_queue_rq preempt_count:0
[ 361.695479] nvme_queue_rq preempt_count:0
[ 361.695490] nvme_queue_rq preempt_count:0
[ 361.695499] nvme_queue_rq preempt_count:0
[ 361.697490] nvme_queue_rq preempt_count:0
[ 361.697516] nvme_queue_rq preempt_count:0
[ 361.697529] nvme_queue_rq preempt_count:0
[ 361.697539] nvme_queue_rq preempt_count:0
[ 361.697598] nvme_queue_rq preempt_count:0
[ 361.769666] ------------[ cut here ]------------
[ 361.769668] kernel BUG at /root/kernel/linux-5.4/drivers/nvme/host/pci.c:934!
[ 361.769676] invalid opcode: 0000 [#1] SMP NOPTI
[ 361.769678] CPU: 0 PID: 356 Comm: kworker/0:1H Kdump: loaded Tainted: G W OE 5.4.0 #1
[ 361.769679] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
[ 361.769684] Workqueue: kblockd blk_mq_run_work_fn
[ 361.769688] RIP: 0010:nvme_queue_rq+0x13d/0x1d0 [nvme]
[ 361.769689] Code: 00 00 48 83 f8 ff 74 24 48 89 45 a0 41 81 7c 24 28 00 90 01 00 0f 84 4d 19 00 00 41 81 7c 24 28 00 94 01 00 0f 85 48 19 00 00 <0f> 0b 4c 89 e6 4c 89 ff 41 bd 0a 00 00 00 e8 f0 d8 ff ff 4c 89 e7
[ 361.769690] RSP: 0018:ffffb9f900af7cd8 EFLAGS: 00010246
[ 361.769691] RAX: 0000000000008801 RBX: ffffb9f900af7d98 RCX: ffffeb84c6c27c80
[ 361.769692] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff905f20db2060
[ 361.769692] RBP: ffffb9f900af7d48 R08: ffff905f20db2080 R09: 00000001b09f9000
[ 361.769693] R10: ffff905f20db20a0 R11: ffff905f00228000 R12: ffff905ec515c780
[ 361.769694] R13: 0000000000000000 R14: ffff905ec55a0100 R15: ffff905f0e2a8000
[ 361.769695] FS: 0000000000000000(0000) GS:ffff905f2f000000(0000) knlGS:0000000000000000
[ 361.769695] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 361.769696] CR2: 00007ff8baf2f024 CR3: 000000033ea30000 CR4: 0000000000340ef0
[ 361.769719] Call Trace:
[ 361.769724] blk_mq_dispatch_rq_list+0x93/0x550
[ 361.769727] ? __switch_to_asm+0x34/0x70
[ 361.769728] ? __switch_to_asm+0x40/0x70
[ 361.769728] ? __switch_to_asm+0x34/0x70
[ 361.769729] ? blk_mq_flush_busy_ctxs+0x18d/0x1c0
[ 361.769731] blk_mq_sched_dispatch_requests+0x162/0x180
[ 361.769732] __blk_mq_run_hw_queue+0x5a/0x110
[ 361.769733] blk_mq_run_work_fn+0x1b/0x20
[ 361.769735] process_one_work+0x1eb/0x3b0
[ 361.769736] worker_thread+0x4d/0x400
[ 361.769738] kthread+0x104/0x140
[ 361.769739] ? process_one_work+0x3b0/0x3b0
[ 361.769740] ? kthread_park+0x90/0x90
[ 361.769741] ret_from_fork+0x22/0x40
[ 361.769743] Modules linked in: nvme(OE) nvme_core(OE) xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter br_netfilter bridge stp llc overlay vmw_vsock_vmci_transport vsock nls_iso8859_1 crct10dif_pclmul ghash_clmulni_intel snd_ens1371 snd_ac97_codec gameport ac97_bus snd_pcm aesni_intel crypto_simd cryptd glue_helper snd_seq_midi snd_seq_midi_event snd_rawmidi binfmt_misc snd_seq vmw_balloon snd_seq_device input_leds joydev snd_timer serio_raw snd soundcore vmw_vmci mac_hid sch_fq_codel vmwgfx ttm drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt msr parport_pc ramoops ppdev lp parport reed_solomon efi_pstore ip_tables x_tables autofs4 hid_generic usbhid hid crc32_pclmul mptspi mptscsih mptbase psmouse e1000 scsi_transport_spi ahci libahci i2c_piix4 pata_acpi [last unloaded: nvme_core]

WARN_ON输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[  +0.000005] WARNING: CPU: 0 PID: 355 at /root/kernel/linux-5.4/drivers/nvme/host/pci.c:935 nvme_queue_rq+0xd3/0x1d0 [nvme]
[ +0.000000] Modules linked in: nvme(OE) nvme_core(OE) xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter br_netfilter bridge stp llc overlay vmw_vsock_vmci_transport vsock nls_iso8859_1 crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper snd_ens1371 snd_ac97_codec gameport ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event vmw_balloon snd_rawmidi snd_seq binfmt_misc joydev input_leds snd_seq_device serio_raw snd_timer snd vmw_vmci soundcore mac_hid sch_fq_codel vmwgfx ttm drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt msr parport_pc efi_pstore ppdev ramoops lp parport reed_solomon ip_tables x_tables autofs4 hid_generic usbhid hid crc32_pclmul psmouse mptspi mptscsih mptbase e1000 scsi_transport_spi ahci libahci i2c_piix4 pata_acpi [last unloaded: nvme_core]
[ +0.000025] CPU: 0 PID: 355 Comm: kworker/0:1H Kdump: loaded Tainted: G W OE 5.4.0 #1
[ +0.000001] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
[ +0.000002] Workqueue: kblockd blk_mq_run_work_fn
[ +0.000002] RIP: 0010:nvme_queue_rq+0xd3/0x1d0 [nvme]
[ +0.000001] Code: 01 00 75 7c 41 8b 44 24 28 3d 00 90 01 00 0f 84 be 19 00 00 3d 00 94 01 00 0f 84 e2 00 00 00 3d 00 98 01 00 0f 85 b7 19 00 00 <0f> 0b e9 b0 19 00 00 41 bd 0a 00 00 00 48 8b 45 d0 65 48 33 04 25
[ +0.000000] RSP: 0018:ffffbd5b80aefcd0 EFLAGS: 00010246
[ +0.000001] RAX: 0000000000019800 RBX: ffffbd5b80aefd98 RCX: 0000000000018800
[ +0.000000] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff98adcadfa400
[ +0.000000] RBP: ffffbd5b80aefd48 R08: ffff98ade7b19000 R09: 00000001c138f000
[ +0.000001] R10: ffff98ade7b19020 R11: ffff98adcadfa300 R12: ffff98ad88ab0a80
[ +0.000000] R13: 0000000000000000 R14: ffff98ad87530100 R15: ffff98ade2d64000
[ +0.000001] FS: 0000000000000000(0000) GS:ffff98adef000000(0000) knlGS:0000000000000000
[ +0.000001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000000] CR2: 0000349586b16000 CR3: 0000000307660000 CR4: 0000000000340ef0
[ +0.000002] Call Trace:
[ +0.000003] blk_mq_dispatch_rq_list+0x93/0x550
[ +0.000002] ? __switch_to_asm+0x34/0x70
[ +0.000000] ? __switch_to_asm+0x40/0x70
[ +0.000001] ? __switch_to_asm+0x34/0x70
[ +0.000001] ? blk_mq_flush_busy_ctxs+0x18d/0x1c0
[ +0.000001] blk_mq_sched_dispatch_requests+0x162/0x180
[ +0.000001] __blk_mq_run_hw_queue+0x5a/0x110
[ +0.000001] blk_mq_run_work_fn+0x1b/0x20
[ +0.000001] process_one_work+0x1eb/0x3b0
[ +0.000001] worker_thread+0x4d/0x400
[ +0.000002] kthread+0x104/0x140
[ +0.000000] ? process_one_work+0x3b0/0x3b0
[ +0.000001] ? kthread_park+0x90/0x90
[ +0.000001] ret_from_fork+0x22/0x40
[ +0.000001] ---[ end trace a60efe659916a654 ]---

二十八 fio下发的请求大小不确定

fio下发的请求大小不确定

通过设置max_segments参数值似乎解决了该问题

杂项

  1. 使用iperf测试tcp udp性能,使用tcpdump进行网络抓包,apt若由于其他软件锁死依赖,可删除不需要的软件再进行安装。
    1
    2
    3
    4
    apt autoremove
    apt -f install
    apt update
    apt upgrade
  2. 不同网卡ip最好位于不同的网段,不然会引起许多奇奇怪怪的问题。+
  3. gcc时使用-D指定宏
1
2
3
4
5
6
7
8
9
10
11
12
13
CFLAGS += -DM1 -DM2
test:test.c
gcc test.c -o test $(CFLAGS)
#include <stdio.h>

int main() {
#ifdef M1
printf("M1\n");
#endif
#ifdef M2
printf("M1\n");
#endif
}
  1. 可以通过手机USB共享网络(手机热点有USB共享网络选项),连接至linux后设置IP即可联网。
    Linux通过手机USB网络共享上网
    相关命令:

    1
    2
    3
    4
    5
    route
    arp -a
    ifconfig
    tcpdump -i 网卡名
    ip route add default via 192.168.xxx.xxx dev 网卡名

    相连的电脑只要一个电脑能够上网,其余电脑可将该电脑IP设置为默认路由,正确设置dns服务器即可上网,最好使用出名的DNS服务器(114.114.114.114 8.8.8.8)。
    盘点国内外优秀公共DNS

  2. UDP丢包率测试
    server使用recvfrom接收数据包,返回值小于指定长度则退出while循环,输出接收数据包数目。
    client使用sendto向server发送指定长度的数据包。
    end程序向server发送一个小于指定长度的数据包,指示测试过程结束。
    对比server接收的数据包数目与client发送的数据包数目,计算丢包率,client可开多个线程同时发送

  3. 宏定义常量时可以定义生僻的值而不是0 1之类常见的值

  4. 注意相等运算符别少写一个=,可以使用0==n的方式避免出错

  5. printf是一个行缓冲函数,先写到缓冲区,满足条件后,才将缓冲区刷到对应文件中,刷缓冲区的条件如下:

    1 缓冲区填满
    2 写入的字符中有”\n” “\r”
    3 调用fflush手动刷新缓冲区
    4 调用scanf要从缓冲区中读取数据时,也会将缓冲区内的数据刷新
    C printf()函数不显示输出内容

不过十四中提到的输出不打印的问题不是因为这个,因为在宏实现中已经加了换行符

1
#define printf(fmt, arg...) printk(KERN_ALERT "%s> " fmt "\n", __FUNCTION__, ##arg)
  1. 使用U盘复制文件,原本6G的文件,只复制了4G,最后发现是文件系统本身(FAT32)的限制,可以重新格式化U盘或拆分该文件
  2. ubuntu系统关闭unattended upgrades无人值守更新功能
    1
    sudo apt remove unattended-upgrades
  3. 移动硬盘无法弹出,显示被进程占用(system占用)

    关掉索引选项

一些命令

1
2
3
4
5
6
7
dd if=/dev/zero of=file4 bs=1G count=1 # 创建指定容量文件
scp file yy@192.168.2.168:/home/yy/kylin/test_driver/test
chmod 777 . -R # 将当前目录所有文件rwx全部设置
chown yy:yy . -R # 设置当前目录所有文件拥有者
sysctl -w net.core.rmem_default=1821440
sysctl -w net.core.wmem_default=1821440
lsof -i:7890 # 查看7890端口监听进程信息,有时候需要root权限

使用FileZilla传输文件,Failed to convert command to 8 bit charset
FileZilla出现Failed to convert command to 8 bit charset
linux-网络数据包抓取-tcpdump
tcp参数设置
linux socket 缓存: core rmem_default rmem_max

重启解决部分问题,关机再开机解决大部分问题
重装系统解决绝大部分问题,换新设备解决一切问题

1
#undef MACRO_NAME // 取消宏定义
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include <stdio.h>

#define ERROR_TYPE1_BASE 100
#define ERROR_TYPE1_1 ERROR_TYPE1_BASE + 1
#define ERROR_TYPE1_2 ERROR_TYPE1_BASE + 2

#define ERROR_TYPE2_BASE 200
#define ERROR_TYPE2_1 ERROR_TYPE2_BASE + 1
#define ERROR_TYPE2_2 ERROR_TYPE2_BASE + 2

#define ERROR_TYPE3_BASE 300
enum ERROR_TYPE3 { ERROR_TYPE3_1 = ERROR_TYPE3_BASE + 1, ERROR_TYPE3_2 };

int main() { printf("%d %d %d %d %d %d", ERROR_TYPE1_1, ERROR_TYPE1_2, ERROR_TYPE2_1, ERROR_TYPE2_2, ERROR_TYPE3_1, ERROR_TYPE3_2); }
/*
101 102 201 202 301 302
*/

错误返回值可以不同类型定义在不同区间,枚举设置起始值,这样即使函数层层嵌套都能迅速找到错误位置。

在定义宏时记得加上括号,确保运算时的优先级,避免一些奇奇怪怪的问题。

Q:银河麒麟桌面操作系统V10 SP1(2107版本之后)安装应用时会反复提示安全授权认证,如何才能取消呢?
A:可以调整“检测应用程序来源”的级别状态为“关闭”来去除反复提示的安全授权认证,具体操作步骤:开始菜单->设置->安全与更新->安全中心->应用控制与保护->检查应用程序来源,选择“关闭任何来源的应用程序均可以安装”选项。

quiet - this option tells the kernel to NOT produce any output (a.k.a. Non verbose mode). If you boot without this option, you’ll see lots of kernel messages such as drivers/modules activations, filesystem checks and errors. Not having the quiet parameter may be useful when you need to find an error.

What do the nomodeset, quiet and splash kernel parameters mean?

Linux kernel - SSE register return with SSE disabled
SSE : streaming SIMD extensions

这是 Intel 在 1999年推出 Pentium 3 时, 所推出的一种 SIMD 指令集.
新增八个 128bit 暂存器 xmm0~xmm7, 用来存放四个 32bit 单精准度的浮点数

Linux kernel 做 context switch 时, 需要储存所有暂存器, 这会增加 overhead,
因此在 kernel 会倾向不使用浮点数计算, 而在 kernel compile 时, 也会带上 compile options:

-mno-sse -mno-mmx -mno-sse2

Linux kernel - SSE register return with SSE disabled

在驱动开发过程中,需要明确自己的工作步骤,遇到的问题以及解决问题的思路与方法,在遇到瓶颈时可以尝试与周围人沟通探讨问题原因,与领导交流工作进展,有些时候沟通比死写代码更能解决问题。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
tldr df
df
Gives an overview of the filesystem disk space usage.More information: https://www.gnu.org/software/coreutils/df.

- Display all filesystems and their disk usage:
df

- Display all filesystems and their disk usage in human-readable form:
df -h

- Display the filesystem and its disk usage containing the given file or directory:
df {{path/to/file_or_directory}}

- Display statistics on the number of free inodes:
df -i

- Display filesystems but exclude the specified types:
df -x {{squashfs}} -x {{tmpfs}}

网卡ioctl用户程序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#include <arpa/inet.h>
#include <errno.h>
#include <fcntl.h>
#include <net/if.h>
#include <net/if_arp.h>
#include <net/route.h>
#include <netinet/in.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#define TEST_CMD 0x89F3
int main(int argc, char *argv[]) {
struct ifreq ifr = {0};
memset(&ifr, 0, sizeof(ifr));
ifr.ifr_addr.sa_family = AF_INET;
int data = 1;
ifr.ifr_ifru.ifru_data = (void *)&data; // 传递用户数据
strcpy(ifr.ifr_name, "网卡名");
int fd = socket(AF_INET, SOCK_DGRAM, 0);
if (fd <= 0) {
printf("create socket fd failed");
return -1;
}
int ret = ioctl(fd, TEST_CMD, &ifr);
if (ret < 0) {
printf("ioctl failed,%d %d\n", ret, errno);
}

return 0;
}

UDP广播程序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
// 发送端
#include <stdio.h>
#include <sys/socket.h>
#include <unistd.h>
#include <sys/types.h>
#include <netdb.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>

int main() {
int sock = -1;
if ((sock = socket(AF_INET, SOCK_DGRAM, 0)) == -1) {
printf("socket error");
return -1;
}
const int opt = 1;
//设置该套接字为广播类型,
int nb = 0;
nb = setsockopt(sock, SOL_SOCKET, SO_BROADCAST, (char *)&opt, sizeof(opt));
if (nb == -1) {
printf("set socket error...");
return -1;
}
//指定Server IP 和 发送给Client的端口
struct sockaddr_in addrto;
bzero(&addrto, sizeof(struct sockaddr_in));
addrto.sin_family = AF_INET;
addrto.sin_addr.s_addr = inet_addr("192.168.2.255");
addrto.sin_port = htons(7890);
int nlen = sizeof(addrto);
char *msg = "abcdef";
while (1) {
sleep(1);
//从广播地址发送消息
int ret = sendto(sock, msg, strlen(msg), 0, (struct sockaddr *)&addrto, nlen);
if (ret < 0) {
printf("send error\n");
} else {
printf("ok\n");
}
}
return 0;
}

// 接收端
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <arpa/inet.h>
#include <errno.h>
#include <netinet/in.h>
#include <net/route.h>
#include <net/if_arp.h>

#define MAXBUF 2000

int main() {
struct sockaddr_in server_addr;
struct sockaddr_in client_addr;
int addr_size = sizeof(client_addr);
char buf[MAXBUF];
int cc;
int server_sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);

memset(&server_addr, 0, addr_size);
server_addr.sin_family = AF_INET;
server_addr.sin_addr.s_addr = htons(INADDR_ANY); // inet_addr("192.168.2.123");
server_addr.sin_port = htons(7890);
cc = bind(server_sock, (struct sockaddr *)&server_addr, sizeof(struct sockaddr));
if (cc < 0) {
printf("bind error\n");
} else {
printf("bind success\n");
}
printf("listen packet\n");
while (1) {
cc = recvfrom(server_sock, buf, MAXBUF, 0, (struct sockaddr *)&client_addr, &addr_size);
printf(" recv msg is %s\n", buf);
}
return 0;
}

重点在于发送端调用setsockopt设置为广播类型,接收端绑定INADDR_ANY而不是特定网卡地址


参考:UDP之广播

UDP组播

组播背景知识:
组播IP地址到底是谁的IP??
组播IGMP-原理介绍+报文分析+配置示例

相关代码:
Linux C/C++编程:Udp组播(多播)

wireshark抓包显示:
发送端:192.168.252.135

接收端:192.168.252.141

推荐阅读

linux内核可加载模块的makefile
linux内核makefile概览
内核模块中使用本地头文件
Linux下头文件搜索路径
linux 内核头文件及内核库文件
Linux内核头文件
Linux errno详解
Linux内核API
Linux Kernel 学习的一些资源
如何编译 Linux 内核
Ubuntu Linux内核版本升级或降级到指定版本(基于ubuntu 18.04示例)
Multi-Queue Block IO Queueing Mechanism (blk-mq)