文章归档

tcp hangle算法

默认socket是开启nagle特性的,该特性主要用于提升网卡的吞吐量,但会增加小包的传输时延。对于一些时延敏感的通信,需要关闭nagle算法:

int flags = 1; rc = setsockopt (rs->ss_fd, IPPROTO_TCP, TCP_NODELAY, &flags, sizeof (flags));

内核会通过以下调用栈来设置socket的nagle特性:

setsockopt()    // net/socket.c tcp_setsockopt() // net/ipv4/tcp.c

/* * Socket option code for TCP. */ static int do_tcp_setsockopt(struct sock *sk, int level, // net/ipv4/tcp.c int optname, char __user *optval, unsigned int optlen) { //... switch (optname) { case TCP_NODELAY: if (val) { /* TCP_NODELAY is weaker than TCP_CORK, so that * this option on corked socket is remembered, but * it is not activated until cork is cleared. * * However, when TCP_NODELAY is set we make * an explicit push, which

»» 继续阅读全文

select的正确用法

在linux平台上主要头文件如下: /usr/include/sys/select.h /usr/include/bits/select.h

select 相关的几个API如下:

int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout); void FD_CLR(int fd, fd_set *set); int FD_ISSET(int fd, fd_set *set); void FD_SET(int fd, fd_set *set); void FD_ZERO(fd_set *set);

参考man-pages,其中nfds的解释是:

nfds is the highest-numbered file descriptor in any of the three sets, plus 1.

这里的nfds不表示fd_set的大小(实际上表示不了),而是指 the highest-numbered file descriptor + 1, the highest-numbered file descriptor也就是fd_set里fd的最大值。而且这个最大值不能超过 FD_SETSIZE

An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET() with a value of fd that is negative or is equal to or larger than FD_SETSIZE will result in undefined behavior.

至于为什么不能超过FD_SETSIZE,主要是因为fd_set是一个大小固定的bitmap,而这个bitmap的大小是通过FD_SETSIZE计算出来的。对于大于或者等于FD_SETSIZE的fd,FD_SET非法修改了不属于自己的内存。

所以每次FD_SET的时候都应该检查fd是否合法。

tcp缓冲区大小不对称

今天写单元测试时发现setsockopt/getsockopt在设置SO_SNDBUF/SO_RCVBUF参数时不对应,getsockopt拿到的并不是setsockopt时的值 下面代码可以说明这个问题:

on = 1024; BUG_ON(tcp_setopt(sfd, TP_NOBLOCK, &on, sizeof(on))); BUG_ON(tcp_getopt(sfd, TP_NOBLOCK, &on, &optlen)); BUG_ON(on != 1024);

内核最终设置的值会发生变化,例如可能double,等等,redhat上的BUG反映:https://bugzilla.redhat.com/show_bug.cgi?id=170694 我的内核版本是:Centos 6.5 2.6.32-431.5.1.el6.x86_64 #1 SMP 翻了一下内核源码,linux-2.6.32-358.6.2.el6/net/core/

/* * This is meant for all protocols to use and covers goings on * at the socket level. Everything here is generic. */ int sock_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen) { // ... case SO_SNDBUF: /* Don't error on this BSD doesn't and if you think about it this is right. Otherwise apps have to play 'guess the biggest size'

»» 继续阅读全文

多进程共享socket之close调用

当父子进程共享一个socket描述符时,如果其中一个进程执行close调用,那会不会发送FIN包,进而影响另一个进程的其他相关操作?只会减少引用计数,没什么其他的操作。

看下源码,可以顺着以下的调用栈看:

close ------------ open.c __close_fd ---- fs/file.c filp_close ---------- fs/open.c fput ------ fs/file_table.c __fput ---------- fs/file_table.c (the last op)

首先socket描述符本身也属于Unix文件的一种,其文件相关的file_operations实现在net/socket.c源文件,如下:

/* * Socket files have a set of 'special' operations as well as the generic file ones. * These don't appear in the operation structures but are done directly via the socketcall() * multiplexor. */ static const struct file_operations socket_file_ops = { .owner = THIS_MODULE, .llseek = no_llseek, .aio_read = sock_aio_read, .aio_write = sock_aio_write, .poll = sock_poll, .unlocked_ioctl = sock_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = compat_sock_ioctl, #endif .mmap = sock_mmap, .open = sock_no_open, /* special open code to disallow open

»» 继续阅读全文

tcp拥塞控制

Transmission of packets from TCP sender is restricted by the congestion window. On reception of Ack TCP sender may increase the congestion window. At the start of TCP connection sender is in slow start wherein congestion window starts growing exponentially from a small value (typically 2) till a threshold is reached. After reaching the threshold TCP enters congestion avoidance phase and increments the congestion window linearly.

Congestion avoidance state machine

Event such as arrival of dupack, SACK and Explicit congestion notification indicate a possibility of congestion due to excess transmission of packets by the sender. These events are processed through

»» 继续阅读全文

The new hlist api for linux kernel

今天在写list的时候参考了linux kernel的list实现,发现内核hlist的实现还有一个可以改进的地方,当前linux版本为 redhat 2.6.32-358.14.1.el6.x86_64 关键是这两个API,list_for_each_entry和hlist_for_each_entry

/** * list_for_each_entry - iterate over list of given type * @pos: the type * to use as a loop cursor. * @head: the head for your list. * @member: the name of the list_struct within the struct. */ #define list_for_each_entry(pos, head, member) \ for (pos = list_entry((head)->next, typeof(*pos), member); \ &pos->member != (head); \ pos = list_entry(pos->member.next, typeof(*pos), member)) /** * hlist_for_each_entry - iterate over list of given type * @tpos: the type * to use as a loop cursor. * @pos: the &struct hlist_node to use as a loop cursor. * @head:

»» 继续阅读全文

NAPI: the new device driver packet processing framework

在napi方式还没有出现之前,网络数据包的接收过程大概是这样的:

  1. 数据包达到网卡,硬中断,中断处理程序处理数据包,调用netif_rx()将数据包放到cpu队列
  2. 然后出发软中断去调用net_rx_action()函数,内核将cpu队列中的数据包送到网络层

napi要解决的问题是:

  • Interrupt mitigation: High-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load.
  • Packet throttling :When the system is overwhelmed and must drop packets, it's better if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adaptor itself, before

    »» 继续阅读全文

Unreliable Guide To Locking

- rusty Unreliable Guide To Locking

简单来说,就两个要点:

  1. 跨上下文要避免抢占
  2. 上锁过程是否可以睡眠

http://blog.pipul.org/2012/11/锁/

linux内核设计模式

  1. http://lwn.net/Articles/336224/
  2. http://lwn.net/Articles/336255/
  3. http://lwn.net/Articles/336262/

The Secret to 10 Million Concurrent Connections -The Kernel is the Problem, Not the Solution

-The Secret to 10 Million Concurrent Connections -The Kernel is the Problem, Not the Solution

Now that we have the C10K concurrent connection problem licked, how do we level up and support 10 million concurrent connections? Impossible you say. Nope, systems right now are delivering 10 million concurrent connections using techniques that are as radical as they may be unfamiliar.

To learn how it’s done we turn to Robert Graham, CEO of Errata Security, and his absolutely fantastic talk at Shmoocon 2013 called C10M Defending The Internet At Scale.

Robert has a brilliant way

»» 继续阅读全文

第 4 页,共 5 页12345