August 2014 ~ Code Skater

Table of Contents

1. The Problem

In a weekend afternoon, our customer support staff reported that many players can’t login to our game.

The services responsible for ‘login’ processing are called GameLogic, which We deployed several instances per game zoo. In these services, there are mainly two threads:

Main Thread, which read/write user data by interacting with two different Mysql servers. One of the Mysql server is local, while the other is remotely deployed in another IDC.
Refresh Thread, which refreshes Mysql Connections once in 10 seconds.

We manage Mysql connections by ourselves which is obviously no a good idea. But that’s not something we want to discuss about today.

At the time the problem happened, we found the following log entries worth noting:

A lot of login requests were queued in Main Thread.
There were logs about the remote Mysql connection:

The last packet successfully received from the server was 1,200,245 milliseconds ago.

The Refresh Thread had been exited hours ago due to some uncaught exception.

1.1. The Puzzle

It’s pretty obvious that the problem is caused by the connection to remote Mysql server. Being in another IDC, it’s quite common that the network will jitters now and then. Once any of the Mysql connection is blocked, players will have difficulty logining in.

But one thing that confused me is that why hadn’t it happen more often? It can’t be that the afternoon was the only time network failed us.

So we need to dig deeper.

2. Technical Background

2.1. Interaction flow between Mysql Clients and Servers

Ordinarily, a client interacts with a Mysql sever with the following steps：

Client connects to Mysql server
Client writes COM_STMT_PREPARE and COM_STMT_EXECUTE commands to server.
Client try to read (and wait for) data from server.
Server process commands, and write data to client
Client successfully reads data

2.2. Connection Timeout in Step 1

At step 1, we can set timeout by using the connectTimeout parameter. By default, however, this parameter is not setting. So this is a potential blocking location. For example, when the machine on with Mysql Server is deployed is closed, we have to wait for the system default TCP connect timeout before step 1 fails.

The system default TCP connect timeout on Linux system is controlled by the tcp_syn_retries variable. which means number of retries before fail. Different OS has different interpretations of this variable according to this ariticle. On centos, the default value is 5, and the total timeout is about 20s.

[root@node3 ~/]# cat /proc/sys/net/ipv4/tcp_syn_retries

5

[root@node3 ~/]# time telnet 8.3.2.1

Trying 8.3.2.1...

telnet: connect to address 8.3.2.1: Connection timed out

telnet: Unable to connect to remote host: Connection timed out



real 0m21.152s

user 0m0.001s

sys 0m0.002s

2.3. Read Timeout & Reset in Step 3-5

If the network is disconnected at step 4, step 5 won’t happen until:

client receives a RST
read timeout (controlled by the socketTimeout parameter) is reached

2.3.1. How RST is sent and received

The RST message is actually a flag in TCP packets. When one TCP peer, say A, sends a TCP packet to another, say B, if the OS of B find there is no such connection it will send a RST back to A.

When a process is reading from a TCP connection, it depends on the TCP Keep Alive mechanism to periodically send keep alive packets to the peer side so as to trigger the RST if the connection is already considered broken on the peer side.

There is a tcpKeepAlive parameter of Mysql Connector we can utilize to enable this feature. By default, this parameter is ON.

At connection level, we can only configure where TCP Keep Alive is on or not. However at the OS level, we can also configure the following parameters (which will affect all connections in the OS):

/proc/sys/net/ipv4/tcp_keepalive_time, which is 1200 on our OS.
/proc/sys/net/ipv4/tcp_keepalive_intvl, which is 75
/proc/sys/net/ipv4/tcp_keepalive_probes, which 9

The meaning is: For each TCP connection that have TCP Keep Alive on, send Keep Alive packet every 1200 seconds. If it is not replied, resend every 75 seconds until the 9th one after which consider the connection broken.

3. The Root Cause

3.1. What really happened today

The 1200 seconds in log must be the tcp_keepalive_time.

After step 3, the network must be broken.

In Step 4, when server finished processing commands, it tries to write back to the client and failed after net_write_timeout. After then, the server considered connection broken and closed it.

The net_write_timeout variable is set by client when calling ExecuteXXX functions provided by the Mysql Connector. The default value of net_write_timeout is equal to netTimeoutForStreamingResults property which is 600 seconds.

Somewhere after net_write_timeout and before keep alive is triggered, network recovered. So the client keep waiting for read to finish.

Then keep alive is triggered, which caused the client to send keep alive packets and the server replied RST immediately.

Finally, after receiving RST, a exception is throw by Mysql connector.

3.2. Why not happen more often?

The Refresh Thread will periodically send SELECT NOW command to server, which involves:

Firstly, acquire a PreparedStatement from the connection. It will send COM_STMT_PREPARE command to server, which will trigger RST.
Secondly, execute ExecuteQuery (which triggers write and read). Note however, Mysql connector executes two ExecuteQuery in an mutual exclusive way, that is only one thread can in the ExecuteQuery at the same time.

Refresh Thread with help trigger RST in its first step. If Refresh Thread is blocked in its second step, it won’t help anything.

So if Refresh Thread is not dead, it helps reduce the frequency of problem happening. But since it’s dead that, the problem occurred.

4. Conclusion

Mysql connections between two IDC is vulnerable. The following situations may result in long blocking:

If connecting to mysql, when the machine on which ysql is deployed is shutdown.
If reading from mysql, when the machine is crashed or when network is broken.
If writing to mysql, when the machine is crashed or when network is broken.

So We’d better use different threads to deal with local and remote mysql connections, or when network is poor unexpected pause is unavoidable.

5. References

layout: post title: Alternative3D Code Reading Notes ---

1. 向量和矩阵乘法

stage3d和alternative3d都使用列向量：

Figure 1. 列向量乘法

按照AGAL的文档 What is AGAL ， m44 destination, source1, source2的作用为：

destination.x = (source1.x * source2[0].x) + (source1.y * source2[0].y) + (source1.z * source2[0].z) + (source1.w * source2[0].w)

destination.y = (source1.x * source2[1].x) + (source1.y * source2[1].y) + (source1.z * source2[1].z) + (source1.w * source2[1].w)

destination.z = (source1.x * source2[2].x) + (source1.y * source2[2].y) + (source1.z * source2[2].z) + (source1.w * source2[2].w)

destination.w = (source1.x * source2[3].x) + (source1.y * source2[3].y) + (source1.z * source2[3].z) + (source1.w * source2[3].w)

因此在上通过Context3D.setProgramConstantsFromByteArray传矩阵到显卡时，需要按行来上传。

2. 世界坐标与视角

游戏中第一人称人物的头顶z，左手y，看着x 人物的视角用Eular坐标表示：yaw表示身体垂直方向的旋转，pitch表示低头，roll为0

2.1. Eular角对应的矩阵

Table 1. 视角
yaw	pitch	roll
z	y	x

按照我们人的特性，肯定是先yaw，再pitch，再roll 这三次旋转表示为 r(yaw) = r(z) r(pitch) = r (y) r(roll) = r(x)

为了得到一个代表欧拉旋转的矩形，需要把这几个矩阵方向连接得(注意我们是列向量，矩阵从右乘到左）：

其中：

得：

Math3D.vector3DAngleVectors

/**
注意上述公式计算出的R(Eula}可以表示为[R(front), R(Left), R(Up)]
而下面的函数计算的是R(front), R(Right), R(Up)
*/
public static function vector3DAngleVectors( angles:Vector3D, forward:Vector3D, right:Vector3D, up:Vector3D ):void{
 var  angle:Number;
 var  sr:Number;
 var  sp:Number;
 var  sy:Number;
 var  cr:Number;
 var  cp:Number;
 var  cy:Number;

 angle = angles.y * (Math.PI*2 / 360);
 sy = Math.sin(angle);
 cy = Math.cos(angle);
 angle = angles.x * (Math.PI*2 / 360);
 sp = Math.sin(angle);
 cp = Math.cos(angle);
 angle = angles.z * (Math.PI*2 / 360);
 sr = Math.sin(angle);
 cr = Math.cos(angle);

 if (forward)
 {
  forward.x = cp*cy;
  forward.y = cp*sy;
  forward.z = -sp;
 }
 if (right)
 {
  right.x = (-1*sr*sp*cy+-1*cr*-sy);
  right.y = (-1*sr*sp*sy+-1*cr*cy);
  right.z = -1*sr*cp;
 }
 if (up)
 {
  up.x = (cr*sp*cy+-sr*-sy);
  up.y = (cr*sp*sy+-sr*cy);
  up.z = cr*cp;
 }
}

2.2. 从世界坐标变换到摄像机坐标

首先我们定义四个坐标系：

C(world)，世界
C(camera_translate)，经过位移后摄像机坐标
C(camera_translate_rotate)，经过位移，再经过yaw, pitch, roll的摄像机坐标，在此坐标系内，x方向的为可见，z方向为上方
C(camera_translate_rotate_flip)，用来投影的坐标，在此坐标内，-z方向的为可见，y为上方

对应于一个顶点P

P(world) = M(translate) * M(rotate) * M(flip) * P(camera)

假设摄像机在世界的位置为pos

3. 缩放的实现

缩放有两种实现方式：

设置ViewPort
设置ScissorRectangle +　设置投影矩阵

当需要达到UI不缩放，3D缩放的效果时（主要为了提高性能），需要用到后者。方法如下：

初始化
1. 通过Context3D.configureBackBuffer设定ViewPort参数
2. 获得3D的x,y轴缩放比例
3. 应用x,y缩放比例到投影矩阵
渲染
1. 设置ScissorRectangle
2. 将3D部分应用投影变换后，调用drawTriangles
3. 关闭ScissorRectangle
4. 将UI按照整个ViewPort渲染

一开始可能以为，只要设置投影矩阵的x,y轴缩放就行了。只这样做的话，的确会使得场景变小，但是ViewPort内的可见物体也更多了。因为原先被视锥裁剪掉的三角形，现在有部分因为又落入了视锥内。加上ScissorRectangle正好抵消的这种作用。

Code Skater

Menu

Friday, August 22, 2014

Solving a Mysql Connection Related Problem