公司抓信息安全,使用gitlab进行代码管理,要求所有用户的远程操作(推送、同步)都记录下来。
gitlab 后台的各种日志保存位置 /var/log/gitlab/
百度了一下他们的对应操作关系:
1)git客户端同步或push代码,以及网页的操作,都会记录在 /var/log/gitlab/gitlab-rails/production.log
2)用户、project的创建删除,会记录在 /var/log/gitlab/application.log
3)分支TAG Commit更新记录,会记录在 /var/log/gitlab/gitlab-shell/gitlab-shell.log
…
还有其他类型的日志,因为我主要记录 代码的同步和推送记录即可,所以先看production.log
测试了一下用git客户端从远端gitlab上拉取代码,还有推送代码的操作。发现production.log有以下记录产生
同步(pull or 克隆)操作
Started GET "xxxx.git/info/refs?service=git-upload-pack" for 192.168.xx.xx at 2017-04-24 19:00:34 +0800
Processing by Projects::GitHttpController#info_refs as */*
Parameters: {"service"=>"git-upload-pack", "namespace_id"=>"xxxx", "project_id"=>"xxxx.git"}
Completed 200 OK in 34ms (Views: 0.6ms | ActiveRecord: 0.0ms | Elasticsearch: 0.0ms)
推送(push)操作
Started GET "xxxx.git/info/refs?service=git-receive-pack" for 192.168.xx.xx at 2017-04-24 19:02:34 +0800
Processing by Projects::GitHttpController#info_refs as */*
Parameters: {"service"=>"git-receive-pack", "namespace_id"=>"xxxx", "project_id"=>"xxxx.git"}
Completed 200 OK in 34ms (Views: 0.6ms | ActiveRecord: 0.0ms | Elasticsearch: 0.0ms)
从上面的记录能够筛选出 客户端IP、时间、操作类型、项目名、用户名(这里踩雷了,一直以为namespace_id 指的就是用户名,其实是项目组或创建成员的名称,无论是用哪个用户进行操作,都是显示同一个namespace_id,所以这里并没有记录到用户是谁)。
搞了半天才发现production.log上面的信息没有记录gitlab用户名,不知道这些操作是谁干的,等于没用。只能看下有没有其他办法了。。。
幸好后面发现了另外一个和production.log长得很像的日志,在同一个目录里面:/var/log/gitlab/gitlab-rails/
production_json.log。里面是Json请求串,重新测试了一下推送操作,发现里面有记录用户信息!
{"method":"GET","path":"/test_user/test_project.git/info/refs","format":"*/*","controller":"Projects::GitHttpController","action":"info_refs","status":200,"duration":268.22,"view":0.48,"db":14.41,"time":"2019-06-27T10:59:56.324Z","params":[{"key":"service","value":"git-receive-pack"},{"key":"namespace_id","value":"test_user"},{"key":"project_id","value":"test_project.git"}],"remote_ip":"192.168.XX.XX","user_id":3,"username":"test_user","ua":"git/2.21.0.windows.1","queue_duration":null,"correlation_id":"b02c02f9-0167-49bf-965f-e4cc86d6751f"}
从里面提取有用的信息:
后面接着是第二个坑了。。
我的git客户端一直都是通过http协议和gitlab远程仓库交互。(git有两种传输协议:ssh和http)。
因为http每次都要输用户名密码,所以换了ssh协议,然后再一次同步、推送的操作,发现production_json.log上没有记录了。。
原来用ssh协议传输时,记录不会保存在production_json.log,而是保存在gitlab-shell.log里面:
(日志目录在/var/log/gitlab/gitlab-shell)
time="2019-07-02T11:17:48+08:00" level=info msg="executing git command" command="gitaly-receive-pack unix:/var/opt/gitlab/gitaly/gitaly.socket {\"repository\":{\"storage_name\":\"default\",\"relative_path\":\"test_user/test_project.git\",\"git_object_directory\":\"\",\"git_alternate_object_directories\":[],\"gl_repository\":\"project-5\",\"gl_project_path\":\"test_user/test_project\"},\"gl_repository\":\"project-5\",\"gl_project_path\":\"test_user/test_project\",\"gl_id\":\"key-3\",\"gl_username\":\"test_user\",\"git_config_options\":[],\"git_protocol\":null}" pid=23657 user="user with id key-3"
从里面提取有用的信息:
我们可以发现和http协议相比,少了客户端IP的信息,不过这个基本影响不大。
那么,总结一下:
(PS:直接在gitlab网页上下载源码的操作也是会记录在production_json.log,同样有仓库+分支、用户名、IP等信息,这里就不详细说了)
上面我们找到了原生日志,接下来就可以用python去读日志并且筛选关键信息保存在数据库中。
首先观察发现,每天凌晨1点左右,gitlab后台会把前一天的日志文件打成gz压缩包,如下图所示:
前一天的为log.1.gz,前两天的为log.2.gz…如此类推。
所以我们可以设计一个读log1日志的python脚本,然后通过linux的crontab定时任务每天执行,这样每天的记录都不会遗漏。
关于python脚本怎么设计:
http协议传输的:
ssh协议传输的:
贴代码:
# read_log.py
# author: ChenYongHang
# date: 2019-07-10
import re
import gzip
import mysql # 自己写好的Mysql封装类
import datetime
git_url = 'http://10.1.4.243'
git_token = 'DhAzm5niQksAL_X4kgAj'
RAIL_PATH = "/var/log/gitlab/gitlab-rails/production_json.log.1.gz"
SHELL_PATH = "/var/log/gitlab/gitlab-shell/gitlab-shell.log.1.gz"
# 读 http协议传输的记录
def read_rail():
with gzip.open(RAIL_PATH, 'rb') as f: # 二进制读取,可以省去解压的步骤,不会生成新文件
line = f.readline().decode('utf-8') # 二进制转utf-8,方便逐行读
conn = mysql.DateBase() #
while line:
if "info_refs" in line:
line = line.replace('null', 'None')
json = eval(line)
if json['status'] == 200: # 只有200才代表操作成功了
temp_time = json['time'][0:19].replace('T', ' ')
time = (datetime.datetime.strptime(temp_time, "%Y-%m-%d %H:%M:%S") + datetime.timedelta(
hours=8)).strftime("%Y-%m-%d %H:%M:%S")
ip = json['remote_ip']
user = json['username']
action = json['params'][0]['value']
project = json['params'][2]['value'].replace('.git', '')
path = "%s/%s" % (json['params'][1]['value'], json['params'][2]['value'])
sql = 'insert into gitlab_operations(project_name,path, action, method, author, ip, time) values("%s","%s","%s","%s","%s","%s","%s");' % (
project, path, action, "git-http", user, ip, time)
conn.execute(sql) # 执行insert语句,插入到mysql数据库
line = f.readline().decode('utf-8')
conn.close() # 关闭数据库连接
# 读 ssh协议传输的记录
def read_shell():
with gzip.open(SHELL_PATH, 'rb') as f:
line = f.readline().decode('utf-8')
conn = mysql.DateBase()
while line:
if "gitaly-upload-pack" in line or "gitaly-receive-pack" in line:
date = re.findall(r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}", line)[0].replace('T', ' ')
if "receive" in line:
action = "git-receive-pack"
elif "upload" in line:
action = "git-upload-pack"
else:
break
line = line.replace("\\", "")
path = re.findall(r'"relative_path":".*?"', line)[0].replace('"', '').split(":")[1]
project = path.split('/')[1].replace(".git", "")
user = re.findall(r'"gl_username":".*?"', line)[0].replace('"', '').split(":")[1]
sql = 'insert into gitlab_operations(project_name,path, action, method, author, time) values("%s","%s","%s","%s","%s","%s");' % (
project, path, action, "git-ssh", user, date)
conn.execute(sql)
line = f.readline().decode('utf-8')
conn.close()
if __name__ == '__main__':
read_shell()
read_rail()
在linux服务器中设置定时执行:
# crontab -e
# 添加一行,每天早上6点运行
00 06 * * * python3 /xxx/read_log.py