1.基础环境说明 1.1 数据流向图
1.1、数据采集层 这一层主要收集日志,也可以做一些简单的数据处理和过滤。通常有两种方式:
(1)日志采集客户端监,比如filebeat、logstash、Flume、Logagent、rsyslog、fluentd等 (2)直接从程序中将日志写入消息队列或es集群
1.2、消息队列 由于数据处理层中的logstash不能持久化存储数据,为防止异常时日志无法储存到es集群中,通常会加一层消息队列作为缓存和持久化存储数据,一般选择kafka或redis。
1.3、数据处理 日志写入es集群前,通常需要做一些数据处理,比如清理一些无效日志、对一些日志做格式化处理、根据不同的类型写入不同的索引等,由于这块可能比较消耗时间和性能,所以官方建议使用比较轻量级的filebeat来收集日志,将日志处理的操作放在服务端做。虽然logstash的功能十分强大,但是其缺点也一直然人诟病,就是在解析时对日志的消费速率会有很大影响。新版的es中针对这点也做了相应的策略,推出的ignest node可以用来处理数据。
1.4、数据存储 Es集群分布式存储数据,采用lucene作为其底层检索引擎,在上层提供了丰富的查询的api,方便快速查询想要的数据
1.5、数据展示 可以通过简单的配置kibana或grafana,就能图形化展示出es中存储的信息。也可以通过api的调用自己实现图形化的展示。
1.2 环境,组件说明
CentOS 6.6 64bit(192.168.31.33,192.168.31.30,192.168.31.31)
jdk1.8.0_51
elasticsearch-5.4.3
logstash-2.4.1
kibana-5.4.3
kafka_2.12-2.4.0
filebeat-5.6.9
2.服务端组件部署 2.1 jdk部署 下载linux下的jdk,解压到对应的目录,一般为/usr/local/ 下面,然后在/etc/profile 配置环境变量:
1 2 3 4 5 6 JAVA_HOME=/usr/local /jdk1.8.0_51 PATH=$JAVA_HOME /bin:$PATH CLASSPATH=.:$JAVA_HOME /lib/dt.jar:$JAVA_HOME /lib/tools.jar export JAVA_HOMEexport PATHexport CLASSPATH
执行命令使环境变量生效:
2.2 kafka部署 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 tar -zxf kafka_2.12-2.4.0.tgz mv kafka_2.12-2.4.0 /usr/local /kafka mkdir /data/zookeeper chown -R osadmin:osadmin /data/zookeeper /usr/local /kafka/config/zookeeper.properties tickTime=2000 initLimit=10 syncLimit=5 dataDir=/data/zookeeper clientPort=2181 server.0=192.168.31.33:2888:3888 server.1=192.168.31.30:2888:3888 server.2=192.168.31.31:2888:3888 echo 0 > /data/zookeeper/myidtickTime=2000 initLimit=10 syncLimit=5 dataDir=/data/zookeeper clientPort=2181 server.0=192.168.31.33:2888:3888 server.1=192.168.31.30:2888:3888 server.2=192.168.31.31:2888:3888 echo 1 > /data/zookeeper/myidtickTime=2000 initLimit=10 syncLimit=5 dataDir=/data/zookeeper clientPort=2181 server.0=192.168.31.33:2888:3888 server.1=192.168.31.30:2888:3888 server.2=192.168.31.31:2888:3888 echo 2 > /data/zookeeper/myidnohup /usr/local /kafka/bin/zookeeper-server-start.sh /usr/local /kafka/config/zookeeper.properties &> /usr/local /kafka/logs/zookeeper.log & /usr/local /kafka/config/server.properties broker.id=0 port=9092 advertised.host.name=192.168.31.33 host.name=192.168.31.33 num.network.threads=3 num.io.threads=8 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 log.dirs=/usr/local /kafka/logs num.partitions=10 num.recovery.threads.per.data.dir=4 log.retention.hours=168 log.segment.bytes=1073741824 log.retention.check.interval.ms=300000 zookeeper.connect=192.168.31.33:2181,192.168.31.30:2181,192.168.31.31:2181 zookeeper.connection.timeout.ms=6000 delete.topic.enable=true broker.id=1 port=9092 advertised.host.name=192.168.31.30 host.name=192.168.31.30 num.network.threads=3 num.io.threads=8 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 log.dirs=/usr/local /kafka/logs num.partitions=10 num.recovery.threads.per.data.dir=4 log.retention.hours=168 log.segment.bytes=1073741824 log.retention.check.interval.ms=300000 zookeeper.connect=192.168.31.33:2181,192.168.31.30:2181,192.168.31.31:2181 zookeeper.connection.timeout.ms=6000 delete.topic.enable=true broker.id=2 port=9092 advertised.host.name=192.168.31.31 host.name=192.168.31.31 num.network.threads=3 num.io.threads=8 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 log.dirs=/usr/local /kafka/logs num.partitions=10 num.recovery.threads.per.data.dir=4 log.retention.hours=168 log.segment.bytes=1073741824 log.retention.check.interval.ms=300000 zookeeper.connect=192.168.31.33:2181,192.168.31.30:2181,192.168.31.31:2181 zookeeper.connection.timeout.ms=6000 delete.topic.enable=true nohup /usr/local /kafka/bin/kafka-server-start.sh /usr/local /kafka/config/server.properties &> /usr/local /kafka/logs/kafka.log & /usr/local /kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test /usr/local /kafka/bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test /usr/local /kafka/bin/kafka-topics.sh --list --zookeeper localhost:2181 /usr/local /kafka/bin/kafka-topics.sh --delete --zookeeper localhost:2181 --topic test /usr/local /kafka/bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --list /usr/local /kafka/bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --describe --group logstash /usr/local /kafka/bin/kafka-console-producer.sh --broker-list 192.168.31.33:9092 --topic test /usr/local /kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
2.3 logstash部署 1 2 3 4 5 tar -zxf logstash-2.3.4.tar.gz mv logstash-2.3.4 /usr/local /logstash chown -R osadmin:osadmin /usr/local /logstash/ mkdir /usr/local /logstash/config
grok正则在线测试 服务端配置 /usr/local/logstash/config/logstash.conf 配置模板:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 input { kafka { zk_connect => "192.168.31.33:2181,192.168.31.30:2181,192.168.31.31:2181" topic_id => "nginx" type => "json" reset_beginning => false consumer_threads => 2 decorate_events => true } } filter { json{ source => "message" remove_field => ["message"] } if [http_x_forwarded_for] == "-" { mutate { update => { "http_x_forwarded_for" => "%{remote_addr} " } }#当http_x_forwarded_for为空时,将其赋值为remote_addr } geoip { source => "http_x_forwarded_for" target => "geoip" database => "/usr/local/logstash/maps/GeoLiteCity.dat" } } output { elasticsearch { hosts => ["192.168.31.33:9200","192.168.31.30:9200","192.168.31.31:9200"] index => "%{[fields][appid]} -%{+YYYY.MM.dd} " } } /usr/local/logstash/bin/logstash -f /usr/local/logstash/config/logstash-nginx.conf --configtest --verbose /usr/local/logstash/bin/logstash -f /usr/local/logstash/config/logstash-nginx.conf
2.4 elasticsearch部署 安装
1 2 3 4 tar -zxf elasticsearch-5.4.3.tar.gz mv elasticsearch-5.4.3 /usr/local /elasticsearch mkdir /usr/local /elasticsearch/{logs,plugins} /data chown -R osadmin:osadmin /usr/local /elasticsearch/
配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 ulimit -n 655350vim /etc/security/limits.conf * soft nofile 655350 * hard nofile 655350 vim /etc/security/limits.d/90-nproc.conf vim /etc/sysctl.conf vm.max_map_count=262144 sysctl -p vim /etc/fstab swapoff -a /usr/local /elasticsearch/config/elasticsearch.yml path.data: /data path.logs: /usr/local /elasticsearch/logs path.plugins: /usr/local /elasticsearch/plugins network.host: 0.0.0.0 http.port: 9200 bootstrap.mlockall: true indices.fielddata.cache.size: 75% indices.breaker.fielddata.limit: 85% threadpool.search.queue_size: 10000 cluster.name: elk-cluster node.name: "master_10.201.3.33" node.master: true node.data: true discovery.zen.ping.multicast.enabled: true discovery.zen.ping.unicast.hosts: ["10.201.1.33" , "10.201.3.30" ,"10.201.3.31" ] path.data: /data path.logs: /usr/local /elasticsearch/logs path.plugins: /usr/local /elasticsearch/plugins network.host: 0.0.0.0 http.port: 9200 bootstrap.mlockall: true indices.fielddata.cache.size: 75% indices.breaker.fielddata.limit: 85% threadpool.search.queue_size: 10000 cluster.name: elk-cluster node.name: "master_10.201.3.30" node.master: true node.data: true discovery.zen.ping.multicast.enabled: true discovery.zen.ping.unicast.hosts: ["10.201.1.33" , "10.201.3.30" ,"10.201.3.31" ] path.data: /data path.logs: /usr/local /elasticsearch/logs path.plugins: /usr/local /elasticsearch/plugins network.host: 0.0.0.0 http.port: 9200 bootstrap.mlockall: true indices.fielddata.cache.size: 75% indices.breaker.fielddata.limit: 85% threadpool.search.queue_size: 10000 cluster.name: elk-cluster node.name: "master_10.201.3.31" node.master: true node.data: true discovery.zen.ping.multicast.enabled: true discovery.zen.ping.unicast.hosts: ["10.201.1.33" , "10.201.3.30" ,"10.201.3.31" ]
各个配置的含义
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 path.data:索引数据的存储路径 path.logs:日志文件的存储路径 path.plugins:插件安装路径 network.host:监听IP http.port:监听端口 bootstrap.mlockall:锁内存,使ES不使用swap indices.fielddata.cache.size:节点用于 fielddata 的最大内存(达到阀值旧数据将被交换出内存) indices.breaker.fielddata.limit:JVM 堆内存大小(确保 indices.breaker.fielddata.limit 的值大于 indices.fielddata.cache.size 的值) threadpool.search.queue_size:ES搜索队列大小(kibana查询量大时需要增大此值) cluster.name:集群名称(cluster.name相同的节点将自动组成一个集群) node.name:集群节点名称 node.master:允许节点成为主节点 node.data:允许节点存储数据 discovery.zen.ping.multicast.enabled:允许组播发现节点 discovery.zen.ping.unicast.hosts:集群初始节点列表(加速发现节点)
Es关键字名词解释
1 2 3 4 5 6 7 8 9 Index:索引,类似数据库中的db type: 索引中可以定义不同的type来存不同结构的数据 document: 索引和搜索的主要数据载体。 field: document中的各个字段。 term: 词项,搜索时的一个单位,代表文本中的某个词。 token: 词条,词项(term)在字段(field)中的一次出现,包括词项的文本、开始和结束的位移、类型等信息。 Segment: 对应lucene中的index,是es中最小的搜索单位 Template: 针对于索引的模板,可以设置filed的格式和预处理数据 Lucene内部使用的是倒排索引的数据结构, 将词项(term)映射到文档(document)
Elasticsearch内存设置
1 2 3 4 /usr/local/elasticsearch/config/jvm.options -Xms31g -Xmx31g
Elasticsearch插件安装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 pip install elasticsearch-curator==3.5.1 curator --timeout 36000 --host localhost close indices --older-than 30 --time-unit days --timestring '%Y.%m.%d' --prefix sd-3-centos33-nginx curator --timeout 36000 --host localhost delete indices --time-unit days --timestring %Y.%m.%d --prefix sd-3-centos33-nginx- curator_cli show_indices --verbose --header --filter_list '[{"filtertype":"age","source":"creation_date","direction":"older","unit":"days","unit_count":7}]' curator_cli show_indices --verbose --filter_list '[{"filtertype":"age","source":"creation_date","direction":"younger","unit":"days","unit_count":7},{"filtertype":"pattern","kind":"prefix","value":"ls-ssj"}]' curator_cli show_indices --verbose --filter_list '[{"filtertype":"pattern","kind":"prefix","value":"ls-ssj"},{"filtertype":"opened","exclude":"False"}]' curator_cli close --filter_list '[{"filtertype":"age","source":"creation_date","direction":"older","unit":"days","unit_count":7},{"filtertype":"opened","exclude":"False"},{"filtertype":"pattern","kind":"prefix","value":"nginx-jr"}]' curator_cli delete_indices --filter_list '[{"filtertype":"age","source":"creation_date","direction":"older","unit":"days","unit_count":30},{"filtertype":"pattern","kind":"prefix","value":"ls-ssj"},{"filtertype":"opened","exclude":"True"}]' curator_cli open --filter_list '[{"filtertype":"pattern","kind":"suffix","value":"2017.07.31"},{"filtertype":"pattern","kind":"prefix","value":"nginx-jr"}]'
启动
1 /usr/local /elasticsearch/bin/elasticsearch -d
Es常用命令:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 curl -k -u user:password https://127.0.0.1:9200/_cluster/health?pretty curl -k -u user:password https://127.0.0.1:9200/_nodes/stats?pretty curl -k -u user:password http://127.0.0.1:9200/_nodes/hot_threads curl -k -u user:password https://127.0.0.1:9200/_template?pretty curl -XPUT -k -u user:password https://127.0.0.1:9200/_template/logstash -d @logstash.json curl -k -u user:password -XPUT https://localhost:9200/indexname -d '{settings : {index : {number_of_shards : 5, number_of_replicas : 1 }}}' curl -k -uuse:password-XPOST 'https://localhost:9200/_reindex?pretty' -H 'Content-Type: application/json' -d' { "source": { "index": "twitter" }, "dest": { "index": "new_twitter" } }' curl -k -u user:password -XPOST https://localhost:9200/indexname/_close curl -k -u user:password -XPOST https://localhost:9200/indexname/_open curl -k -u user:password -XDELETE https://localhost:9200/indexname curl -s -k -u user:password https://localhost:9200/indexname/_cache/clear?field_data=true curl -k -u user:password -XPUT https://localhost:9200/nginx-jr-*/_settings -d '{"index":{"max_result_window":"100000"}}' curl -XPUT -u user:admin -k 'https://localhost:9200/_cluster/settings' -d '{ "transient" : { "cluster.routing.allocation.enable" : "none" } }' curl -XPUT -u user:admin -k 'https://localhost:9200/_cluster/settings' -d '{ "transient" : { "cluster.routing.allocation.enable" : "all" } }' curl -XPUT 127.0.0.1:9200/_settings -d '{"index" : {"refresh_interval" : "60s"}}' curl -XPUT 127.0.0.1:9200/_settings -d '{"index" : {"refresh_interval" : "-1"}}' curl -XPUT http://127.0.0.1:9200/_cluster/settings -d '{"persistent":{"indices.store.throttle.max_bytes_per_sec" : "80mb"}}' curl -XPUT -u user:password -k https://localhost:9200/index_name/_settings -d '{"index" : {"number_of_replicas" : 0}}' curl -XPUT -uadmin -k 'https://localhost:9200/_cluster/settings' -d '{ "transient" : { "indices.recovery.max_bytes_per_sec": "500mb", "cluster.routing.allocation.node_initial_primaries_recoveries": 25, "cluster.routing.allocation.node_concurrent_recoveries": 10, "cluster.routing.allocation.cluster_concurrent_rebalance": 10, "indices.recovery.concurrent_streams": 25 } }' [root@bj4-1-centos14 ~] curl -u user:password -k https://127.0.0.1:9200/_cat?help /_cat/allocation /_cat/shards /_cat/shards/{index} /_cat/master /_cat/nodes /_cat/indices /_cat/indices/{index} /_cat/segments /_cat/segments/{index} /_cat/count /_cat/count/{index} /_cat/recovery /_cat/recovery/{index} /_cat/health /_cat/pending_tasks /_cat/aliases /_cat/aliases/{alias } /_cat/thread_pool /_cat/plugins /_cat/fielddata /_cat/fielddata/{fields} /_cat/nodeattrs /_cat/repositories /_cat/snapshots/{repository} curl -u user:password -k https://127.0.0.1:9200/_cat/allocation?help shards | s | number of shards on node disk.indices | di,diskIndices | disk used by ES indices disk.used | du,diskUsed | disk used (total, not just ES) disk.avail | da,diskAvail | disk available disk.total | dt,diskTotal | total capacity of all volumes disk.percent | dp,diskPercent | percent disk used host | h | host of node ip | | ip of node node | n | name of node curl -u user:password -k https://127.0.0.1:9200/_cat/allocation curl -u user:password -k https://127.0.0.1:9200/_cat/allocation?v shards disk.indices disk.used disk.avail disk.total disk.percent host ip node curl -u user:paswword -k https://127.0.0.1:9200/_cat/allocation?h=ip,n,h,s ```bash pip install -U setuptools pip install supervisor echo_supervisord_conf > /etc/supervisord.conf supervisord supervisorctl 通过supervisorctl控制程序的启动,也可以通过Web界面管理http://10.201.3.33:9001 supervisorctl [start|stop|restart|reread|update] program_name
3.客户端组件部署 3.1 filebeat部署 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 yum install filebeat -y vim /etc/filebeat/filebeat.yml filebeat.prospectors: input_type: log paths: /var/log /nginx/access.log fields: appid: nginx tail_files: true output.kafka: hosts: ["192.168.31.33:9092" ,"192.168.31.30:9092" ,"192.168.31.31:9092" ] topic: 'nginx' partition.round_robin: reachable_only: false required_acks: 1 compression: gzip max_message_bytes: 1000000 /etc/init.d/filebeat start
4.数据展示 4.1 kibana部署 1 2 3 4 5 6 7 8 9 10 11 12 13 tar -zxf kibana-5.4.3-linux-x86_64.tar.gz mv kibana-5.4.3-linux-x86_64 /usr/local /kibana chown -R kibana:kibana /usr/local /kibana/ /usr/local /kibana/config/kibana.yml server.port: 5601 server.host: "0.0.0.0" elasticsearch.url: "http://localhost:9200" http://ip:5601
nginx日志展示如下:
5.其他问题 5.1瓶颈 在有大量日志写入的同时如果有大量数据查询的操作并且无法从缓存中直接获取数据时,磁盘的io瞬间压力会变的很大,如果这种情况持续一段时间,可能导致大量数据在kafka堆积。 优化的方案有下面几种:
(1)横向或纵向扩展es集群: (2)将kafka日志放在一个单独的磁盘上面,分担io压力 (3)Es集群优化(比如增大索引刷新时间、定期关闭索引、定期强制执行段合并….) (4)采用开源的缓存方案(flashcache、bcache…)为数据存储设备添加缓存,在数据突发的高峰期有很好的缓解作用
5.2常见故障及处理方法
jvm占用过高:jvm占用超过75%后就会触发old gc,如果old gc时间过长,会导致集群所有请求都阻塞住。降低jvm内存占用的方式主要有几种:
(1)关闭文档数较大的索引 (2)增大索引的刷新时间 (3)减小索引副本数 (4)重启集群(滚动重启所有节点进程)
节点被踢出集群:查看集群状态时可以查看是否有被踢出的节点,一般节点被踢出有两种情况:进程不在导致心跳超时,节点负载过高导致心跳超时。 恢复方法:
(1)通过_cat/nodes请求找到当前的主节点,在主节点时执行命令关闭集群索引的初始化和重分布,命令在被踢出的节点上启动或重启es进程,等待其连接上集群,通过命令查看需要重新初始化的索引数量。 根据实际情况做下面操作: 1)、如果需要保证数据安全则打开集群索引重分布,命令,等待集群恢复正常。 数据量越大时这个需要的时间越长。 2)、如果需要尽快恢复集群状态,则可以将现在处于yellow状态并且文档数比较 多的索引副本设置为0,命令,然后打开索引分布,等待处于initializing_shards 状态的分片处理完,在关闭索引分布,等待kafka中堆积的日志消费完后再 打开索引分布。
部分unassinged状态的索引无法恢复:如果查看集群健康状态时发现有unassigned状态的分片,需要手动将这些分片重定位,否则整个索引将无法读写。具体操作方法:
(1)通过_cat/shards接口找出UNASSIGNED状态的索引和分片信息 (2)通过命令将索引副本先设置成0 (3)通过命令将其分片重定位 (4)将索引副本数量恢复,等待集群状态恢复正常 (5)或者通过下面脚本自动恢复:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 #!/bin/bash for index in $(curl -s 'http://localhost:9200/_cat/shards' | grep UNASSIGNED | awk '{print $1}' | sort | uniq); do for shard in $(curl -s 'http://localhost:9200/_cat/shards' | grep UNASSIGNED | grep $index | awk '{print $2}' | sort | uniq); do echo $index $shard curl -XPOST 'http://localhost:9200/_cluster/reroute' -d '{ "commands" : [ { "allocate" : { "index" : "' $index '", "shard" : "' $shard '", "node" : "nodename", #nodename替换成节点名称 "allow_primary" : true } } ] }' sleep 5 done Done
定期优化操作:为保持集群长期的稳定,最好能定时执行一些优化操作:
(1)定期关闭索引 (2)定期清理索引缓存 (3)定期合并segment (4)容量不足的情况下定期删除索引
上面的操作都可以通过官方推荐的索引管理工具curator来操作,可以使用pip直接安装,Python的版本需要是2.7或2.7以上