[00:42:09] <grrrit-wm>	 (03CR) 10Alex Monk: "Ottomata: I've looked into it. The reason we can't use monitoring::graphite_threshold is that it relies upon monitoring::service, which re" [puppet] - 10https://gerrit.wikimedia.org/r/311747 (owner: 10Alex Monk)
[00:50:30] <grrrit-wm>	 (03CR) 10Alex Monk: "Actually, would it just silently not attempt to share the resource and be okay with that? If so we might be able to use it." [puppet] - 10https://gerrit.wikimedia.org/r/311747 (owner: 10Alex Monk)
[01:44:36] <icinga-wm_>	 PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:56:57] <icinga-wm_>	 PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:09:29] <icinga-wm_>	 RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:18:12] <icinga-wm_>	 PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:19:08] <grrrit-wm>	 (03PS1) 10Matanya: standard packages: re-add intel-microcode [puppet] - 10https://gerrit.wikimedia.org/r/312714 
[02:21:49] <icinga-wm_>	 RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:33:59] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0]
[02:38:58] <icinga-wm_>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[02:43:10] <icinga-wm_>	 RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[03:07:20] <icinga-wm_>	 PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 38999 MB (3% inode=99%)
[03:22:06] <icinga-wm_>	 RECOVERY - Disk space on maps-test2001 is OK: DISK OK
[03:38:29] <icinga-wm_>	 PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:43:30] <icinga-wm_>	 PROBLEM - MegaRAID on ms-be1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline)
[03:44:12] <icinga-wm_>	 PROBLEM - Disk space on ms-be1002 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdl1 is not accessible: Input/output error
[03:47:33] <icinga-wm_>	 PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds.
[03:52:34] <icinga-wm_>	 RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor
[04:00:43] <icinga-wm_>	 PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdl1]
[04:01:03] <icinga-wm_>	 RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[04:01:45] <icinga-wm_>	 RECOVERY - Disk space on ms-be1002 is OK: DISK OK
[06:26:12] <icinga-wm_>	 RECOVERY - Disk space on scb1002 is OK: DISK OK
[06:51:06] <grrrit-wm>	 (03CR) 10Hashar: [C: 04-1] "So I knew that something was eventually going to wrong, and on benefit of doubt CR+1 anyway." [puppet] - 10https://gerrit.wikimedia.org/r/312705 (owner: 10EBernhardson)
[06:57:18] <icinga-wm_>	 PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 531 bytes in 0.017 second response time
[07:07:48] <icinga-wm_>	 PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:11:51] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 031] "That was my fault when merging 3 different patches into 1. Maybe this can be cherry-picked in beta, not sure we can merge it these week in" [puppet] - 10https://gerrit.wikimedia.org/r/312652 (owner: 10Hashar)
[07:20:07] <icinga-wm_>	 RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.062 second response time
[07:33:14] <icinga-wm_>	 RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[08:42:18] <icinga-wm_>	 PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:05:15] <icinga-wm_>	 RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[10:38:13] <icinga-wm_>	 PROBLEM - Disk space on scb1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=84%)
[11:00:10] <icinga-wm_>	 PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1202
[11:05:13] <icinga-wm_>	 RECOVERY - check_mysql on frdb1001 is OK: Uptime: 858982 Threads: 1 Questions: 187272113 Slow queries: 6682 Opens: 6879 Flush tables: 1 Open tables: 571 Queries per second avg: 218.016 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[11:21:02] <grrrit-wm>	 (03PS1) 10Urbanecm: Change protection level autoreview in arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312734 (https://phabricator.wikimedia.org/T146575) 
[11:22:00] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665350 (10abian)
[11:26:27] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665357 (10abian)
[11:31:34] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665360 (10abian)
[11:51:33] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665376 (10Esc3300)
[11:52:46] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665333 (10Esc3300) Maybe related: T146529
[12:07:49] <grrrit-wm>	 (03CR) 10Luke081515: [C: 031] Change protection level autoreview in arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312734 (https://phabricator.wikimedia.org/T146575) (owner: 10Urbanecm)
[12:31:21] <grrrit-wm>	 (03PS1) 10Dereckson: Disable Upload Wizard blacklist issues on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312735 (https://phabricator.wikimedia.org/T146417) 
[13:14:48] <godog>	 mhhh scb spammed with celery logs 🤔
[13:25:37] <icinga-wm_>	 PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:45:21] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665415 (10Sjoerddebruin) p:05Triage>03High Lots of reports coming in.
[13:48:34] <godog>	 looks like it ran out of disk space for a couple of hours until logrotate kicked in, I'm going to stop puppet and redirect celery logs to /srv/log/celery /cc akosiaris halfak Amir1 
[13:49:12] <Amir1>	 godog: Thanks. I take care of it
[13:49:48] <Amir1>	 We probably need to shut up celery
[13:50:27] <Amir1>	 let me check the puppet config
[13:50:39] <godog>	 Amir1: yeah it is logging like crazy and two almost identical lines
[13:50:56] <Amir1>	 It's probably in debug mode
[13:51:01] <icinga-wm_>	 RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[13:52:08] <godog>	 celery? all lines have INFO which I'm assuming is the level
[13:52:20] <Amir1>	 The log level is ERROR https://github.com/wikimedia/operations-puppet/blob/production/modules/ores/manifests/worker.pp#L2
[13:52:26] <icinga-wm_>	 PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:52:26] <icinga-wm_>	 PROBLEM - dhclient process on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:52:57] <Amir1>	 it's probably overwritten somewhere in hiera or the web deamon is spitting out a lot
[13:53:15] <Amir1>	 let me check logs directly
[13:54:36] <icinga-wm_>	 PROBLEM - salt-minion processes on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:55:32] <godog>	 looks like the loglevel is forced INFO in ores_celery.py
[13:55:32] <godog>	     application.worker_main(argv=["celery_worker", "--loglevel=INFO"])
[13:57:14] <Amir1>	 godog: it shouldn't happen because that part only applies when being ran directly not as a deamon 
[13:57:42] <Amir1>	 godog: I can't check logs, it says needs root and I don't have
[13:58:15] <Amir1>	 Can you send me several lines in private? (you can delete IP, etc. I signed the NDA though)
[13:59:24] <godog>	 sure
[14:00:33] <godog>	 but yeah you are right that shouldn't apply
[14:18:52] <godog>	 Amir1: ok I'll let you debug why loglevel isn't ERROR as it should be but INFO instead, though I'm going ahead and move the logging to /srv where's space or scb will run out of space
[14:22:40] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665507 (10Esc3300) Which ones show it?    - https://grafana.wikimedia.org/dashboard/db/wikidata-query-servi...
[14:24:16] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665333 (10Multichill) I guess Varnish 5xx rate at https://grafana.wikimedia.org/dashboard/db/wikidata-query...
[14:26:06] <icinga-wm_>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[14:27:03] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665511 (10Esc3300) Yeah, looks like it: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?p...
[14:28:56] <godog>	 !log temporarily redirect ores-celery-worker logs to /srv/log/celery/syslog.log and remove old daemon.log.1 to avoid scb100* filling up the / filesystem
[14:29:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:29:24] <icinga-wm_>	 RECOVERY - Disk space on scb1001 is OK: DISK OK
[14:30:27] <gehel>	 !log putting wdqs1002 in maintenance mode, server looks unstable, investigating... - T146576
[14:30:28] <stashbot>	 T146576: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576
[14:30:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:31:24] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665514 (10Esc3300)
[14:34:15] <icinga-wm_>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:34:30] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665333 (10jcrespo) According to icinga/grafana, several servers have lag (including that one).
[14:37:58] <icinga-wm_>	 PROBLEM - WDQS HTTP on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.007 second response time
[14:38:30] <icinga-wm_>	 PROBLEM - WDQS SPARQL on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.007 second response time
[14:39:15] <gehel>	 ^ I'm on the WDQS issue...
[14:39:29] <gehel>	 !log restarting blazegraph on wdqs1002 - T146576
[14:39:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:39:34] <stashbot>	 T146576: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576
[14:40:00] <wikibugs>	 06Operations, 10ORES, 15User-Ladsgroup: ores celery worker spamming scb logs - https://phabricator.wikimedia.org/T146583#2665522 (10fgiunchedi)
[14:40:02] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665536 (10Gehel) logs on wdqs1002 show error opening sockets:  ``` Sep 25 14:35:08 wdqs1002 bash[10426]: 20...
[14:48:26] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665333 (10EricJ) I was getting the same issue intermittently when hitting the sparql endpoint programmatica...
[14:53:20] <icinga-wm_>	 RECOVERY - WDQS HTTP on wdqs1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9779 bytes in 0.007 second response time
[14:53:51] <icinga-wm_>	 RECOVERY - WDQS SPARQL on wdqs1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9779 bytes in 0.006 second response time
[14:54:29] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665541 (10Gehel) Restarting blazegraph on wdqs1002 seems to have solved the issue. I'll dig into the logs t...
[15:03:30] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665542 (10Gehel) It seems that wdqs1002 started showing replication lag issues around 16:00 UTC on Saturday...
[15:04:45] <Amir1>	 godog: I was afk, I hope it doesn't happen again
[15:04:55] <Amir1>	 I debug and fix it
[15:05:10] <Amir1>	 probably a deployment
[15:06:06] <wikibugs>	 06Operations, 10ORES, 15User-Ladsgroup: ores celery worker spamming scb logs - https://phabricator.wikimedia.org/T146583#2665546 (10Ladsgroup)
[15:20:14] <icinga-wm_>	 RECOVERY - dhclient process on stat1002 is OK: PROCS OK: 0 processes with command name dhclient
[15:22:23] <icinga-wm_>	 RECOVERY - salt-minion processes on stat1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[15:22:37] <icinga-wm_>	 RECOVERY - Disk space on stat1002 is OK: DISK OK
[15:24:01] <elukey>	 !log executed https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Fixing_HDFS_mount_at_.2Fmnt.2Fhdfs on stat1002 (fusermount didn't succeed to umount though)
[15:24:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:24:21] <icinga-wm_>	 PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:26:56] <icinga-wm_>	 RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[15:27:04] <elukey>	 thanks stat1002
[15:28:00] <Amir1>	 Is someone from Ops around just to sit while I deploy for https://phabricator.wikimedia.org/T146581 ?
[15:28:27] <Amir1>	 Caused down time today
[15:34:21] <Amir1>	 _joe_: around for ^ 
[15:34:24] <Amir1>	 ?
[16:03:53] <Amir1>	 I'm still waiting for someone :D
[16:04:13] <paladox>	 Amir1 ops is out all week
[16:04:33] <paladox>	 they are ofsite this week
[16:04:36] <Amir1>	 obviously but there should probably someone in case of emergency 
[16:05:07] <paladox>	 Oh
[16:05:10] <Amir1>	 (+ it's weekend)
[16:06:35] <paladox>	 yep
[16:09:14] <MatmaRex>	 Amir1: i'd try pinging the person on clinic duty (see channel topic)
[16:09:52] <Amir1>	 MatmaRex: I tried but his is not in the channel (probably due to time zone differences)
[16:10:15] <Amir1>	  s/his/he
[16:10:21] <MatmaRex>	 uh, he's not online at all. whoops…
[16:10:43] <icinga-wm_>	 PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:12:00] <Amir1>	 the funny thing is I can do it but someone from ops should be around in case something went really really wrong :D
[16:12:47] <Amir1>	 (Also deployed in labs, works just fine)
[16:13:11] <Krenair>	 yeah the clinic duty person is not online 24/7
[16:13:37] <Krenair>	 Amir1, what sort of change to the currently-deployed version is involved?
[16:14:06] <Amir1>	 Krenair: I want to only deploy this: https://gerrit.wikimedia.org/r/#/c/312740/
[16:14:58] <Amir1>	 I agree but what if a DDoS attacker chooses this time?
[16:15:36] <Krenair>	 if your concern was security-related maybe we shouldn't be having this discussion here
[16:15:46] <Amir1>	 maybe I'm too pessimist 
[16:16:05] <Krenair>	 it looks simple, you've tested it in labs
[16:18:06] <Krenair>	 how many nodes run this service?
[16:18:51] <Amir1>	 two in eqiad scb100[12] (+ two in codfw)
[16:19:04] <Amir1>	 Okay, I deploy it 
[16:19:06] <Krenair>	 are you capable of depooling each and applying the change there, then repooling?
[16:19:16] <Amir1>	 We have canary nodes
[16:19:36] <Amir1>	 I think scb1002 is the canary one
[16:25:58] <Amir1>	 !log deploying 9cc2009 on ORES nodes (T146581) 
[16:26:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:26:13] <stashbot>	 T146581: celery log level is INFO causing disruption on ORES service  - https://phabricator.wikimedia.org/T146581
[16:28:29] <Amir1>	 deployed in the canary node (scb2001)
[16:30:46] <Amir1>	 deploying to all nodes
[16:35:42] <icinga-wm_>	 RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[16:43:03] <icinga-wm_>	 PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:45:05] <Amir1>	 Okay, everything is fine for now
[16:50:58] <Amir1>	 It's to enjoy what's left from the weekend
[16:51:00] <Amir1>	 o/
[17:08:18] <icinga-wm_>	 RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:24:49] <icinga-wm_>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[17:27:47] <icinga-wm_>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[17:32:19] <icinga-wm_>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
[17:40:15] <icinga-wm_>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:42:12] <icinga-wm_>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
[17:56:18] <icinga-wm_>	 PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:21:13] <icinga-wm_>	 RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[18:31:32] <icinga-wm_>	 PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:40:16] <grrrit-wm>	 (03PS1) 10Hashar: puppetmaster: polish git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/312747 
[18:40:18] <grrrit-wm>	 (03PS1) 10Hashar: puppetmaster: git-sync-upstream early abort [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) 
[18:42:26] <icinga-wm_>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 641 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4043897 keys - replication_delay is 641
[18:47:16] <grrrit-wm>	 (03PS2) 10Hashar: puppetmaster: git-sync-upstream early abort [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) 
[18:51:25] <grrrit-wm>	 (03CR) 10Hashar: "cherry picked on beta puppet master" [puppet] - 10https://gerrit.wikimedia.org/r/312747 (owner: 10Hashar)
[18:51:38] <grrrit-wm>	 (03CR) 10Hashar: "cherry picked on beta puppet master" [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) (owner: 10Hashar)
[18:56:30] <icinga-wm_>	 RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:10:00] <icinga-wm_>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4008356 keys - replication_delay is 0
[20:29:57] <icinga-wm_>	 PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:43:07] <icinga-wm_>	 PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:55:08] <icinga-wm_>	 RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:08:15] <icinga-wm_>	 RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[21:19:33] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0]
[21:22:00] <icinga-wm_>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[22:12:49] <grrrit-wm>	 (03CR) 10Alex Monk: ""always cause rebase to rebase"? always causes git to rebase?" [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) (owner: 10Hashar)
[22:24:07] <icinga-wm_>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]