[00:16:27] <icinga-wm>	 RECOVERY - logstash process on logstash1003 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash
[00:23:47] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1751220 (10hoo) >>! In T116487#1751166, @Krenair wrote: >>>! In T116487#1751145, @hoo wrote: >> While (probably) not a formal requirement, I think you should also get access to t...
[00:27:26] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1751230 (10Krenair) >>! In T116487#1751220, @hoo wrote: >>>! In T116487#1751166, @Krenair wrote: >>>>! In T116487#1751145, @hoo wrote: >>> While (probably) not a formal requireme...
[00:28:53] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1751233 (10JanZerebecki) For the shell group I count 2 wmf employees, 1 disabled account. None for the gerrit group. The mediawiki gerrit group includes the wmf ldap group, but t...
[00:33:09] <Krinkle>	 ori: Grafana uses POST for larger queries to graphite, and for the labs graphite we use the grafana proxy.
[00:33:21] <Krinkle>	 However POST is disabled so the dashbaord deoesn't work at https://grafana.wikimedia.org/dashboard/db/labs-project-board
[00:33:35] <Krinkle>	 because POST to /api/datasources/proxy/3/ is rejected
[00:34:47] <Krinkle>	 Actually, I'll change the datasource back to not use the proxy
[00:35:02] <Krinkle>	 I was hoping it would be less buggy when using a request directly to the internal graphite server but it's just as buggy
[00:35:03] <Krinkle>	 https://grafana-admin.wikimedia.org/dashboard/db/labs-project-board
[00:35:19] <Krinkle>	 Some panels are randomly missing there because graphite.wmflabs is responding with PNG instead of JSON sometimes 
[00:35:25] <Krinkle>	 it's not deterministic though
[00:35:46] <Krinkle>	 so naturally grafana JS gets confusd and can't read the response
[01:51:15] <wikibugs>	 7Puppet, 6Labs, 6Phabricator: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1751263 (10zhuyifei1999) >>! In T116442#1751011, @Negative24 wrote: > `role::phabricator::main` isn't the right Puppet class to use in Labs. I'm pretty sure the error had to do with the site variables....
[02:23:32] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 07m 59s)
[02:23:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:27:25] <wikibugs>	 7Puppet, 6Labs, 6Phabricator: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1751267 (10Negative24) Its not documented. But its not hard. Essentially its just: 1. Apply `role::phabricator::labs` and run puppet 2. `cd /srv/phab/phabricator` and run `sudo bin/storage upgrade` 3....
[02:28:08] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-25 02:28:08+00:00
[02:28:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:39:06] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[02:39:15] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[02:41:55] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:05:37] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[03:10:55] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[03:14:16] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[03:19:36] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[03:21:15] <icinga-wm>	 PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection timed out
[03:25:47] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[03:26:26] <icinga-wm>	 RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.000 second response time on port 9042
[03:26:36] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[03:26:45] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[03:35:26] <icinga-wm>	 PROBLEM - puppet last run on mw2038 is CRITICAL: CRITICAL: Puppet has 1 failures
[03:35:36] <icinga-wm>	 PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Puppet has 1 failures
[04:01:07] <icinga-wm>	 PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: puppet fail
[04:01:47] <icinga-wm>	 RECOVERY - puppet last run on mw2038 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[04:01:56] <icinga-wm>	 RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[04:29:07] <icinga-wm>	 RECOVERY - puppet last run on mw1102 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[05:17:36] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds
[05:21:05] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds
[05:28:06] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds
[05:31:45] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds
[05:35:07] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds
[05:36:33] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Oct 25 05:36:33 UTC 2015 (duration 36m 32s)
[05:36:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[05:42:06] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds
[05:47:25] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds
[05:52:36] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds
[06:08:36] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds
[06:13:46] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds
[06:30:25] <icinga-wm>	 PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:30:26] <icinga-wm>	 PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:45] <icinga-wm>	 PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: puppet fail
[06:31:06] <icinga-wm>	 PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:16] <icinga-wm>	 PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:31:56] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:16] <icinga-wm>	 PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:25] <icinga-wm>	 PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:36] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:33:15] <icinga-wm>	 PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:26] <icinga-wm>	 PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:41:13] <wikibugs>	 6operations, 10Traffic, 7Availability, 5Patch-For-Review: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#1751407 (10aaron)
[06:55:47] <icinga-wm>	 RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[06:56:06] <icinga-wm>	 RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[06:56:56] <icinga-wm>	 RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[06:56:56] <icinga-wm>	 RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[06:57:05] <icinga-wm>	 RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:05] <icinga-wm>	 RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:57:15] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:56] <icinga-wm>	 RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:07] <icinga-wm>	 RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[06:58:26] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:57] <icinga-wm>	 RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:03:56] <icinga-wm>	 PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 out: 300 virgin: 25)
[08:19:36] <icinga-wm>	 RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits.
[08:54:40] <grrrit-wm>	 (03PS1) 10Ori.livneh: Replace dynamic gdash.wikimedia.org with static mirror [puppet] - 10https://gerrit.wikimedia.org/r/248678 (https://phabricator.wikimedia.org/T104365) 
[08:57:58] <grrrit-wm>	 (03PS2) 10Ori.livneh: Replace dynamic gdash.wikimedia.org with static mirror [puppet] - 10https://gerrit.wikimedia.org/r/248678 (https://phabricator.wikimedia.org/T104365) 
[09:07:27] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Replace dynamic gdash.wikimedia.org with static mirror [puppet] - 10https://gerrit.wikimedia.org/r/248678 (https://phabricator.wikimedia.org/T104365) (owner: 10Ori.livneh)
[09:35:07] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 7 below the confidence bounds
[09:45:46] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds
[09:47:15] <YuviPanda>	 ori: I wonder how long before you get ticked off enough by the daily logrotate based puppet failures to do something about them
[09:47:22] <YuviPanda>	 ticking-off-based-development!
[09:48:05] <ori>	 i fixed it before but godog (I think) reverted
[09:48:33] <YuviPanda>	 haha reallyyy
[09:49:39] <ori>	 https://gerrit.wikimedia.org/r/#/c/219788/
[09:50:04] <ori>	 revert (with reason): https://gerrit.wikimedia.org/r/#/c/221149/
[09:50:56] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 7 below the confidence bounds
[09:51:06] <YuviPanda>	 hmm we can probably re-revert it by directing the passenger logs to elsewhere
[09:52:35] <ori>	 yeah we could. i didn't even want to use chronolog, though. i think the best solution is logrotate's "copytruncate" option, which truncates the log file in place after creating a copy
[09:53:28] <YuviPanda>	 ah nice
[09:53:53] <ori>	 filippo vetoed that because that approach is susceptible to dropping log records during the brief gap between the copy and truncate operations
[09:55:29] <YuviPanda>	 I wonder how many times people look at the apache log files
[09:55:34] <ori>	 IMO, we have holes in our log coverage big enough to drive a truck through
[09:55:37] <ori>	 this is not one of them
[09:55:48] <ori>	 so i think the opposition is academic and fussy
[09:55:58] <ori>	 you know godog, always walking around with his monocle
[09:56:25] <YuviPanda>	 yay, monocling his way around airbnb rentals...
[09:57:17] <ori>	 it's literally a one line fix
[09:57:43] <YuviPanda>	 do it and let's see if godog still wants to veto
[09:57:52] <ori>	 i think the real reason people resist fixing it is that it has become something like an IRC timekeeping device, like a sundial
[09:58:04] <YuviPanda>	 I think we can use _joe_ waking up as an alternative
[09:58:32] <YuviPanda>	 Could also have a bot that just announces the UTC midnight
[09:58:58] <ori>	 "we do."
[09:59:17] <YuviPanda>	 hahaha
[10:00:30] <ori>	 well i taunted godog above
[10:00:38] <ori>	 to bait him into re-considering
[10:00:42] <ori>	 so let's see :)
[10:01:32] <ori>	 oh shit did the clock change?
[10:01:55] <YuviPanda>	 I don't think the taunt is complete unless you actually submit a patch and add him as cc
[10:01:59] <YuviPanda>	 what clock?
[10:02:17] <ori>	 the local clock
[10:02:33] <ori>	 (it didn't; dst ends nov. 1)
[10:02:34] <YuviPanda>	 ?
[10:02:38] <YuviPanda>	 hahaha
[10:02:47] <YuviPanda>	 did you just forget an hour pass by instead?
[10:03:17] <ori>	 no, i hadn't been paying attention to the time
[10:03:48] <YuviPanda>	 yeah for a moment I was surprised too since it was 3:03 AM and the last time I looked at the time it was 1:18AM
[10:03:53] <ori>	 it just ended in europe
[10:03:55] <ori>	 so i wasn't totally off
[10:04:05] <ori>	 Sunday, October 25, 2015 at 1:00 UTC, clocks will be set back one hour when Daylight Saving Time (DST) ends in most of Europe.
[10:04:39] <ori>	 10 hrs ago
[10:05:12] <YuviPanda>	 I wonder how a world without measured time will look like
[10:05:22] <YuviPanda>	 where the unit of measurement is the day rather than the hour and minute
[10:05:41] <YuviPanda>	 known but uncared for, like how we treat milliseconds most of the time
[10:06:19] <ori>	 i would have gone with "known but uncared for, like how we treat icinga alerts"
[10:06:41] <YuviPanda>	 yeah, I was thinking at some point of 'labs issues' but that's probably not strictly true
[10:07:47] <ori>	 it's the alert that is over-sensitive
[10:07:51] <ori>	 it's not a genuine spike
[10:08:09] <YuviPanda>	 the graphite 5xx one?
[10:08:10] <YuviPanda>	 yeah
[10:08:22] <YuviPanda>	 although the reverse could be true too
[10:08:30] <YuviPanda>	 we used to completely ignore the labstore on high load alert allll the time
[10:08:39] <YuviPanda>	 and then we actually fixed the systems
[10:08:46] <YuviPanda>	 and the original parameters were actually sane for a sane system
[10:09:03] <YuviPanda>	 and was alerting all the time because we were operating close to max capacity all th etime
[10:09:26] <YuviPanda>	 I shall now attempt to fall asleep.
[10:10:03] <ori>	 night!
[10:39:33] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 19 data above and 9 below the confidence bounds
[10:45:36] <godog>	 ori: consider the copytruncate veto lifted :) my reasoning at the time IIRC was that we were working on upgrading the puppetmasters anyways, not the case ATM
[10:50:01] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 9 below the confidence bounds
[11:00:41] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 9 below the confidence bounds
[11:04:11] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 19 data above and 9 below the confidence bounds
[11:10:16] <jynus>	 the high 503 is applebot doing malformed requests and us not returning 404s
[11:11:12] <jynus>	 it's been like that since 5UTC
[11:33:29] <grrrit-wm>	 (03PS2) 10Glaisher: Remove Page and Index namespaces from $wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240640 (https://phabricator.wikimedia.org/T54709) 
[11:44:22] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[12:09:02] <wikibugs>	 6operations, 7Database: Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - https://phabricator.wikimedia.org/T107072#1751635 (10jcrespo) db1035 continues having issues:  Compare db1044 (same shard and hardware):  {F2767546...
[12:14:22] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds
[12:23:03] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds
[12:26:33] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds
[12:35:32] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds
[12:40:43] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds
[12:46:01] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds
[12:49:12] <icinga-wm>	 RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:49:23] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[13:15:52] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds
[13:21:11] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds
[13:26:22] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds
[13:27:23] <yurik>	 akosiaris, jynus - maps replication failed again with FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment ... has already been removed
[13:27:37] <yurik>	 i will try to re-sync it, but any suggestions are welcome
[13:27:47] * yurik is googling how to restart postgres replicatin
[13:31:51] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds
[13:35:21] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 8 below the confidence bounds
[13:47:12] <Luke081515>	 o.O
[13:52:51] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds
[14:03:23] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[14:06:22] <Bsadowski1>	 _O_O_o
[15:27:52] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds
[15:42:01] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds
[16:16:12] <icinga-wm>	 PROBLEM - puppet last run on mw2100 is CRITICAL: CRITICAL: puppet fail
[16:44:21] <icinga-wm>	 RECOVERY - puppet last run on mw2100 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[16:45:02] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[17:30:55] <grrrit-wm>	 (03PS1) 10Ori.livneh: gdash: add notice pointing users to grafana.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/248697 (https://phabricator.wikimedia.org/T104365) 
[17:31:21] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] gdash: add notice pointing users to grafana.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/248697 (https://phabricator.wikimedia.org/T104365) (owner: 10Ori.livneh)
[17:37:53] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps DWDM]BR
[17:38:10] <wikibugs>	 6operations, 7Graphite, 7Monitoring, 5Patch-For-Review: deprecate gdash - https://phabricator.wikimedia.org/T104365#1751905 (10ori) I replaced gdash with a static HTML mirror, migrated it to krypton, and added a site notice pointing users to grafana. Closing as resolved.
[17:38:35] <wikibugs>	 6operations, 7Graphite, 7Monitoring, 5Patch-For-Review: deprecate gdash - https://phabricator.wikimedia.org/T104365#1751906 (10ori) 5Open>3Resolved a:5fgiunchedi>3ori
[17:46:32] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0
[18:03:29] <grrrit-wm>	 (03PS1) 10Cmjohnson: Removing dns entries for es1001-es1010 [dns] - 10https://gerrit.wikimedia.org/r/248700 
[18:06:31] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] Removing dns entries for es1001-es1010 [dns] - 10https://gerrit.wikimedia.org/r/248700 (owner: 10Cmjohnson)
[18:21:35] <grrrit-wm>	 (03PS1) 10Cmjohnson: Adding dns entries for ms-be1019-1020 [dns] - 10https://gerrit.wikimedia.org/r/248702 
[18:23:11] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] Adding dns entries for ms-be1019-1020 [dns] - 10https://gerrit.wikimedia.org/r/248702 (owner: 10Cmjohnson)
[18:25:27] <wikibugs>	 6operations, 10ops-eqiad: Rack/Setup ms-be1019-ms-1021 - https://phabricator.wikimedia.org/T116542#1751953 (10Cmjohnson) 3NEW a:3Cmjohnson
[18:31:40] <grrrit-wm>	 (03PS1) 10Cmjohnson: Addin dhcp and netboot cfg for ms-be1019-1021 [puppet] - 10https://gerrit.wikimedia.org/r/248705 
[18:34:20] <grrrit-wm>	 (03PS2) 10Cmjohnson: Addin dhcp and netboot cfg for ms-be1019-1021 [puppet] - 10https://gerrit.wikimedia.org/r/248705 
[18:35:22] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] Addin dhcp and netboot cfg for ms-be1019-1021 [puppet] - 10https://gerrit.wikimedia.org/r/248705 (owner: 10Cmjohnson)
[18:37:31] <wikibugs>	 6operations, 10ops-eqiad: Rack/Setup ms-be1019-ms-1021 - https://phabricator.wikimedia.org/T116542#1751972 (10Cmjohnson) All on-site work has been completed including bios setup and raid cfg.   partman and dhcp  https://gerrit.wikimedia.org/r/#/c/248705/ dns https://gerrit.wikimedia.org/r/#/c/248702/ mgmt dns...
[18:40:41] <wikibugs>	 6operations, 10ops-eqiad: Decommission es1001-es1010 - https://phabricator.wikimedia.org/T113080#1751976 (10Cmjohnson) 5Open>3Resolved Completed
[19:25:43] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [500.0]
[19:29:13] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:00:02] <akosiaris>	 yurik: that probably is the reason https://dpaste.de/mHgH . Dropping and recreating an index in a 100GB table is not exactly the best thing you can do to a database
[20:00:34] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1752045 (10Dzahn) a:5Dzahn>3None
[20:11:12] <grrrit-wm>	 (03PS1) 10Papaul: Add DNS entries for ms-be20[1-2][0-6] Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/248712 (https://phabricator.wikimedia.org/T114712) 
[20:18:46] <grrrit-wm>	 (03PS2) 10Papaul: Add DNS entries for ms-be20[1-2][0-6] Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/248712 (https://phabricator.wikimedia.org/T114712) 
[20:40:03] <icinga-wm>	 PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:08:02] <icinga-wm>	 RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:36:18] <wikibugs>	 7Puppet: Disable 80 character line length check in puppet-lint - https://phabricator.wikimedia.org/T116552#1752131 (10siebrand) 3NEW
[22:37:15] <wikibugs>	 7Puppet, 6operations: Disable 80 character line length check in puppet-lint - https://phabricator.wikimedia.org/T116552#1752138 (10yuvipanda)
[22:38:25] <wikibugs>	 7Puppet, 6operations: Disable 80 character line length check in puppet-lint - https://phabricator.wikimedia.org/T116552#1752131 (10yuvipanda) I already see `--no-80chars-check` in .puppet-lint.rc in operations/puppet.git. Is that not being read by the strict check?
[22:48:31] <wikibugs>	 7Puppet, 6operations: Disable 80 character line length check in puppet-lint - https://phabricator.wikimedia.org/T116552#1752144 (10scfc) I think this is a failure in the Jenkins job.  According to https://integration.wikimedia.org/ci/job/translatewiki-puppetlint-strict/2322/console:  ``` [translatewiki-puppetl...
[22:49:57] <wikibugs>	 7Puppet, 6operations, 10Continuous-Integration-Config: Disable 80 character line length check in puppet-lint - https://phabricator.wikimedia.org/T116552#1752145 (10scfc)
[22:51:32] <wikibugs>	 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752152 (10Yurik)
[22:55:11] <icinga-wm>	 PROBLEM - salt-minion processes on lvs3001 is CRITICAL: Timeout while attempting connection
[22:55:22] <icinga-wm>	 PROBLEM - salt-minion processes on lvs3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[22:57:02] <icinga-wm>	 PROBLEM - pybal on lvs3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:58:41] <icinga-wm>	 RECOVERY - pybal on lvs3001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal
[23:00:22] <icinga-wm>	 RECOVERY - salt-minion processes on lvs3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[23:16:22] <icinga-wm>	 RECOVERY - salt-minion processes on lvs3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion