[00:00:14] <dr0ptp4kt>	 w00t ori!
[00:00:17] <Josve05a>	 Now, everybody reload at the same time, and we can break it again, come on!
[00:00:25] <foks>	 HaeB, heh, was just talking to JB about it
[00:02:00] <icinga-wm>	 RECOVERY - Apache HTTP on mw1168 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time
[00:02:01] <icinga-wm>	 RECOVERY - HHVM rendering on mw2190 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.305 second response time
[00:02:01] <icinga-wm>	 RECOVERY - HHVM rendering on mw2091 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.319 second response time
[00:02:07] <icinga-wm>	 PROBLEM - puppet last run on mw1222 is CRITICAL Puppet has 1 failures
[00:02:21] <Liz>	 I think these outages used to happen more 10 years ago
[00:02:26] <icinga-wm>	 RECOVERY - HHVM rendering on mw2048 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.333 second response time
[00:02:26] <icinga-wm>	 RECOVERY - HHVM rendering on mw2035 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.320 second response time
[00:02:27] <icinga-wm>	 RECOVERY - HHVM rendering on mw2032 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.344 second response time
[00:02:27] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1401 bytes in 0.216 second response time
[00:02:27] <icinga-wm>	 RECOVERY - HHVM rendering on mw2106 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.322 second response time
[00:02:28] <icinga-wm>	 RECOVERY - HHVM rendering on mw2020 is OK: HTTP OK: HTTP/1.1 200 OK - 72492 bytes in 1.572 second response time
[00:02:36] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1022 is CRITICAL 40.00% of data above the critical threshold [86.4]
[00:02:52] <HaeB>	 foks: me too ;) so how about "We're back up after a brief outage. Free knowledge back at your fingertips!"
[00:02:57] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1045 is CRITICAL 40.00% of data above the critical threshold [86.4]
[00:03:05] <foks>	 HaeB, sounds good.
[00:03:10] <Liz>	 The first round is on me!
[00:03:30] <Chillum>	 noice
[00:04:26] <Bsadowski1>	 bblack: How did you break it
[00:04:27] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1022 is OK Less than 30.00% above the threshold [57.6]
[00:06:11] <bblack>	 https://gdash.wikimedia.org/dashboards/reqerror/
[00:06:18] <bblack>	 looking better! :)
[00:06:52] <bblack>	 Bsadowski1: it wasn't me, for once :)
[00:08:02] <Krenair>	 So who was it? :P
[00:08:18] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1103 is OK Less than 30.00% above the threshold [57.6]
[00:08:36] <Bsadowski1>	 Yeah, who broke all the wikis
[00:08:38] <Bsadowski1>	 ?
[00:08:40] <Bsadowski1>	 >_>
[00:08:42] <Bsadowski1>	 <_<
[00:08:47] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1045 is OK Less than 30.00% above the threshold [57.6]
[00:08:48] <Bsadowski1>	 Was it a HHMV patch?
[00:09:26] <icinga-wm>	 PROBLEM - HHVM processes on mw1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm
[00:16:27] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[00:17:03] <greg-g>	 Bsadowski1: Krenair post-mortem coming
[00:23:14] <ori>	 it was me
[00:26:47] <icinga-wm>	 RECOVERY - puppet last run on mw1222 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[00:28:33] <grrrit-wm>	 (03PS1) 10BBlack: maps.wm.o: turn back on, but only for beta+self referer [puppet] - 10https://gerrit.wikimedia.org/r/231726 (https://phabricator.wikimedia.org/T105076) 
[00:29:58] <wikibugs>	 6operations, 6Discovery, 10Maps: Determine limited deploy options - https://phabricator.wikimedia.org/T109159#1541953 (10greg) via hangout or somesuch :)
[00:32:47] <icinga-wm>	 PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100%
[00:33:56] <icinga-wm>	 RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 34.94 ms
[00:37:50] <wikibugs>	 6operations, 6Discovery, 10Maps: Determine limited deploy options - https://phabricator.wikimedia.org/T109159#1541961 (10Yurik) Ok, so its seems the agreement has been to allow only requests with REFERRER set to either *.wmflabs.org, or to maps.wikimedia.org.
[00:38:07] <icinga-wm>	 RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 2 processes with command name hhvm
[00:38:14] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] maps.wm.o: turn back on, but only for beta+self referer [puppet] - 10https://gerrit.wikimedia.org/r/231726 (https://phabricator.wikimedia.org/T105076) (owner: 10BBlack)
[00:57:06] <icinga-wm>	 RECOVERY - Disk space on mw1132 is OK: DISK OK
[00:57:17] <icinga-wm>	 RECOVERY - Disk space on mw1114 is OK: DISK OK
[00:57:27] <icinga-wm>	 RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.080 second response time
[00:57:28] <icinga-wm>	 RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 67024 bytes in 3.535 second response time
[00:57:47] <icinga-wm>	 RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 67024 bytes in 1.425 second response time
[00:57:47] <icinga-wm>	 RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.056 second response time
[00:58:42] <ottomata>	 !log stopping kafka broker on analytics1012, it is causing consumption problems with camus, will look into why later.
[00:58:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:04:46] <icinga-wm>	 PROBLEM - Kafka Broker Server on analytics1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties
[01:05:00] <ottomata>	 thats me
[01:11:15] <wikibugs>	 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1542011 (10Yurik) 5Open>3Resolved per IRC with @bblack, closing this task as complete. > <bblack>  there are outstanding spinoff iss...
[01:14:46] <icinga-wm>	 RECOVERY - Kafka Broker Server on analytics1012 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties
[01:15:18] <ottomata>	 !log starting broker on analytics1012, camus  wasn't happy about that either. hrm.
[01:15:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:15:33] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542029 (10Yurik)
[01:16:44] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542035 (10Yurik)
[01:37:45] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1542059 (10Dzahn) ``` #!/bin/bash # import a mailman list - config and archives # dzahn@wikimedia.org - 20150814 - T108073  LISTNAME=$1 IMPO...
[01:46:16] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1542069 (10Dzahn) one issue with a list that has "locked" in the name , which stopped the import script
[01:46:36] <icinga-wm>	 RECOVERY - salt-minion processes on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[01:52:27] <icinga-wm>	 PROBLEM - salt-minion processes on labstore1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[02:05:36] <icinga-wm>	 PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied
[02:15:35] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1542086 (10Dzahn) ``` ==> /var/log/mailman/mischief <== Aug 15 02:14:58 2015 (3431) Hostile listname: wikitech-announce.disabled.T100503  ``...
[02:25:44] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.26wmf18/cache/l10n: l10nupdate for 1.26wmf18 (duration: 06m 37s)
[02:25:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:29:05] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.26wmf18) at 2015-08-15 02:29:05+00:00
[02:30:22] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1542087 (10Dzahn)  "wikiit-l" broke everything, the listinfo page, manual ./list_lists and even a service restart because that also tries ./...
[02:30:24] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1542090 (10Dzahn)
[02:30:26] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1542088 (10Dzahn) 5Open>3Resolved a:3Dzahn
[02:30:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:30:37] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 78 data above and 9 below the confidence bounds
[02:31:07] <icinga-wm>	 PROBLEM - High load average on labstore1002 is CRITICAL 100.00% of data above the critical threshold [24.0]
[03:03:57] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1002 is CRITICAL 50.00% of data above the critical threshold [60.0]
[03:45:28] <icinga-wm>	 RECOVERY - Persistent high iowait on labstore1002 is OK Less than 50.00% above the threshold [40.0]
[04:01:26] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK No anomaly detected
[04:15:46] <icinga-wm>	 PROBLEM - puppet last run on ms-be1018 is CRITICAL Puppet has 1 failures
[04:38:17] <andrewbogott>	 !log killing some rsync processes on labstore1002 because iowaits are through the roof
[04:38:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[04:43:30] <icinga-wm>	 PROBLEM - Last backup of the maps filesystem on labstore1002 is CRITICAL - Last run result was exit-code
[04:44:07] <icinga-wm>	 PROBLEM - Last backup of the tools filesystem on labstore1002 is CRITICAL - Last run result was exit-code
[04:45:17] <icinga-wm>	 RECOVERY - Disk space on labstore1002 is OK: DISK OK
[04:49:07] <icinga-wm>	 PROBLEM - Last backup of the others filesystem on labstore1002 is CRITICAL - Last run result was exit-code
[05:03:08] <icinga-wm>	 PROBLEM - Disk space on mw1123 is CRITICAL: DISK CRITICAL - free space: / 8178 MB (3% inode=93%)
[05:08:37] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1002 is CRITICAL 55.56% of data above the critical threshold [60.0]
[05:14:26] <icinga-wm>	 RECOVERY - Persistent high iowait on labstore1002 is OK Less than 50.00% above the threshold [40.0]
[05:33:08] <icinga-wm>	 RECOVERY - High load average on labstore1002 is OK Less than 50.00% above the threshold [16.0]
[05:41:57] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Aug 15 05:41:57 UTC 2015 (duration 41m 56s)
[05:42:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[05:45:17] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 28926 seconds ago, expected 28800
[05:50:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 29226 seconds ago, expected 28800
[05:55:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 29526 seconds ago, expected 28800
[06:00:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 29826 seconds ago, expected 28800
[06:05:18] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 30126 seconds ago, expected 28800
[06:10:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 30426 seconds ago, expected 28800
[06:15:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 30725 seconds ago, expected 28800
[06:20:08] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 31025 seconds ago, expected 28800
[06:25:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 31329 seconds ago, expected 28800
[06:30:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 31626 seconds ago, expected 28800
[06:31:18] <icinga-wm>	 PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures
[06:31:46] <icinga-wm>	 PROBLEM - puppet last run on cp3048 is CRITICAL Puppet has 1 failures
[06:32:18] <icinga-wm>	 PROBLEM - puppet last run on mw1203 is CRITICAL Puppet has 1 failures
[06:32:18] <icinga-wm>	 PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures
[06:33:17] <icinga-wm>	 PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures
[06:35:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 31925 seconds ago, expected 28800
[06:40:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 32225 seconds ago, expected 28800
[06:45:17] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 32525 seconds ago, expected 28800
[06:50:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 32825 seconds ago, expected 28800
[06:55:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 33126 seconds ago, expected 28800
[06:56:06] <icinga-wm>	 RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures
[06:56:07] <icinga-wm>	 RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures
[06:56:58] <icinga-wm>	 RECOVERY - puppet last run on mw1203 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:06] <icinga-wm>	 RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:27] <icinga-wm>	 RECOVERY - puppet last run on cp3048 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:00:17] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 33426 seconds ago, expected 28800
[07:05:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 33726 seconds ago, expected 28800
[07:07:37] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[07:10:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 34029 seconds ago, expected 28800
[07:15:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 34326 seconds ago, expected 28800
[07:20:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 34626 seconds ago, expected 28800
[07:25:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 34926 seconds ago, expected 28800
[07:30:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 35226 seconds ago, expected 28800
[07:35:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 35526 seconds ago, expected 28800
[07:40:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 35826 seconds ago, expected 28800
[07:41:07] <icinga-wm>	 PROBLEM - High load average on labstore1002 is CRITICAL 55.56% of data above the critical threshold [24.0]
[07:45:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 36125 seconds ago, expected 28800
[07:50:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 36425 seconds ago, expected 28800
[07:50:37] <icinga-wm>	 RECOVERY - High load average on labstore1002 is OK Less than 50.00% above the threshold [16.0]
[07:55:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 36725 seconds ago, expected 28800
[08:00:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 37025 seconds ago, expected 28800
[08:04:16] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected
[08:05:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 37326 seconds ago, expected 28800
[08:10:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 37626 seconds ago, expected 28800
[08:15:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 37926 seconds ago, expected 28800
[08:20:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 38226 seconds ago, expected 28800
[08:25:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 38526 seconds ago, expected 28800
[08:30:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 38826 seconds ago, expected 28800
[08:31:37] <icinga-wm>	 PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL 10.71% of data above the critical threshold [100000000.0]
[08:35:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 39126 seconds ago, expected 28800
[08:40:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 39426 seconds ago, expected 28800
[08:45:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 39726 seconds ago, expected 28800
[08:50:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 40026 seconds ago, expected 28800
[08:54:27] <icinga-wm>	 RECOVERY - Outgoing network saturation on labstore1002 is OK Less than 10.00% above the threshold [75000000.0]
[08:55:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 40325 seconds ago, expected 28800
[09:00:07] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 40625 seconds ago, expected 28800
[09:01:17] <grrrit-wm>	 (03Abandoned) 10Giuseppe Lavagetto: Introducing mobileapps role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/227725 (owner: 10Giuseppe Lavagetto)
[09:01:36] <grrrit-wm>	 (03Abandoned) 10Giuseppe Lavagetto: Assign mobileapps service to sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/227726 (owner: 10Giuseppe Lavagetto)
[09:01:50] <grrrit-wm>	 (03Abandoned) 10Giuseppe Lavagetto: Setup LVS for mobileapps service on sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/227727 (owner: 10Giuseppe Lavagetto)
[09:05:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 40925 seconds ago, expected 28800
[09:10:01] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: puppet_compiler: Create the workdir as well [puppet] - 10https://gerrit.wikimedia.org/r/231755 
[09:10:07] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 41225 seconds ago, expected 28800
[09:10:25] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: puppet_compiler: Create the workdir as well [puppet] - 10https://gerrit.wikimedia.org/r/231755 
[09:15:07] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 41525 seconds ago, expected 28800
[09:20:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 41825 seconds ago, expected 28800
[09:25:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 42126 seconds ago, expected 28800
[09:30:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 42426 seconds ago, expected 28800
[09:35:17] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 42726 seconds ago, expected 28800
[09:39:27] <wikibugs>	 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1542247 (10Kghbln) It's down again. Perhaps some kind of monitoring could be implemented to detect this until the migration to a stable successor system is implemented.
[09:40:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 43026 seconds ago, expected 28800
[09:45:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 43326 seconds ago, expected 28800
[09:45:36] <wikibugs>	 6operations, 10Gitblit-Deprecate: evaluate "klaus" to replace gitblit as a git web viewer - https://phabricator.wikimedia.org/T109004#1542259 (10Kghbln) Diffusion is a showstopper since it does not allow to download code, raw diffs only. That's what makes it kinda useless and ridiculous.
[09:48:02] <wikibugs>	 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1542270 (10Glaisher) >>! In T83702#1542247, @Kghbln wrote: > Perhaps some kind of monitoring could be implemented to detect this until the migration to a stable successor syst...
[09:50:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 43626 seconds ago, expected 28800
[09:55:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 43926 seconds ago, expected 28800
[09:57:45] <wikibugs>	 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1542278 (10Kghbln) Ah, nobody there at operations. Was not aware of this. :p Yeah, since Diffusion is kinda useless since it does not allow downloads and also has a pretty con...
[10:00:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 44226 seconds ago, expected 28800
[10:04:48] <wikibugs>	 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1542286 (10Glaisher) >>! In T83702#1542278, @Kghbln wrote: > Anybody aware of the fact that about every extension's page points to git.wikimedia.org?  Yes. See {T108864}.
[10:05:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 44525 seconds ago, expected 28800
[10:10:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 44825 seconds ago, expected 28800
[10:10:54] <grrrit-wm>	 (03PS1) 10Yuvipanda: quarry: Remove duplication of clone_path and other variables [puppet] - 10https://gerrit.wikimedia.org/r/231759 
[10:10:56] <grrrit-wm>	 (03PS1) 10Yuvipanda: ores: Add role+class for the precached daemon [puppet] - 10https://gerrit.wikimedia.org/r/231760 
[10:10:58] <grrrit-wm>	 (03PS1) 10Yuvipanda: ores: Mark all roles requiring ores::base properly [puppet] - 10https://gerrit.wikimedia.org/r/231761 
[10:15:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 45125 seconds ago, expected 28800
[10:17:33] <wikibugs>	 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1542291 (10Krenair)
[10:18:09] <wikibugs>	 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1542292 (10Kghbln) Cool, it's in the making. :)
[10:20:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 45426 seconds ago, expected 28800
[10:24:27] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:25:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 45726 seconds ago, expected 28800
[10:26:17] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[10:27:07] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:30:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 46026 seconds ago, expected 28800
[10:32:57] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 1 hour ago with 0 failures
[10:35:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 46326 seconds ago, expected 28800
[10:40:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 46626 seconds ago, expected 28800
[10:43:16] <icinga-wm>	 RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:45:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 46926 seconds ago, expected 28800
[10:50:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 47226 seconds ago, expected 28800
[10:55:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 47526 seconds ago, expected 28800
[10:55:38] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL puppet fail
[11:00:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 47826 seconds ago, expected 28800
[11:05:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 48126 seconds ago, expected 28800
[11:10:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 48426 seconds ago, expected 28800
[11:12:18] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:13:57] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[11:15:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 48726 seconds ago, expected 28800
[11:15:56] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61459 bytes in 0.091 second response time
[11:20:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 49025 seconds ago, expected 28800
[11:23:46] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:25:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 49325 seconds ago, expected 28800
[11:25:27] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[11:30:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 49626 seconds ago, expected 28800
[11:30:42] <jynus>	 500 seem back on track, the small recent increase may have been created by some bots
[11:35:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 49925 seconds ago, expected 28800
[11:40:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 50225 seconds ago, expected 28800
[11:45:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 50526 seconds ago, expected 28800
[11:50:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 50826 seconds ago, expected 28800
[11:54:17] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:55:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 51126 seconds ago, expected 28800
[11:56:06] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[12:00:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 51426 seconds ago, expected 28800
[12:02:38] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:05:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 51726 seconds ago, expected 28800
[12:07:47] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:09:17] <icinga-wm>	 RECOVERY - Router interfaces on cr1-ulsfo is OK host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 1, unused: 0
[12:10:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 52026 seconds ago, expected 28800
[12:13:17] <icinga-wm>	 RECOVERY - salt-minion processes on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[12:13:56] <icinga-wm>	 RECOVERY - Router interfaces on mr1-codfw is OK host 208.80.153.196, interfaces up: 33, down: 0, dormant: 0, excluded: 0, unused: 0
[12:14:17] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:15:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 52326 seconds ago, expected 28800
[12:15:27] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[12:16:16] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 15 minutes ago with 0 failures
[12:17:31] <paravoid>	 apergos: snapshot1002 alerts?
[12:20:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 52626 seconds ago, expected 28800
[12:25:06] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:25:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 52926 seconds ago, expected 28800
[12:26:58] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[12:28:59] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542421 (10BBlack) To import some questions from IRC earlier:  1. Does maps needs its own cache cluster?   - My opinion is that yes, it does, especially in early days when we don...
[12:29:10] <wikibugs>	 6operations, 10Traffic, 3Discovery-Maps-Sprint: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542422 (10BBlack)
[12:30:16] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 53226 seconds ago, expected 28800
[12:30:28] <wikibugs>	 6operations, 10Traffic, 3Discovery-Maps-Sprint: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542014 (10BBlack)
[12:35:16] <icinga-wm>	 RECOVERY - check_puppetrun on payments1001 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures
[12:36:47] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqdfw is OK host 208.80.153.198, interfaces up: 33, down: 0, dormant: 0, excluded: 2, unused: 0
[12:37:07] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK host 208.80.154.198, interfaces up: 33, down: 0, dormant: 0, excluded: 3, unused: 0
[12:44:16] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[12:45:07] <icinga-wm>	 RECOVERY - Disk space on uranium is OK: DISK OK
[12:46:06] <bblack>	 !log restarted gitblit on antimony, because Java is Awesome
[12:46:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:47:57] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61460 bytes in 0.347 second response time
[12:53:07] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:53:13] <icinga-wm>	 ACKNOWLEDGEMENT - Last backup of the maps filesystem on labstore1002 is CRITICAL - Last run result was exit-code Coren Last backup missed because out of snapshot space - cleaned, but needs more frequent cleanup. (https://phabricator.wikimedia.org/T109176)
[12:53:13] <icinga-wm>	 ACKNOWLEDGEMENT - Last backup of the others filesystem on labstore1002 is CRITICAL - Last run result was exit-code Coren Last backup missed because out of snapshot space - cleaned, but needs more frequent cleanup. (https://phabricator.wikimedia.org/T109176)
[12:53:13] <icinga-wm>	 ACKNOWLEDGEMENT - Last backup of the tools filesystem on labstore1002 is CRITICAL - Last run result was exit-code Coren Last backup missed because out of snapshot space - cleaned, but needs more frequent cleanup. (https://phabricator.wikimedia.org/T109176)
[13:02:07] <icinga-wm>	 PROBLEM - check_puppetrun on bellatrix is CRITICAL Puppet has 28 failures
[13:05:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[13:07:07] <icinga-wm>	 RECOVERY - check_puppetrun on bellatrix is OK Puppet is currently enabled, last run 104 seconds ago with 0 failures
[13:10:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[13:15:08] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 212 seconds ago with 0 failures
[13:16:32] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Switch US/TX to codfw [dns] - 10https://gerrit.wikimedia.org/r/231772 
[13:19:49] <wikibugs>	 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations: John Lewis sudo as 'list' on mailman staging VM - https://phabricator.wikimedia.org/T108349#1542501 (10JohnLewis) And confirmed (late). ``` johnflewis@fermium:~$ sudo service mailman status ● mailman.service - LSB: Mailman Master Queue Runner ```  Tha...
[13:21:02] <grrrit-wm>	 (03CR) 10Alex Monk: "Is this going to start sending traffic to through to the codfw apaches and databases, or just caches?" [dns] - 10https://gerrit.wikimedia.org/r/231772 (owner: 10Faidon Liambotis)
[13:23:07] <grrrit-wm>	 (03CR) 10Faidon Liambotis: "Just caches." [dns] - 10https://gerrit.wikimedia.org/r/231772 (owner: 10Faidon Liambotis)
[13:23:18] <jynus>	 ^ok :-)
[13:28:15] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: service IP can't be switched over - https://phabricator.wikimedia.org/T108080#1542502 (10JohnLewis) Looking at usage; it's sparse so we can easily add a new IP via hiera alone once the autobound{lists} variables in role::lists are in hiera.
[13:35:08] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[13:39:47] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures
[13:40:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[13:45:16] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 225 seconds ago with 0 failures
[14:05:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[14:06:15] <grrrit-wm>	 (03Abandoned) 10Merlijn van Deen: Flake8-ify everything [debs/adminbot] - 10https://gerrit.wikimedia.org/r/181054 (owner: 10Merlijn van Deen)
[14:10:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[14:10:17] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:15:07] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 207 seconds ago with 0 failures
[14:15:58] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[14:35:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[14:40:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[14:45:07] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 169 seconds ago with 0 failures
[14:48:14] <wikibugs>	 6operations, 7Database, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1542524 (10jcrespo) 3NEW
[14:49:11] <wikibugs>	 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1542532 (10jcrespo) Import on dbstore2002 finished. We have a 3-day lag on dbstore2002, but I prefer that than performing another manual import because it will be less error-prone a...
[14:49:30] <wikibugs>	 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1542536 (10jcrespo) 5Open>3Resolved
[14:49:32] <wikibugs>	 6operations, 7Database: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1542537 (10jcrespo)
[14:49:51] <ottomata>	 !log stopping kafka broker on analytics1012 to again try to figure out why camus can't consume from it
[14:49:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:51:34] <wikibugs>	 6operations, 10hardware-requests: dbproxy servers for codfw - https://phabricator.wikimedia.org/T109116#1542547 (10jcrespo) Status: "It's complicated"
[14:52:17] <wikibugs>	 6operations, 7Database: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1542548 (10jcrespo)
[14:52:18] <wikibugs>	 6operations, 10hardware-requests: dbproxy servers for codfw - https://phabricator.wikimedia.org/T109116#1542549 (10jcrespo)
[14:53:07] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:53:57] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:55:47] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 26 minutes ago with 0 failures
[14:56:57] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[15:05:06] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[15:10:06] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[15:15:07] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 178 seconds ago with 0 failures
[15:23:07] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:23:58] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:25:06] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[15:25:48] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 56 minutes ago with 0 failures
[15:35:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[15:35:07] <icinga-wm>	 RECOVERY - Disk space on mw1123 is OK: DISK OK
[15:40:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[15:43:16] <grrrit-wm>	 (03PS1) 10BBlack: bits-legacy: remove beacon/statsv support [puppet] - 10https://gerrit.wikimedia.org/r/231777 
[15:43:18] <grrrit-wm>	 (03PS1) 10BBlack: bits-legacy: remove special https://bits redirects for secure wikis [puppet] - 10https://gerrit.wikimedia.org/r/231778 
[15:45:07] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 212 seconds ago with 0 failures
[15:52:04] <grrrit-wm>	 (03PS1) 10Ottomata: Add param for auto.leader.rebalance.enable [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231780 
[15:53:05] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Add param for auto.leader.rebalance.enable [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231780 (owner: 10Ottomata)
[15:53:53] <wikibugs>	 6operations, 7HTTPS: download.wikipedia.org is using an invalid certificate - https://phabricator.wikimedia.org/T107575#1542593 (10Chmarkine) How about mapping download.Wikipedia.org to the text cluster, and then have it redirect to https://dumps.wikimedia.org?
[15:55:52] <grrrit-wm>	 (03PS1) 10Ottomata: Disable kafka auto leader rebalance [puppet] - 10https://gerrit.wikimedia.org/r/231781 
[15:58:16] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:00:11] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Disable kafka auto leader rebalance [puppet] - 10https://gerrit.wikimedia.org/r/231781 (owner: 10Ottomata)
[16:00:17] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 1 hour ago with 0 failures
[16:05:03] <ottomata>	 !log starting rolling restart of kafka brokers to apply auto leader rebalance enable = false
[16:05:06] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[16:05:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:05:57] <icinga-wm>	 RECOVERY - puppet last run on analytics1012 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures
[16:07:26] <_joe_>	 !log removing manually core dumps from last night's outage on all appservers in eqiad, they occpy on average 30 GB/server
[16:07:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:10:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[16:15:06] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 216 seconds ago with 0 failures
[16:23:37] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:25:37] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[16:35:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[16:40:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[16:45:08] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 198 seconds ago with 0 failures
[16:57:21] <grrrit-wm>	 (03PS1) 10Ottomata: Split webrequest camus import job into multiple jobs for different size topics [puppet] - 10https://gerrit.wikimedia.org/r/231785 
[16:57:46] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:05:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[17:05:47] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[17:10:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[17:13:56] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:15:06] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 197 seconds ago with 0 failures
[17:15:47] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[17:23:27] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Varnish referrer filter is blocking links - https://phabricator.wikimedia.org/T109187#1542643 (10Yurik) 3NEW a:3BBlack
[17:24:26] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL puppet fail
[17:26:32] <wikibugs>	 6operations, 10Gitblit-Deprecate: evaluate "klaus" to replace gitblit as a git web viewer - https://phabricator.wikimedia.org/T109004#1542651 (10Krenair) >>! In T109004#1542259, @Kghbln wrote: > Diffusion is a showstopper since it does not allow to download code, raw diffs only. That's what makes it kinda usel...
[17:27:19] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Varnish referrer filter is blocking links - https://phabricator.wikimedia.org/T109187#1542652 (10Yurik)
[17:35:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[17:37:56] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:40:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[17:41:58] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Varnish referrer filter is blocking links - https://phabricator.wikimedia.org/T109187#1542658 (10BBlack) I don't see that behavior, at least in Chrome, clicking your link out of gmail or phab.
[17:43:47] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[17:45:07] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 215 seconds ago with 0 failures
[17:46:07] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Varnish referrer filter is blocking links - https://phabricator.wikimedia.org/T109187#1542660 (10BBlack) ... but in any case, we *do* want to block referer in the long run to keep usage to our wikis only.  the right way around this is to do it like a wiki: put a page up in...
[17:46:57] <icinga-wm>	 PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL 24.14% of data above the critical threshold [100000000.0]
[17:49:05] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Split webrequest camus import job into multiple jobs for different size topics [puppet] - 10https://gerrit.wikimedia.org/r/231785 (owner: 10Ottomata)
[17:58:38] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures
[18:02:38] <grrrit-wm>	 (03PS1) 10Ottomata: Revert split of webrequest imports in multiple Camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/231795 
[18:04:17] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Revert split of webrequest imports in multiple Camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/231795 (owner: 10Ottomata)
[18:05:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[18:10:06] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[18:12:07] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:15:07] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 203 seconds ago with 0 failures
[18:18:16] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[18:27:38] <icinga-wm>	 ACKNOWLEDGEMENT - Outgoing network saturation on labstore1002 is CRITICAL 28.57% of data above the critical threshold [100000000.0] Coren Cluprit killed waiting for rolling average to go down.
[18:33:27] <icinga-wm>	 RECOVERY - Outgoing network saturation on labstore1002 is OK Less than 10.00% above the threshold [75000000.0]
[18:35:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[18:40:06] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures
[18:45:07] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 207 seconds ago with 0 failures
[18:53:17] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:54:56] <icinga-wm>	 PROBLEM - check_puppetrun on fdb2001 is CRITICAL Puppet has 34 failures
[18:55:07] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 28 minutes ago with 0 failures
[18:59:56] <icinga-wm>	 PROBLEM - check_puppetrun on fdb2001 is CRITICAL Puppet has 34 failures
[19:04:56] <icinga-wm>	 PROBLEM - check_puppetrun on fdb2001 is CRITICAL Puppet has 34 failures
[19:08:27] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:09:56] <icinga-wm>	 RECOVERY - check_puppetrun on fdb2001 is OK Puppet is currently enabled, last run 292 seconds ago with 0 failures
[19:10:27] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[19:16:27] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:20:18] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[19:22:07] <wikibugs>	 6operations, 10Gitblit-Deprecate: evaluate "klaus" to replace gitblit as a git web viewer - https://phabricator.wikimedia.org/T109004#1542775 (10Kghbln) Well, talking about shared hosting without command line access. Still according to my experience the predominant environment out there even if some people ref...
[19:22:27] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1041 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:24:18] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1041 is OK YARN NodeManager analytics1041.eqiad.wmnet:8041 Node-State: RUNNING
[19:38:31] <wikibugs>	 6operations, 6Multimedia: Add monitoring of upload rate on commons to icinga alerts - https://phabricator.wikimedia.org/T92322#1542817 (10Tgr) >>! In T92322#1419551, @Tgr wrote: > https://gerrit.wikimedia.org/r/#/c/222224/ will make this easy again.  It did not, it killed statsd without sampling, and with samp...
[19:44:37] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:46:36] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[19:53:27] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:55:17] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 28 minutes ago with 0 failures
[20:16:38] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:26:36] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[20:29:16] <wikibugs>	 6operations, 10Math: Install-more-LaTeX-packages - https://phabricator.wikimedia.org/T109195#1542849 (10Krenair)
[20:29:26] <wikibugs>	 6operations, 10Math: Install more LaTeX packages - https://phabricator.wikimedia.org/T109195#1542852 (10Krenair)
[20:52:47] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:53:38] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:53:46] <icinga-wm>	 PROBLEM - SSH on snapshot1002 is CRITICAL - Socket timeout after 10 seconds
[20:55:37] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 1 hour ago with 0 failures
[20:55:38] <icinga-wm>	 RECOVERY - SSH on snapshot1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2wmfprecise2 (protocol 2.0)
[20:56:36] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[21:14:46] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:20:21] <wikibugs>	 6operations: videoscaler naming conventions - https://phabricator.wikimedia.org/T105009#1542881 (10Peachey88) > Maybe when those eqiad ones get reinstalled in T104747 they can be renamed to mw*?  Or looking at it the other way shouldn't the codfw boxes be renamed inline with the naming conventions?
[21:23:28] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:25:17] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 1 hour ago with 0 failures
[21:25:28] <wikibugs>	 6operations: videoscaler naming conventions - https://phabricator.wikimedia.org/T105009#1542882 (10Krenair)
[21:27:14] <wikibugs>	 6operations: Investigate smsglobal delivery failures from 2015-06-13 weekend - https://phabricator.wikimedia.org/T102396#1542884 (10Krenair)
[21:28:17] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[21:35:10] <wikibugs>	 6operations: videoscaler naming conventions - https://phabricator.wikimedia.org/T105009#1542891 (10ori) >>! In T105009#1542881, @Peachey88 wrote: >> Maybe when those eqiad ones get reinstalled in T104747 they can be renamed to mw*? >  > Or looking at it the other way shouldn't the codfw boxes be renamed inline w...
[21:51:53] <wikibugs>	 6operations, 10Gitblit-Deprecate: evaluate "klaus" to replace gitblit as a git web viewer - https://phabricator.wikimedia.org/T109004#1542895 (10demon) I see a "view raw file" button on [[ /diffusion/MW/browse/master/README | this file ]].
[21:55:47] <icinga-wm>	 PROBLEM - puppet last run on cp3045 is CRITICAL puppet fail
[22:12:56] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:16:37] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[22:21:57] <wikibugs>	 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1542896 (10GWicke) Another global outage triggered by a puppet config deploy: https://wikitech.wikimedia.org/wiki/Incident_documentation/2...
[22:22:47] <icinga-wm>	 RECOVERY - puppet last run on cp3045 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures
[22:28:37] <odder>	 Hello.
[22:28:43] <odder>	 I think something's gone badly wrong.
[22:30:20] <Krenair>	 ?
[22:30:54] <odder>	 Krenair: A few e-mails didn't get delivered, so I got my e-mail revoked and had to re-confirm it again
[22:30:59] <odder>	 For, like, third time this year
[22:32:56] <Krenair>	 odder, BounceHandler will do that if your email provider starts sending bounces
[22:33:11] <odder>	 (again.)
[22:33:12] <Krenair>	 what's going badly wrong about this?
[22:33:33] <odder>	 Krenair: Yeah, but it's getting a bit irritating when you have to send 10+ e-mails
[22:33:49] <Krenair>	 Why are your mails failing to deliver?
[22:34:01] <odder>	 Yes, why are they.
[22:36:52] <odder>	 Krenair: I get my messages, like talk page message notifications and Echo mentions, okay, never been any problems
[22:37:35] <odder>	 But I just tried sending a couple of e-mails, and failed to send more than 1
[22:39:22] <Krenair>	 "Mail delivery failed: returning message to sender" apparently is the reason
[22:39:26] <Krenair>	 not very useful
[22:40:33] <Krenair>	 legoktm, around?
[22:40:47] <valhallasw`cloud>	 odder: yahoo, by any chance?
[22:40:51] <legoktm>	 Ish
[22:41:05] <odder>	 valhallasw`cloud: nope.
[22:41:57] <legoktm>	 odder: the emails to yourself bounced? Or to other people?
[22:42:02] <valhallasw`cloud>	 odder: hm, wait, I think it's actually an spf issue, not a dmarc one, so it's not necessarily just yahoo
[22:42:26] <odder>	 legoktm: I tried sending e-mail out, and get copies sent to myself
[22:42:32] <valhallasw`cloud>	 but I'm confused so I'll let legoktm try to understand it
[22:42:49] <odder>	 but they apparently bounced, which resulted in my having to re-confirm my e-mail address
[22:42:53] <legoktm>	 Did you receive those copies?
[22:43:00] <odder>	 None
[22:43:35] <odder>	 legoktm: Funnily enough, I did receive the e-mail reconfirmation mail
[22:44:18] <legoktm>	 > 	$wgBounceRecordLimit = 5;
[22:44:26] <legoktm>	 so after 5 bounces I think we unconfirm
[22:44:32] <Krenair>	 yes, it's been triggering that according to BounceHandler.log
[22:45:39] <odder>	 after 5 in a row, or 5 in total?
[22:45:48] <odder>	 because I reconfirmed my e-mail, and only managed to send one e-mail
[22:46:01] <legoktm>	 and then it unconfirmed you again?
[22:46:10] <legoktm>	 I think it's 5 over some time period
[22:46:11] <odder>	 Yep.
[22:47:27] <odder>	 https://gerrit.wikimedia.org/r/#/c/168337/2/BounceHandler.php 7 days apparently
[22:48:13] <legoktm>	 we should probably clear previous bounce records once you reconfirm
[22:48:34] <legoktm>	 and you should talk to your mail provider about why they keep bouncing ;)
[22:49:08] <Krenair>	 Would we keep a copy of the full bounce message somewhere?
[22:49:37] <legoktm>	 https://phabricator.wikimedia.org/T99767 we don't
[22:49:41] <odder>	 legoktm: Then why would I get all other e-mails okay?
[22:49:50] <legoktm>	 I don't know :/
[22:50:30] <odder>	 unfortunately their customer service seem to be sleeping. how dare they.
[22:51:07] <Krenair>	 legoktm, we don't in BH, sure, but maybe ops would have a copy somewhere on a mail server?
[22:51:26] <legoktm>	 maybe, idk
[22:52:10] <legoktm>	 odder: so in the meantime, let me remove your bounce records so you can at least go 5 more bounces before having to reconfirm
[22:52:30] <odder>	 legoktm: thanks
[22:56:24] <legoktm>	 !log removed 13 bounce_records for User:odder from bouncehandler database
[22:56:29] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:00:18] <odder>	 13, wow
[23:03:57] <legoktm>	 odder: hope that helps, if it keeps happening we can look into saving the full bounce message from your emails if we have to...
[23:04:06] * legoktm goes offline
[23:04:24] <odder>	 legoktm: It does, thanks a lot
[23:22:26] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:24:17] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed
[23:51:36] <icinga-wm>	 PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:52:27] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:54:16] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 24 minutes ago with 0 failures
[23:55:17] <icinga-wm>	 RECOVERY - RAID on snapshot1002 is OK no RAID installed