[00:00:14] w00t ori! [00:00:17] Now, everybody reload at the same time, and we can break it again, come on! [00:00:25] HaeB, heh, was just talking to JB about it [00:02:00] RECOVERY - Apache HTTP on mw1168 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [00:02:01] RECOVERY - HHVM rendering on mw2190 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.305 second response time [00:02:01] RECOVERY - HHVM rendering on mw2091 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.319 second response time [00:02:07] PROBLEM - puppet last run on mw1222 is CRITICAL Puppet has 1 failures [00:02:21] I think these outages used to happen more 10 years ago [00:02:26] RECOVERY - HHVM rendering on mw2048 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.333 second response time [00:02:26] RECOVERY - HHVM rendering on mw2035 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.320 second response time [00:02:27] RECOVERY - HHVM rendering on mw2032 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.344 second response time [00:02:27] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1401 bytes in 0.216 second response time [00:02:27] RECOVERY - HHVM rendering on mw2106 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.322 second response time [00:02:28] RECOVERY - HHVM rendering on mw2020 is OK: HTTP OK: HTTP/1.1 200 OK - 72492 bytes in 1.572 second response time [00:02:36] PROBLEM - HHVM busy threads on mw1022 is CRITICAL 40.00% of data above the critical threshold [86.4] [00:02:52] foks: me too ;) so how about "We're back up after a brief outage. Free knowledge back at your fingertips!" [00:02:57] PROBLEM - HHVM busy threads on mw1045 is CRITICAL 40.00% of data above the critical threshold [86.4] [00:03:05] HaeB, sounds good. [00:03:10] The first round is on me! [00:03:30] noice [00:04:26] bblack: How did you break it [00:04:27] RECOVERY - HHVM busy threads on mw1022 is OK Less than 30.00% above the threshold [57.6] [00:06:11] https://gdash.wikimedia.org/dashboards/reqerror/ [00:06:18] looking better! :) [00:06:52] Bsadowski1: it wasn't me, for once :) [00:08:02] So who was it? :P [00:08:18] RECOVERY - HHVM busy threads on mw1103 is OK Less than 30.00% above the threshold [57.6] [00:08:36] Yeah, who broke all the wikis [00:08:38] ? [00:08:40] >_> [00:08:42] <_< [00:08:47] RECOVERY - HHVM busy threads on mw1045 is OK Less than 30.00% above the threshold [57.6] [00:08:48] Was it a HHMV patch? [00:09:26] PROBLEM - HHVM processes on mw1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [00:16:27] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:17:03] Bsadowski1: Krenair post-mortem coming [00:23:14] it was me [00:26:47] RECOVERY - puppet last run on mw1222 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:28:33] (03PS1) 10BBlack: maps.wm.o: turn back on, but only for beta+self referer [puppet] - 10https://gerrit.wikimedia.org/r/231726 (https://phabricator.wikimedia.org/T105076) [00:29:58] 6operations, 6Discovery, 10Maps: Determine limited deploy options - https://phabricator.wikimedia.org/T109159#1541953 (10greg) via hangout or somesuch :) [00:32:47] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [00:33:56] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 34.94 ms [00:37:50] 6operations, 6Discovery, 10Maps: Determine limited deploy options - https://phabricator.wikimedia.org/T109159#1541961 (10Yurik) Ok, so its seems the agreement has been to allow only requests with REFERRER set to either *.wmflabs.org, or to maps.wikimedia.org. [00:38:07] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 2 processes with command name hhvm [00:38:14] (03CR) 10BBlack: [C: 032] maps.wm.o: turn back on, but only for beta+self referer [puppet] - 10https://gerrit.wikimedia.org/r/231726 (https://phabricator.wikimedia.org/T105076) (owner: 10BBlack) [00:57:06] RECOVERY - Disk space on mw1132 is OK: DISK OK [00:57:17] RECOVERY - Disk space on mw1114 is OK: DISK OK [00:57:27] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.080 second response time [00:57:28] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 67024 bytes in 3.535 second response time [00:57:47] RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 67024 bytes in 1.425 second response time [00:57:47] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.056 second response time [00:58:42] !log stopping kafka broker on analytics1012, it is causing consumption problems with camus, will look into why later. [00:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:04:46] PROBLEM - Kafka Broker Server on analytics1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [01:05:00] thats me [01:11:15] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1542011 (10Yurik) 5Open>3Resolved per IRC with @bblack, closing this task as complete. > there are outstanding spinoff iss... [01:14:46] RECOVERY - Kafka Broker Server on analytics1012 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [01:15:18] !log starting broker on analytics1012, camus wasn't happy about that either. hrm. [01:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:15:33] 6operations, 3Discovery-Maps-Sprint: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542029 (10Yurik) [01:16:44] 6operations, 3Discovery-Maps-Sprint: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542035 (10Yurik) [01:37:45] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1542059 (10Dzahn) ``` #!/bin/bash # import a mailman list - config and archives # dzahn@wikimedia.org - 20150814 - T108073 LISTNAME=$1 IMPO... [01:46:16] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1542069 (10Dzahn) one issue with a list that has "locked" in the name , which stopped the import script [01:46:36] RECOVERY - salt-minion processes on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:52:27] PROBLEM - salt-minion processes on labstore1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:05:36] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied [02:15:35] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1542086 (10Dzahn) ``` ==> /var/log/mailman/mischief <== Aug 15 02:14:58 2015 (3431) Hostile listname: wikitech-announce.disabled.T100503 ``... [02:25:44] !log l10nupdate@tin Synchronized php-1.26wmf18/cache/l10n: l10nupdate for 1.26wmf18 (duration: 06m 37s) [02:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:05] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf18) at 2015-08-15 02:29:05+00:00 [02:30:22] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1542087 (10Dzahn) "wikiit-l" broke everything, the listinfo page, manual ./list_lists and even a service restart because that also tries ./... [02:30:24] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1542090 (10Dzahn) [02:30:26] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1542088 (10Dzahn) 5Open>3Resolved a:3Dzahn [02:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 78 data above and 9 below the confidence bounds [02:31:07] PROBLEM - High load average on labstore1002 is CRITICAL 100.00% of data above the critical threshold [24.0] [03:03:57] PROBLEM - Persistent high iowait on labstore1002 is CRITICAL 50.00% of data above the critical threshold [60.0] [03:45:28] RECOVERY - Persistent high iowait on labstore1002 is OK Less than 50.00% above the threshold [40.0] [04:01:26] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK No anomaly detected [04:15:46] PROBLEM - puppet last run on ms-be1018 is CRITICAL Puppet has 1 failures [04:38:17] !log killing some rsync processes on labstore1002 because iowaits are through the roof [04:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:43:30] PROBLEM - Last backup of the maps filesystem on labstore1002 is CRITICAL - Last run result was exit-code [04:44:07] PROBLEM - Last backup of the tools filesystem on labstore1002 is CRITICAL - Last run result was exit-code [04:45:17] RECOVERY - Disk space on labstore1002 is OK: DISK OK [04:49:07] PROBLEM - Last backup of the others filesystem on labstore1002 is CRITICAL - Last run result was exit-code [05:03:08] PROBLEM - Disk space on mw1123 is CRITICAL: DISK CRITICAL - free space: / 8178 MB (3% inode=93%) [05:08:37] PROBLEM - Persistent high iowait on labstore1002 is CRITICAL 55.56% of data above the critical threshold [60.0] [05:14:26] RECOVERY - Persistent high iowait on labstore1002 is OK Less than 50.00% above the threshold [40.0] [05:33:08] RECOVERY - High load average on labstore1002 is OK Less than 50.00% above the threshold [16.0] [05:41:57] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Aug 15 05:41:57 UTC 2015 (duration 41m 56s) [05:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:45:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 28926 seconds ago, expected 28800 [05:50:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 29226 seconds ago, expected 28800 [05:55:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 29526 seconds ago, expected 28800 [06:00:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 29826 seconds ago, expected 28800 [06:05:18] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 30126 seconds ago, expected 28800 [06:10:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 30426 seconds ago, expected 28800 [06:15:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 30725 seconds ago, expected 28800 [06:20:08] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 31025 seconds ago, expected 28800 [06:25:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 31329 seconds ago, expected 28800 [06:30:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 31626 seconds ago, expected 28800 [06:31:18] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures [06:31:46] PROBLEM - puppet last run on cp3048 is CRITICAL Puppet has 1 failures [06:32:18] PROBLEM - puppet last run on mw1203 is CRITICAL Puppet has 1 failures [06:32:18] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures [06:33:17] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:35:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 31925 seconds ago, expected 28800 [06:40:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 32225 seconds ago, expected 28800 [06:45:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 32525 seconds ago, expected 28800 [06:50:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 32825 seconds ago, expected 28800 [06:55:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 33126 seconds ago, expected 28800 [06:56:06] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:07] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw1203 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:06] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on cp3048 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 33426 seconds ago, expected 28800 [07:05:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 33726 seconds ago, expected 28800 [07:07:37] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [07:10:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 34029 seconds ago, expected 28800 [07:15:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 34326 seconds ago, expected 28800 [07:20:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 34626 seconds ago, expected 28800 [07:25:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 34926 seconds ago, expected 28800 [07:30:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 35226 seconds ago, expected 28800 [07:35:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 35526 seconds ago, expected 28800 [07:40:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 35826 seconds ago, expected 28800 [07:41:07] PROBLEM - High load average on labstore1002 is CRITICAL 55.56% of data above the critical threshold [24.0] [07:45:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 36125 seconds ago, expected 28800 [07:50:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 36425 seconds ago, expected 28800 [07:50:37] RECOVERY - High load average on labstore1002 is OK Less than 50.00% above the threshold [16.0] [07:55:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 36725 seconds ago, expected 28800 [08:00:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 37025 seconds ago, expected 28800 [08:04:16] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [08:05:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 37326 seconds ago, expected 28800 [08:10:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 37626 seconds ago, expected 28800 [08:15:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 37926 seconds ago, expected 28800 [08:20:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 38226 seconds ago, expected 28800 [08:25:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 38526 seconds ago, expected 28800 [08:30:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 38826 seconds ago, expected 28800 [08:31:37] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL 10.71% of data above the critical threshold [100000000.0] [08:35:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 39126 seconds ago, expected 28800 [08:40:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 39426 seconds ago, expected 28800 [08:45:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 39726 seconds ago, expected 28800 [08:50:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 40026 seconds ago, expected 28800 [08:54:27] RECOVERY - Outgoing network saturation on labstore1002 is OK Less than 10.00% above the threshold [75000000.0] [08:55:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 40325 seconds ago, expected 28800 [09:00:07] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 40625 seconds ago, expected 28800 [09:01:17] (03Abandoned) 10Giuseppe Lavagetto: Introducing mobileapps role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/227725 (owner: 10Giuseppe Lavagetto) [09:01:36] (03Abandoned) 10Giuseppe Lavagetto: Assign mobileapps service to sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/227726 (owner: 10Giuseppe Lavagetto) [09:01:50] (03Abandoned) 10Giuseppe Lavagetto: Setup LVS for mobileapps service on sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/227727 (owner: 10Giuseppe Lavagetto) [09:05:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 40925 seconds ago, expected 28800 [09:10:01] (03PS1) 10Giuseppe Lavagetto: puppet_compiler: Create the workdir as well [puppet] - 10https://gerrit.wikimedia.org/r/231755 [09:10:07] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 41225 seconds ago, expected 28800 [09:10:25] (03PS2) 10Giuseppe Lavagetto: puppet_compiler: Create the workdir as well [puppet] - 10https://gerrit.wikimedia.org/r/231755 [09:15:07] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 41525 seconds ago, expected 28800 [09:20:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 41825 seconds ago, expected 28800 [09:25:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 42126 seconds ago, expected 28800 [09:30:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 42426 seconds ago, expected 28800 [09:35:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 42726 seconds ago, expected 28800 [09:39:27] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1542247 (10Kghbln) It's down again. Perhaps some kind of monitoring could be implemented to detect this until the migration to a stable successor system is implemented. [09:40:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 43026 seconds ago, expected 28800 [09:45:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 43326 seconds ago, expected 28800 [09:45:36] 6operations, 10Gitblit-Deprecate: evaluate "klaus" to replace gitblit as a git web viewer - https://phabricator.wikimedia.org/T109004#1542259 (10Kghbln) Diffusion is a showstopper since it does not allow to download code, raw diffs only. That's what makes it kinda useless and ridiculous. [09:48:02] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1542270 (10Glaisher) >>! In T83702#1542247, @Kghbln wrote: > Perhaps some kind of monitoring could be implemented to detect this until the migration to a stable successor syst... [09:50:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 43626 seconds ago, expected 28800 [09:55:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 43926 seconds ago, expected 28800 [09:57:45] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1542278 (10Kghbln) Ah, nobody there at operations. Was not aware of this. :p Yeah, since Diffusion is kinda useless since it does not allow downloads and also has a pretty con... [10:00:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 44226 seconds ago, expected 28800 [10:04:48] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1542286 (10Glaisher) >>! In T83702#1542278, @Kghbln wrote: > Anybody aware of the fact that about every extension's page points to git.wikimedia.org? Yes. See {T108864}. [10:05:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 44525 seconds ago, expected 28800 [10:10:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 44825 seconds ago, expected 28800 [10:10:54] (03PS1) 10Yuvipanda: quarry: Remove duplication of clone_path and other variables [puppet] - 10https://gerrit.wikimedia.org/r/231759 [10:10:56] (03PS1) 10Yuvipanda: ores: Add role+class for the precached daemon [puppet] - 10https://gerrit.wikimedia.org/r/231760 [10:10:58] (03PS1) 10Yuvipanda: ores: Mark all roles requiring ores::base properly [puppet] - 10https://gerrit.wikimedia.org/r/231761 [10:15:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 45125 seconds ago, expected 28800 [10:17:33] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1542291 (10Krenair) [10:18:09] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1542292 (10Kghbln) Cool, it's in the making. :) [10:20:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 45426 seconds ago, expected 28800 [10:24:27] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 45726 seconds ago, expected 28800 [10:26:17] RECOVERY - RAID on snapshot1002 is OK no RAID installed [10:27:07] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:30:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 46026 seconds ago, expected 28800 [10:32:57] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 1 hour ago with 0 failures [10:35:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 46326 seconds ago, expected 28800 [10:40:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 46626 seconds ago, expected 28800 [10:43:16] RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 46926 seconds ago, expected 28800 [10:50:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 47226 seconds ago, expected 28800 [10:55:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 47526 seconds ago, expected 28800 [10:55:38] PROBLEM - puppet last run on snapshot1002 is CRITICAL puppet fail [11:00:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 47826 seconds ago, expected 28800 [11:05:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 48126 seconds ago, expected 28800 [11:10:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 48426 seconds ago, expected 28800 [11:12:18] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:13:57] RECOVERY - RAID on snapshot1002 is OK no RAID installed [11:15:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 48726 seconds ago, expected 28800 [11:15:56] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61459 bytes in 0.091 second response time [11:20:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 49025 seconds ago, expected 28800 [11:23:46] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 49325 seconds ago, expected 28800 [11:25:27] RECOVERY - RAID on snapshot1002 is OK no RAID installed [11:30:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 49626 seconds ago, expected 28800 [11:30:42] 500 seem back on track, the small recent increase may have been created by some bots [11:35:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 49925 seconds ago, expected 28800 [11:40:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 50225 seconds ago, expected 28800 [11:45:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 50526 seconds ago, expected 28800 [11:50:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 50826 seconds ago, expected 28800 [11:54:17] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:55:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 51126 seconds ago, expected 28800 [11:56:06] RECOVERY - RAID on snapshot1002 is OK no RAID installed [12:00:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 51426 seconds ago, expected 28800 [12:02:38] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:05:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 51726 seconds ago, expected 28800 [12:07:47] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:17] RECOVERY - Router interfaces on cr1-ulsfo is OK host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 1, unused: 0 [12:10:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 52026 seconds ago, expected 28800 [12:13:17] RECOVERY - salt-minion processes on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:13:56] RECOVERY - Router interfaces on mr1-codfw is OK host 208.80.153.196, interfaces up: 33, down: 0, dormant: 0, excluded: 0, unused: 0 [12:14:17] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 52326 seconds ago, expected 28800 [12:15:27] RECOVERY - RAID on snapshot1002 is OK no RAID installed [12:16:16] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 15 minutes ago with 0 failures [12:17:31] apergos: snapshot1002 alerts? [12:20:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 52626 seconds ago, expected 28800 [12:25:06] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:25:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 52926 seconds ago, expected 28800 [12:26:58] RECOVERY - RAID on snapshot1002 is OK no RAID installed [12:28:59] 6operations, 3Discovery-Maps-Sprint: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542421 (10BBlack) To import some questions from IRC earlier: 1. Does maps needs its own cache cluster? - My opinion is that yes, it does, especially in early days when we don... [12:29:10] 6operations, 10Traffic, 3Discovery-Maps-Sprint: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542422 (10BBlack) [12:30:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet last ran 53226 seconds ago, expected 28800 [12:30:28] 6operations, 10Traffic, 3Discovery-Maps-Sprint: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542014 (10BBlack) [12:35:16] RECOVERY - check_puppetrun on payments1001 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [12:36:47] RECOVERY - Router interfaces on cr1-eqdfw is OK host 208.80.153.198, interfaces up: 33, down: 0, dormant: 0, excluded: 2, unused: 0 [12:37:07] RECOVERY - Router interfaces on cr1-eqord is OK host 208.80.154.198, interfaces up: 33, down: 0, dormant: 0, excluded: 3, unused: 0 [12:44:16] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [12:45:07] RECOVERY - Disk space on uranium is OK: DISK OK [12:46:06] !log restarted gitblit on antimony, because Java is Awesome [12:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:47:57] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61460 bytes in 0.347 second response time [12:53:07] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:53:13] ACKNOWLEDGEMENT - Last backup of the maps filesystem on labstore1002 is CRITICAL - Last run result was exit-code Coren Last backup missed because out of snapshot space - cleaned, but needs more frequent cleanup. (https://phabricator.wikimedia.org/T109176) [12:53:13] ACKNOWLEDGEMENT - Last backup of the others filesystem on labstore1002 is CRITICAL - Last run result was exit-code Coren Last backup missed because out of snapshot space - cleaned, but needs more frequent cleanup. (https://phabricator.wikimedia.org/T109176) [12:53:13] ACKNOWLEDGEMENT - Last backup of the tools filesystem on labstore1002 is CRITICAL - Last run result was exit-code Coren Last backup missed because out of snapshot space - cleaned, but needs more frequent cleanup. (https://phabricator.wikimedia.org/T109176) [13:02:07] PROBLEM - check_puppetrun on bellatrix is CRITICAL Puppet has 28 failures [13:05:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [13:07:07] RECOVERY - check_puppetrun on bellatrix is OK Puppet is currently enabled, last run 104 seconds ago with 0 failures [13:10:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [13:15:08] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 212 seconds ago with 0 failures [13:16:32] (03PS1) 10Faidon Liambotis: Switch US/TX to codfw [dns] - 10https://gerrit.wikimedia.org/r/231772 [13:19:49] 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations: John Lewis sudo as 'list' on mailman staging VM - https://phabricator.wikimedia.org/T108349#1542501 (10JohnLewis) And confirmed (late). ``` johnflewis@fermium:~$ sudo service mailman status ● mailman.service - LSB: Mailman Master Queue Runner ``` Tha... [13:21:02] (03CR) 10Alex Monk: "Is this going to start sending traffic to through to the codfw apaches and databases, or just caches?" [dns] - 10https://gerrit.wikimedia.org/r/231772 (owner: 10Faidon Liambotis) [13:23:07] (03CR) 10Faidon Liambotis: "Just caches." [dns] - 10https://gerrit.wikimedia.org/r/231772 (owner: 10Faidon Liambotis) [13:23:18] ^ok :-) [13:28:15] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: service IP can't be switched over - https://phabricator.wikimedia.org/T108080#1542502 (10JohnLewis) Looking at usage; it's sparse so we can easily add a new IP via hiera alone once the autobound{lists} variables in role::lists are in hiera. [13:35:08] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [13:39:47] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [13:40:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [13:45:16] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 225 seconds ago with 0 failures [14:05:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [14:06:15] (03Abandoned) 10Merlijn van Deen: Flake8-ify everything [debs/adminbot] - 10https://gerrit.wikimedia.org/r/181054 (owner: 10Merlijn van Deen) [14:10:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [14:10:17] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:07] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 207 seconds ago with 0 failures [14:15:58] RECOVERY - RAID on snapshot1002 is OK no RAID installed [14:35:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [14:40:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [14:45:07] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 169 seconds ago with 0 failures [14:48:14] 6operations, 7Database, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1542524 (10jcrespo) 3NEW [14:49:11] 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1542532 (10jcrespo) Import on dbstore2002 finished. We have a 3-day lag on dbstore2002, but I prefer that than performing another manual import because it will be less error-prone a... [14:49:30] 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1542536 (10jcrespo) 5Open>3Resolved [14:49:32] 6operations, 7Database: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1542537 (10jcrespo) [14:49:51] !log stopping kafka broker on analytics1012 to again try to figure out why camus can't consume from it [14:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:34] 6operations, 10hardware-requests: dbproxy servers for codfw - https://phabricator.wikimedia.org/T109116#1542547 (10jcrespo) Status: "It's complicated" [14:52:17] 6operations, 7Database: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1542548 (10jcrespo) [14:52:18] 6operations, 10hardware-requests: dbproxy servers for codfw - https://phabricator.wikimedia.org/T109116#1542549 (10jcrespo) [14:53:07] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:53:57] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:55:47] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 26 minutes ago with 0 failures [14:56:57] RECOVERY - RAID on snapshot1002 is OK no RAID installed [15:05:06] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [15:10:06] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [15:15:07] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 178 seconds ago with 0 failures [15:23:07] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:58] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:06] RECOVERY - RAID on snapshot1002 is OK no RAID installed [15:25:48] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 56 minutes ago with 0 failures [15:35:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [15:35:07] RECOVERY - Disk space on mw1123 is OK: DISK OK [15:40:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [15:43:16] (03PS1) 10BBlack: bits-legacy: remove beacon/statsv support [puppet] - 10https://gerrit.wikimedia.org/r/231777 [15:43:18] (03PS1) 10BBlack: bits-legacy: remove special https://bits redirects for secure wikis [puppet] - 10https://gerrit.wikimedia.org/r/231778 [15:45:07] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 212 seconds ago with 0 failures [15:52:04] (03PS1) 10Ottomata: Add param for auto.leader.rebalance.enable [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231780 [15:53:05] (03CR) 10Ottomata: [C: 032] Add param for auto.leader.rebalance.enable [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231780 (owner: 10Ottomata) [15:53:53] 6operations, 7HTTPS: download.wikipedia.org is using an invalid certificate - https://phabricator.wikimedia.org/T107575#1542593 (10Chmarkine) How about mapping download.Wikipedia.org to the text cluster, and then have it redirect to https://dumps.wikimedia.org? [15:55:52] (03PS1) 10Ottomata: Disable kafka auto leader rebalance [puppet] - 10https://gerrit.wikimedia.org/r/231781 [15:58:16] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:00:11] (03CR) 10Ottomata: [C: 032] Disable kafka auto leader rebalance [puppet] - 10https://gerrit.wikimedia.org/r/231781 (owner: 10Ottomata) [16:00:17] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 1 hour ago with 0 failures [16:05:03] !log starting rolling restart of kafka brokers to apply auto leader rebalance enable = false [16:05:06] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [16:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:57] RECOVERY - puppet last run on analytics1012 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:07:26] <_joe_> !log removing manually core dumps from last night's outage on all appservers in eqiad, they occpy on average 30 GB/server [16:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [16:15:06] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 216 seconds ago with 0 failures [16:23:37] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:25:37] RECOVERY - RAID on snapshot1002 is OK no RAID installed [16:35:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [16:40:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [16:45:08] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 198 seconds ago with 0 failures [16:57:21] (03PS1) 10Ottomata: Split webrequest camus import job into multiple jobs for different size topics [puppet] - 10https://gerrit.wikimedia.org/r/231785 [16:57:46] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:05:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [17:05:47] RECOVERY - RAID on snapshot1002 is OK no RAID installed [17:10:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [17:13:56] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:15:06] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 197 seconds ago with 0 failures [17:15:47] RECOVERY - RAID on snapshot1002 is OK no RAID installed [17:23:27] 6operations, 3Discovery-Maps-Sprint: Varnish referrer filter is blocking links - https://phabricator.wikimedia.org/T109187#1542643 (10Yurik) 3NEW a:3BBlack [17:24:26] PROBLEM - puppet last run on snapshot1002 is CRITICAL puppet fail [17:26:32] 6operations, 10Gitblit-Deprecate: evaluate "klaus" to replace gitblit as a git web viewer - https://phabricator.wikimedia.org/T109004#1542651 (10Krenair) >>! In T109004#1542259, @Kghbln wrote: > Diffusion is a showstopper since it does not allow to download code, raw diffs only. That's what makes it kinda usel... [17:27:19] 6operations, 3Discovery-Maps-Sprint: Varnish referrer filter is blocking links - https://phabricator.wikimedia.org/T109187#1542652 (10Yurik) [17:35:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [17:37:56] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:40:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [17:41:58] 6operations, 3Discovery-Maps-Sprint: Varnish referrer filter is blocking links - https://phabricator.wikimedia.org/T109187#1542658 (10BBlack) I don't see that behavior, at least in Chrome, clicking your link out of gmail or phab. [17:43:47] RECOVERY - RAID on snapshot1002 is OK no RAID installed [17:45:07] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 215 seconds ago with 0 failures [17:46:07] 6operations, 3Discovery-Maps-Sprint: Varnish referrer filter is blocking links - https://phabricator.wikimedia.org/T109187#1542660 (10BBlack) ... but in any case, we *do* want to block referer in the long run to keep usage to our wikis only. the right way around this is to do it like a wiki: put a page up in... [17:46:57] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL 24.14% of data above the critical threshold [100000000.0] [17:49:05] (03CR) 10Ottomata: [C: 032] Split webrequest camus import job into multiple jobs for different size topics [puppet] - 10https://gerrit.wikimedia.org/r/231785 (owner: 10Ottomata) [17:58:38] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:02:38] (03PS1) 10Ottomata: Revert split of webrequest imports in multiple Camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/231795 [18:04:17] (03CR) 10Ottomata: [C: 032] Revert split of webrequest imports in multiple Camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/231795 (owner: 10Ottomata) [18:05:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [18:10:06] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [18:12:07] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:15:07] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 203 seconds ago with 0 failures [18:18:16] RECOVERY - RAID on snapshot1002 is OK no RAID installed [18:27:38] ACKNOWLEDGEMENT - Outgoing network saturation on labstore1002 is CRITICAL 28.57% of data above the critical threshold [100000000.0] Coren Cluprit killed waiting for rolling average to go down. [18:33:27] RECOVERY - Outgoing network saturation on labstore1002 is OK Less than 10.00% above the threshold [75000000.0] [18:35:07] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [18:40:06] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [18:45:07] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 207 seconds ago with 0 failures [18:53:17] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:54:56] PROBLEM - check_puppetrun on fdb2001 is CRITICAL Puppet has 34 failures [18:55:07] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 28 minutes ago with 0 failures [18:59:56] PROBLEM - check_puppetrun on fdb2001 is CRITICAL Puppet has 34 failures [19:04:56] PROBLEM - check_puppetrun on fdb2001 is CRITICAL Puppet has 34 failures [19:08:27] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:09:56] RECOVERY - check_puppetrun on fdb2001 is OK Puppet is currently enabled, last run 292 seconds ago with 0 failures [19:10:27] RECOVERY - RAID on snapshot1002 is OK no RAID installed [19:16:27] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:20:18] RECOVERY - RAID on snapshot1002 is OK no RAID installed [19:22:07] 6operations, 10Gitblit-Deprecate: evaluate "klaus" to replace gitblit as a git web viewer - https://phabricator.wikimedia.org/T109004#1542775 (10Kghbln) Well, talking about shared hosting without command line access. Still according to my experience the predominant environment out there even if some people ref... [19:22:27] PROBLEM - YARN NodeManager Node-State on analytics1041 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:24:18] RECOVERY - YARN NodeManager Node-State on analytics1041 is OK YARN NodeManager analytics1041.eqiad.wmnet:8041 Node-State: RUNNING [19:38:31] 6operations, 6Multimedia: Add monitoring of upload rate on commons to icinga alerts - https://phabricator.wikimedia.org/T92322#1542817 (10Tgr) >>! In T92322#1419551, @Tgr wrote: > https://gerrit.wikimedia.org/r/#/c/222224/ will make this easy again. It did not, it killed statsd without sampling, and with samp... [19:44:37] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:46:36] RECOVERY - RAID on snapshot1002 is OK no RAID installed [19:53:27] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:55:17] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 28 minutes ago with 0 failures [20:16:38] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:26:36] RECOVERY - RAID on snapshot1002 is OK no RAID installed [20:29:16] 6operations, 10Math: Install-more-LaTeX-packages - https://phabricator.wikimedia.org/T109195#1542849 (10Krenair) [20:29:26] 6operations, 10Math: Install more LaTeX packages - https://phabricator.wikimedia.org/T109195#1542852 (10Krenair) [20:52:47] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:38] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:46] PROBLEM - SSH on snapshot1002 is CRITICAL - Socket timeout after 10 seconds [20:55:37] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 1 hour ago with 0 failures [20:55:38] RECOVERY - SSH on snapshot1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2wmfprecise2 (protocol 2.0) [20:56:36] RECOVERY - RAID on snapshot1002 is OK no RAID installed [21:14:46] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:20:21] 6operations: videoscaler naming conventions - https://phabricator.wikimedia.org/T105009#1542881 (10Peachey88) > Maybe when those eqiad ones get reinstalled in T104747 they can be renamed to mw*? Or looking at it the other way shouldn't the codfw boxes be renamed inline with the naming conventions? [21:23:28] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:25:17] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 1 hour ago with 0 failures [21:25:28] 6operations: videoscaler naming conventions - https://phabricator.wikimedia.org/T105009#1542882 (10Krenair) [21:27:14] 6operations: Investigate smsglobal delivery failures from 2015-06-13 weekend - https://phabricator.wikimedia.org/T102396#1542884 (10Krenair) [21:28:17] RECOVERY - RAID on snapshot1002 is OK no RAID installed [21:35:10] 6operations: videoscaler naming conventions - https://phabricator.wikimedia.org/T105009#1542891 (10ori) >>! In T105009#1542881, @Peachey88 wrote: >> Maybe when those eqiad ones get reinstalled in T104747 they can be renamed to mw*? > > Or looking at it the other way shouldn't the codfw boxes be renamed inline w... [21:51:53] 6operations, 10Gitblit-Deprecate: evaluate "klaus" to replace gitblit as a git web viewer - https://phabricator.wikimedia.org/T109004#1542895 (10demon) I see a "view raw file" button on [[ /diffusion/MW/browse/master/README | this file ]]. [21:55:47] PROBLEM - puppet last run on cp3045 is CRITICAL puppet fail [22:12:56] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:16:37] RECOVERY - RAID on snapshot1002 is OK no RAID installed [22:21:57] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1542896 (10GWicke) Another global outage triggered by a puppet config deploy: https://wikitech.wikimedia.org/wiki/Incident_documentation/2... [22:22:47] RECOVERY - puppet last run on cp3045 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [22:28:37] Hello. [22:28:43] I think something's gone badly wrong. [22:30:20] ? [22:30:54] Krenair: A few e-mails didn't get delivered, so I got my e-mail revoked and had to re-confirm it again [22:30:59] For, like, third time this year [22:32:56] odder, BounceHandler will do that if your email provider starts sending bounces [22:33:11] (again.) [22:33:12] what's going badly wrong about this? [22:33:33] Krenair: Yeah, but it's getting a bit irritating when you have to send 10+ e-mails [22:33:49] Why are your mails failing to deliver? [22:34:01] Yes, why are they. [22:36:52] Krenair: I get my messages, like talk page message notifications and Echo mentions, okay, never been any problems [22:37:35] But I just tried sending a couple of e-mails, and failed to send more than 1 [22:39:22] "Mail delivery failed: returning message to sender" apparently is the reason [22:39:26] not very useful [22:40:33] legoktm, around? [22:40:47] odder: yahoo, by any chance? [22:40:51] Ish [22:41:05] valhallasw`cloud: nope. [22:41:57] odder: the emails to yourself bounced? Or to other people? [22:42:02] odder: hm, wait, I think it's actually an spf issue, not a dmarc one, so it's not necessarily just yahoo [22:42:26] legoktm: I tried sending e-mail out, and get copies sent to myself [22:42:32] but I'm confused so I'll let legoktm try to understand it [22:42:49] but they apparently bounced, which resulted in my having to re-confirm my e-mail address [22:42:53] Did you receive those copies? [22:43:00] None [22:43:35] legoktm: Funnily enough, I did receive the e-mail reconfirmation mail [22:44:18] > $wgBounceRecordLimit = 5; [22:44:26] so after 5 bounces I think we unconfirm [22:44:32] yes, it's been triggering that according to BounceHandler.log [22:45:39] after 5 in a row, or 5 in total? [22:45:48] because I reconfirmed my e-mail, and only managed to send one e-mail [22:46:01] and then it unconfirmed you again? [22:46:10] I think it's 5 over some time period [22:46:11] Yep. [22:47:27] https://gerrit.wikimedia.org/r/#/c/168337/2/BounceHandler.php 7 days apparently [22:48:13] we should probably clear previous bounce records once you reconfirm [22:48:34] and you should talk to your mail provider about why they keep bouncing ;) [22:49:08] Would we keep a copy of the full bounce message somewhere? [22:49:37] https://phabricator.wikimedia.org/T99767 we don't [22:49:41] legoktm: Then why would I get all other e-mails okay? [22:49:50] I don't know :/ [22:50:30] unfortunately their customer service seem to be sleeping. how dare they. [22:51:07] legoktm, we don't in BH, sure, but maybe ops would have a copy somewhere on a mail server? [22:51:26] maybe, idk [22:52:10] odder: so in the meantime, let me remove your bounce records so you can at least go 5 more bounces before having to reconfirm [22:52:30] legoktm: thanks [22:56:24] !log removed 13 bounce_records for User:odder from bouncehandler database [22:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:00:18] 13, wow [23:03:57] odder: hope that helps, if it keeps happening we can look into saving the full bounce message from your emails if we have to... [23:04:06] * legoktm goes offline [23:04:24] legoktm: It does, thanks a lot [23:22:26] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:24:17] RECOVERY - RAID on snapshot1002 is OK no RAID installed [23:51:36] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:52:27] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:54:16] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 24 minutes ago with 0 failures [23:55:17] RECOVERY - RAID on snapshot1002 is OK no RAID installed