[00:00:42] !log pausing replication on dbstore2002 [00:00:48] Logged the message, Master [00:04:26] PROBLEM - jmxtrans on analytics1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [00:04:45] PROBLEM - jmxtrans on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [00:04:56] PROBLEM - jmxtrans on analytics1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [00:05:15] PROBLEM - jmxtrans on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [00:09:56] RECOVERY - jmxtrans on analytics1022 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [00:14:56] RECOVERY - jmxtrans on analytics1018 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [00:17:16] RECOVERY - jmxtrans on analytics1012 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [00:17:27] RECOVERY - jmxtrans on analytics1021 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [00:43:07] (03PS1) 10Springle: eventlogging purge no longer on m2-master [software] - 10https://gerrit.wikimedia.org/r/221561 [00:50:18] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Deployment Access to tin for Ellery Wulczyn - https://phabricator.wikimedia.org/T103782#1408802 (10RobH) a:3DarTar Assigning to @Dartar as @ellery's manager. Please reply back to task with approval for this access, or deny said request. Once you have... [00:51:44] !log restart replication on dbstore2002 [00:51:51] Logged the message, Master [00:52:14] !log restart eventlogging auto-purge on m4 [00:52:20] Logged the message, Master [01:00:26] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 46.15% of data above the critical threshold [500.0] [01:04:54] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Deployment Access to tin for Ellery Wulczyn - https://phabricator.wikimedia.org/T103782#1408807 (10Krenair) >>! In T103782#1405312, @RobH wrote: > * There are no objections on this or any sub-tasks. Please note that the sub-tasks can be hidden and viewa... [01:16:26] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [02:14:05] PROBLEM - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:14:42] ^ looking [02:15:46] RECOVERY - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 495 bytes in 3.003 second response time [02:16:04] works for me, testing with curl [02:16:11] works ok for icinga too, apparently :P [02:17:08] yeah [02:17:21] I still think this is some kind of monitoring flaw. whatever it is, we saw it briefly friday too [02:18:10] anyways, there are multiple other things up in the air to be addressed on the LVS/pybal front. So long as these don't become real outages, I'm inclined to wait and see if they go away when other stuff is fixed this week. [02:18:20] hm i didn't get paged about it friday. was it at night? [02:18:48] * jgage checks email [02:18:56] sorry not friday, wednesday [02:19:04] about this same time, + ~30minutes [02:19:28] interesting [02:19:48] I think, looking at my phone, anyways [02:20:36] !log l10nupdate Synchronized php-1.26wmf11/cache/l10n: (no message) (duration: 05m 53s) [02:20:44] Logged the message, Master [02:21:03] whatever evening it was, it went off and I thought it was esams at first until godog pointed out it was eqiad heh [02:23:48] !log LocalisationUpdate completed (1.26wmf11) at 2015-06-29 02:23:47+00:00 [02:23:55] Logged the message, Master [02:24:44] it was the start of looking into lvs/pybal there, which turned into the creation of multiple new pybal fix tickets, and a few live hacks to make things better [02:25:01] but tbh I don't think any are directly related to the specific monitoring flap of ipv6/eqiad/text [02:26:41] https://phabricator.wikimedia.org/T82747 + https://phabricator.wikimedia.org/T103921 [02:28:27] heh v4 health checks for v6 vips [02:30:22] bblack: I didn't get the last part, not the same root cause as wed this time around? [02:31:06] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures [02:31:22] godog: I don't think I've found a root cause for this ipv6/eqiad/text flap at all. the one above is basically the same as the one from weds. everything else is just fallout from noticing other problems while looking into it, which probably aren't directly related. [02:32:25] that flap could still be a specific problem to neon monitoring somehow, too [02:32:52] (as in, I don't see any correlating evidence from pybal healthchecks or catchpoint saying it's real) [02:35:00] (03PS4) 10Andrew Bogott: Add a labsproject fact that doesn't rely on ldap config. [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) [02:35:02] (03PS1) 10Andrew Bogott: Use the labsproject fact rather than $::instanceproject from ldap [puppet] - 10https://gerrit.wikimedia.org/r/221562 [02:35:04] (03PS1) 10Andrew Bogott: Add an additional puppet config to use with minimal runs. [puppet] - 10https://gerrit.wikimedia.org/r/221563 [02:38:08] bblack: I see, it could be the check or neon too indeed, what seems recurring is that just mobile text and upload are involved in timing out [02:39:40] also rendering and appservers but just once/twice [02:41:44] are they all ipv6 only? [02:43:52] the vast majority yeah [02:47:36] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [02:58:28] 6operations, 10ops-eqiad: ms-be1015 idrac not working, no more sessions - https://phabricator.wikimedia.org/T104161#1408842 (10fgiunchedi) 3NEW a:3Cmjohnson [03:04:56] RECOVERY - Disk space on analytics1012 is OK: DISK OK [03:05:36] RECOVERY - Disk space on analytics1018 is OK: DISK OK [03:06:45] RECOVERY - Disk space on analytics1021 is OK: DISK OK [03:08:03] !log jmxtrans filled disks on all kafka brokers, 21GB log files. removed logs and restarted services. [03:08:10] Logged the message, Master [03:08:55] RECOVERY - Disk space on analytics1022 is OK: DISK OK [03:12:30] thanks for the heads up [03:12:38] * jgage hits tab [04:51:49] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jun 29 04:51:48 UTC 2015 (duration 51m 47s) [04:51:59] Logged the message, Master [05:28:15] (03PS2) 10Ori.livneh: Use INotify to watch for configuration file changes [debs/pybal] - 10https://gerrit.wikimedia.org/r/213223 [06:01:55] (03PS3) 10Ori.livneh: Use INotify to watch for configuration file changes [debs/pybal] - 10https://gerrit.wikimedia.org/r/213223 [06:01:57] (03PS1) 10Ori.livneh: Make util.getboolean handle booleans gracefully [debs/pybal] - 10https://gerrit.wikimedia.org/r/221569 [06:02:17] (03CR) 10Ori.livneh: [C: 032] "Tested" [debs/pybal] - 10https://gerrit.wikimedia.org/r/221569 (owner: 10Ori.livneh) [06:02:35] (03Merged) 10jenkins-bot: Make util.getboolean handle booleans gracefully [debs/pybal] - 10https://gerrit.wikimedia.org/r/221569 (owner: 10Ori.livneh) [06:02:39] (03CR) 10Ori.livneh: "Finally got around to testing this." [debs/pybal] - 10https://gerrit.wikimedia.org/r/213223 (owner: 10Ori.livneh) [06:03:00] (03CR) 10Ori.livneh: [C: 032] Use INotify to watch for configuration file changes [debs/pybal] - 10https://gerrit.wikimedia.org/r/213223 (owner: 10Ori.livneh) [06:03:17] (03Merged) 10jenkins-bot: Use INotify to watch for configuration file changes [debs/pybal] - 10https://gerrit.wikimedia.org/r/213223 (owner: 10Ori.livneh) [06:09:25] PROBLEM - puppet last run on mw2109 is CRITICAL Puppet has 1 failures [06:24:17] RECOVERY - puppet last run on mw2109 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:30:56] PROBLEM - puppet last run on mw2030 is CRITICAL puppet fail [06:31:26] PROBLEM - puppet last run on analytics1038 is CRITICAL puppet fail [06:33:17] PROBLEM - puppet last run on db1028 is CRITICAL Puppet has 1 failures [06:33:17] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 1 failures [06:33:36] PROBLEM - puppet last run on mw1042 is CRITICAL Puppet has 1 failures [06:34:06] PROBLEM - puppet last run on db1016 is CRITICAL Puppet has 1 failures [06:34:16] PROBLEM - puppet last run on lvs2001 is CRITICAL Puppet has 1 failures [06:34:25] PROBLEM - puppet last run on db1034 is CRITICAL Puppet has 1 failures [06:34:35] PROBLEM - puppet last run on wtp2015 is CRITICAL Puppet has 1 failures [06:34:35] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:34:36] PROBLEM - puppet last run on mw2212 is CRITICAL Puppet has 1 failures [06:34:36] PROBLEM - puppet last run on elastic1030 is CRITICAL Puppet has 1 failures [06:34:45] PROBLEM - puppet last run on elastic1022 is CRITICAL Puppet has 1 failures [06:34:46] PROBLEM - puppet last run on mw1118 is CRITICAL Puppet has 1 failures [06:34:56] PROBLEM - puppet last run on tin is CRITICAL Puppet has 1 failures [06:35:26] PROBLEM - puppet last run on mw1211 is CRITICAL Puppet has 1 failures [06:35:26] PROBLEM - puppet last run on mw2146 is CRITICAL Puppet has 1 failures [06:35:26] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 2 failures [06:35:26] PROBLEM - puppet last run on mw2136 is CRITICAL Puppet has 1 failures [06:35:26] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:35:37] PROBLEM - puppet last run on mw1126 is CRITICAL Puppet has 1 failures [06:36:16] PROBLEM - puppet last run on mw1129 is CRITICAL Puppet has 1 failures [06:36:26] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [06:36:26] PROBLEM - puppet last run on mw2095 is CRITICAL Puppet has 1 failures [06:46:25] RECOVERY - puppet last run on db1028 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Please just hold this just a bit. see https://gerrit.wikimedia.org/r/#/c/221065/ as to why" [puppet] - 10https://gerrit.wikimedia.org/r/221380 (owner: 10BBlack) [06:47:35] RECOVERY - puppet last run on db1034 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:05] RECOVERY - puppet last run on tin is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:52:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 5 below the confidence bounds [07:04:58] (03PS1) 10Jcrespo: Enable jessie installer for db1022 [puppet] - 10https://gerrit.wikimedia.org/r/221572 [07:06:25] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:06:25] RECOVERY - puppet last run on wtp2015 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:06:25] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:06:25] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:06:26] RECOVERY - puppet last run on elastic1022 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [07:06:26] RECOVERY - puppet last run on mw1118 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:07:07] RECOVERY - puppet last run on mw1211 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:07:07] RECOVERY - puppet last run on mw1042 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:07:15] RECOVERY - puppet last run on mw2146 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:07:15] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:16] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:07:27] RECOVERY - puppet last run on mw1126 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:07:45] RECOVERY - puppet last run on db1016 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [07:07:55] RECOVERY - puppet last run on lvs2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:56] RECOVERY - puppet last run on mw1129 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:08:06] RECOVERY - puppet last run on elastic1030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:15] RECOVERY - puppet last run on mw2095 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:08:16] RECOVERY - puppet last run on mw2030 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:08:45] RECOVERY - puppet last run on analytics1038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:09:02] (03PS2) 10Alexandros Kosiaris: lvs: split monitors to respective files [puppet] - 10https://gerrit.wikimedia.org/r/221356 [07:09:04] (03PS3) 10Alexandros Kosiaris: Merge all lvs::monitor_service manifests into one [puppet] - 10https://gerrit.wikimedia.org/r/221357 [07:09:06] (03PS3) 10Alexandros Kosiaris: Remove more unused lvs::monitor manifests [puppet] - 10https://gerrit.wikimedia.org/r/221363 [07:09:06] RECOVERY - puppet last run on mw2136 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:09:08] (03PS4) 10Alexandros Kosiaris: Merge lvs::hashes [puppet] - 10https://gerrit.wikimedia.org/r/221361 [07:09:10] (03PS1) 10Alexandros Kosiaris: lvs: add lb suffix to misc_web [puppet] - 10https://gerrit.wikimedia.org/r/221573 [07:13:44] (03CR) 10Alexandros Kosiaris: [C: 032] lvs: split monitors to respective files [puppet] - 10https://gerrit.wikimedia.org/r/221356 (owner: 10Alexandros Kosiaris) [07:14:11] (03CR) 10Alexandros Kosiaris: [C: 032] Merge all lvs::monitor_service manifests into one [puppet] - 10https://gerrit.wikimedia.org/r/221357 (owner: 10Alexandros Kosiaris) [07:23:39] (03CR) 10Merlijn van Deen: [C: 04-1] "I think it should be (it's in ruby stdlib 1.9.3 and 2.0, at least), so using that sounds like a good idea." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) (owner: 10Andrew Bogott) [07:25:55] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [07:26:28] (03CR) 10Merlijn van Deen: Add a labsproject fact that doesn't rely on ldap config. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) (owner: 10Andrew Bogott) [07:29:46] (03PS1) 10Giuseppe Lavagetto: etcd: add conf1002 to the server list [dns] - 10https://gerrit.wikimedia.org/r/221575 [07:29:48] (03PS1) 10Giuseppe Lavagetto: etcd: add conf1003 to the server list [dns] - 10https://gerrit.wikimedia.org/r/221576 [07:29:50] (03PS1) 10Giuseppe Lavagetto: etcd: add conf100{2,3} to the client list [dns] - 10https://gerrit.wikimedia.org/r/221577 [07:29:55] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [07:30:15] (03PS5) 10Alexandros Kosiaris: Merge lvs::monitor hashes [puppet] - 10https://gerrit.wikimedia.org/r/221361 [07:31:58] (03PS6) 10Alexandros Kosiaris: Merge lvs::monitor hashes [puppet] - 10https://gerrit.wikimedia.org/r/221361 [07:32:09] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Merge lvs::monitor hashes [puppet] - 10https://gerrit.wikimedia.org/r/221361 (owner: 10Alexandros Kosiaris) [07:37:02] (03PS4) 10Alexandros Kosiaris: Remove more unused lvs::monitor manifests [puppet] - 10https://gerrit.wikimedia.org/r/221363 [07:37:09] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove more unused lvs::monitor manifests [puppet] - 10https://gerrit.wikimedia.org/r/221363 (owner: 10Alexandros Kosiaris) [07:37:49] (03PS2) 10Alexandros Kosiaris: lvs: add lb suffix to misc_web [puppet] - 10https://gerrit.wikimedia.org/r/221573 [07:37:55] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] lvs: add lb suffix to misc_web [puppet] - 10https://gerrit.wikimedia.org/r/221573 (owner: 10Alexandros Kosiaris) [07:39:52] (03PS2) 10Alexandros Kosiaris: Add ensure parameter to ntp::daemon [puppet] - 10https://gerrit.wikimedia.org/r/220761 [07:39:54] (03PS4) 10Alexandros Kosiaris: Use hiera to disable ntp fleet wise, with exceptions [puppet] - 10https://gerrit.wikimedia.org/r/220772 [07:40:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add ensure parameter to ntp::daemon [puppet] - 10https://gerrit.wikimedia.org/r/220761 (owner: 10Alexandros Kosiaris) [07:40:43] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Use hiera to disable ntp fleet wise, with exceptions [puppet] - 10https://gerrit.wikimedia.org/r/220772 (owner: 10Alexandros Kosiaris) [07:42:52] (03CR) 10Alexandros Kosiaris: "Indeed, adding those" [puppet] - 10https://gerrit.wikimedia.org/r/220772 (owner: 10Alexandros Kosiaris) [07:42:55] (03PS5) 10Alexandros Kosiaris: Use hiera to disable ntp fleet wise, with exceptions [puppet] - 10https://gerrit.wikimedia.org/r/220772 [07:43:29] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Use hiera to disable ntp fleet wise, with exceptions [puppet] - 10https://gerrit.wikimedia.org/r/220772 (owner: 10Alexandros Kosiaris) [07:46:45] !log disabling ntp everywhere expect selected hosts in anticipation for the leap second [07:46:50] Logged the message, Master [07:51:03] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Deployment Access to tin for Ellery Wulczyn - https://phabricator.wikimedia.org/T103782#1408982 (10DarTar) approved. [07:51:17] (03PS1) 10Giuseppe Lavagetto: etcd: configure and add conf1002 [puppet] - 10https://gerrit.wikimedia.org/r/221578 [07:53:35] RECOVERY - Router interfaces on cr2-eqiad is OK host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 [07:55:34] (03PS2) 10Giuseppe Lavagetto: etcd: configure and add conf1002 [puppet] - 10https://gerrit.wikimedia.org/r/221578 [07:55:36] (03PS1) 10Giuseppe Lavagetto: etcd: add conf1003 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/221579 [07:56:15] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: No response from NTP server [07:57:08] <_joe_> akosiaris: do we have a leap second party tomorrow? [07:57:18] <_joe_> it's gonna be at 3 AM there right? [07:58:07] PROBLEM - NTP peers on maerlant is CRITICAL: NTP CRITICAL: No response from NTP server [08:00:55] actually, db1022 is giving some strange mysql results- i am going to depool it as soon as possible [08:01:17] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: No response from NTP server [08:01:56] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: No response from NTP server [08:02:58] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: No response from NTP server [08:04:02] 6operations, 10Traffic: openssl 1.0.2 packaging for jessie - https://phabricator.wikimedia.org/T104143#1408991 (10MoritzMuehlenhoff) I checked and the data for CVE-2015-4000 in the Debian Security Tracker seems slightly off: The current 1.0.2c from unstable includes the same fix (https://git.openssl.org/?p=ope... [08:04:26] 6operations, 10Traffic: openssl 1.0.2 packaging for jessie - https://phabricator.wikimedia.org/T104143#1408992 (10MoritzMuehlenhoff) I checked and the data for CVE-2015-4000 in the Debian Security Tracker seems slightly off: The current 1.0.2c from unstable includes the same fix (https://git.openssl.org/?p=ope... [08:04:32] _joe_: yes. I 'll be at that party [08:07:02] _joe_: it's 2 AM [08:08:01] moritzm: he was talking UTC+3 [08:08:08] but you are UTC+2 so... [08:08:16] both of you are correct! [08:08:47] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: No response from NTP server [08:11:15] <_joe_> yeah I'll be at the party [08:11:39] (03PS1) 10Jcrespo: Depool db1022 for reinstall [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221580 [08:12:38] <_joe_> !log adding conf1002 to the etcd cluster as a member [08:12:42] Logged the message, Master [08:15:11] (03CR) 10Jcrespo: [C: 032] Depool db1022 for reinstall [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221580 (owner: 10Jcrespo) [08:15:34] ^gonna deploy this right away [08:16:58] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: add conf1002 to the server list [dns] - 10https://gerrit.wikimedia.org/r/221575 (owner: 10Giuseppe Lavagetto) [08:20:58] !log jynus Synchronized wmf-config/db-eqiad.php: Depool db1022 for reinstall (duration: 00m 12s) [08:21:03] Logged the message, Master [08:22:59] (03PS3) 10Giuseppe Lavagetto: etcd: configure and add conf1002 [puppet] - 10https://gerrit.wikimedia.org/r/221578 [08:23:44] (03CR) 10jenkins-bot: [V: 04-1] etcd: configure and add conf1002 [puppet] - 10https://gerrit.wikimedia.org/r/221578 (owner: 10Giuseppe Lavagetto) [08:25:02] (03PS4) 10Giuseppe Lavagetto: etcd: configure and add conf1002 [puppet] - 10https://gerrit.wikimedia.org/r/221578 [08:25:43] (03CR) 10jenkins-bot: [V: 04-1] etcd: configure and add conf1002 [puppet] - 10https://gerrit.wikimedia.org/r/221578 (owner: 10Giuseppe Lavagetto) [08:27:23] (03PS5) 10Giuseppe Lavagetto: etcd: configure and add conf1002 [puppet] - 10https://gerrit.wikimedia.org/r/221578 [08:30:59] (03PS6) 10Giuseppe Lavagetto: etcd: configure and add conf1002 [puppet] - 10https://gerrit.wikimedia.org/r/221578 [08:31:56] (03PS3) 10Hashar: contint: Create symlink for composer in /usr/local/bin/ [puppet] - 10https://gerrit.wikimedia.org/r/220658 (owner: 10Legoktm) [08:32:13] (03CR) 10Hashar: [V: 032] "Rebased" [puppet] - 10https://gerrit.wikimedia.org/r/220658 (owner: 10Legoktm) [08:32:45] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: configure and add conf1002 [puppet] - 10https://gerrit.wikimedia.org/r/221578 (owner: 10Giuseppe Lavagetto) [08:35:06] PROBLEM - NTP on wtp1016 is CRITICAL: NTP CRITICAL: No response from NTP server [08:35:45] PROBLEM - NTP on elastic1021 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:07] PROBLEM - NTP on mc1003 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:07] PROBLEM - NTP on elastic1027 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:15] PROBLEM - NTP on mw1068 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:27] PROBLEM - NTP on logstash1004 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:27] PROBLEM - NTP on labsdb1003 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:37] PROBLEM - NTP on mw1025 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:37] PROBLEM - NTP on elastic1018 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:55] PROBLEM - NTP on db1059 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:55] PROBLEM - NTP on etcd1003 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:56] PROBLEM - NTP on db1015 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:56] PROBLEM - NTP on mw1052 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:56] PROBLEM - NTP on helium is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:56] PROBLEM - NTP on cp3014 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:56] PROBLEM - NTP on cp3016 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:57] PROBLEM - NTP on db1018 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:57] PROBLEM - NTP on elastic1012 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:58] PROBLEM - NTP on db1002 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:58] PROBLEM - NTP on db1046 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:59] PROBLEM - NTP on lvs3001 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:05] PROBLEM - NTP on db2018 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:05] PROBLEM - NTP on mw1228 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:15] PROBLEM - NTP on mw1173 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:15] PROBLEM - NTP on mw1008 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:15] PROBLEM - NTP on mw1061 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:15] PROBLEM - NTP on mw2083 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:16] PROBLEM - NTP on mw1222 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:16] PROBLEM - NTP on mw1144 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:16] PROBLEM - NTP on cp2026 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:16] PROBLEM - NTP on mw2096 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:16] PROBLEM - NTP on ms-fe2001 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:25] PROBLEM - NTP on cp2005 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:25] PROBLEM - NTP on mw2093 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:25] PROBLEM - NTP on mw2045 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:27] <_joe_> akosiaris: we might want to silence that alarm? [08:37:27] PROBLEM - NTP on mw2066 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:27] PROBLEM - NTP on es2001 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:27] PROBLEM - NTP on mw2212 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:27] PROBLEM - NTP on mw2104 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:27] PROBLEM - NTP on mc2011 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:27] PROBLEM - NTP on db2059 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:28] PROBLEM - NTP on mw2079 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:28] PROBLEM - NTP on mw2073 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:28] <_joe_> :P [08:37:29] PROBLEM - NTP on db1040 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:29] PROBLEM - NTP on mw1126 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:30] PROBLEM - NTP on mw1217 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:46] PROBLEM - NTP on mw2123 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:46] PROBLEM - NTP on mw2003 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:46] PROBLEM - NTP on db2064 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:46] PROBLEM - NTP on subra is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:46] PROBLEM - NTP on mw2127 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:46] PROBLEM - NTP on mw2022 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:46] PROBLEM - NTP on mw2206 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:47] PROBLEM - NTP on mw2184 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:47] PROBLEM - NTP on labvirt1003 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:48] PROBLEM - NTP on db1034 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:48] PROBLEM - NTP on mw2134 is CRITICAL: NTP CRITICAL: No response from NTP server [08:37:49] PROBLEM - NTP on mw2114 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:06] PROBLEM - NTP on iron is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:06] PROBLEM - NTP on tin is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:07] PROBLEM - NTP on db1051 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:16] PROBLEM - NTP on cp4008 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:18] PROBLEM - NTP on mw1175 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:18] PROBLEM - NTP on mw1054 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:18] PROBLEM - NTP on mw1153 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:25] PROBLEM - NTP on mw1166 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:36] PROBLEM - NTP on elastic1022 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:36] PROBLEM - NTP on mw1060 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:37] PROBLEM - NTP on db1023 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:37] PROBLEM - NTP on dataset1001 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:37] PROBLEM - NTP on mw1088 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:37] PROBLEM - NTP on gallium is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:37] PROBLEM - NTP on mw1170 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:38] PROBLEM - NTP on db1021 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:45] PROBLEM - NTP on ruthenium is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:45] PROBLEM - NTP on mw1176 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:55] PROBLEM - NTP on mw1211 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:56] PROBLEM - NTP on db1042 is CRITICAL: NTP CRITICAL: No response from NTP server [08:38:56] PROBLEM - NTP on mw1213 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:05] PROBLEM - NTP on mw1235 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:05] PROBLEM - NTP on mw1065 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:05] PROBLEM - NTP on polonium is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:05] PROBLEM - NTP on mw1118 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:05] PROBLEM - NTP on cp1061 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:05] PROBLEM - NTP on mw1249 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:05] PROBLEM - NTP on logstash1006 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:06] PROBLEM - NTP on logstash1002 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:06] PROBLEM - NTP on db1028 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:11] (03PS2) 10Giuseppe Lavagetto: etcd: add conf1003 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/221579 [08:39:15] PROBLEM - NTP on mw1129 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:15] PROBLEM - NTP on mw1195 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:15] PROBLEM - NTP on holmium is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:15] PROBLEM - NTP on mw1119 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:15] PROBLEM - NTP on mw1011 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:17] PROBLEM - NTP on mw1120 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:28] PROBLEM - NTP on cp4019 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:28] PROBLEM - NTP on cp4004 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:28] PROBLEM - NTP on cp4014 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:28] PROBLEM - NTP on cp4001 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:36] that's me not having scheduled downtime [08:39:36] PROBLEM - NTP on lithium is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:36] PROBLEM - NTP on db1067 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:40] I am already doing it [08:39:42] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: add conf1003 to the server list [dns] - 10https://gerrit.wikimedia.org/r/221576 (owner: 10Giuseppe Lavagetto) [08:39:45] PROBLEM - NTP on mw1177 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:45] PROBLEM - NTP on mw1123 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:45] PROBLEM - NTP on wtp1005 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:46] I 'll kill icinga-wm in the meantime [08:39:47] PROBLEM - NTP on cp1058 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:47] PROBLEM - NTP on cp3041 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:47] PROBLEM - NTP on cp3008 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:58] <_joe_> akosiaris: uhm ok [08:40:08] <_joe_> I'll check etcd status by hand [08:40:25] PROBLEM - NTP on sodium is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:26] PROBLEM - NTP on pc1002 is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:26] PROBLEM - NTP on wtp2015 is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:27] PROBLEM - NTP on snapshot1001 is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:27] PROBLEM - NTP on elastic1024 is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:28] _joe_: why not icinga web interface ? [08:40:28] PROBLEM - NTP on cp2013 is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:28] PROBLEM - NTP on db2040 is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:29] PROBLEM - NTP on db2068 is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:29] PROBLEM - NTP on cp2020 is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:30] PROBLEM - NTP on mw2092 is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:30] PROBLEM - NTP on install2001 is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:31] PROBLEM - NTP on db1052 is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:31] PROBLEM - NTP on elastic1019 is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:32] PROBLEM - NTP on mw1208 is CRITICAL: NTP CRITICAL: No response from NTP server [08:40:38] <_joe_> akosiaris: yeah I meant that way [08:40:44] lol, ok [08:41:03] <_joe_> that means doing clicks and not just switching terminals [08:41:26] <_joe_> my scheduler tells me it's much more expensive [08:41:43] ah, context switching [08:41:46] yup it is [08:43:10] (03CR) 10Hashar: [C: 031] ";)" [puppet] - 10https://gerrit.wikimedia.org/r/221150 (https://phabricator.wikimedia.org/T102544) (owner: 10Andrew Bogott) [08:43:30] (03CR) 10Hashar: [C: 031] Remove the wait-on-NFS code from labs instance firstboot. [puppet] - 10https://gerrit.wikimedia.org/r/221151 (https://phabricator.wikimedia.org/T102544) (owner: 10Andrew Bogott) [08:43:46] <_joe_> did you notice I just assumed my brain is powered by an OS kernel and you did find that completely normal? [08:44:05] there are 4 hosts on "puppet cert -l" on palladium [08:44:24] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: add conf1003 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/221579 (owner: 10Giuseppe Lavagetto) [08:44:30] will investigate later [08:46:34] and 2 unaccepted keys on salt [08:46:50] <_joe_> jynus: maybe failed installations? [08:47:07] _joe_, I will investigate later [08:47:29] I want to finish my installation later- just wanted to "soft log" [08:47:41] s/later/first/ [08:49:03] <_joe_> ok [08:49:21] <_joe_> just suggesting btw, I'm working on the etcd cluster atm [08:49:31] <_joe_> !log joined conf1003 to the etcd cluster [08:49:35] Logged the message, Master [08:49:43] sure, sorry if it disturbed you, I didn't mean to [08:50:32] <_joe_> oh you didn't :) [08:51:45] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: add conf100{2,3} to the client list [dns] - 10https://gerrit.wikimedia.org/r/221577 (owner: 10Giuseppe Lavagetto) [08:53:52] (03PS1) 10Giuseppe Lavagetto: etcd: add conf100{2,3} to the client lists of the other datacenters [dns] - 10https://gerrit.wikimedia.org/r/221584 [08:55:44] what version of rsvg we are running in prod ? [08:55:59] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: add conf100{2,3} to the client lists of the other datacenters [dns] - 10https://gerrit.wikimedia.org/r/221584 (owner: 10Giuseppe Lavagetto) [08:57:43] should I backup the ssh host keys? [09:01:43] matanya: 2.36.1 on precise, 2.40.2 on trusty [09:02:19] thank you moritzm [09:05:28] (03PS1) 10Giuseppe Lavagetto: etcd: remove join-time params from conf1002 [puppet] - 10https://gerrit.wikimedia.org/r/221585 [09:05:30] (03PS1) 10Giuseppe Lavagetto: etcd: remove join-time params from conf1003 [puppet] - 10https://gerrit.wikimedia.org/r/221586 [09:06:44] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: remove join-time params from conf1002 [puppet] - 10https://gerrit.wikimedia.org/r/221585 (owner: 10Giuseppe Lavagetto) [09:08:25] (03PS1) 10Alexandros Kosiaris: package_builder: Update README.md [puppet] - 10https://gerrit.wikimedia.org/r/221587 [09:12:22] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: remove join-time params from conf1003 [puppet] - 10https://gerrit.wikimedia.org/r/221586 (owner: 10Giuseppe Lavagetto) [09:19:32] (03PS2) 10Alexandros Kosiaris: package_builder: Update README.md [puppet] - 10https://gerrit.wikimedia.org/r/221587 [09:19:34] (03PS1) 10Alexandros Kosiaris: apt::daemon: force hiera lookup for ensure argument [puppet] - 10https://gerrit.wikimedia.org/r/221588 [09:20:08] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] package_builder: Update README.md [puppet] - 10https://gerrit.wikimedia.org/r/221587 (owner: 10Alexandros Kosiaris) [09:20:19] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] apt::daemon: force hiera lookup for ensure argument [puppet] - 10https://gerrit.wikimedia.org/r/221588 (owner: 10Alexandros Kosiaris) [09:29:46] (03PS1) 10Alexandros Kosiaris: Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/221589 [09:31:04] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/221589 (owner: 10Alexandros Kosiaris) [09:43:20] (03CR) 10Hashar: [C: 031] Use the labsproject fact rather than $::instanceproject from ldap [puppet] - 10https://gerrit.wikimedia.org/r/221562 (owner: 10Andrew Bogott) [09:43:26] (03PS10) 10Addshore: rsync wikidata json dumps to labs /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) [09:43:38] (03CR) 10Jcrespo: "Change has been applied to the master, but not to the slave- purging is not being done there currently." [software] - 10https://gerrit.wikimedia.org/r/221561 (owner: 10Springle) [10:16:59] <_joe_> !log starting removal of etcd1003 from the etcd cluster [10:17:03] Logged the message, Master [10:20:50] (03PS1) 10Giuseppe Lavagetto: etcd: removing etcd1003 from the client list [dns] - 10https://gerrit.wikimedia.org/r/221595 [10:20:52] (03PS1) 10Giuseppe Lavagetto: etcd: removing etcd1003 from the servers list [dns] - 10https://gerrit.wikimedia.org/r/221596 [10:23:14] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: removing etcd1003 from the client list [dns] - 10https://gerrit.wikimedia.org/r/221595 (owner: 10Giuseppe Lavagetto) [10:26:08] (03PS10) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [10:31:13] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [10:32:22] <_joe_> !log effectively removing etcd1003 from the cluster [10:32:27] Logged the message, Master [10:38:48] (03PS1) 10Giuseppe Lavagetto: etcd: decommission etcd1003 [puppet] - 10https://gerrit.wikimedia.org/r/221599 [10:39:27] PROBLEM - etcd service on etcd1003 is CRITICAL: NRPE_CHECK_SYSTEMD_STATE CRITICAL - Service is in state inactive [10:46:35] <_joe_> and I scheduled downtime... [10:49:01] (03PS1) 10Matanya: add unibas.ch to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221600 [10:50:11] (03PS2) 10Giuseppe Lavagetto: etcd: decommission etcd1003 [puppet] - 10https://gerrit.wikimedia.org/r/221599 [10:50:49] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure: Update jenkins-debian-glue packages on Jessie to v0.13.0 - https://phabricator.wikimedia.org/T102106#1409305 (10hashar) [10:51:02] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 7Jenkins: Please refresh Jenkins package on apt.wikimedia.org to 1.609.1 - https://phabricator.wikimedia.org/T103343#1409307 (10hashar) [10:52:07] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: decommission etcd1003 [puppet] - 10https://gerrit.wikimedia.org/r/221599 (owner: 10Giuseppe Lavagetto) [10:53:24] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:14] RECOVERY - Host mw1085 is UPING OK - Packet loss = 0%, RTA = 9.03 ms [10:57:31] (03Abandoned) 10Hashar: (WIP) ferm: lame tests for service template (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/205274 (owner: 10Hashar) [11:00:02] <_joe_> !log shutting down etcd1003, cleaning exported resources [11:00:06] Logged the message, Master [11:04:25] PROBLEM - Host etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:06:18] <_joe_> this should go away in a few minutes, I ran puppet node clean [11:45:22] PROBLEM - puppet last run on mw1058 is CRITICAL Puppet has 1 failures [11:46:16] 6operations, 7discovery-system: Install etcd in multiple rows/racks - https://phabricator.wikimedia.org/T101713#1409408 (10Joe) I added conf1002 and conf1003 too, and removed etcd1003 so that now the ganeti-based hosts don't have the quorum by themselves. I may remove them alltoghether at a later point in time [11:46:32] 6operations, 10Traffic, 7discovery-system, 5services-tooling: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1409411 (10Joe) [11:46:33] 6operations, 7discovery-system: Install etcd in multiple rows/racks - https://phabricator.wikimedia.org/T101713#1409409 (10Joe) 5Open>3Resolved a:3Joe [11:47:03] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Create a confd template for pybal files that will work with our etcd schema. - https://phabricator.wikimedia.org/T101858#1409416 (10Joe) [11:47:13] 6operations, 7discovery-system: Build a python-etcd deb package for all current WMF platforms - https://phabricator.wikimedia.org/T101971#1409418 (10Joe) [11:53:08] (03PS21) 10Yuvipanda: labs: Centralize config of which projects have NFS enabled [puppet] - 10https://gerrit.wikimedia.org/r/218637 (https://phabricator.wikimedia.org/T102403) [11:58:58] mark: interrupting rsync now [11:59:06] !log interupt rsync on labstore1001 to prevent it from copying mwofflienr files [11:59:11] Logged the message, Master [11:59:39] mark: interesting, the exclusions file already contains mwoffliner [11:59:48] then it's not working :) [12:00:21] oh [12:00:23] RECOVERY - puppet last run on mw1058 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:00:24] mark: indeed, it has others/mwoffliner [12:00:24] because it's on others/ [12:00:26] but this predates that [12:00:27] should be project/mwoffliner [12:00:28] same for video [12:00:29] yep [12:00:37] let me make that change [12:00:47] just have both [12:00:49] yeah [12:00:55] I wonder if root/exclusions is puppetized? [12:01:03] don't think so [12:01:08] cool [12:01:11] then it wouldn't be in /root :P [12:01:37] * matanya heard his project mentioned [12:01:41] we have helper scripts in /root on labcontrol1001 that are puppetized [12:02:24] then we need to change that [12:02:24] mark: should I add maps too? that has an others entry no project entry [12:02:36] yes [12:02:45] hmm, tools also has an others entry but no maps entry [12:02:52] but don't we want to rsync the snapshot of tools? [12:03:01] what do you mean? [12:03:09] > - /others/tools [12:03:12] is in the excludes file [12:03:22] sorry, i can't follow [12:03:24] which means the intent was to not rsync a copy of the tools backup? [12:03:36] the intent is to save the files on the corrupted fs [12:03:53] right, so we want to save the files belonging to the tools project as well, right? [12:04:00] yes [12:04:12] which are on that fs as well right? [12:05:16] yes, they are on /mnt/broken/project/tools [12:05:25] so they won't be hit by the exclusion file, I guess. [12:05:34] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: removing etcd1003 from the servers list [dns] - 10https://gerrit.wikimedia.org/r/221596 (owner: 10Giuseppe Lavagetto) [12:05:41] I'm happy with the exclusion file as is, mark do you want to take a look before I start it again? [12:06:01] seems fine [12:06:12] alright [12:06:35] !log excluded maps, mwoffliner and video project from rsync of broken FS to speed it up [12:06:39] Logged the message, Master [12:06:55] !log restarting rsync with new exclusions file on labstore1002 to codfw [12:06:59] Logged the message, Master [12:08:00] I am performing a full backup of db1022 sqldata to dbstore1001, limiting rate to 20MB/s, ETA 2h [12:09:56] mark: do you know approximately how long the 'sending incremental file list' stage lasts? [12:10:03] like, should I stare at it or context switch? [12:10:07] don't wait [12:10:09] I guess I should context switch and check back on it later [12:10:09] ok [12:10:15] it's probably just going over all the files that haven't changed [12:10:17] which might take days [12:10:21] heh [12:10:27] until it gets past mwoffliner [12:10:29] restarting the rsync isn't cheap, I guess. [12:10:34] no [12:10:45] mark: can we just rm -rf it? I guess we can't since it's mounted ro [12:10:53] let's not [12:10:59] yeah [12:11:33] I literally had a dream the other night where I ran my script that lists projects with full NFS and it returned '10' [12:11:39] * YuviPanda sighs [12:11:56] currently at 34! [12:12:11] brb [12:12:58] (03PS13) 10Giuseppe Lavagetto: move text backend_random into "directors" [puppet] - 10https://gerrit.wikimedia.org/r/220645 (owner: 10BBlack) [12:23:07] reported https://phabricator.wikimedia.org/T104189 [12:28:18] (03CR) 10Giuseppe Lavagetto: [C: 032] "Re-verified with the compiler, the small unevennes in weights for the random backend is only present in esams, and it's a 28% discrepancy." [puppet] - 10https://gerrit.wikimedia.org/r/220645 (owner: 10BBlack) [12:40:32] (03PS4) 10Giuseppe Lavagetto: varnish: add service to the directors options [puppet] - 10https://gerrit.wikimedia.org/r/220815 [12:40:55] YuviPanda: head up, i am going to kill NFS [12:41:06] matanya: on the video project? [12:41:07] :) [12:41:09] yes [12:41:20] matanya: sure! everything or everything except /data/scratch? [12:41:47] YuviPanda: No, the other meaning of kill (saturate) [12:41:58] matanya: oh, uhm. why? [12:42:18] i will be copying all the data i need from NFS to /data/scratch [12:42:39] <_joe_> ionicece won't help [12:42:45] matanya: oh, don't do that. we can do that on the serverside easier [12:42:54] <_joe_> use scp locally and limit bandwidth maybe? [12:43:02] that was the plan [12:43:03] <_joe_> yeah server-side you can use ionice [12:43:21] <_joe_> which is gonna be soo much better [12:43:36] ok, can i please get help with that ? [12:43:47] can't do it on my own ofc [12:44:29] matanya: hmm, can you file a bug detailing exactly what you wanted to do? and we'll figure out a way of doing that without killing anything :0 [12:44:54] sure, thanks much, giving heads up is a good idea :) [12:45:16] matanya: totally is :) [12:45:25] /data/scratch is on NFS too ? [12:46:48] YuviPanda: ^ [12:46:51] <_joe_> matanya: it's just not backed up [12:47:04] <_joe_> it is [12:47:06] ah, so that won't help much [12:47:28] any local storage i can use ? [12:47:38] matanya: yes it is, just not backed up and on a slightly different setup [12:47:48] matanya: local storage > /data/scratch > /data/project [12:48:25] local storage == not on labstore ? [12:48:57] matanya: yes, local storage == on instances, but that's capped atm and I guess increasing that causes local problems... [12:49:49] (03CR) 10Jcrespo: [C: 031] "Ignore my last comment, I was looking at the wrong host." [software] - 10https://gerrit.wikimedia.org/r/221561 (owner: 10Springle) [12:49:53] so i am stuck here, i have 20G locally, that is not enough for more than 2-3 files at the same time [12:49:58] what would you suggest ? [12:50:03] <_joe_> 20 G? [12:50:16] /dev/vda3 19G 1.7G 16G 10% / [12:50:18] <_joe_> don't you have more than that locally? [12:50:25] nope [12:50:27] <_joe_> matanya: you can add additional juice [12:50:42] srv ? [12:50:45] <_joe_> with the labs::lvm classes AFAIR [12:50:56] <_joe_> but YuviPanda will be more precise [12:51:15] matanya: oh you can allocate upto 160G [12:51:26] matanya: labs::lvm::srv or a similar named class [12:52:01] (03CR) 10Yuvipanda: "The reasons for this being in puppet than hiera is that we explicitly want these to be ops controlled - if your project needs NFS, you com" [puppet] - 10https://gerrit.wikimedia.org/r/218637 (https://phabricator.wikimedia.org/T102403) (owner: 10Yuvipanda) [12:52:13] matanya: is 160G enough for your needs? [12:52:24] i will try, i hope it does [12:52:38] alright! [12:52:43] it won't be enough for wikimania, but that will be handled on its own [12:52:51] ok! [12:53:10] <_joe_> matanya: you can use multiple instances in case of need and do nfs between them! [12:53:26] * _joe_ hides before yuvi kills him [12:53:32] good point! :D [12:53:35] (03PS22) 10Yuvipanda: labs: Centralize config of which projects have NFS enabled [puppet] - 10https://gerrit.wikimedia.org/r/218637 (https://phabricator.wikimedia.org/T102403) [12:53:47] (03CR) 10Yuvipanda: [C: 032 V: 032] "Merge and pray!" [puppet] - 10https://gerrit.wikimedia.org/r/218637 (https://phabricator.wikimedia.org/T102403) (owner: 10Yuvipanda) [12:54:08] (03CR) 10Andrew Bogott: """ is more distinctive than "unknown". I'll put in a check to error out if that value is unset in labs." [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) (owner: 10Andrew Bogott) [12:55:23] NFS is disabled for new projects by default now! \o/ [12:55:44] YuviPanda: how can i decide what size the /srv gets ? [12:55:55] matanya: it allocates as much as possible by default. [12:56:03] matanya: it's just LVM though, so you can just use default LVM tools [12:56:06] i got 60GB [12:56:08] to make ita s big as or as small as you would like [12:56:17] matanya: yeah, so your instance flavor had 80G set [12:56:21] and 20G / and 60G /srv [12:56:28] the default xlarge one has 160G [12:56:47] lets try with 60 [12:56:51] see what happens [12:57:52] (03CR) 10Merlijn van Deen: "I think it's better to always error out than to subtly fail (as $::labsproject is used for directory names, this will subtly fail with '' " [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) (owner: 10Andrew Bogott) [13:09:35] (03CR) 10Yuvipanda: "https://tools.wmflabs.org/watroles/role/misc::labsdebrepo easy way to find which ones, and it turns out both are instances I am responsibl" [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [13:09:54] (03PS4) 10BBlack: ssl_ciphersuite: re-order ECDSA ahead of RSA [puppet] - 10https://gerrit.wikimedia.org/r/220377 [13:14:36] (03PS1) 10Joal: Add projectview to metrics website [puppet] - 10https://gerrit.wikimedia.org/r/221611 (https://phabricator.wikimedia.org/T101118) [13:16:17] (03PS1) 10Glaisher: Allow ukwiki sysops to add/remove users to accountcreator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221612 (https://phabricator.wikimedia.org/T104034) [13:17:54] (03CR) 10BBlack: [C: 032] ssl_ciphersuite: re-order ECDSA ahead of RSA [puppet] - 10https://gerrit.wikimedia.org/r/220377 (owner: 10BBlack) [13:20:48] (03PS5) 10Giuseppe Lavagetto: varnish: add service to the directors options [puppet] - 10https://gerrit.wikimedia.org/r/220815 [13:21:56] (03PS6) 10Giuseppe Lavagetto: varnish: add service to the directors options [puppet] - 10https://gerrit.wikimedia.org/r/220815 [13:22:33] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 16.67% of data above the critical threshold [500.0] [13:23:14] (03CR) 10Yuvipanda: "I clearly misspoke - two are mine (shinken and ircnotifier), one is _joe_'s and one is mutante's" [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [13:24:57] (03CR) 10Giuseppe Lavagetto: [C: 032] varnish: add service to the directors options [puppet] - 10https://gerrit.wikimedia.org/r/220815 (owner: 10Giuseppe Lavagetto) [13:25:20] (03PS5) 10Andrew Bogott: Add a labsproject fact that doesn't rely on ldap config. [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) [13:25:22] (03PS2) 10Andrew Bogott: Use the labsproject fact rather than $::instanceproject from ldap [puppet] - 10https://gerrit.wikimedia.org/r/221562 [13:25:24] (03PS2) 10Andrew Bogott: Add an additional puppet config to use with minimal runs. [puppet] - 10https://gerrit.wikimedia.org/r/221563 [13:26:01] (03CR) 10jenkins-bot: [V: 04-1] Add a labsproject fact that doesn't rely on ldap config. [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) (owner: 10Andrew Bogott) [13:26:14] (03CR) 10jenkins-bot: [V: 04-1] Use the labsproject fact rather than $::instanceproject from ldap [puppet] - 10https://gerrit.wikimedia.org/r/221562 (owner: 10Andrew Bogott) [13:26:19] (03CR) 10jenkins-bot: [V: 04-1] Add an additional puppet config to use with minimal runs. [puppet] - 10https://gerrit.wikimedia.org/r/221563 (owner: 10Andrew Bogott) [13:29:12] (03PS7) 10Yuvipanda: move misc/labsdebrepo out of misc to module [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [13:32:22] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [13:33:05] (03PS2) 10DCausse: Upgrade extra and experimental-highlighter to 1.6.0 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/221136 (https://phabricator.wikimedia.org/T103598) [13:33:11] (03CR) 10Yuvipanda: move misc/labsdebrepo out of misc to module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [13:33:42] (03PS10) 10Hashar: nodepool: preliminary role and config file [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) [13:34:29] (03CR) 10jenkins-bot: [V: 04-1] nodepool: preliminary role and config file [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [13:37:05] (03PS11) 10Hashar: nodepool: preliminary role and config file [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) [13:37:53] (03PS6) 10Andrew Bogott: Add a labsproject fact that doesn't rely on ldap config. [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) [13:37:55] (03PS3) 10Andrew Bogott: Use the labsproject fact rather than $::instanceproject from ldap [puppet] - 10https://gerrit.wikimedia.org/r/221562 [13:37:57] (03PS3) 10Andrew Bogott: Add an additional puppet config to use with minimal runs. [puppet] - 10https://gerrit.wikimedia.org/r/221563 [13:38:25] andrewbogott: why only error out in the case of labs? [13:38:58] (03CR) 10Hashar: "The change should now be in sync with what is deployed on labnodepool1001.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [13:39:00] hm, I guess it's hard to only error out when $::labsproject is actually used [13:40:10] (03PS2) 10Jcrespo: Enable jessie installer for db1022 [puppet] - 10https://gerrit.wikimedia.org/r/221572 [13:40:33] PROBLEM - puppet last run on ganeti2004 is CRITICAL Puppet has 1 failures [13:40:50] valhallasw: why would labsproject be set on production? [13:41:08] (03CR) 10Jcrespo: [C: 032] Enable jessie installer for db1022 [puppet] - 10https://gerrit.wikimedia.org/r/221572 (owner: 10Jcrespo) [13:41:12] PROBLEM - puppet last run on cp3047 is CRITICAL Puppet has 1 failures [13:41:13] andrewbogott: not, but a manifest might use $::labsproject anyway [13:41:27] a broken manifest [13:43:04] yeah, that'd already be broken [13:43:11] andrewbogott: mm. is strict_variables used? in that case, nil/undef (rather than '') might be a better choice [13:43:15] yes, but it'd be silently broken [13:43:32] Sorry, I don’t understand what you’re proposing [13:44:05] silently returning an empty string means things may/may not work when $::labsproject is used [13:44:39] (on anything that's not labs, which also included e.g. vagrant) [13:44:59] (not sure, actually about vagrant, but ok) [13:45:24] So you just want it to return undef instead? [13:45:54] PROBLEM - puppet last run on cp4017 is CRITICAL Puppet has 1 failures [13:45:56] I think undef is caught when strict_variables is set [13:46:03] PROBLEM - puppet last run on mw1152 is CRITICAL Puppet has 1 failures [13:46:15] PROBLEM - puppet last run on mw2156 is CRITICAL Puppet has 1 failures [13:46:39] hm, or not. "(This does not affect referencing variables that are explicitly set to undef)." [13:46:56] anyway, undef makes more sense semantically, but will still break. I guess there's no way around that. [13:47:33] can a fact return undef? [13:47:36] (03PS1) 10Giuseppe Lavagetto: varnish: qualify ls command in confd template [puppet] - 10https://gerrit.wikimedia.org/r/221618 [13:47:41] (03CR) 10jenkins-bot: [V: 04-1] varnish: qualify ls command in confd template [puppet] - 10https://gerrit.wikimedia.org/r/221618 (owner: 10Giuseppe Lavagetto) [13:48:44] (03PS3) 10Hashar: nodepool: provide openstack env variables to system user [puppet] - 10https://gerrit.wikimedia.org/r/220444 (https://phabricator.wikimedia.org/T103673) [13:48:57] I think you can just return nil [13:49:24] (03CR) 10Merlijn van Deen: [C: 031] "There doesn't seem to be a way to catch $::labsproject use in manifests that are also used on prod (strict_variables = True /probably/ won" [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) (owner: 10Andrew Bogott) [13:50:57] (03PS3) 10Hashar: nodepool: element to prepare an image for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/220445 [13:51:20] (03PS2) 10Hashar: nodepool: add diskimage 'devuser' element [puppet] - 10https://gerrit.wikimedia.org/r/220446 (https://phabricator.wikimedia.org/T102880) [13:52:13] (03PS7) 10Andrew Bogott: Add a labsproject fact that doesn't rely on ldap config. [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) [13:52:15] (03PS4) 10Andrew Bogott: Use the labsproject fact rather than $::instanceproject from ldap [puppet] - 10https://gerrit.wikimedia.org/r/221562 [13:52:17] (03PS4) 10Andrew Bogott: Add an additional puppet config to use with minimal runs. [puppet] - 10https://gerrit.wikimedia.org/r/221563 [13:53:13] (03CR) 10Merlijn van Deen: [C: 031] Add a labsproject fact that doesn't rely on ldap config. [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) (owner: 10Andrew Bogott) [13:55:23] RECOVERY - puppet last run on ganeti2004 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [13:55:43] RECOVERY - puppet last run on mw1152 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:55:52] RECOVERY - puppet last run on cp3047 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:56:02] RECOVERY - puppet last run on mw2156 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [13:57:44] (03CR) 10Giuseppe Lavagetto: [C: 032] varnish: qualify ls command in confd template [puppet] - 10https://gerrit.wikimedia.org/r/221618 (owner: 10Giuseppe Lavagetto) [13:58:17] (03CR) 10Merlijn van Deen: "LGTM, but given that the tools changes touch gridengine, I'd like to be sure the change is a null change or that it can be reverted correc" [puppet] - 10https://gerrit.wikimedia.org/r/221562 (owner: 10Andrew Bogott) [14:00:32] RECOVERY - puppet last run on cp4017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:00:54] andrewbogott: once https://gerrit.wikimedia.org/r/220991 is merged, please ping me to test https://gerrit.wikimedia.org/r/221562 on toolsbeta [14:01:59] valhallasw: ok [14:02:05] thanks! [14:02:30] (03PS4) 10Ottomata: Refactor eventlogging monitoring classes [puppet] - 10https://gerrit.wikimedia.org/r/221277 [14:05:34] (03CR) 10Tim Landscheidt: "I think someone who would be able to cheat the NFS configuration on their project's "Hiera:" page without understanding what they are doin" [puppet] - 10https://gerrit.wikimedia.org/r/218637 (https://phabricator.wikimedia.org/T102403) (owner: 10Yuvipanda) [14:06:04] (03CR) 10Ottomata: [C: 032] Refactor eventlogging monitoring classes [puppet] - 10https://gerrit.wikimedia.org/r/221277 (owner: 10Ottomata) [14:08:34] andrewbogott: we can try your labs project variable changes on either the beta cluster or integration labs projects :) [14:09:27] hashar: thanks. I think it’s pretty safe but I’m not sure I want to merge it today… it’s part of a bigger picture that I need to think about more. [14:09:48] andrewbogott: just proposing :-] [14:10:06] (03CR) 10Yuvipanda: ":) It's about what is self serve and what isn't and NFS shouldn't be. The failure mode for this code right now is also only soft - puppet " [puppet] - 10https://gerrit.wikimedia.org/r/218637 (https://phabricator.wikimedia.org/T102403) (owner: 10Yuvipanda) [14:10:12] if we get puppet passing on beta, we can cherry pick your patches then check what is going on on different flavor of role classes [14:10:25] andrewbogott: why the move from lDAP to facter, btw? [14:12:12] A few reasons. We’re en route to eliminating ldap node definitions altogether, which won’t happen soon but would be nice. In order to support proper instance creation the project name has to be in metadata (because we don’t have ldap info yet at that point) and I don’t like having the same info in two places [14:12:46] And also I’m tinkering with the idea of having a special puppet run that uses a generic (non-host-specific) catalog. Which means not learning anything instance-specific from ldap. [14:12:47] 6operations, 10Traffic: Deploy infra ganeti cluster @ ulsfo - https://phabricator.wikimedia.org/T96852#1409722 (10faidon) >>! In T96852#1406463, @BBlack wrote: > I think at least one of the reasons for the 3 hosts idea was that if one underlying ganeti box died, we could still have 2x instances of various type... [14:13:18] andrewbogott: ah, right. although I think the solution to 'self hosted puppetmasters are bad!' is just the 'let them autoupdate and if they have a conflicting change well then you need to recreate your setup I guess' [14:13:57] YuviPanda: I spend many hours every month fixing self-hosted puppetmaster problems. It’s not enough to tell people to piss off. [14:14:23] I definitely think it should be... [14:14:36] self hosted puppetmasters should be a very 'if you do this you take responsibility for it' setup [14:14:49] that's easy to say when you have +2 on ops/puppet ;-) [14:14:50] similar to 'we do sysadmin stuff for you if you use tools, but outside of that you are on your own' [14:15:33] it's basically impossible to develop your own puppet manifest if you a) don't have +2 on ops/puppet and b) don't have your own puppetmaster [14:15:38] ok, except many of the instances that I had to fix last week were /your/ instances :) [14:15:42] RECOVERY - Check status of defined EventLogging jobs on analytics1010 is OK All defined EventLogging jobs are runnning. [14:15:58] andrewbogott: the marathon ones I deleted and the ores ones were *not* self hosted puppetmasters... [14:16:07] andrewbogott: and I"m saying you shouldn't have fixed them :) [14:16:16] it should not be your or our responsibility [14:16:19] also, people break puppet in other ways, not just via self-hosted puppet. [14:16:35] It’s easy enough to put a typo in hiera, or include two classes that can’t overlap [14:16:44] and then suddely, plop, that instance is never updated again. [14:16:54] And the user doesn’t notice since we’re removing puppet status from wikitech :) [14:17:26] they didn't notice before either, did they? :) [14:17:59] valhallasw: I'm not saying self hosted puppetmasters are evil, but that now with autoupdate on and they fail only if you have a local hack that conflicts with a global change, and that's something that should be your responsibility... [14:18:39] andrewbogott: most of the other opsen's response to their self hosted puppetmasters was also to just delete it, I think? [14:19:04] YuviPanda: I think that's too easy. We *want* people to puppetize their infrastructure, so we should provide a way to do that that's not 'oh, sorry, we merged something and you now lost everything you built' [14:19:11] *shrug* They didn’t delete them until I asked them to, which still counts as ‘fixing’ in my book [14:19:41] valhallasw: there's a puppet module called puppetception that I started building... [14:19:45] valhallasw: that's probably the solution for that [14:20:11] PROBLEM - Check status of defined EventLogging jobs on hafnium is CRITICAL Stopped EventLogging jobs: reporter/statsd [14:21:27] andrewbogott: heh :) I still think the problem is a social one and not a technical one (beyond turning on autoupdate). I'll respond on the thread later today... [14:21:47] YuviPanda: mmm. maybe. doesn't it just move the conflict to the puppet rather than the git level? [14:21:55] well, less risk for conflicts, I guess [14:22:15] valhallasw: yes. also the auto update script rebases and refuses to do anything destructive... [14:22:47] YuviPanda: it’s definitely a social problem. But after 3 years of asking people to behave differently and getting no results /even from my own team/ I’m pretty sure it’s time to consider a different approach. [14:24:04] andrewbogott: hmm, I'm just worried that a two stage puppet run introduces more complexities and more differences from production than we already have... [14:24:23] yeah, me too. I’m not sure it’s a good idea, still tinkering. [14:24:28] yeah [14:24:41] andrewbogott: I"ll update the script to do autostashing later. [14:24:56] valhallasw: so my problem with the puppetception module is that I'll have to reimplement everything from scratch [14:25:06] valhallasw: ops/puppet has nice stuff (like a uwsgi module, nginx module, etc) that I can just use [14:25:11] valhallasw: those are not available from puppetception [14:25:36] YuviPanda: mm. Yeah, either you merge the manifests somehow, and you have the same issues as now, or you don't merge them and you lose all the goodies [14:25:39] Better than autostashing would be sending an email or a wikitech alert to an instance owner. I don’t think that’s really possible though. [14:26:26] andrewbogott: we can do that probably... [14:26:41] valhallasw: indeed. you can futz around with git based merging but you have the same fundamental problem [14:27:49] fwiw, I think andrews approach 'a basic puppet manifest that makes sure you can login to fix your mess' and another one for the rest is pretty sane [14:28:02] YuviPanda: if you can arrange for an email-scolding that’d be awesome. [14:28:24] it has problems - if you move the NFS setup to the basic one than the 'other' one can not refer to anything in the basic one, for example. no 'do this after that' [14:28:30] If it’s sent to all project admins, that might also motivate people to clean up unused projects. [14:28:39] andrewbogott: indeed, I think it should send to all projectadmins... [14:28:39] (03PS11) 10ArielGlenn: rsync wikidata json dumps to labs /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) (owner: 10Addshore) [14:29:01] andrewbogott: I think that's a better alternative than the two stage one, and I think we can hook it up with LDAP to do that. can you open a bug? [14:29:09] sure [14:29:24] YuviPanda: everything in 'the rest' is always after everything in 'the basic one', I'd think? and once your ldap pubkey change is merged, people can login even without nfs [14:29:45] valhallasw: it has been merged :) [14:29:46] (03CR) 10ArielGlenn: [C: 032] rsync wikidata json dumps to labs /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) (owner: 10Addshore) [14:29:56] valhallasw: anyway, it's just a complication that we can live without if we find alternative solutions :) [14:30:03] fair enough [14:30:27] https://phabricator.wikimedia.org/T104199 [14:34:41] !log rebooting and reinstalling db1022 [14:34:46] Logged the message, Master [14:36:44] heyaaaa _joe_, yt? [14:36:47] admin groups q [14:40:23] (03PS1) 10Ottomata: Remove eventlogging::reporter from hafnium [puppet] - 10https://gerrit.wikimedia.org/r/221631 [14:42:14] hey andrewbogott, we no longer need to set $wgOpenStackManagerProxyGateways['pmtpa'] right? [14:42:31] Surely not [14:45:30] (03PS1) 10Ottomata: Explicitly use eventlogging role so that hieradata for admin::groups is applied [puppet] - 10https://gerrit.wikimedia.org/r/221634 [14:45:59] (03CR) 10Ottomata: [C: 032] Remove eventlogging::reporter from hafnium [puppet] - 10https://gerrit.wikimedia.org/r/221631 (owner: 10Ottomata) [14:46:12] (03CR) 10jenkins-bot: [V: 04-1] Explicitly use eventlogging role so that hieradata for admin::groups is applied [puppet] - 10https://gerrit.wikimedia.org/r/221634 (owner: 10Ottomata) [14:46:48] (03PS2) 10Ottomata: Explicitly use eventlogging role so that hieradata for admin::groups is applied [puppet] - 10https://gerrit.wikimedia.org/r/221634 [14:49:26] (03CR) 10Ottomata: [C: 032] Explicitly use eventlogging role so that hieradata for admin::groups is applied [puppet] - 10https://gerrit.wikimedia.org/r/221634 (owner: 10Ottomata) [14:50:29] Who is SWATing? [14:54:19] jouncebot, next [14:54:19] In 0 hour(s) and 5 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150629T1500) [14:54:26] I guess I will [14:54:31] Krenair: thanks. [14:55:01] PROBLEM - puppet last run on ms-be2008 is CRITICAL puppet fail [14:55:22] Krenair: You need to update ContentTranslation submodule for two patches I cherry-picked for SWAT. [14:55:49] Krenair: I wonder how submodule thingy will work from now. Is this okay? [14:57:22] kart_, I don't think we need to do such an update anymore. [14:57:49] Just scap then? [14:58:13] I don't think this needs a full scap? [14:58:21] just a few sync-files [14:59:35] Yes. Small changes. sync-files should be Okay [15:00:05] manybubbles anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150629T1500). [15:02:43] !log krenair Synchronized php-1.26wmf11/extensions/ContentTranslation/modules/tools/ext.cx.tools.formatter.js: https://gerrit.wikimedia.org/r/#/c/221604/ (duration: 00m 14s) [15:02:48] Logged the message, Master [15:02:50] kart_, ^ [15:03:18] please check that didn't break it [15:03:58] Krenair: checking. [15:06:40] Krenair: testing. Give me few more minutes. [15:06:59] k [15:08:41] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1409821 (10Eevans) OK, so catching up with where we are according to The Plan: >>! In T102015#1397474, @Eevans wrote: > # ~~We attempt... [15:10:02] RECOVERY - Check status of defined EventLogging jobs on hafnium is OK All defined EventLogging jobs are runnning. [15:10:42] Krenair: oh. you sync one file? [15:10:57] one of the commits, yep [15:11:10] want the other to test at the same time? [15:11:19] Krenair: okk. then it is okay. [15:11:25] Krenair: go for second. [15:12:02] !log krenair Synchronized php-1.26wmf11/extensions/ContentTranslation/modules/tools/ext.cx.tools.link.js: https://gerrit.wikimedia.org/r/#/c/221605 (duration: 00m 13s) [15:12:06] kart_, ^ [15:12:06] Logged the message, Master [15:12:31] RECOVERY - puppet last run on ms-be2008 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:12:33] Krenair: okay. Testing. [15:12:44] (03CR) 10Manybubbles: [C: 04-1] "No mergies unless as part of the Elasticsearch 1.6 upgrade to production." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/221136 (https://phabricator.wikimedia.org/T103598) (owner: 10DCausse) [15:14:01] 6operations, 10ops-codfw, 7Database: Faulty memory on es2004 - https://phabricator.wikimedia.org/T103843#1409827 (10Papaul) Discussed this with Jynus on IRC , I am waiting on Jynus to get memory purchase permission. [15:14:28] (03PS4) 10Ottomata: Add new projectview to projectcounts aggregation [puppet] - 10https://gerrit.wikimedia.org/r/220752 (https://phabricator.wikimedia.org/T101118) (owner: 10Joal) [15:15:23] 6operations, 10ops-codfw, 7Database: Faulty memory on es2004 - https://phabricator.wikimedia.org/T103843#1409834 (10jcrespo) a:3jcrespo [15:15:50] (03CR) 10Ottomata: [C: 032] Add new projectview to projectcounts aggregation [puppet] - 10https://gerrit.wikimedia.org/r/220752 (https://phabricator.wikimedia.org/T101118) (owner: 10Joal) [15:17:23] (03PS1) 10Ottomata: Need to include the new projectview and projectcount aggregator classes [puppet] - 10https://gerrit.wikimedia.org/r/221638 [15:17:25] Krenair: looks good. [15:17:36] (03CR) 10Alex Monk: [C: 032] CX: Add eswiki-recommender campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221047 (owner: 10KartikMistry) [15:18:08] (03Merged) 10jenkins-bot: CX: Add eswiki-recommender campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221047 (owner: 10KartikMistry) [15:18:24] (03CR) 10Ottomata: [C: 032] Need to include the new projectview and projectcount aggregator classes [puppet] - 10https://gerrit.wikimedia.org/r/221638 (owner: 10Ottomata) [15:18:46] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/221047/ (duration: 00m 13s) [15:18:47] kart_, ^ can you test that? [15:18:50] Logged the message, Master [15:19:07] Krenair: no. But, that's fine :) [15:19:10] PROBLEM - puppet last run on stat1002 is CRITICAL puppet fail [15:19:15] ok [15:19:57] (03CR) 10Alex Monk: [C: 032] More wikitech cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221009 (https://phabricator.wikimedia.org/T75939) (owner: 10Alex Monk) [15:20:10] (03PS1) 10Ottomata: s/projectviews/projectview/ [puppet] - 10https://gerrit.wikimedia.org/r/221639 [15:20:21] (03Merged) 10jenkins-bot: More wikitech cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221009 (https://phabricator.wikimedia.org/T75939) (owner: 10Alex Monk) [15:20:33] (03CR) 10Ottomata: [C: 032 V: 032] s/projectviews/projectview/ [puppet] - 10https://gerrit.wikimedia.org/r/221639 (owner: 10Ottomata) [15:20:58] !log krenair Synchronized wmf-config/wikitech.php: https://gerrit.wikimedia.org/r/#/c/221009/ (duration: 00m 11s) [15:21:03] Logged the message, Master [15:22:30] (03CR) 10Alex Monk: [C: 032] More high-resolution logos for Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221113 (https://phabricator.wikimedia.org/T102852) (owner: 10Odder) [15:22:57] (03Merged) 10jenkins-bot: More high-resolution logos for Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221113 (https://phabricator.wikimedia.org/T102852) (owner: 10Odder) [15:23:51] !log krenair Synchronized w/static/images/project-logos/zhwiki-hans.png: https://gerrit.wikimedia.org/r/#/c/221113/ (duration: 00m 12s) [15:23:55] Logged the message, Master [15:24:13] !log krenair Synchronized w/static/images/project-logos/zhwiki-hans-1.5x.png: https://gerrit.wikimedia.org/r/#/c/221113/ (duration: 00m 12s) [15:24:17] Logged the message, Master [15:24:33] !log krenair Synchronized w/static/images/project-logos/zhwiki-hans-2x.png: https://gerrit.wikimedia.org/r/#/c/221113/ (duration: 00m 14s) [15:24:38] Logged the message, Master [15:25:10] (03PS1) 10Muehlenhoff: Update to 3.19.8-ckt2 [debs/linux] - 10https://gerrit.wikimedia.org/r/221642 [15:25:51] (03CR) 10Alex Monk: [C: 032] Allow ukwiki sysops to add/remove users to accountcreator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221612 (https://phabricator.wikimedia.org/T104034) (owner: 10Glaisher) [15:26:15] (03Merged) 10jenkins-bot: Allow ukwiki sysops to add/remove users to accountcreator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221612 (https://phabricator.wikimedia.org/T104034) (owner: 10Glaisher) [15:26:58] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/221612/ (duration: 00m 12s) [15:27:02] Logged the message, Master [15:27:07] 7Puppet, 6Labs: Allow per-host hiera overrides via wikitech - https://phabricator.wikimedia.org/T104202#1409856 (10yuvipanda) 3NEW [15:27:10] sync-dir? [15:27:41] (03CR) 10Alex Monk: [C: 032] Set wikidata's logo specifically for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221405 (https://phabricator.wikimedia.org/T54214) (owner: 10Alex Monk) [15:27:54] Reedy, yeah, I just realised I'll have to do that for the next one [15:27:57] :/ [15:28:06] (03Merged) 10jenkins-bot: Set wikidata's logo specifically for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221405 (https://phabricator.wikimedia.org/T54214) (owner: 10Alex Monk) [15:28:09] lol [15:29:10] RECOVERY - puppet last run on stat1002 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:29:27] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/221405/ (duration: 00m 15s) [15:29:32] Logged the message, Master [15:30:21] (03PS11) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [15:30:49] At least incubator is happy now. [15:31:06] (03CR) 10jenkins-bot: [V: 04-1] WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 (owner: 10Alexandros Kosiaris) [15:32:13] (03PS1) 10Giuseppe Lavagetto: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221643 [15:32:56] (03CR) 10jenkins-bot: [V: 04-1] WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221643 (owner: 10Giuseppe Lavagetto) [15:33:00] Hmm. [15:33:10] Well, that wikidata.png is supposed to be redundant now (in favour of wikidatawiki.png) [15:33:16] except there's still cached references to it [15:34:44] cmjohnson1: Do you still need additional info regarding labnet1002? [15:35:47] andrewbogott: i don't think so [15:35:55] i think we can try installing now [15:35:56] Next sync-dir/scap will remove it from the servers... am wondering if actually it should stay around for a month or something instead [15:36:11] cmjohnson1: great! [15:36:31] cmjohnson1: should I give that a try or leave it to you? [15:36:57] i will try now ..not sure if I have it right [15:38:33] Krenair: yeah, that'd be typical for how we handle most static assets. Keep around for 5ish weeks in case it's referenced by varnish. [15:39:03] Okay, I'll upload a patchset to put the image back in place in the directory for now [15:41:01] (03CR) 10BBlack: [C: 032] Import openssl-1.0.2c [debs/openssl] - 10https://gerrit.wikimedia.org/r/221619 (owner: 10BBlack) [15:41:09] (03CR) 10BBlack: [V: 032] Import openssl-1.0.2c [debs/openssl] - 10https://gerrit.wikimedia.org/r/221619 (owner: 10BBlack) [15:41:20] (03CR) 10BBlack: [C: 032 V: 032] Import openssl_1.0.2c-1.debian.tar.xz [debs/openssl] - 10https://gerrit.wikimedia.org/r/221620 (owner: 10BBlack) [15:42:31] (03PS1) 10Alex Monk: Re-add wikidata.png, it'll still have cached references etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221646 [15:43:13] (03CR) 10Alex Monk: [C: 032] "to revert in approx. 5 weeks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221646 (owner: 10Alex Monk) [15:43:19] (03Merged) 10jenkins-bot: Re-add wikidata.png, it'll still have cached references etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221646 (owner: 10Alex Monk) [15:43:40] cmjohnson1: thank you! [15:43:47] anyone want anything else in swat? [15:43:53] YuviPanda, andrewbogott ? [15:44:13] Krenair: oh, that wikitech patch for shell access? [15:44:27] (nothing for me) [15:44:44] Krenair: enough time left for me to backport and stuff? [15:44:54] YuviPanda, should just be a backport now [15:44:57] sure [15:45:07] doing so now [15:45:10] the submodule update... seems to have started being handled automatically? very strangely. [15:45:33] kind of scary that gerrit just miraculously started being helpful and no one has been able to tell me why [15:45:49] Krenair: https://gerrit.wikimedia.org/r/#/c/221648/ [15:45:54] shall I self merge that? [15:46:00] I'll do it [15:47:25] Krenair: thanks! [15:47:31] Krenair: the auto submdoule update seems scary [15:48:40] (03PS12) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [15:49:31] !log krenair Synchronized php-1.26wmf11/extensions/OpenStackManager: https://gerrit.wikimedia.org/r/#/c/221648/ (duration: 00m 13s) [15:49:37] Logged the message, Master [15:49:40] YuviPanda, could you test that please? [15:50:14] Krenair: wheee [15:50:14] doing [15:50:25] (03PS2) 10BBlack: Delete loginlb from LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/221380 [15:50:36] (03PS3) 10BBlack: Delete login-lb/login-addrs from DNS [dns] - 10https://gerrit.wikimedia.org/r/221378 [15:51:51] !log disabling puppet on caches temporarily ... [15:51:55] Logged the message, Master [15:52:12] (03CR) 10BBlack: [C: 032] Delete loginlb from LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/221380 (owner: 10BBlack) [15:53:22] (03PS1) 10Negative24: hiera: phab-02 ssh on port 222 [puppet] - 10https://gerrit.wikimedia.org/r/221649 [15:54:09] (03PS2) 10Negative24: hiera: phab-02 ssh on port 222 [puppet] - 10https://gerrit.wikimedia.org/r/221649 [15:54:44] Krenair: all good! [15:54:45] wheee [15:54:55] great [15:55:17] (03PS1) 10Ottomata: Don't prepend seqs for kafka forwarder [puppet] - 10https://gerrit.wikimedia.org/r/221650 [15:55:49] (03CR) 10Ottomata: [C: 032 V: 032] Don't prepend seqs for kafka forwarder [puppet] - 10https://gerrit.wikimedia.org/r/221650 (owner: 10Ottomata) [15:55:57] (03CR) 10Yuvipanda: [C: 032 V: 032] hiera: phab-02 ssh on port 222 [puppet] - 10https://gerrit.wikimedia.org/r/221649 (owner: 10Negative24) [15:56:20] (03PS1) 10Jcrespo: Update db1022 to mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/221651 (https://phabricator.wikimedia.org/T101516) [15:56:40] (03PS3) 10Yuvipanda: hiera: phab-02 ssh on port 222 [puppet] - 10https://gerrit.wikimedia.org/r/221649 (owner: 10Negative24) [15:56:46] (03CR) 10Yuvipanda: [V: 032] hiera: phab-02 ssh on port 222 [puppet] - 10https://gerrit.wikimedia.org/r/221649 (owner: 10Negative24) [15:58:02] PROBLEM - puppet last run on lvs1001 is CRITICAL Puppet last ran 2 days ago [15:59:43] Krenair: around? [15:59:46] kart_, yep [16:00:11] (03PS2) 10Jcrespo: Update db1022 to mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/221651 (https://phabricator.wikimedia.org/T101516) [16:00:22] Krenair: it looks CX code is not updated for Amir. [16:00:23] (03CR) 10Jcrespo: [C: 032] Update db1022 to mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/221651 (https://phabricator.wikimedia.org/T101516) (owner: 10Jcrespo) [16:00:29] Krenair: while I see it updated. [16:00:41] hi kart_ [16:00:46] hi Krenair [16:00:55] something crazy is happening [16:01:00] okay, so [16:01:02] RL caching [16:01:13] Krinkle_ would be able to tell you all about this [16:01:17] is there anything I can do to force it to work? [16:01:19] Krenair: good idea to run scap? [16:01:24] how is it possible that we see different things? [16:01:31] You can try adding debug=true to the URL? [16:01:34] I did [16:01:42] and it still gave you old code? [16:01:42] otherwise I wouldn't be able to see the code [16:01:48] (at least not in a readable way) [16:01:50] yes [16:01:52] I see old code [16:01:54] Krenair: and one file is udpated. One isn't. [16:01:56] kart_ sees new code [16:02:01] kart_: scap doesn't do anything to clear RL caches if that's the problem [16:02:20] bd808: I see. Thanks. [16:02:22] kart_ and I are looking at (presumably) the same file, and we see different content. [16:02:27] well, I assume it'd be RL caching involved, since we're talking about CX's JS files [16:02:35] Yes. [16:02:47] one other JS file was updated correctly. [16:02:59] Krinkle_: around? ^ [16:03:02] (03PS13) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [16:03:02] do you have an idea? [16:03:37] are we connected to different servers because we are in different countries? (or am I just making crazy stuff up?) [16:03:44] "how is it possible that we see different things?" -- are you both logged in (or logged out)? [16:03:50] logged in [16:04:00] ContentTranslation only works for logged in users. [16:04:01] bd808: CX works with logged in. [16:04:12] that would rule out varnish issues then [16:04:59] could be an issue of this, if I'm understanding correctly: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#The_Resource_Loader_.28RL.29_and_l10n [16:05:53] I don't think that it has much to do with localization update. [16:06:21] Can you give us a clear description of the problem and a URL to reproduce? [16:06:22] (03CR) 10BBlack: [C: 032] Delete login-lb/login-addrs from DNS [dns] - 10https://gerrit.wikimedia.org/r/221378 (owner: 10BBlack) [16:06:36] !log re-enabling puppet on caches [16:06:40] Logged the message, Master [16:06:51] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:56] So far I see "something crazy is happening" and "how is it possible that we see different things?" without further context [16:07:37] bd808: try this: [16:08:01] (03PS14) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [16:08:12] bd808: go to https://simple.wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures [16:08:44] check Content Translation [16:08:58] save the preferences [16:09:11] {{done}} [16:09:14] then go do [16:09:26] https://simple.wikipedia.org/wiki/Special:CX [16:09:47] Type ASCII in the From: field. [16:09:54] then "Start translation". [16:10:01] bd808: done? [16:10:25] enabled the beta feature [16:10:32] trying the rest now [16:13:26] bd808: is it working? [16:14:14] I have a split screen that is showing me the enwiki ASCII article on one side and a place to make a simple english translation on the other [16:14:58] bd808: cool [16:15:04] now add &debug=true to the URL [16:15:12] and reload [16:15:17] and open the JS debugger [16:15:56] ok [16:16:25] bd808: can you please find the ext.cx.tools.formatter.js? [16:16:29] in the JS debugger [16:16:50] got it [16:17:33] bd808: go to line 194 [16:17:36] what do you see? [16:17:46] this.$section = section.jquery ? section : getParentSection(); [16:18:08] bd808: sorry, 196 [16:18:18] if ( this.$section.is( 'h1, h2, h3, h4, h5, h6, figure, table' ) ) { [16:18:22] !!! [16:18:25] you see the new code [16:18:30] I still see something old [16:18:33] let me re-check [16:20:39] That script is getting cached by varnish -- "X-Cache [16:20:39] cp1055 hit (4), cp1065 frontend miss (0)" [16:20:51] bd808: I see "if ( $.isEmpty( this.$section ) ||" on line 196 [16:20:59] and "this.$section.is( 'h1, h2, h3, h4, h5, h6, figure, table' )" on line 197 [16:21:02] that's the old code [16:21:10] kart_ and bd808 are seeing the new code [16:21:12] I see the old code. [16:21:18] how are we not getting authdns alerts? [16:21:44] PROBLEM - puppet last run on baham is CRITICAL Puppet has 1 failures [16:23:15] bd808: do you have any idea what could be causing this? [16:23:15] PROBLEM - puppet last run on radon is CRITICAL Puppet has 1 failures [16:23:28] how is it possible that we see different versions of the same file? [16:23:37] ah they're just not very fast, it was still on 1/3 on icinga [16:24:14] RECOVERY - puppet last run on baham is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [16:24:17] aharoni: I'm seeing varnish cache hit headers for https://simple.wikipedia.org/static/1.26wmf11/extensions/ContentTranslation/modules/tools/ext.cx.tools.formatter.js. That makes me think you and I could see different versions if we are hitting different varnish clusters. My X-Cache header says cp1055 [16:24:41] bd808: OK, is there any way to synchronize them? [16:24:46] Or just waiT? [16:25:07] Is it right that it takes more than an hour for it to synchronize? [16:25:21] Kartik deployed two patches in SWAT today. [16:25:23] there's no great way to issue a purge for a URL like that sadly. [16:25:29] One works well for me. The other doesn't [16:25:43] RECOVERY - puppet last run on radon is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:25:46] So how much time is it supposed to take until it's really updated? [16:26:33] We talked some about this in the past -- https://phabricator.wikimedia.org/T99096 [16:27:15] 7Puppet, 6Labs, 3Labs-Sprint-104: Allow per-host hiera overrides via wikitech - https://phabricator.wikimedia.org/T104202#1410013 (10yuvipanda) [16:27:24] The TL;DR is that when we used bits to serve static things it had a more volatile varnish cache [16:27:46] today we serve static from the main text varnishes and things could stick around a lot longer [16:28:08] which makes backports of static assets a hit or miss proposition [16:28:30] (03PS1) 10BBlack: authdns scripts: use "service" to restart [puppet] - 10https://gerrit.wikimedia.org/r/221655 [16:29:03] (03CR) 10BBlack: [C: 032 V: 032] authdns scripts: use "service" to restart [puppet] - 10https://gerrit.wikimedia.org/r/221655 (owner: 10BBlack) [16:29:04] 6operations, 10RESTBase-Cassandra, 6Services, 7RESTBase-architecture: alternative Cassandra metrics reporting - https://phabricator.wikimedia.org/T104208#1410015 (10Eevans) 3NEW a:3fgiunchedi [16:29:46] aharoni: that text cluster caching thing *should* only effect ?debug=true requests [16:30:00] (03PS1) 10RobH: account cleanups: sumanah & mglaser [puppet] - 10https://gerrit.wikimedia.org/r/221657 [16:31:10] 6operations, 6Labs, 3Labs-Sprint-104, 5Patch-For-Review: update star.wmflabs.org cert from sha1 to sha256 - https://phabricator.wikimedia.org/T104017#1410027 (10yuvipanda) [16:33:21] bd808: very weird, but OK for most end-users, I guess [16:33:53] makes debugging harder for me than it should be :/ [16:33:58] cache invalidation is hard and we have layers upon layers of caches [16:34:32] this is one of the most frequent replies to my bug reports :) [16:35:16] 6operations, 10RESTBase-Cassandra, 6Services, 7RESTBase-architecture: alternative Cassandra metrics reporting - https://phabricator.wikimedia.org/T104208#1410038 (10Eevans) One option, would be to implement a JMX-based collector that writes to Graphite, in Java. I believe such a collector could be written... [16:35:18] (03PS2) 10BBlack: update-ocsp: require proxy argument [puppet] - 10https://gerrit.wikimedia.org/r/221422 [16:35:35] (03CR) 10BBlack: [C: 032 V: 032] update-ocsp: require proxy argument [puppet] - 10https://gerrit.wikimedia.org/r/221422 (owner: 10BBlack) [16:35:42] (03PS2) 10BBlack: update-ocsp: support multi-cert fetches [puppet] - 10https://gerrit.wikimedia.org/r/221423 [16:35:50] (03PS2) 10RobH: account cleanups: sumanah & mglaser [puppet] - 10https://gerrit.wikimedia.org/r/221657 [16:36:15] (03CR) 10BBlack: [C: 032 V: 032] update-ocsp: support multi-cert fetches [puppet] - 10https://gerrit.wikimedia.org/r/221423 (owner: 10BBlack) [16:37:42] (03CR) 10RobH: [C: 032] account cleanups: sumanah & mglaser [puppet] - 10https://gerrit.wikimedia.org/r/221657 (owner: 10RobH) [16:38:02] (03PS3) 10RobH: account cleanups: sumanah & mglaser [puppet] - 10https://gerrit.wikimedia.org/r/221657 [16:38:05] (03PS15) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [16:38:11] (03CR) 10RobH: [V: 032] account cleanups: sumanah & mglaser [puppet] - 10https://gerrit.wikimedia.org/r/221657 (owner: 10RobH) [16:40:43] (03PS16) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [16:42:36] (03PS1) 10ArielGlenn: rsync of phab dumps from iridium to dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/221658 [16:44:23] 6operations, 6Phabricator: Automate nightly dump of Phabricator metadata - https://phabricator.wikimedia.org/T103028#1410069 (10ArielGlenn) https://gerrit.wikimedia.org/r/#/c/221658/ [16:44:31] (03PS1) 10RobH: shell accounts cleanup: maryana & howief [puppet] - 10https://gerrit.wikimedia.org/r/221659 [16:44:53] robh: https://gerrit.wikimedia.org/r/#/c/221657/ -- shouldn't they be in the 'absent' group? [16:47:14] and aren't their keys supposed to be removed? [16:47:36] robh: Are you mostly the guy who buys certs these days, or is mutante doing that? [16:47:54] (03PS2) 10Qgil: rsync of phab dumps from iridium to dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/221658 (https://phabricator.wikimedia.org/T103028) (owner: 10ArielGlenn) [16:48:40] andrewbogott: https://phabricator.wikimedia.org/T104211 [16:49:13] * andrewbogott closes browser tab [16:49:45] (03PS17) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [16:50:32] YuviPanda: is it possible to do a browser redirect to tools.wmflabs.org without needing a cert? [16:50:38] 6operations, 10Datasets-General-or-Unknown, 6Labs, 10wikitech.wikimedia.org: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#1410096 (10ArielGlenn) should we host the lastest dump on dumps.wm.org? [16:50:39] andrewbogott: unfortunately not [16:50:46] Because ssl doesn’t play nice with redirects? [16:52:53] andrewbogott: kind of [16:53:40] (03PS2) 10BBlack: Use overridden direct DNS for all LVS [puppet] - 10https://gerrit.wikimedia.org/r/221303 (https://phabricator.wikimedia.org/T103921) [16:54:52] 6operations, 10Datasets-General-or-Unknown, 6Labs, 10wikitech.wikimedia.org: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#1410103 (10Krenair) Maybe. Might need to keep in mind that you can only login to the database as wikiadmin from silver (e.g. it won't work from tin... [16:54:54] andrewbogott: because when the browser hits the initial link in https it tries to connect and fails because fail of validation [16:55:06] andrewbogott: and so this prevents the browser from actually recieving the redirect [16:55:09] 7Puppet: Allow per-host hiera customizations on wikitech - https://phabricator.wikimedia.org/T97055#1410104 (10scfc) [16:55:11] 7Puppet, 6Labs, 3Labs-Sprint-104: Allow per-host hiera overrides via wikitech - https://phabricator.wikimedia.org/T104202#1410105 (10scfc) [16:55:15] ah, right, because even a redirect would have to pass the cert test. [16:55:39] (03CR) 10BBlack: [C: 032] Use overridden direct DNS for all LVS [puppet] - 10https://gerrit.wikimedia.org/r/221303 (https://phabricator.wikimedia.org/T103921) (owner: 10BBlack) [16:56:41] (03PS2) 10BBlack: enable SPDY header compression [puppet] - 10https://gerrit.wikimedia.org/r/221376 [16:57:21] (03CR) 10BBlack: [C: 032 V: 032] enable SPDY header compression [puppet] - 10https://gerrit.wikimedia.org/r/221376 (owner: 10BBlack) [16:58:31] 6operations, 6Labs: salt does not run reliably for toollabs - https://phabricator.wikimedia.org/T99213#1410114 (10ArielGlenn) I'm going through all the labs instances and: converting those that still talk to virt1001 to the new saltmaster, generating shorter keys as we have for production, and testing. there a... [17:00:29] (03PS2) 10RobH: shell accounts cleanup: maryana & howief [puppet] - 10https://gerrit.wikimedia.org/r/221659 [17:01:02] (03CR) 10RobH: [C: 032] shell accounts cleanup: maryana & howief [puppet] - 10https://gerrit.wikimedia.org/r/221659 (owner: 10RobH) [17:01:06] (03PS3) 10Rush: rsync of phab dumps from iridium to dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/221658 (https://phabricator.wikimedia.org/T103028) (owner: 10ArielGlenn) [17:01:15] (03CR) 10Rush: [C: 031] "seems ok, one note :)" [puppet] - 10https://gerrit.wikimedia.org/r/221658 (https://phabricator.wikimedia.org/T103028) (owner: 10ArielGlenn) [17:01:44] PROBLEM - puppet last run on bast1001 is CRITICAL Puppet has 1 failures [17:02:09] (03Abandoned) 10BBlack: Disallow indexing of non-canonical domains not covered by TLS Cert wildcards [puppet] - 10https://gerrit.wikimedia.org/r/219121 (owner: 10BBlack) [17:02:21] (03PS3) 10BBlack: No need for wgSecureLogin on our wikis, HTTPS is forced everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219265 (https://phabricator.wikimedia.org/T103021) [17:05:06] (03PS4) 10BBlack: No need for wgSecureLogin on our wikis, HTTPS is forced everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219265 (https://phabricator.wikimedia.org/T103021) [17:05:27] (03CR) 10BBlack: "Removed the labs part, as labs doesn't have working HTTPS right now..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219265 (https://phabricator.wikimedia.org/T103021) (owner: 10BBlack) [17:08:23] (03PS2) 10BBlack: resolv.conf: lower timeout from 3s to 1s, ++attempts [puppet] - 10https://gerrit.wikimedia.org/r/221304 (https://phabricator.wikimedia.org/T103921) [17:08:55] andrewbogott: its Rob [17:11:42] (03PS1) 10RobH: disabling avar's access [puppet] - 10https://gerrit.wikimedia.org/r/221662 [17:12:34] PROBLEM - puppet last run on analytics1010 is CRITICAL Puppet has 1 failures [17:14:14] robh: can I assign https://phabricator.wikimedia.org/T104211 to you? [17:14:42] (03CR) 10John F. Lewis: [C: 04-1] "should be listed in the absent group" [puppet] - 10https://gerrit.wikimedia.org/r/221662 (owner: 10RobH) [17:14:45] PROBLEM - puppet last run on stat1003 is CRITICAL Puppet has 1 failures [17:15:05] PROBLEM - puppet last run on analytics1002 is CRITICAL Puppet has 1 failures [17:15:07] 6operations, 6Labs, 10RESTBase, 10Traffic, and 2 others: Fix RESTBase support for wikitech.wikimedia.org - https://phabricator.wikimedia.org/T102178#1410178 (10GWicke) p:5Triage>3Normal [17:15:10] andrewbogott: uhh, why? [17:15:16] the question is do you want a new cert? [17:15:28] if youwant one, then i guess yes, but if its going to depreciate and stop support and im not spending money [17:15:31] then no. [17:15:57] its like 115 annually for a non wilcard cert iirc. [17:16:00] robh: It’s just a page saying “don’t look here, look over here.” But it’s still up and throwing cert warnings now, which seems rude. [17:16:11] Maybe we can gather stats about whether anyone is actually hitting it. [17:16:27] basically dont assign it to me unless you already have a purchase justification [17:16:29] =] [17:16:41] ok, I’ll bring it up in the meeting. [17:16:42] cuz all i do it put the pricing on there and get mark to approve and if its not clear [17:16:44] 6operations, 6Labs, 10RESTBase, 10Traffic, and 2 others: Fix RESTBase support for wikitech.wikimedia.org - https://phabricator.wikimedia.org/T102178#1410181 (10GWicke) @bblack, do you think it is feasible / advisable to add a `/api/rest_v1/` rewrite in the wikitech nginx config? [17:16:47] How much $$$ are we talking? [17:16:48] he'll totally ask me, to ask you, etc.... [17:16:53] PROBLEM - puppet last run on stat1002 is CRITICAL Puppet has 1 failures [17:16:54] its like 115 annually for a non wilcard cert iirc. [17:16:58] ok [17:17:09] but now i wanna check [17:18:44] andrewbogott: so my price was right, but we'll have to add toolserver to globalsign [17:18:46] but thats not a big deal [17:19:43] PROBLEM - puppet last run on analytics1001 is CRITICAL Puppet has 1 failures [17:19:55] PROBLEM - puppet last run on analytics1004 is CRITICAL Puppet has 1 failures [17:21:04] (03PS1) 10RobH: disabling werdna's access (user accounts cleanup) [puppet] - 10https://gerrit.wikimedia.org/r/221663 [17:23:46] 6operations, 10RESTBase, 6Services, 7RESTBase-architecture: Update restbase100[1-6] to the 3.19 kernel - https://phabricator.wikimedia.org/T102234#1410218 (10GWicke) [17:25:57] robh: can you please add the accounts to the absent group if you're removing them :) [17:26:12] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: automated invocation of Cassandra repair jobs - https://phabricator.wikimedia.org/T92355#1410236 (10GWicke) [17:26:29] 6operations, 10RESTBase, 10RESTBase-Cassandra: secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#1410238 (10GWicke) [17:26:39] 6operations, 10RESTBase, 10RESTBase-Cassandra: use non-default credentials when authenticating to Cassandra - https://phabricator.wikimedia.org/T92590#1410240 (10GWicke) [17:27:03] (03PS1) 10RobH: user accounts cleanup: diederik [puppet] - 10https://gerrit.wikimedia.org/r/221665 [17:27:04] JohnFLewis: absent group? bleh.... [17:27:27] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 6 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1410249 (10GWicke) [17:27:28] i didnt notice it [17:27:35] I mentioned it earlier but you missed it and I -1'd one of your patches with it so :) [17:27:48] well i was just patching along and in office [17:27:50] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1410253 (10GWicke) [17:27:50] so i cannot hear pings [17:27:58] office is horrible for productivity you see. [17:28:21] I've been told :) [17:28:58] (03PS2) 10RobH: user accounts cleanup: diederik [puppet] - 10https://gerrit.wikimedia.org/r/221665 [17:29:04] paravoid, YuviPanda, I propose we skip this week’s labs checkin and just contribute to the phab sprint. because 1) no Coren and 2) we have an outage to fix [17:29:13] we have an outage? [17:29:17] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1410264 (10GWicke) a:5RobH>3Eevans [17:29:28] mark: ^^^ [17:29:30] YuviPanda: did you fix it already? [17:29:33] andrewbogott: the toollabs homepage thing? I failed over, it's ok now. [17:29:40] ok, nevermind then :) [17:29:42] I’ll call in [17:29:46] 6operations, 10Mathoid, 10RESTBase, 6Services: Document and hook up public mathoid end point in RB - https://phabricator.wikimedia.org/T102030#1410269 (10GWicke) [17:30:32] (03PS3) 10RobH: user accounts cleanup: diederik [puppet] - 10https://gerrit.wikimedia.org/r/221665 [17:31:25] (03CR) 10RobH: [C: 032] "diederik has been gone for over a year, disabling." [puppet] - 10https://gerrit.wikimedia.org/r/221665 (owner: 10RobH) [17:34:37] 6operations, 3Labs-Sprint-104, 5Patch-For-Review: Install molly-guard on production hosts - https://phabricator.wikimedia.org/T103873#1410285 (10yuvipanda) [17:35:16] (03PS1) 10RobH: absented user group cleanup [puppet] - 10https://gerrit.wikimedia.org/r/221667 [17:36:01] (03PS2) 10RobH: absented user group cleanup [puppet] - 10https://gerrit.wikimedia.org/r/221667 [17:36:32] (03CR) 10RobH: [C: 032] absented user group cleanup [puppet] - 10https://gerrit.wikimedia.org/r/221667 (owner: 10RobH) [17:37:25] (03PS18) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [17:41:54] (03Abandoned) 10RobH: disabling avar's access [puppet] - 10https://gerrit.wikimedia.org/r/221662 (owner: 10RobH) [17:42:30] (03PS2) 10RobH: disabling werdna's access (user accounts cleanup) [puppet] - 10https://gerrit.wikimedia.org/r/221663 [17:43:15] abandoned, robh? [17:45:00] (03PS3) 10RobH: disabling werdna's access (user accounts cleanup) [puppet] - 10https://gerrit.wikimedia.org/r/221663 [17:45:18] i had a clusterfuck of bad commits and merges and rebases [17:45:30] so i ditched the one that was fubaring my others and will redo it ;] [17:45:35] PROBLEM - Check status of defined EventLogging jobs on analytics1010 is CRITICAL Stopped EventLogging jobs: processor/server-side-events-kafka processor/client-side-events-kafka forwarder/server-side-raw forwarder/client-side-raw [17:46:02] (03CR) 10RobH: [C: 032] disabling werdna's access (user accounts cleanup) [puppet] - 10https://gerrit.wikimedia.org/r/221663 (owner: 10RobH) [17:46:09] that i fine, my fault. am experimenting on analytics1010 [17:46:19] (03CR) 10RobH: "If it turns out andrew isn't gone, this can be reverted." [puppet] - 10https://gerrit.wikimedia.org/r/221663 (owner: 10RobH) [17:46:21] 6operations, 3Labs-Sprint-104: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1410360 (10Andrew) [17:48:13] RECOVERY - puppet last run on analytics1010 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:48:28] (03PS1) 10BBlack: primary ssl services -> unified-only, not SNI [puppet] - 10https://gerrit.wikimedia.org/r/221670 [17:50:15] RECOVERY - puppet last run on stat1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:50:43] RECOVERY - puppet last run on analytics1002 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:52:25] RECOVERY - puppet last run on stat1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:52:33] RECOVERY - puppet last run on analytics1001 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:53:30] (03PS19) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [17:55:15] RECOVERY - puppet last run on analytics1004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:57:14] RECOVERY - puppet last run on bast1001 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:58:17] 10Ops-Access-Requests, 6operations: Grant dcausse sudo on the search cluster - https://phabricator.wikimedia.org/T104222#1410752 (10Manybubbles) 3NEW [18:00:22] !log powering down ms-be1015 [18:00:26] Logged the message, Master [18:06:12] (03PS20) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [18:10:28] hello, any ops wants to merge a contint puppet change that is already deployed? https://gerrit.wikimedia.org/r/#/c/220658/ :) [18:11:29] they're in a meeting, so chances are both reduced and increased [18:11:35] heh [18:15:57] (03PS4) 10Andrew Bogott: contint: Create symlink for composer in /usr/local/bin/ [puppet] - 10https://gerrit.wikimedia.org/r/220658 (owner: 10Legoktm) [18:16:51] 6operations: Redis lua sandbox bypass - https://phabricator.wikimedia.org/T101397#1410841 (10MoritzMuehlenhoff) 5Open>3Resolved All updated [18:16:52] (03CR) 10Andrew Bogott: [C: 032] contint: Create symlink for composer in /usr/local/bin/ [puppet] - 10https://gerrit.wikimedia.org/r/220658 (owner: 10Legoktm) [18:17:22] andrewbogott: woo, thanks :) and maybe https://gerrit.wikimedia.org/r/#/c/220694/ too? :-) [18:17:31] I’m in a meeting :p [18:17:48] no rush :) [18:17:49] 6operations, 10Traffic: enwiki Main_Page timeouts - https://phabricator.wikimedia.org/T104225#1410845 (10BBlack) 3NEW [18:18:04] legoktm: that one needs a manual rebase [18:18:22] (03PS21) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [18:18:40] * legoktm does [18:18:51] !log ori Synchronized php-1.26wmf11/includes/db/LoadBalancer.php: I0e5f2d3b2: Use APC for caching slave lag times (duration: 01m 09s) [18:18:55] Logged the message, Master [18:19:04] legoktm: rush is in the meeting as well, I think? [18:19:11] (terrible joke about chasemp's nick) [18:19:18] lol [18:19:21] YuviPanda: good try [18:19:39] YuviPanda: are you sure you're not a dad? [18:19:42] (03PS3) 10Legoktm: contint: Add 'libffi-dev' package [puppet] - 10https://gerrit.wikimedia.org/r/220694 (https://phabricator.wikimedia.org/T103775) [18:19:49] greg-g: I had a scare once but I'm pretty sure I'm not. [18:19:53] :P [18:19:54] andrewbogott: weird, it rebased fine for me locally [18:20:04] it’s because of the cassandra submodule thing [18:20:04] YuviPanda: TMI [18:20:22] ah [18:20:23] (03CR) 10Legoktm: "PS3: Rebased." [puppet] - 10https://gerrit.wikimedia.org/r/220694 (https://phabricator.wikimedia.org/T103775) (owner: 10Legoktm) [18:20:24] greg-g: yw :) [18:21:08] greg-g: btw, no Staging / deployment-prep related stuff in https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q1_Goals#Release_Engineering, so not much dependence on ops merging patches... [18:22:19] (03CR) 10Andrew Bogott: [C: 032] contint: Add 'libffi-dev' package [puppet] - 10https://gerrit.wikimedia.org/r/220694 (https://phabricator.wikimedia.org/T103775) (owner: 10Legoktm) [18:22:26] YuviPanda: yeah, nothing more than the standard churn [18:22:31] cool [18:22:31] unforunately [18:22:34] ty :) [18:22:35] yeah :( [18:23:23] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 2 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1410880 (10yuvipanda) [18:26:08] (03PS22) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [18:33:09] (03PS3) 10Yuvipanda: base: Install molly-guard everywhere [puppet] - 10https://gerrit.wikimedia.org/r/221111 (https://phabricator.wikimedia.org/T103873) [18:33:23] (03CR) 10Yuvipanda: [C: 032 V: 032] "Sneaking this in while everyone else is in a meeting!" [puppet] - 10https://gerrit.wikimedia.org/r/221111 (https://phabricator.wikimedia.org/T103873) (owner: 10Yuvipanda) [18:35:04] 6operations, 3Labs-Sprint-104, 5Patch-For-Review: Install molly-guard on production hosts - https://phabricator.wikimedia.org/T103873#1410950 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Done [18:35:19] (03PS23) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [18:39:41] 7Puppet, 6operations, 6Labs: Labs puppet breaks for projects without a Hiera: page on wikitech - https://phabricator.wikimedia.org/T101913#1410960 (10yuvipanda) 5Open>3Resolved a:3yuvipanda [18:42:44] PROBLEM - puppet last run on restbase1003 is CRITICAL Puppet has 1 failures [18:46:23] RECOVERY - Host ms-be1015 is UPING OK - Packet loss = 0%, RTA = 0.86 ms [18:48:04] RECOVERY - High load average on ms-be1015 is OK - load average: 52.25, 20.36, 7.48 [18:52:45] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1411002 (10BBlack) 5stalled>3Open [18:53:12] (03CR) 10Milimetric: [C: 031] "Change looks good to me (I don't have +2 here btw) except the naming problems. It seems the names of the folders leading up to here devia" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/221611 (https://phabricator.wikimedia.org/T101118) (owner: 10Joal) [18:53:25] 6operations, 10Traffic: openssl 1.0.2 packaging for jessie - https://phabricator.wikimedia.org/T104143#1411003 (10BBlack) p:5Triage>3Normal [18:53:28] (03CR) 10Milimetric: [C: 04-1] "oops, meant to -1 for the name" [puppet] - 10https://gerrit.wikimedia.org/r/221611 (https://phabricator.wikimedia.org/T101118) (owner: 10Joal) [18:54:30] 6operations, 10Traffic: enwiki Main_Page timeouts - https://phabricator.wikimedia.org/T104225#1411011 (10BBlack) p:5Triage>3High [18:56:14] RECOVERY - Check status of defined EventLogging jobs on analytics1010 is OK All defined EventLogging jobs are runnning. [18:57:28] 7Blocked-on-Operations, 6operations, 6Scrum-of-Scrums: terbium et al - php-luasandbox must install without errors and luasandbox must be enabled - https://phabricator.wikimedia.org/T101583#1411040 (10akosiaris) 5Open>3Resolved This bug has been fixed in 2.0.10. I just ran ``` apt-get install php-luasand... [18:59:51] 6operations, 10ops-eqiad: ms-be1015 idrac not working, no more sessions - https://phabricator.wikimedia.org/T104161#1411071 (10Cmjohnson) 5Open>3Resolved Fixed [19:02:10] 6operations, 10ops-eqiad: install 10g NIC card to labnet1002 - https://phabricator.wikimedia.org/T103849#1411103 (10Cmjohnson) The card is installed and connected to the up-link module on asw-b4-eqiad. The ports have been enabled. I do not get any link lights on the card. I verified the card is being seen by... [19:02:48] (03PS24) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [19:05:13] (03PS2) 10BBlack: primary ssl services -> unified-only, not SNI [puppet] - 10https://gerrit.wikimedia.org/r/221670 [19:10:29] (03CR) 10Rush: [C: 031] "Is there a way to know if it indeed falling back to secondary for success? If this is a mitigation for that we should probably pursue tha" [puppet] - 10https://gerrit.wikimedia.org/r/221304 (https://phabricator.wikimedia.org/T103921) (owner: 10BBlack) [19:13:45] (03CR) 10BBlack: "In the pybal case that triggered all of this, I'm pretty confident it's losing the first request to the 3s timeout. Investigating why chr" [puppet] - 10https://gerrit.wikimedia.org/r/221304 (https://phabricator.wikimedia.org/T103921) (owner: 10BBlack) [19:16:44] 6operations, 10Traffic: openssl 1.0.2 packaging for jessie - https://phabricator.wikimedia.org/T104143#1411211 (10BBlack) (or alternatively, should I just ignore the udeb error because we don't care about 1.0.2c during install-time? [19:19:05] (03PS1) 10Alex Monk: More wikitech cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221729 (https://phabricator.wikimedia.org/T75939) [19:22:34] (03PS2) 10Joal: Add Pageviews/LegacyPageviews to metrics website [puppet] - 10https://gerrit.wikimedia.org/r/221611 (https://phabricator.wikimedia.org/T104003) [19:22:52] (03CR) 10Joal: Add Pageviews/LegacyPageviews to metrics website (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/221611 (https://phabricator.wikimedia.org/T104003) (owner: 10Joal) [19:23:32] (03CR) 10Rush: "Read the ticket again and makes total sense to me what you are doing" [puppet] - 10https://gerrit.wikimedia.org/r/221304 (https://phabricator.wikimedia.org/T103921) (owner: 10BBlack) [19:25:45] 6operations, 10Traffic: openssl 1.0.2 packaging for jessie - https://phabricator.wikimedia.org/T104143#1411259 (10MoritzMuehlenhoff) >>! In T104143#1411211, @BBlack wrote: > (or alternatively, should I just ignore the udeb error because we don't care about 1.0.2c during install-time?) That would work as well;... [19:26:02] (03PS1) 10TheDJ: Disable webp for now, so we can enable outside of WMF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221731 (https://phabricator.wikimedia.org/T27397) [19:26:05] (03CR) 10jenkins-bot: [V: 04-1] Disable webp for now, so we can enable outside of WMF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221731 (https://phabricator.wikimedia.org/T27397) (owner: 10TheDJ) [19:29:49] (03PS3) 10BBlack: resolv.conf: lower timeout from 3s to 1s, ++attempts [puppet] - 10https://gerrit.wikimedia.org/r/221304 (https://phabricator.wikimedia.org/T103921) [19:32:03] (03CR) 10BBlack: [C: 032] resolv.conf: lower timeout from 3s to 1s, ++attempts [puppet] - 10https://gerrit.wikimedia.org/r/221304 (https://phabricator.wikimedia.org/T103921) (owner: 10BBlack) [19:32:23] (03PS2) 10TheDJ: Disable webp for now, so we can enable outside of WMF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221731 (https://phabricator.wikimedia.org/T27397) [19:32:36] (03PS25) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [19:35:19] (03PS3) 10BBlack: primary ssl services -> unified-only, not SNI [puppet] - 10https://gerrit.wikimedia.org/r/221670 [19:36:22] 10Ops-Access-Requests, 6operations: Grant dcausse sudo on the search cluster - https://phabricator.wikimedia.org/T104222#1411317 (10Krenair) [19:36:59] 6operations, 10Traffic: enwiki Main_Page timeouts - https://phabricator.wikimedia.org/T104225#1411321 (10aaron) T102916 is likely a side effect of excess purges. [19:39:03] RECOVERY - puppet last run on lvs1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:43:12] (03PS4) 10BBlack: primary ssl services -> unified-only, not SNI [puppet] - 10https://gerrit.wikimedia.org/r/221670 [19:43:14] PROBLEM - puppet last run on mw2182 is CRITICAL puppet fail [19:43:14] (03PS1) 10BBlack: create ssl::unified as a non-SNI alternative to ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/221741 [19:48:16] 6operations, 10Traffic: Preload HSTS - https://phabricator.wikimedia.org/T104244#1411365 (10BBlack) 3NEW [19:49:49] 6operations, 10Traffic: Fix/decom multiple-subdomain wikis in wikimedia.org - https://phabricator.wikimedia.org/T102826#1411376 (10BBlack) [19:49:52] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom www.$lang hostnames/redirects - https://phabricator.wikimedia.org/T102815#1411377 (10BBlack) [19:49:54] 6operations, 10Traffic, 7HTTPS: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1411378 (10BBlack) [19:49:55] 6operations, 10Traffic: Preload HSTS - https://phabricator.wikimedia.org/T104244#1411375 (10BBlack) [19:50:09] 6operations, 10Traffic: Preload HSTS - https://phabricator.wikimedia.org/T104244#1411379 (10BBlack) p:5Triage>3Normal [19:57:15] (03PS26) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [19:57:45] RECOVERY - puppet last run on mw2182 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:00:04] gwicke cscott arlolra subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150629T2000). Please do the needful. [20:10:29] hey godog [20:12:03] (03PS2) 10BBlack: redirects: use separate ServerAlias directives for each alias [puppet] - 10https://gerrit.wikimedia.org/r/221291 [20:12:38] (03PS1) 10Ori.livneh: Add tessera module [puppet] - 10https://gerrit.wikimedia.org/r/221747 [20:13:10] 6operations, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1411415 (10kevinator) [20:13:32] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 7Monitoring: Replace uses of monitoring::ganglia with monitoring::graphite_* - https://phabricator.wikimedia.org/T90642#1411423 (10kevinator) [20:13:47] akosiaris, apergos git deploy restart for parsoid deploy failed for me as well /cc cscott [20:13:52] will use the workaround documented in that ticket. [20:19:12] !log deployed parsoid sha ea98be88 [20:19:17] Logged the message, Master [20:19:37] (03PS2) 10Ori.livneh: Add tessera module [puppet] - 10https://gerrit.wikimedia.org/r/221747 [20:20:28] (03PS2) 10Yuvipanda: Labs: small race condition fix in replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [20:20:33] (03CR) 10jenkins-bot: [V: 04-1] Labs: small race condition fix in replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [20:20:51] (03PS3) 10Yuvipanda: Labs: small race condition fix in replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [20:21:00] paravoid: ^ fixed the error (hopefully - my perl is non-existent atm) [20:23:14] (03PS4) 10Yuvipanda: Labs: small race condition fix in replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [20:23:19] (03CR) 10Ori.livneh: Add tessera module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/221747 (owner: 10Ori.livneh) [20:24:14] (03CR) 10Yuvipanda: "Updated to hopefully work correctly on two cases:" [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [20:24:39] (03CR) 10Yuvipanda: [C: 04-1] Labs: small race condition fix in replica-addusers.pl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [20:27:45] 7Blocked-on-Operations, 6operations, 10Parsoid: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1411461 (10cscott) From IRC, @ssastry confirmed that `git deploy service restart` doesn't work for him, either: ``` (04:13:47... [20:46:12] (03PS1) 10Jdlrobson: Enable Gather flagging on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221754 (https://phabricator.wikimedia.org/T97704) [20:46:30] (03PS2) 10Jdlrobson: Enable Gather flagging on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221754 (https://phabricator.wikimedia.org/T97704) [20:47:32] 6operations, 10Traffic, 7HTTPS: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1411534 (10Reedy) Do we have any DNS or varnish/apache stats of hits to these URLs? Presumably they're still not needed, but I honestly couldn't tell you the amount of people still... [20:53:53] ori: yo! [20:57:32] (03PS1) 10Ottomata: Enable auto commit for kafka consumer ingput in eventlogging procressors on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/221756 [20:58:04] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [20:58:17] (03CR) 10Ottomata: [C: 032 V: 032] Enable auto commit for kafka consumer ingput in eventlogging procressors on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/221756 (owner: 10Ottomata) [20:58:33] RECOVERY - Host mw1085 is UPING WARNING - Packet loss = 64%, RTA = 2.37 ms [21:00:26] godog: sent a tessera puppet module yr way ;) [21:02:56] ori: sweet, I'll take a look later, thanks! [21:04:47] (03CR) 10Yuvipanda: "Nothing uses star.wmflabs.crt, and I'm ok babysitting this on the 5 instances that have the cert :)" [puppet] - 10https://gerrit.wikimedia.org/r/221167 (https://phabricator.wikimedia.org/T104017) (owner: 10RobH) [21:07:07] (03CR) 10BryanDavis: Labs: small race condition fix in replica-addusers.pl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [21:08:08] (03CR) 10Yuvipanda: Labs: small race condition fix in replica-addusers.pl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [21:08:44] (03PS2) 10Yuvipanda: changing *.wmflabs.org from sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/221167 (https://phabricator.wikimedia.org/T104017) (owner: 10RobH) [21:08:50] (03CR) 10Yuvipanda: [C: 032 V: 032] changing *.wmflabs.org from sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/221167 (https://phabricator.wikimedia.org/T104017) (owner: 10RobH) [21:15:05] (03PS2) 10Filippo Giunchedi: diamond: add cassandra collector for basic metrics [puppet] - 10https://gerrit.wikimedia.org/r/220650 (https://phabricator.wikimedia.org/T78514) [21:15:07] (03PS1) 10Eevans: testing a higher a reporting interval [puppet] - 10https://gerrit.wikimedia.org/r/221763 [21:15:22] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] diamond: add cassandra collector for basic metrics [puppet] - 10https://gerrit.wikimedia.org/r/220650 (https://phabricator.wikimedia.org/T78514) (owner: 10Filippo Giunchedi) [21:16:03] (03PS2) 10Filippo Giunchedi: cassandra: testing a higher a reporting interval [puppet] - 10https://gerrit.wikimedia.org/r/221763 (owner: 10Eevans) [21:16:26] (03Abandoned) 10Chad: Elastic: Don't hold data on master nodes [puppet] - 10https://gerrit.wikimedia.org/r/218421 (owner: 10Chad) [21:16:30] (03PS3) 10Filippo Giunchedi: cassandra: testing a higher a reporting interval [puppet] - 10https://gerrit.wikimedia.org/r/221763 (https://phabricator.wikimedia.org/T102015) (owner: 10Eevans) [21:16:58] (03PS4) 10Filippo Giunchedi: cassandra: testing a higher a reporting interval [puppet] - 10https://gerrit.wikimedia.org/r/221763 (https://phabricator.wikimedia.org/T102015) (owner: 10Eevans) [21:17:06] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: testing a higher a reporting interval [puppet] - 10https://gerrit.wikimedia.org/r/221763 (https://phabricator.wikimedia.org/T102015) (owner: 10Eevans) [21:17:49] (03PS5) 10Yuvipanda: Labs: small race condition fix in replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [21:18:07] bd808: ^ if you still want to help a perl noobie [21:18:14] who is also trying to avoid learning it [21:18:42] np, it'll slip out of your mind pretty quickly [21:18:54] godog: :D [21:18:59] I already learnt about qw! [21:21:11] (03PS4) 10Chad: Elastic: move auto_create_index into hiera instead of role [puppet] - 10https://gerrit.wikimedia.org/r/207140 [21:21:16] IdiotPanda: hehe lots of amusing readings in perl-doc, e.g. perldoc perlretut [21:22:02] _joe_: 207140 should be trivial and zero actual change. [21:22:30] also perlop [21:23:21] IdiotPanda: I *think* that will work, yeah. I had to lookup if chown/chmod in perl grokked operating on file handles but it looks like they do. [21:23:38] bd808: yeah, I searched that too, since otherwise I'd have to make it fchmod [21:23:48] bd808: but why was it MYCNF instead of $mycnf? [21:23:57] convention [21:24:57] bd808: HAHA, and they're equivalent?! [21:25:01] just file handles are ALLCAPS [21:25:32] (03CR) 10Tim Landscheidt: [C: 04-1] Labs: small race condition fix in replica-addusers.pl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [21:26:08] IdiotPanda: "An older style is to use a bareword as the filehandle, as ..." -- http://perldoc.perl.org/functions/open.html [21:26:13] (03CR) 10Yuvipanda: Labs: small race condition fix in replica-addusers.pl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [21:26:26] "Note that it's a global variable, so this form is not recommended in new code." [21:26:40] (03PS1) 10Chad: Elastic: Unify default plugins directory in /srv/deployment [puppet] - 10https://gerrit.wikimedia.org/r/221766 [21:26:56] but it looks right to someone like me who learned Perl ~20 years ago [21:27:03] bd808: aaah, I see :) [21:27:06] bd808: so my code is ok [21:27:51] IdiotPanda: get off bd808s lawn [21:27:54] yeah I think it's doing the right thing. Maybe add "my" to limit the var scope [21:28:09] "my $tmpfile, $tmpfilepath = tempfile(DIR => $dir);" [21:28:30] oh, I see [21:28:36] my is needed by default? [21:28:45] and applies to tuple unpacking? [21:29:57] I ... think so yes [21:30:03] !log restarting restbase1004 to apply new metrics reporting interval [21:30:07] Logged the message, Master [21:30:24] bd808: ugh, no 'my' in a lot of other places. [21:30:32] rewriting this in python is on the cards [21:30:36] (03PS1) 10Ori.livneh: Add tessera.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/221767 [21:31:13] pathetically eclectic rubbish lister is perfectly good language for system scripts ;) [21:31:38] (03PS3) 10Ori.livneh: Add tessera module and role; apply on graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/221747 [21:31:46] bd808: :) that isn't the only problem with this, of course. it assumes all projects have NFS homes and that everyone wants a mysql account by default :) [21:31:49] (03CR) 10Milimetric: [C: 031] "Sweet. Ottomata you can merge this." [puppet] - 10https://gerrit.wikimedia.org/r/221611 (https://phabricator.wikimedia.org/T104003) (owner: 10Joal) [21:32:04] (03PS6) 10Yuvipanda: Labs: small race condition fix in replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [21:32:18] bd808: ^ :) also see Tim's question? [21:32:38] (03CR) 10Ori.livneh: "Should be good to go now." [puppet] - 10https://gerrit.wikimedia.org/r/221747 (owner: 10Ori.livneh) [21:33:32] ToothPainPanda: the my bit is not a big deal at all IMO if the current script has mixed globel/local usage [21:33:52] bd808: yeah, but I learnt about 'my'! [21:34:03] it's like "var" in js [21:34:42] right [21:34:48] I hope perl doesn't do lifting [21:35:12] bd808: btw, I wrote https://tools.wmflabs.org/watroles/role/role::puppet::self the other day to replace writing SMW queries [21:35:33] can filter roles or variables (click around) [21:37:51] (03PS1) 10RobH: disabling avar's shell account [puppet] - 10https://gerrit.wikimedia.org/r/221769 [21:38:28] ToothPainPanda: thinking about Tim's comment.... [21:38:33] (03CR) 10RobH: [C: 032] disabling avar's shell account [puppet] - 10https://gerrit.wikimedia.org/r/221769 (owner: 10RobH) [21:39:03] (03PS1) 10Yuvipanda: labstore: Move projects config to labstore module [puppet] - 10https://gerrit.wikimedia.org/r/221770 [21:39:32] Reedy: I hopefully will get on his lawn and off it sometime this year [21:40:30] ToothPainPanda: you are operating on the open file handle so if they did symlink attack then your handle should still be to the original inode, but the rename later would maybe do something weird. Not my areas of expertise really but Faidon will point out if it's broke for sure. [21:40:41] bd808: oh yeah, the rename... [21:40:55] (03PS2) 10Yuvipanda: labstore: Move projects config to labstore module [puppet] - 10https://gerrit.wikimedia.org/r/221770 [21:42:20] (03PS3) 10Yuvipanda: labstore: Move projects config to labstore module [puppet] - 10https://gerrit.wikimedia.org/r/221770 [21:43:28] ToothPainPanda: perl's tempfile() makes random names AFAIK. I don't know that you can do much better really [21:43:37] true [21:44:20] (03CR) 10Tim Landscheidt: Labs: small race condition fix in replica-addusers.pl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [21:44:22] (03PS4) 10Yuvipanda: labstore: Move projects config to labstore module [puppet] - 10https://gerrit.wikimedia.org/r/221770 [21:44:32] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Move projects config to labstore module [puppet] - 10https://gerrit.wikimedia.org/r/221770 (owner: 10Yuvipanda) [21:44:48] bd808: tim says it's ok :) [21:44:56] bd808: I'll still wait for paravoid to look however [21:48:37] PROBLEM - Apache HTTP on mw1085 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.090 second response time [21:48:53] PROBLEM - HHVM rendering on mw1085 is CRITICAL - Socket timeout after 10 seconds [21:49:14] (03PS1) 10Yuvipanda: labstore: Disable NFS (except scratch) for maps-team [puppet] - 10https://gerrit.wikimedia.org/r/221771 (https://phabricator.wikimedia.org/T103757) [21:49:19] (03CR) 10jenkins-bot: [V: 04-1] labstore: Disable NFS (except scratch) for maps-team [puppet] - 10https://gerrit.wikimedia.org/r/221771 (https://phabricator.wikimedia.org/T103757) (owner: 10Yuvipanda) [21:49:28] (03PS2) 10Yuvipanda: labstore: Disable NFS (except scratch) for maps-team [puppet] - 10https://gerrit.wikimedia.org/r/221771 (https://phabricator.wikimedia.org/T103757) [21:49:34] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Disable NFS (except scratch) for maps-team [puppet] - 10https://gerrit.wikimedia.org/r/221771 (https://phabricator.wikimedia.org/T103757) (owner: 10Yuvipanda) [21:50:04] (03PS1) 10Filippo Giunchedi: cassandra: enable diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/221773 (https://phabricator.wikimedia.org/T78514) [21:50:49] (03PS2) 10Filippo Giunchedi: cassandra: enable diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/221773 (https://phabricator.wikimedia.org/T78514) [21:50:59] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: enable diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/221773 (https://phabricator.wikimedia.org/T78514) (owner: 10Filippo Giunchedi) [21:51:13] RECOVERY - Apache HTTP on mw1085 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.044 second response time [21:51:15] RECOVERY - HHVM rendering on mw1085 is OK: HTTP OK: HTTP/1.1 200 OK - 69002 bytes in 0.140 second response time [21:53:30] 6operations, 6Labs, 3Labs-Sprint-104, 5Patch-For-Review: update star.wmflabs.org cert from sha1 to sha256 - https://phabricator.wikimedia.org/T104017#1411788 (10yuvipanda) 5Open>3Resolved Done! [21:53:33] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1411790 (10yuvipanda) [21:54:36] (03PS1) 10RobH: disable qchris's shell account [puppet] - 10https://gerrit.wikimedia.org/r/221775 [21:55:21] robh: Will that still allow me to log into ytterbium? [21:55:24] (03PS2) 10RobH: disable qchris's shell account [puppet] - 10https://gerrit.wikimedia.org/r/221775 [21:55:45] lol. [21:55:47] (Phabricator does not allow me to see the task) [21:55:47] qchris: ahhhhh yay, i didnt merge [21:55:53] qchris: i was about to kill it ALL man [21:56:02] so... are you still workign for us and what are you supposed to have? [21:56:08] (its not merged yet ;) [21:56:28] As the change says ... I am a volunteer these days, not a contractor any longer. [21:56:36] But I am still a Gerrit admin and have to log in to [21:56:40] qchris stopped getting paid, but didn't stop doing (high-impact) work [21:56:45] ytterbium from time to time to rlook at the logs. [21:56:54] ok, so you need to gerrit admin still and such, well, glad i didnt merge [21:57:06] qchris: for the record, i wasnt gunning for you, it just turned up in a task with a bunch of other users =] [21:57:16] robh: no worries :-) [21:57:30] +1 to qchris retaining gerrit shell access. [21:57:35] otherwise i'm all alone! [21:57:45] oh, +1 to making it all chads problem! [21:57:45] but i thought you loved java long time? [21:57:45] ;D [21:58:08] qchris: ok, so im going to leave your access in tact for gerrit [21:58:10] what about deploy? [21:58:29] also you are in restricted (private data) which is the terbium and the like [21:58:40] i guess i shoudlnt have kill that task i made, reopening so we can get it on record as a volunteer you have X [22:00:21] I have a nice big table on meta you could use :p [22:04:00] robh: When leaving the WMF, ottomata wanted me to retain access to cluster and deploying analytics code. [22:04:04] Not sure ... [22:04:17] since I did not use it the past two months, I gues those are fine to kill. [22:04:20] (03Abandoned) 10RobH: disable qchris's shell account [puppet] - 10https://gerrit.wikimedia.org/r/221775 (owner: 10RobH) [22:04:36] well, i'll just put them all on the task [22:04:47] and you guys can spell out which ones aren't needed there [22:05:05] I lack permission to view the task, [22:05:30] so I guess you'll have to have this discussion without me :-) [22:05:58] Krenair: heh, who had root as staff, but is now volunteer and still has it? [22:06:02] Krenair: we've had it go the toerh way [22:06:09] Ryan [22:06:10] qchris: ... oh.. ill make this one open for you [22:06:12] ryan doesnt have root [22:06:36] robh: laner, not kaldar.i [22:06:39] he may have kept it post employee, but we had him as a contractor [22:06:40] yep [22:06:41] i know [22:06:47] he's in the ops group [22:07:05] (03CR) 10BBlack: [C: 04-1] "I really haven't properly reviewed all of this, but noted tempfile discussion in -ops and took a peek at just that. In a broader scope (a" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [22:07:09] ahh, roots group, not root root, true [22:07:16] wait, so... [22:07:19] .... that should likely be fixed! [22:07:26] laner doesn't actually have root on the servers? [22:07:27] i dunno if he works in that stuff anymore [22:07:30] just in the ops group? [22:07:42] well, he doesnt know the root password, but as you point out he is in the sudo group [22:07:43] so meh [22:07:54] for all intents and purposes he indeed still does [22:07:56] so he can sudo as root [22:07:59] you are correct ;D [22:08:13] yeah, I'm gonna go ahead and count that as root for these purposes [22:08:19] robh: plus midom [22:08:27] was he ever staff? [22:08:27] domas was never paid! [22:08:36] he has always been board or volunteer iirc [22:08:38] hehe [22:09:01] I still log in sometimes! [22:09:09] try to avoid being destructive! [22:09:11] now he's just a bored volunteer ;) [22:09:16] I have lots of fun at work [22:09:17] heh [22:09:24] we run this facebook-like social network [22:09:26] that is called facebook [22:09:33] \o/ [22:09:51] "It's just like Facebook but with Domas" [22:09:55] domas: facebook is much better than facebook [22:10:11] qchris: https://phabricator.wikimedia.org/T104254 is open enough to view now [22:10:12] 6operations, 6Security, 7Security-General: determine validity of Christian Aistleitner (qchris's) shell account - https://phabricator.wikimedia.org/T104254#1411866 (10RobH) [22:10:23] if you had invented facebook you would have invented facebook, &c. [22:10:24] the security was inherited from the parent [22:10:27] now, how many rainbow colored iamges did wikimedia deliver [22:10:32] and how many did facebook, eh!? [22:10:36] robh: Thanks. [22:10:37] where's the pure joy and awesomeness?!!? [22:11:01] domas: the extension hasn't passed security review yet ;) [22:11:11] JohnFLewis: 2018? [22:11:19] probably [22:11:38] robh, wow, that is quite a list of groups! [22:11:45] okay, so which of those would not be given to volunteers? [22:12:01] JohnFLewis: funny, our intern built that rainbow thing as an employee-only experiment a week or two ago... :) [22:12:11] what rainbow thing? [22:12:27] ori: WHICH PLANET ARE YOU AT?! [22:12:35] ugh, rainbows [22:13:12] Krenair: my point was more of 'its unusual for someone to quit and retain all these rights' [22:13:13] domas: The only Facebook account i have is on https://reviews.facebook.net :) [22:13:16] just poorly worded [22:13:23] ostriches, got an overdose of them? [22:13:26] http://www.theatlantic.com/technology/archive/2015/06/were-all-those-rainbow-profile-photos-another-facebook-experiment/397088/ [22:13:26] Oh, actually, not true. I think I created one recently for the HHVM group. [22:13:31] but qchris is obviously a special exception, much like domas =] [22:13:35] robh, okay, that sounds much more reasonable :p [22:13:45] qchris: I paved the road for you! [22:14:02] though I never quit [22:14:19] domas: Thanks :-) [22:14:24] * qchris hugs domas. [22:14:57] * domas didn't do anything truly bad with cluster access yet [22:15:11] there's still time before your H1 :P [22:15:20] are you doing peer reviews?!!? [22:15:24] domas: you probably ssh to random mw* machines and check when hhvm was last restarted. 'a week! we did it 10 seconds ago' [22:15:27] domas, well - you promoted yourself to Occitan admins [22:15:35] MaxSem: lol [22:15:45] forgot about that one! [22:15:57] funny, I had less trouble for taking whole es.wikipedia down [22:16:04] than that tiny occitan incident [22:16:23] ugh, rainbows < inorite? [22:16:41] guillom: hah [22:16:54] wasn't guillom pouring oil into the flame for the occitan one?! [22:17:22] What Occitan thing are we talking about? [22:17:35] the one over which my root access was threatened... [22:17:47] (seriously!) [22:17:48] I don't think so. When was that? [22:17:53] I do not recall this. [22:17:55] guillom, https://oc.wikipedia.org/wiki/Discussion_Utilizaire:Midom [22:17:57] ages ago! [22:17:57] It was before me? [22:18:02] robh: no, rather new [22:18:04] * hoo recalls it [22:18:28] oh, 2012 [22:18:37] Oh come on, there was no attempt to anger the community, there was lack of effort to please the community :-) Midom (d) 23 genièr de 2012 a 19.24 (UTC) [22:18:41] ok, i LOL'd [22:19:06] I plead not guilty: https://oc.wikipedia.org/wiki/Especial:Contribucions/Guillom [22:19:16] (Unless it was elsewhere and I genuinely don't remember.) [22:19:18] it poured into meta and few other places [22:19:27] was escalated via ED [22:19:29] and CTO [22:19:31] and what not [22:19:34] Ugh. [22:19:48] well, 2012 i wasnt yet in SF but in DC in my lonely little hermit hole [22:20:01] so it passed me by without notice [22:21:55] https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L10637 [22:23:07] lol [22:23:34] uhm, and it doesn't work anymore on hhvm [22:24:38] so... ocwiki is broken by hhvm and nobody has realised? [22:24:57] not broken, but not smitten anymore [22:25:14] 6operations: Create instrumentation to monitor load on geoiplookup.wikimedia.org - https://phabricator.wikimedia.org/T104258#1411912 (10AndyRussG) 3NEW [22:25:46] 6operations, 6Security, 7Security-General: determine validity of Christian Aistleitner (qchris's) shell account - https://phabricator.wikimedia.org/T104254#1411921 (10QChris) I only still use gerrit-admin and I still use access to the bastion hosts to connect to `ytterbium`. If I read it correctly, I cur... [22:26:41] 6operations: Create instrumentation to monitor load on geoiplookup.wikimedia.org - https://phabricator.wikimedia.org/T104258#1411925 (10AndyRussG) [22:30:03] PROBLEM - puppet last run on graphite2001 is CRITICAL Puppet last ran 4 days ago [22:30:32] 6operations: Create instrumentation to monitor load on geoiplookup.wikimedia.org - https://phabricator.wikimedia.org/T104258#1411936 (10BBlack) I'm not sure "fallback" is the entire story here. We have to use either a geoiplookup.wm.o or http://foo/geoiplookup hit in order to get accurate info for IPv6 clients... [22:32:33] RECOVERY - puppet last run on graphite2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:33:02] 6operations, 10Gather, 7Database, 7Schema-change: Update Gather DB schema for flagging backend - https://phabricator.wikimedia.org/T103611#1411950 (10Tgr) I doubt locking would be a concern for Gather; we are talking about a table with 5000 rows and a couple hundred write queries a day. That said, splitti... [22:34:05] jynus: do you have time to discuss https://phabricator.wikimedia.org/T103611 ? [22:34:32] tgr, I was going to go to sleep [22:34:39] robh: module names in puppet should use underscores instead of dashes right> [22:34:50] but I see your answer [22:34:58] nothing to add [22:35:36] jynus: should I wait for a DBA to do the deploy? given the scale, seems unnecessary to me [22:36:32] I can do that tomorrow in a few hours [22:36:35] JohnFLewis: i think you are right, yes [22:36:41] the only exception being install-server =P [22:36:49] the rest are _ instead of - [22:37:05] standarise? :p [22:37:08] can you do it before the next branch is cut? that would make things a lot simpler [22:37:27] tgr, when is that? [22:37:40] though im not sure if thats a puppet standard in naming or not [22:38:13] hm [22:38:32] jynus: 18h UTC [22:39:05] Module names should only contain lowercase letters, numbers, and underscores, and should begin with a lowercase letter; that is, they should match the expression [a-z][a-z0-9_]*. Note that these are the same restrictions that apply to class names, but with the added restriction that module names cannot contain the namespace separator (::) as modules cannot be nested. [22:39:14] from puppet conventions [22:39:16] https://docs.puppetlabs.com/puppet/latest/reference/modules_fundamentals.html#allowed-module-names [22:39:18] should be doable, tgr [22:39:25] cool, thanks! [22:39:36] JohnFLewis: so yea, install-server is a poor exception [22:40:04] robh: shall we/I be risky and patch a fix for its name? (if that won't gain hate :) ) [22:40:42] i think it'd be ok, but I'd let more than you and i oversight it [22:40:56] ie: feel free to make it and i'll review, but we should get someone else [22:41:04] i dont think messing with that would affect any of the apt configuration [22:41:15] but since the apt configuration is not my strong point... [22:41:26] and install server runs on our same apt server host =] [22:41:46] though i dont think it would at all [22:42:23] (its a lot of config line changes in manifests throughout the module) [22:44:37] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1412018 (10Jdforrester-WMF) a:5Jdforrester-WMF>3None The subsidiary tasks remain undone. [22:52:00] (03PS1) 10Matanya: access: remove qchris from all groups except gerrit-admin and bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/221786 [22:52:46] robh: you on the clinic this week too ? [22:52:57] nope, faidon is but need something? [22:53:02] ahh, bad topic [22:53:08] yep [22:53:14] thanks [22:53:31] !log canary deploy of restbase 32db4ce1e1 on restbase1001.eqiad [22:53:36] Logged the message, Master [22:53:37] ohh, i see the patch for qchris [22:53:42] i'll merge that in and handle, thanks! [22:54:00] sure, i have some others in the queue if interested [22:54:53] feel free to link, if i have the subject matter expertise to review im happy to [22:55:12] (03CR) 10RobH: [C: 032] access: remove qchris from all groups except gerrit-admin and bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/221786 (owner: 10Matanya) [22:55:55] robh: https://gerrit.wikimedia.org/r/#/c/220990/ [22:56:07] 6operations, 6Security, 5Patch-For-Review, 7Security-General: determine validity of Christian Aistleitner (qchris's) shell account - https://phabricator.wikimedia.org/T104254#1412049 (10RobH) 5Open>3Resolved I've gone ahead and merge'd Matanya's patchset live, after @qchris's update to the task. Thank... [22:56:20] (03PS1) 10John F. Lewis: install-server: rename module to install_server [puppet] - 10https://gerrit.wikimedia.org/r/221787 [22:56:25] and https://gerrit.wikimedia.org/r/#/c/218905/ though godog would probably want to look at that too [22:56:53] (03PS2) 10John F. Lewis: install-server: rename module to install_server [puppet] - 10https://gerrit.wikimedia.org/r/221787 [22:56:57] hrmm [22:57:07] matanya: i think technincally we have to have a manager approval on that one [22:57:14] yes [22:57:18] but no comment [22:57:26] so raising your attention [22:57:26] robh: https://gerrit.wikimedia.org/r/#/c/221787/ [22:57:52] jouncebot, next [22:57:52] In 0 hour(s) and 2 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150629T2300) [22:58:05] heh robh, nvm that is messed up because I didn't do it form the modules dir :) [22:58:20] JohnFLewis: it should be moves, not added [22:58:27] matanya: I know [22:58:32] k [22:58:59] manybubbles: who is dcausse ? [22:59:11] does he have any shell account ? [22:59:19] robh: fixed [22:59:26] I think that's a new wmf person? [22:59:50] someone in search & discovery engineering [22:59:52] 10Ops-Access-Requests, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Grant access to HTTP request logs - https://phabricator.wikimedia.org/T103872#1412059 (10RobH) a:5Jdouglas>3Tfinc While the three days has passed and there is a patchset pending, this task hasn't yet received... [22:59:55] i think so too, but no evidance [23:00:01] matanya: so yea, updated the task that as soon as we have tomasz approve its good to go [23:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150629T2300). [23:00:04] legoktm: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:12] but technically its an escalation of his current access, hence needs manager approval. [23:00:26] (I also had to check and make sure he had signed the L3 doc, but he had ;) [23:00:48] thanks robh [23:00:53] JohnFLewis: holly shit thats a lot of files [23:00:54] 10Ops-Access-Requests, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Grant access to HTTP request logs - https://phabricator.wikimedia.org/T103872#1412066 (10Jdouglas) Tomasz is out on family leave; bumping up to Wes. [23:01:02] robh: its a large module :) [23:01:36] blerg didn't rename the other stuff [23:01:36] 10Ops-Access-Requests, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Grant access to HTTP request logs - https://phabricator.wikimedia.org/T103872#1412067 (10Jdouglas) a:5Tfinc>3Wwes [23:01:38] (03CR) 10RobH: [C: 04-1] "The patchset looks great, my -1 is only due to the lack of manager approval on task. Once that is done (and opsen review for no objection" [puppet] - 10https://gerrit.wikimedia.org/r/220990 (owner: 10Matanya) [23:01:51] legoktm, you doing that deploy? [23:02:08] JohnFLewis: indeed it is, but i really like matching what the puppet standard is, so reviewing =] [23:04:00] though since its a ton of them and will break the shit out of the server, I think I [23:04:05] PROBLEM - puppet last run on mw1103 is CRITICAL Puppet has 1 failures [23:04:12] 'll just +1 this with a note that if someone else also +1 im happy to mrege [23:04:13] PROBLEM - puppet last run on analytics1021 is CRITICAL Puppet has 1 failures [23:04:14] PROBLEM - puppet last run on mw1194 is CRITICAL Puppet has 1 failures [23:04:14] PROBLEM - puppet last run on tmh1001 is CRITICAL Puppet has 1 failures [23:04:16] (still reviewing) [23:04:23] PROBLEM - puppet last run on analytics1031 is CRITICAL Puppet has 1 failures [23:04:24] PROBLEM - puppet last run on stat1001 is CRITICAL Puppet has 1 failures [23:04:24] PROBLEM - puppet last run on mw1075 is CRITICAL Puppet has 1 failures [23:04:24] PROBLEM - puppet last run on mw1070 is CRITICAL Puppet has 1 failures [23:04:25] hrmm [23:04:31] ok, who just merged something to break puppet? [23:04:31] (03PS5) 10John F. Lewis: install-server: rename module to install_server [puppet] - 10https://gerrit.wikimedia.org/r/221787 [23:04:34] PROBLEM - puppet last run on mw1073 is CRITICAL Puppet has 1 failures [23:04:39] (it wasnt the install server module yet ;) [23:04:41] (03PS6) 10John F. Lewis: install-server: rename module to install_server [puppet] - 10https://gerrit.wikimedia.org/r/221787 [23:04:44] PROBLEM - puppet last run on mw1083 is CRITICAL Puppet has 1 failures [23:04:45] PROBLEM - puppet last run on mw1179 is CRITICAL Puppet has 1 failures [23:04:45] PROBLEM - puppet last run on mw1199 is CRITICAL Puppet has 1 failures [23:05:04] PROBLEM - puppet last run on mw2084 is CRITICAL Puppet has 1 failures [23:05:04] PROBLEM - puppet last run on mw2049 is CRITICAL Puppet has 1 failures [23:05:04] PROBLEM - puppet last run on mw2198 is CRITICAL Puppet has 1 failures [23:05:04] PROBLEM - puppet last run on mw2061 is CRITICAL Puppet has 1 failures [23:05:04] PROBLEM - puppet last run on mw2178 is CRITICAL Puppet has 1 failures [23:05:04] PROBLEM - puppet last run on mw2174 is CRITICAL Puppet has 1 failures [23:05:04] PROBLEM - puppet last run on mw1230 is CRITICAL Puppet has 1 failures [23:05:05] PROBLEM - puppet last run on mw1058 is CRITICAL Puppet has 1 failures [23:05:05] PROBLEM - puppet last run on mw1136 is CRITICAL Puppet has 1 failures [23:05:11] (03CR) 10Catrope: [C: 032] Remove $wgPopupsSurveyLink as trial is complete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220121 (https://phabricator.wikimedia.org/T103283) (owner: 10Prtksxna) [23:05:15] PROBLEM - puppet last run on mw1101 is CRITICAL Puppet has 1 failures [23:05:23] PROBLEM - puppet last run on mw2165 is CRITICAL Puppet has 1 failures [23:05:23] PROBLEM - puppet last run on mw2110 is CRITICAL Puppet has 1 failures [23:05:23] PROBLEM - puppet last run on mw2131 is CRITICAL Puppet has 1 failures [23:05:23] PROBLEM - puppet last run on mw2101 is CRITICAL Puppet has 1 failures [23:05:23] PROBLEM - puppet last run on mw2120 is CRITICAL Puppet has 1 failures [23:05:24] PROBLEM - puppet last run on mw2035 is CRITICAL Puppet has 1 failures [23:05:24] PROBLEM - puppet last run on mw2055 is CRITICAL Puppet has 1 failures [23:05:25] PROBLEM - puppet last run on mw2203 is CRITICAL Puppet has 1 failures [23:05:25] PROBLEM - puppet last run on mw2077 is CRITICAL Puppet has 1 failures [23:05:26] PROBLEM - puppet last run on mw2111 is CRITICAL Puppet has 1 failures [23:05:26] PROBLEM - puppet last run on mw2047 is CRITICAL Puppet has 1 failures [23:05:27] PROBLEM - puppet last run on mw2052 is CRITICAL Puppet has 1 failures [23:05:27] PROBLEM - puppet last run on mw1102 is CRITICAL Puppet has 1 failures [23:05:28] PROBLEM - puppet last run on mw1094 is CRITICAL Puppet has 1 failures [23:05:36] (03Merged) 10jenkins-bot: Remove $wgPopupsSurveyLink as trial is complete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220121 (https://phabricator.wikimedia.org/T103283) (owner: 10Prtksxna) [23:05:43] PROBLEM - puppet last run on erbium is CRITICAL Puppet has 1 failures [23:05:43] PROBLEM - puppet last run on mw1078 is CRITICAL Puppet has 1 failures [23:05:43] PROBLEM - puppet last run on mw1018 is CRITICAL Puppet has 1 failures [23:05:43] PROBLEM - puppet last run on mw1127 is CRITICAL Puppet has 1 failures [23:05:43] PROBLEM - puppet last run on mw1191 is CRITICAL Puppet has 1 failures [23:05:44] PROBLEM - puppet last run on mw1128 is CRITICAL Puppet has 1 failures [23:05:44] PROBLEM - puppet last run on mw1085 is CRITICAL Puppet has 1 failures [23:05:53] PROBLEM - puppet last run on mw1137 is CRITICAL Puppet has 1 failures [23:05:53] PROBLEM - puppet last run on mw1095 is CRITICAL Puppet has 1 failures [23:05:53] PROBLEM - puppet last run on mw2135 is CRITICAL Puppet has 1 failures [23:05:54] PROBLEM - puppet last run on mw2177 is CRITICAL Puppet has 1 failures [23:05:54] PROBLEM - puppet last run on mw2025 is CRITICAL Puppet has 1 failures [23:05:54] PROBLEM - puppet last run on analytics1012 is CRITICAL Puppet has 1 failures [23:06:04] PROBLEM - puppet last run on mw1232 is CRITICAL Puppet has 1 failures [23:06:05] PROBLEM - puppet last run on mw1182 is CRITICAL Puppet has 1 failures [23:06:05] PROBLEM - puppet last run on mw1245 is CRITICAL Puppet has 1 failures [23:06:06] PROBLEM - puppet last run on mw2202 is CRITICAL Puppet has 1 failures [23:06:06] PROBLEM - puppet last run on mw2119 is CRITICAL Puppet has 1 failures [23:06:06] PROBLEM - puppet last run on mw2103 is CRITICAL Puppet has 1 failures [23:06:14] PROBLEM - puppet last run on mw2005 is CRITICAL Puppet has 1 failures [23:06:14] PROBLEM - puppet last run on tmh1002 is CRITICAL Puppet has 1 failures [23:06:14] PROBLEM - puppet last run on mw1062 is CRITICAL Puppet has 1 failures [23:06:15] PROBLEM - puppet last run on mw1019 is CRITICAL Puppet has 1 failures [23:06:15] robh: last merge was you with that access change and looking at the servers, seems they correlate [23:06:15] PROBLEM - puppet last run on mw1214 is CRITICAL Puppet has 1 failures [23:06:18] Hm. [23:06:23] PROBLEM - puppet last run on analytics1034 is CRITICAL Puppet has 1 failures [23:06:24] PROBLEM - puppet last run on mw1252 is CRITICAL Puppet has 1 failures [23:06:24] PROBLEM - puppet last run on mw1169 is CRITICAL Puppet has 1 failures [23:06:32] really? [23:06:34] PROBLEM - puppet last run on mw1047 is CRITICAL Puppet has 1 failures [23:06:34] PROBLEM - puppet last run on mw1015 is CRITICAL Puppet has 1 failures [23:06:35] PROBLEM - puppet last run on mw2029 is CRITICAL Puppet has 1 failures [23:06:37] seems borked that my change would do that [23:06:44] PROBLEM - puppet last run on mw1157 is CRITICAL Puppet has 1 failures [23:06:44] PROBLEM - puppet last run on mw1130 is CRITICAL Puppet has 1 failures [23:06:44] PROBLEM - puppet last run on mw1218 is CRITICAL Puppet has 1 failures [23:06:45] PROBLEM - puppet last run on mw1233 is CRITICAL Puppet has 1 failures [23:06:45] PROBLEM - puppet last run on mw1184 is CRITICAL Puppet has 1 failures [23:06:45] Jun 29 23:00:12 mw2025 puppet-agent[4235]: /usr/local/sbin/enforce-users-groups returned 1 instead of one of [0] [23:06:53] ohh, it was removing him from deployers... [23:06:54] PROBLEM - puppet last run on mw1234 is CRITICAL Puppet has 1 failures [23:06:54] PROBLEM - puppet last run on mw2182 is CRITICAL Puppet has 1 failures [23:06:54] PROBLEM - puppet last run on mw2172 is CRITICAL Puppet has 1 failures [23:06:54] PROBLEM - puppet last run on mw2181 is CRITICAL Puppet has 1 failures [23:06:54] PROBLEM - puppet last run on mw1017 is CRITICAL Puppet has 1 failures [23:06:55] PROBLEM - puppet last run on analytics1036 is CRITICAL Puppet has 1 failures [23:06:55] PROBLEM - puppet last run on analytics1039 is CRITICAL Puppet has 1 failures [23:07:02] lemme run on one and see what it errors as [23:07:05] PROBLEM - puppet last run on mw1246 is CRITICAL Puppet has 1 failures [23:07:05] PROBLEM - puppet last run on mw1096 is CRITICAL Puppet has 1 failures [23:07:13] PROBLEM - puppet last run on mw2142 is CRITICAL Puppet has 1 failures [23:07:14] PROBLEM - puppet last run on mw2039 is CRITICAL Puppet has 1 failures [23:07:14] PROBLEM - puppet last run on mw2138 is CRITICAL Puppet has 1 failures [23:07:14] PROBLEM - puppet last run on mw2098 is CRITICAL Puppet has 1 failures [23:07:14] PROBLEM - puppet last run on mw2046 is CRITICAL Puppet has 1 failures [23:07:14] PROBLEM - puppet last run on mw2053 is CRITICAL Puppet has 1 failures [23:07:14] PROBLEM - puppet last run on mw1036 is CRITICAL Puppet has 1 failures [23:07:15] PROBLEM - puppet last run on mw1216 is CRITICAL Puppet has 1 failures [23:07:23] PROBLEM - puppet last run on mw1132 is CRITICAL Puppet has 1 failures [23:07:25] PROBLEM - puppet last run on mw2041 is CRITICAL Puppet has 1 failures [23:07:34] PROBLEM - puppet last run on mw1161 is CRITICAL Puppet has 1 failures [23:07:34] PROBLEM - puppet last run on mw1048 is CRITICAL Puppet has 1 failures [23:07:34] PROBLEM - puppet last run on mw1013 is CRITICAL Puppet has 1 failures [23:07:34] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [23:07:44] PROBLEM - puppet last run on mw2133 is CRITICAL Puppet has 1 failures [23:07:44] PROBLEM - puppet last run on mw2122 is CRITICAL Puppet has 1 failures [23:07:44] PROBLEM - puppet last run on mw2089 is CRITICAL Puppet has 1 failures [23:07:44] PROBLEM - puppet last run on mw2072 is CRITICAL Puppet has 1 failures [23:07:44] PROBLEM - puppet last run on mw2037 is CRITICAL Puppet has 1 failures [23:07:45] PROBLEM - puppet last run on mw2170 is CRITICAL Puppet has 1 failures [23:07:45] PROBLEM - puppet last run on mw2169 is CRITICAL Puppet has 1 failures [23:07:46] PROBLEM - puppet last run on mw1257 is CRITICAL Puppet has 1 failures [23:07:47] granted, this is non ideal [23:07:51] but its non outage at least [23:07:54] PROBLEM - puppet last run on mw1072 is CRITICAL Puppet has 1 failures [23:07:54] PROBLEM - puppet last run on mw1124 is CRITICAL Puppet has 1 failures [23:07:56] just no puppet updates. [23:08:04] PROBLEM - puppet last run on mw2034 is CRITICAL Puppet has 1 failures [23:08:04] PROBLEM - puppet last run on mw1115 is CRITICAL Puppet has 1 failures [23:08:13] PROBLEM - puppet last run on mw1035 is CRITICAL Puppet has 1 failures [23:08:13] PROBLEM - puppet last run on mw1109 is CRITICAL Puppet has 1 failures [23:08:14] PROBLEM - puppet last run on mw1200 is CRITICAL Puppet has 1 failures [23:08:14] PROBLEM - puppet last run on mw1067 is CRITICAL Puppet has 1 failures [23:08:14] PROBLEM - puppet last run on mw1221 is CRITICAL Puppet has 1 failures [23:08:14] PROBLEM - puppet last run on mw1197 is CRITICAL Puppet has 1 failures [23:08:14] PROBLEM - puppet last run on mw1138 is CRITICAL Puppet has 1 failures [23:08:20] hey guys [23:08:23] PROBLEM - puppet last run on mw2197 is CRITICAL Puppet has 1 failures [23:08:23] PROBLEM - puppet last run on mw2160 is CRITICAL Puppet has 1 failures [23:08:23] PROBLEM - puppet last run on mw1244 is CRITICAL Puppet has 1 failures [23:08:23] PROBLEM - puppet last run on mw2214 is CRITICAL Puppet has 1 failures [23:08:23] PROBLEM - puppet last run on mw1040 is CRITICAL Puppet has 1 failures [23:08:24] PROBLEM - puppet last run on mw1147 is CRITICAL Puppet has 1 failures [23:08:24] PROBLEM - puppet last run on mw1256 is CRITICAL Puppet has 1 failures [23:08:27] did you know there are some puppet failures? [23:08:31] just a few ;] [23:08:33] PROBLEM - puppet last run on mw1178 is CRITICAL Puppet has 1 failures [23:08:34] PROBLEM - puppet last run on mw2124 is CRITICAL Puppet has 1 failures [23:08:35] PROBLEM - puppet last run on mw2179 is CRITICAL Puppet has 1 failures [23:08:35] PROBLEM - puppet last run on mw2099 is CRITICAL Puppet has 1 failures [23:08:38] i merged a change to remove deployer from someone [23:08:41] that may have caused this, checking now [23:08:44] PROBLEM - puppet last run on mw2014 is CRITICAL Puppet has 1 failures [23:08:45] PROBLEM - puppet last run on mw1192 is CRITICAL Puppet has 1 failures [23:08:58] cuz the user didnt remove, just the dpeloyer right [23:09:03] PROBLEM - puppet last run on mw2183 is CRITICAL Puppet has 1 failures [23:09:03] PROBLEM - puppet last run on mw2106 is CRITICAL Puppet has 1 failures [23:09:03] PROBLEM - puppet last run on mw1006 is CRITICAL Puppet has 1 failures [23:09:03] PROBLEM - puppet last run on mw1041 is CRITICAL Puppet has 1 failures [23:09:04] PROBLEM - puppet last run on mw1038 is CRITICAL Puppet has 1 failures [23:09:04] PROBLEM - puppet last run on mw1141 is CRITICAL Puppet has 1 failures [23:09:06] Error: /Stage[main]/Admin/Exec[enforce-users-groups-cleanup]/returns: change from notrun to 0 failed: /usr/local/sbin/enforce-users-groups returned 1 instead of one of [0] [23:09:13] PROBLEM - puppet last run on mw1063 is CRITICAL Puppet has 1 failures [23:09:13] PROBLEM - puppet last run on mw1012 is CRITICAL Puppet has 1 failures [23:09:13] PROBLEM - puppet last run on mw1134 is CRITICAL Puppet has 1 failures [23:09:13] PROBLEM - puppet last run on mw1145 is CRITICAL Puppet has 1 failures [23:09:13] PROBLEM - puppet last run on mw1140 is CRITICAL Puppet has 1 failures [23:09:13] domas if you want robh can test by removing your ops group :) [23:09:14] PROBLEM - puppet last run on mw1031 is CRITICAL Puppet has 1 failures [23:09:14] PROBLEM - puppet last run on mw1224 is CRITICAL Puppet has 1 failures [23:09:23] PROBLEM - puppet last run on mw1028 is CRITICAL Puppet has 1 failures [23:09:24] PROBLEM - puppet last run on mw1089 is CRITICAL Puppet has 1 failures [23:09:24] PROBLEM - puppet last run on mw2102 is CRITICAL Puppet has 1 failures [23:09:24] PROBLEM - puppet last run on mw2112 is CRITICAL Puppet has 1 failures [23:09:24] PROBLEM - puppet last run on mw2028 is CRITICAL Puppet has 1 failures [23:09:33] I DONT LIKE YOU ALREADY [23:09:47] !log stop ircecho on neon, icinga spam [23:09:52] Logged the message, Master [23:09:53] its totally the qchris rmoval from deployers [23:09:56] and a cleanup script [23:09:59] Krenair: yean, I'll do it [23:10:02] chatting with chase about it [23:10:03] ah ok [23:10:04] oh look, morebots still alive [23:10:05] looking [23:10:07] legoktm, looks like Roan already was [23:10:10] domas: but I like you :( [23:10:10] <3 <3 <3 [23:10:12] but then robh broke all the things [23:10:13] Yeah I'm just doing it [23:10:28] * hoo wonders whether greg-g is around [23:10:35] i totally did break all puppet [23:10:35] But sync-file is still hanging on an unresponsive host [23:10:38] the site is totally ok [23:10:46] sorry about this [23:10:47] And with godog's disabling of ircecho, even when it finishes, it might not log here [23:10:51] robh, so, I guess what's up is that his account will still be present on servers he's no longer supposed to get access to [23:11:00] RoanKattouw: it should still be running on tin? [23:11:05] as he's no longer a deployer, and many other things [23:11:06] its an issue where a clean up script is getitng info it doesnt expet [23:11:11] It just finished [23:11:17] as we remove from deployer, it tries to clean it up, but he isn't a removed user [23:11:17] oh [23:11:27] !log Synced wmf-config/CommonSettings.php: Remove survey access point in Popups [23:11:32] RoanKattouw: mhh that should affect just icinga-wm [23:11:33] Logged the message, Mr. Obvious [23:11:41] So, since the site is up, we are taking a moment to try to fix this in place rather than revert my qchris permission change. [23:11:42] !log ssh: connect to host mw2027.codfw.wmnet port 22: Connection timed out [23:11:45] RoanKattouw: i.e. not logmsgbot [23:11:48] Logged the message, Mr. Obvious [23:12:03] Hmm maybe its because of my DONOLOGMSG setting from earlier [23:12:04] hi domas [23:12:11] hi! [23:12:26] heh... ive not broken things like this in a long time =P [23:12:28] robh: yeah another puppet run doesn't recover by itself sadly :( [23:12:33] Error: /usr/local/sbin/enforce-users-groups returned 1 instead of one of [0] [23:12:37] godog: nope, its the user cleanupscript [23:12:45] it expcets that when cleaning up a user, the user is totally gone [23:12:46] 6operations, 6Commons, 6Multimedia, 6Performance-Team, and 4 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1412111 (10ori) a:3Joe [23:12:50] domas: you have some really old things in wmf-config repo, got a sec to see what can go ? [23:12:53] 6operations, 6Commons, 6Multimedia, 6Performance-Team, and 4 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1412112 (10Jdforrester-WMF) a:5Joe>3None [23:12:53] but not in some odd state where user lost one group but noth all [23:12:59] this is my assumption based on [23:13:04] Error: /usr/local/sbin/enforce-users-groups returned 1 instead of one of [0] [23:13:04] Error: /Stage[main]/Admin/Exec[enforce-users-groups-cleanup]/returns: change from notrun to 0 failed: /usr/local/sbin/enforce-users-groups returned 1 instead of one of [0] [23:13:05] matanya: like what?! [23:13:24] chase wrote the script so i totally pinged him to check it out, but i am also grepping it [23:13:42] 6operations, 6Commons, 6Multimedia, 6Performance-Team, and 4 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1412116 (10Jdforrester-WMF) a:3Joe Bah, edit conflicts. [23:13:46] uhm... wikitech flaky? [23:13:56] (03CR) 10Catrope: [C: 032] More wikitech cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221729 (https://phabricator.wikimedia.org/T75939) (owner: 10Alex Monk) [23:14:03] (03Merged) 10jenkins-bot: More wikitech cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221729 (https://phabricator.wikimedia.org/T75939) (owner: 10Alex Monk) [23:14:03] * hoo eyes RoanKattouw [23:14:10] he hasn't sync'd it yet [23:14:15] what's up with wikitech hoo? [23:14:24] i even see the code its having issues with [23:14:29] robh: yeah the script is set -e and mv /etc/sudoers.d/qchris /home/qchris fails [23:14:30] It just logged me out [23:14:31] but im a shite dev and cannot fix it ;D [23:14:38] after only logging in 30s ago [23:14:46] Syncing now [23:14:59] domas: e.g: if ( $wmgPrivateWikiUploads ) { [23:14:59] # mav forced me to --midom [23:14:59] $wgFileExtensions[] = 'ppt'; [23:15:01] MrIGetPingedALot [23:15:01] robh, run the script without dryrun? [23:15:08] * hoo rages [23:15:15] JohnFLewis: -P [23:15:17] Krenair: uh, i rather we fix this permanently [23:15:19] matanya: we need powerpoints, jeees [23:15:20] cuz its going to happen again [23:15:24] this was a matter of time kinda thing [23:15:27] I think that's the point of the script [23:15:30] matanya: isn't it all .pptx nowadays!!?! [23:15:30] I can schedule a deploy because wikitech <3 [23:15:45] RoanKattouw: confirmed working now that startup module cache updated [23:15:52] !log catrope Synchronized wmf-config/: wikitech cleanup (duration: 01m 08s) [23:15:57] Reedy: the purist inside me screams out silently :) [23:15:59] Logged the message, Master [23:16:03] * can't... doh [23:16:07] OK yeah so logmsgbot does still work [23:16:11] My env was just screwed up [23:16:11] matanya: you can delete all of them, as long as you're ready to take the blame [23:16:24] more than willing to [23:16:45] heh, mav. [23:16:51] he was our first.... accountant? [23:16:55] or what was his role [23:16:58] iirc [23:17:03] volunteer, obviously [23:17:16] that was, when? 2004 ? [23:17:27] matanya: wouldn't be too surprised [23:17:56] 11 files changed, 1287 insertions(+), 1300 deletions(-) [23:17:59] this will be fun to review [23:18:14] whitespace? [23:18:33] Anyone able to edit wikitech? [23:18:55] If not, I'm just going to schedule my deploy tomorrow or orally (well, over IRC) to Greg [23:19:08] Nope [23:19:12] uh oh... [23:19:17] hoo: Works for me. [23:19:20] I can't stay logged in on wikitech [23:19:27] https://wikitech.wikimedia.org/w/index.php?title=User:Jforrester&diff=168382&oldid=102068 [23:19:40] https://wikitech.wikimedia.org/w/index.php?title=User:Alex_Monk/sandbox&oldid=168384 [23:19:45] hoo: use a clean profile, worked for Krenair [23:19:46] James_F: show off [23:19:55] JohnFLewis: :-) [23:19:58] Krenair: show off too [23:20:04] just because we can't :) [23:20:06] matanya: Tried Firefox and Epiphany [23:20:14] What, showing that VE can be used to make edits? [23:20:28] (03PS1) 10Rush: admin: enforce user removal test for sudo perms [puppet] - 10https://gerrit.wikimedia.org/r/221790 [23:20:53] PROBLEM - puppet last run on mw2205 is CRITICAL Puppet has 1 failures [23:20:55] Reedy: yes [23:21:37] !log ori Synchronized php-1.26wmf11/includes/resourceloader: I0e5f2d3b2: resourceloader: Add timing metrics for key operations (duration: 01m 12s) [23:21:41] Logged the message, Master [23:22:01] (03CR) 10Filippo Giunchedi: [C: 031] "*rubberstamp*" [puppet] - 10https://gerrit.wikimedia.org/r/221790 (owner: 10Rush) [23:22:24] woooo, works for me now [23:22:36] In an incognito windows within epiphany... but it works [23:22:43] (03CR) 10Rush: [C: 032 V: 032] admin: enforce user removal test for sudo perms [puppet] - 10https://gerrit.wikimedia.org/r/221790 (owner: 10Rush) [23:23:03] I can login to wikitech but I can't see anything related labs projects (eg https://wikitech.wikimedia.org/wiki/Special:NovaResources is empty) [23:23:45] bd808: were you alredy logged in? I have had to logout and in again [23:23:47] for similar [23:23:53] Did the cleanup change cache config? [23:23:54] 7Blocked-on-Operations, 6operations, 10Deployment-Systems, 6Scrum-of-Scrums: Update wikitech wiki with deployment train - https://phabricator.wikimedia.org/T70751#1412155 (10Krenair) [23:23:54] *nod* tried that twice [23:24:36] bummer then maybe it's a keystone problem idk [23:27:16] ori: is wmgUseGeSHi still needed in CommonSettings.php ? [23:28:13] yes matanya [23:28:16] it needs renaming [23:28:28] to pygment ? [23:28:31] matanya: that is not the right approach, though [23:28:46] enlighten me please :) [23:28:50] if you comb over those files variable by variable you'll drive yourself and everyone else nuts [23:29:00] try to think of a programmatic way to assess which variables are in use [23:29:15] !log deployed restbase 32db4ce1e1 [23:29:19] Logged the message, Master [23:29:56] it have a vauge knowledge what is in use, but namingwise - hard to figure out, thanks for pointing that out thogh, ori . point taken [23:30:47] godog: did you silence nagios? its gonna blast a bunch of clears [23:30:55] icinga that is [23:30:57] shame on me. [23:31:04] he killed it [23:31:09] robh: no I've stopped ircecho on neon [23:31:25] didn't it come back? [23:31:30] good enough, i wouldnt turn it back yet [23:31:33] oh, then quit again [23:31:39] 387 unhandled in icinga still [23:31:43] they havent all called back in yet [23:32:14] (all puppet fails ;) [23:33:06] I really want to condense those into one notice [23:33:21] my last gig we had a "5 puppet failures have ocurred" [23:33:34] instead of every host individually which is just unmanageable [23:33:38] tho it gets attention so idk [23:34:23] maybe break into service group puppet errors [23:34:28] so one error for mw, etc.. [23:35:17] robh: one puppet recovers in spirit you have to merge another patch that breaks it :) [23:35:30] thas got to be tradition somewhere along lines [23:35:47] every time we break puppet, it breaks us a bit in return. [23:40:38] this is where i debate if i should just fire off puppet [23:40:53] then recall how long a manual run on all hosts takes and decide to wait a little while longer [23:41:23] !log restarted cassandra instance on restbase1004.eqiad; log showed many small writes and clients saw timeouts [23:41:27] Logged the message, Master [23:41:46] robh: may not be puppet recovering, icinga checks are staggered so could be that too [23:41:47] what's up with mw2027 btw? [23:42:33] icinga reports it as down for over 7h, but apparently it's still supposed to be receiving deploys as Roan ran into it earlier [23:45:57] RECOVERY - puppet last run on mw1141 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:46:23] RECOVERY - puppet last run on mw2033 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:46:28] dcausse: around ? [23:48:06] I think it was down last week too [23:48:11] Unless that was another machine [23:48:32] no open machines, lesse about it [23:48:36] * robh goes to poke at it [23:48:50] !log start upgrading restbase1* to cassandra 2.1.7 [23:48:55] Logged the message, Master [23:50:11] hrmm, its blank screen and unresponsive [23:50:18] no open ticket, so going to power cycle it [23:52:56] no open tasks that is, not no open machines =P [23:54:45] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 43.06 ms [23:55:59] shall we place bets how long until 2027 goes down? [23:57:22] yeah it does seem to go down a lot, faulty perhaps [23:57:24] !log mw2027 was offline (blank screen on serial console). mgmt powercycled [23:57:28] Logged the message, Master [23:58:05] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:58:07] i see a few mentions on server admin log [23:58:18] but not so much that its insane... but folks may not log each reboot