[00:00:47] PROBLEM - puppet last run on mw2047 is CRITICAL puppet fail [00:02:35] (03CR) 10Yuvipanda: "Attempting to test this on toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/204193 (https://phabricator.wikimedia.org/T96059) (owner: 10Yuvipanda) [00:03:05] (03PS1) 10BBlack: remove more torrus api-cluster refs (followup fix for 6254a447?) [puppet] - 10https://gerrit.wikimedia.org/r/204198 [00:04:41] (03CR) 10BBlack: [C: 032 V: 032] remove more torrus api-cluster refs (followup fix for 6254a447?) [puppet] - 10https://gerrit.wikimedia.org/r/204198 (owner: 10BBlack) [00:08:07] RECOVERY - puppet last run on netmon1001 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [00:12:32] bblack: yes, ^ recovery confirmed [00:12:34] thx [00:16:58] (03PS9) 10BBlack: r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [00:18:00] (03PS1) 10Ori.livneh: Create Application class [software/brrd] - 10https://gerrit.wikimedia.org/r/204199 [00:18:17] (03CR) 10Ori.livneh: [C: 032 V: 032] Create Application class [software/brrd] - 10https://gerrit.wikimedia.org/r/204199 (owner: 10Ori.livneh) [00:18:37] RECOVERY - puppet last run on mw2047 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [00:21:52] (03PS1) 10Ori.livneh: Update upstart job def for brrd [puppet] - 10https://gerrit.wikimedia.org/r/204200 [00:22:02] (03PS2) 10Ori.livneh: Update upstart job def for brrd [puppet] - 10https://gerrit.wikimedia.org/r/204200 [00:22:36] (03CR) 10Ori.livneh: [C: 032 V: 032] Update upstart job def for brrd [puppet] - 10https://gerrit.wikimedia.org/r/204200 (owner: 10Ori.livneh) [00:28:41] (03CR) 10Chad: Add submodules to master checkoutMediaWiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204080 (https://phabricator.wikimedia.org/T88442) (owner: 10Thcipriani) [00:32:35] (03PS10) 10BBlack: r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [00:35:10] (03PS1) 10Dzahn: integration: move redirect out of .htaccess [puppet] - 10https://gerrit.wikimedia.org/r/204202 [00:36:13] (03CR) 10Dzahn: "[gallium:/srv/org/wikimedia/integration] $ cat .htaccess" [puppet] - 10https://gerrit.wikimedia.org/r/204202 (owner: 10Dzahn) [00:41:33] (03PS2) 10Dzahn: various role classes: moar small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/202653 (https://phabricator.wikimedia.org/T93645) [00:41:49] (03CR) 10Dzahn: [C: 032] various role classes: moar small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/202653 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [00:43:36] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60633 bytes in 1.156 second response time [00:45:59] (03PS3) 10Dzahn: drop shop & store entries from most projects [dns] - 10https://gerrit.wikimedia.org/r/196605 (https://phabricator.wikimedia.org/T92438) [00:46:51] (03PS1) 10Alex Monk: Add AffCom user group application contact page on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204205 (https://phabricator.wikimedia.org/T95789) [00:46:53] (03CR) 10Dzahn: "@Faidon ok:)" [dns] - 10https://gerrit.wikimedia.org/r/196605 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [00:50:17] (03Abandoned) 10Dzahn: color root shell in red [puppet] - 10https://gerrit.wikimedia.org/r/198425 (owner: 10Dzahn) [00:51:02] (03Abandoned) 10Dzahn: use 208.80.153.224 for text-lb.codfw.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/196075 (https://phabricator.wikimedia.org/T92377) (owner: 10Dzahn) [00:55:01] (03PS2) 10Alex Monk: Add AffCom user group application contact page on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204205 (https://phabricator.wikimedia.org/T95789) [00:56:19] (03CR) 10Alex Monk: [C: 04-1] Add AffCom user group application contact page on meta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204205 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [01:08:04] <^d> mutante: Has anything changed with the gerrit ssl cert recently? [01:08:12] <^d> (like, last couple of days?) [01:08:35] not that I've heard about [01:08:43] ^d: same, not that i heard about [01:08:54] why? [01:09:43] <^d> I have a stupid little PHP script on my machine that fetches some JSON from Gerrit for me. It started blowing up like yesterday [01:10:04] <^d> Warning: file_get_contents(): SSL operation failed with code 1. OpenSSL Error messages: [01:10:05] <^d> error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed in fetch_missing_repos.php on line 17 [01:11:37] there is https://phabricator.wikimedia.org/T82319 but that's not a new thing , it doesnt explain things changing just a few days ago [01:12:14] <^d> I'm totally willing to accept it's possibly something I screwed up on my machine :p [01:12:47] ^d: i see "ssl3" in there [01:13:04] we disabled SSLv3 [01:13:04] <^d> Yeah, hmm [01:13:10] also not a few days ago.. but .. wait.. [01:14:08] I think openssl always calls that function ssl3_get_server_certificate even when it's using TLS? [01:15:48] <^d> https://phabricator.wikimedia.org/P520 - my openssl config for PHP [01:16:02] cant find the other bug i meant [01:16:09] anyways that change on gerrit was in Oct 2014 [01:18:34] ^d: is this using curl from php? [01:18:51] <^d> fopen equivalent [01:18:58] shouldn't your openssl have a capath? how else would it have a set of root certs to validate us against? [01:19:02] "curl used to include a list of accepted CAs, but no longer bundles ANY CA certs. So by default it'll reject all SSL certificates as unverifiable." [01:19:11] like /etc/ssl/certs/ [01:19:15] so you would have to set the capath, yea [01:19:42] <^d> Gah, default config changed under me at some point [01:21:03] (03PS11) 10BBlack: r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [01:22:34] 6operations, 10Wikimedia-Mailing-lists: Update mailman listinfo.txt template - https://phabricator.wikimedia.org/T96108#1208549 (10Dzahn) a:3Dzahn [01:40:48] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=223.60 Read Requests/Sec=133.20 Write Requests/Sec=36.10 KBytes Read/Sec=873.60 KBytes_Written/Sec=448.75 [01:41:39] now that might have been me running a script [01:42:27] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=3.50 Read Requests/Sec=1.10 Write Requests/Sec=8.40 KBytes Read/Sec=6.40 KBytes_Written/Sec=136.40 [02:30:31] (03PS1) 10Chad: Use Diffusion to support r1234 links in Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/204211 [02:32:02] 6operations, 5Interdatacenter-IPsec: Strongswan: security association reauthentication failure - https://phabricator.wikimedia.org/T96111#1208595 (10Gage) 3NEW a:3Gage [02:32:19] !log l10nupdate Synchronized php-1.25wmf24/cache/l10n: (no message) (duration: 09m 03s) [02:32:19] blah [02:32:41] Logged the message, Master [02:39:02] !log LocalisationUpdate completed (1.25wmf24) at 2015-04-15 02:37:59+00:00 [02:39:14] Logged the message, Master [02:52:07] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=189.40 Read Requests/Sec=95.70 Write Requests/Sec=3.30 KBytes Read/Sec=11046.00 KBytes_Written/Sec=32.55 [02:54:32] 6operations, 5Interdatacenter-IPsec: Strongswan: security association reauthentication failure - https://phabricator.wikimedia.org/T96111#1208607 (10Gage) [02:56:57] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=84.40 Read Requests/Sec=21.00 Write Requests/Sec=78.80 KBytes Read/Sec=692.40 KBytes_Written/Sec=533.30 [03:04:18] !log l10nupdate Synchronized php-1.26wmf1/cache/l10n: (no message) (duration: 09m 22s) [03:04:32] Logged the message, Master [03:05:17] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=201.80 Read Requests/Sec=136.80 Write Requests/Sec=11.60 KBytes Read/Sec=3515.20 KBytes_Written/Sec=604.70 [03:08:27] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=88.70 Read Requests/Sec=96.60 Write Requests/Sec=11.20 KBytes Read/Sec=1232.00 KBytes_Written/Sec=292.85 [03:11:11] !log LocalisationUpdate completed (1.26wmf1) at 2015-04-15 03:10:08+00:00 [03:11:20] Logged the message, Master [03:13:17] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=44.00 Read Requests/Sec=93.70 Write Requests/Sec=5.20 KBytes Read/Sec=5864.00 KBytes_Written/Sec=25.35 [03:18:16] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=207.00 Read Requests/Sec=153.20 Write Requests/Sec=27.20 KBytes Read/Sec=15444.00 KBytes_Written/Sec=283.35 [03:21:37] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=69.70 Read Requests/Sec=1.80 Write Requests/Sec=7.60 KBytes Read/Sec=9.60 KBytes_Written/Sec=828.20 [03:26:27] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=101.60 Read Requests/Sec=146.80 Write Requests/Sec=2.90 KBytes Read/Sec=4511.60 KBytes_Written/Sec=32.55 [03:28:06] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=97.30 Read Requests/Sec=18.80 Write Requests/Sec=72.80 KBytes Read/Sec=79.60 KBytes_Written/Sec=547.45 [03:34:27] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=181.90 Read Requests/Sec=94.80 Write Requests/Sec=1.30 KBytes Read/Sec=755.60 KBytes_Written/Sec=30.80 [03:36:06] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=5.90 Read Requests/Sec=0.40 Write Requests/Sec=2.80 KBytes Read/Sec=2.00 KBytes_Written/Sec=135.60 [04:12:00] (03PS1) 10EBernhardson: Invalidate flow cache by bumping cache version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204215 [04:13:51] (03CR) 10Mattflaschen: [C: 032] Invalidate flow cache by bumping cache version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204215 (owner: 10EBernhardson) [04:13:56] (03Merged) 10jenkins-bot: Invalidate flow cache by bumping cache version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204215 (owner: 10EBernhardson) [04:16:41] !log ebernhardson Synchronized wmf-config/CommonSettings.php: Bump flow cache version (duration: 00m 11s) [04:16:48] Logged the message, Master [06:01:57] PROBLEM - puppetmaster https on virt1000 is CRITICAL - Socket timeout after 10 seconds [06:14:47] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 9.398 second response time [06:18:04] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Apr 15 06:17:01 UTC 2015 (duration 17m 0s) [06:19:46] PROBLEM - puppetmaster https on virt1000 is CRITICAL - Socket timeout after 10 seconds [06:29:57] PROBLEM - puppet last run on db1028 is CRITICAL Puppet has 1 failures [06:30:57] PROBLEM - puppet last run on db1021 is CRITICAL Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on cp1056 is CRITICAL Puppet has 1 failures [06:32:06] PROBLEM - puppet last run on logstash1002 is CRITICAL Puppet has 2 failures [06:32:07] PROBLEM - puppet last run on cp4014 is CRITICAL Puppet has 1 failures [06:33:47] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures [06:35:07] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [06:35:27] PROBLEM - puppet last run on mw1144 is CRITICAL Puppet has 3 failures [06:35:37] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:35:47] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:35:47] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:35:47] PROBLEM - puppet last run on mw2096 is CRITICAL Puppet has 1 failures [06:35:47] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 2 failures [06:35:47] PROBLEM - puppet last run on mw1052 is CRITICAL Puppet has 1 failures [06:35:57] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:36:07] PROBLEM - puppet last run on mw1123 is CRITICAL Puppet has 1 failures [06:36:17] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:36:27] PROBLEM - puppet last run on mw1118 is CRITICAL Puppet has 1 failures [06:36:28] PROBLEM - puppet last run on mw1065 is CRITICAL Puppet has 1 failures [06:36:47] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:36:56] PROBLEM - puppet last run on mw1175 is CRITICAL Puppet has 1 failures [06:45:37] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.293 second response time [06:45:47] RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on logstash1002 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:07] RECOVERY - puppet last run on db1021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:18] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:36] PROBLEM - puppetmaster https on virt1000 is CRITICAL - Socket timeout after 10 seconds [07:06:38] RECOVERY - puppet last run on mw1052 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [07:06:46] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [07:06:46] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:06:46] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:06:57] RECOVERY - puppet last run on mw1123 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:07:07] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:07:07] RECOVERY - puppet last run on db1028 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:07:16] RECOVERY - puppet last run on mw1118 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [07:07:17] RECOVERY - puppet last run on mw1065 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:37] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:07:37] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:07:47] RECOVERY - puppet last run on mw1175 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:47] RECOVERY - puppet last run on mw1144 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:08:07] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:17] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [07:08:17] RECOVERY - puppet last run on mw2096 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:19:19] 6operations, 6Services, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1208803 (10mobrovac) After sleeping on it, I realised it's just a matter of format and we can go either way. I still think we should expose a proper endpoint for this (either sp... [07:19:31] _joe_: ^^ [07:25:16] 6operations, 6Services, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1208804 (10Joe) Having an endpoint exposing this gives us a lot of flexibility/autodiscovery ability that does NOT depend on people using swagger/spec to define what we should m... [07:25:23] <_joe_> mobrovac: thanks [07:26:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Is there a plan to have something in the main namespaces of the module (or the role) ? If yes, just introduce it in this commit so we can " [puppet] - 10https://gerrit.wikimedia.org/r/204161 (owner: 10MaxSem) [07:28:26] <_joe_> mobrovac: I don't care about having to parse a slightly more complicated yaml than a json already taylored to my needs [07:29:17] _joe_: ok cool, i can also go either way, really don't care [07:29:32] i cared yesterday but today i don't any more [07:29:46] too much discussion [07:29:53] <_joe_> eheh [07:30:26] <_joe_> still, I'd prefer the application to do that translation work [07:30:36] <_joe_> it's an API and I'm your client, right?] [07:30:47] oh please write that in the ticket [07:30:56] (poossibly also explaining why) [07:31:17] _joe_: yes, it's a contract between services and the monitoring tool [07:31:59] i've been basing my comments on that, why gabriel argues for exposing full specs so that any entity may use them as they wish [07:32:06] s/why/but [07:32:39] and i agree that eventually all services should expose their full specs [07:32:47] but the template is not yet there to offer that [07:32:58] <_joe_> I think one doesn't exclude the other [07:33:23] <_joe_> but in my experience, you will have situations where you want to define what to monitor as a "decorator" to your method [07:33:47] <_joe_> (sorry for the java/pythonesque terminology, I have no idea how you call those in nodejs) [07:34:16] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.090 second response time [07:34:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think this patch goes completely in the right direction. I have a few comments, but I guess we may go in a slightly different direction:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/204068 (owner: 10BBlack) [07:34:21] no worries [07:35:06] <_joe_> and I had to write some nodejs code at $WORK~1 [07:35:52] <_joe_> because well, some libraries were so ugly that even I knew how to make them better :P [07:36:10] hahaha [07:36:52] * mobrovac has to think of a project where he can write some c++ just for the pleasure of it [07:38:33] where were you during the HHVM migration! [07:38:53] <_joe_> oh well. I have a wishlist for HHVM :P [07:39:29] <_joe_> mobrovac: btw, something tells me you'll have to work on that code of mine sooner than later. It's a node library for etcd. I really hoped someone did something better but it doesn't look like it. There is one in coffeescript which is obviously better, though [07:40:07] :) [07:40:52] _joe_: https://www.npmjs.com/package/node-etcd ? [07:41:08] <_joe_> that one is in coffeescript and it's better [07:41:45] <_joe_> mobrovac: but devs refused to use coffeescript [07:42:16] ah right [07:42:28] but it compiles to a nodejs module, so ... [07:42:32] <_joe_> yes [07:42:43] <_joe_> I didn't say it made sense :) [07:42:49] <_joe_> so yes, I'd use that [07:43:26] btw are we set on etcd or are options still being explored? [07:45:32] both ;) [07:45:42] <_joe_> ahah [07:45:54] <_joe_> nope, not true. I like a _lot_ zookeeper [07:46:50] <_joe_> I just find it unconfortable its api, and the fact you can't query it via curl [07:47:25] <_joe_> also, I'm a bit scared by it being java, of course :) [07:47:25] zk still suffers from SPOF IIRC [07:47:57] <_joe_> mobrovac: I don't think so, but I may remember incorrectly. I plan to start testing next week [07:48:16] PROBLEM - RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:36] "unconfortable its api, and the fact you can't query it via curl" -> seems like reason enough not to use it [07:49:16] PROBLEM - SSH on ms-be1016 is CRITICAL - Socket timeout after 10 seconds [07:49:27] PROBLEM - configured eth on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:28] PROBLEM - swift-account-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:36] PROBLEM - swift-object-updater on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:47] PROBLEM - very high load average likely xfs on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:57] PROBLEM - swift-account-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:06] PROBLEM - swift-account-reaper on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:07] PROBLEM - swift-container-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:18] PROBLEM - swift-container-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:27] PROBLEM - swift-container-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:27] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:36] PROBLEM - swift-account-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:36] PROBLEM - swift-container-updater on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:46] PROBLEM - swift-object-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:46] PROBLEM - dhclient process on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:47] PROBLEM - swift-object-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:57] PROBLEM - Disk space on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:51:06] PROBLEM - salt-minion processes on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:51:37] PROBLEM - DPKG on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:51:37] PROBLEM - swift-object-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:31] <_joe_> !log powercycling ms-be1016, console shows BUG: soft lockup - CPU#XX stuck for YYs! being emitted continuously [08:02:57] RECOVERY - DPKG on ms-be1016 is OK: All packages OK [08:02:57] RECOVERY - swift-object-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [08:02:57] RECOVERY - swift-account-reaper on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:03:07] RECOVERY - swift-container-auditor on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:03:17] RECOVERY - swift-container-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [08:03:17] RECOVERY - swift-container-server on ms-be1016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [08:03:18] RECOVERY - puppet last run on ms-be1016 is OK Puppet is currently enabled, last run 24 minutes ago with 0 failures [08:03:27] RECOVERY - swift-account-server on ms-be1016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [08:03:36] RECOVERY - swift-container-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:03:37] RECOVERY - swift-object-auditor on ms-be1016 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:03:46] RECOVERY - dhclient process on ms-be1016 is OK: PROCS OK: 0 processes with command name dhclient [08:03:47] RECOVERY - SSH on ms-be1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:03:47] RECOVERY - swift-object-server on ms-be1016 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [08:03:48] RECOVERY - Disk space on ms-be1016 is OK: DISK OK [08:03:57] RECOVERY - configured eth on ms-be1016 is OK - interfaces up [08:03:57] RECOVERY - swift-account-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [08:03:57] RECOVERY - salt-minion processes on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:04:08] RECOVERY - swift-object-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [08:04:26] RECOVERY - very high load average likely xfs on ms-be1016 is OK - load average: 8.70, 2.86, 1.01 [08:04:27] RECOVERY - RAID on ms-be1016 is OK Active: 6, Working: 6, Failed: 0, Spare: 0 [08:04:27] RECOVERY - swift-account-auditor on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [08:05:07] seems the log bot is dead again [08:05:15] !log testing !log [08:05:36] Didn't log the message, master. [08:05:46] stupid bot [08:05:57] restarting it [08:07:05] !log testing !log [08:07:12] logmsgbot: ping [08:07:27] LoginError: (, {u'result': u'WrongPass'}) [08:07:32] so that is wikitech being dead again [08:07:34] or ldap [08:08:35] if I am right, someone need to restart keystone or ldap on virt1000 [08:09:09] Sigh [08:09:14] I'm in a bus [08:09:22] Someone else with root needs to [08:11:00] I'll take a look [08:12:02] !log bounce keystone on virt1000 [08:12:07] hashar: https://integration.wikimedia.org/ci/job/mwext-testextension-zend/114/console is dead too. known? [08:12:16] sad_trombone.wav [08:12:35] Logged the message, Master [08:14:31] godog: our error! [08:14:38] err [08:14:39] our hero [08:15:39] haha it is a fine line between the two [08:27:16] 6operations: Java security updates (CPU 2014) - https://phabricator.wikimedia.org/T96125#1208853 (10MoritzMuehlenhoff) [08:27:38] godog: hey! I have a graphite / statsd question (from pov of a developer) got a moment? [08:34:37] PROBLEM - puppetmaster https on virt1000 is CRITICAL - Socket timeout after 10 seconds [08:37:21] YuviPanda: sure, in 5 [08:37:55] godog: cool [08:43:33] YuviPanda: hey, shoot [08:43:44] godog: so if you look at http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1429086474.982&target=tools.tools-services-02.WebServiceMonitor.manifestscollected [08:43:54] it says something about 1900 a minute of sorts, right? [08:43:58] but what’s actually happening [08:44:05] is that it’s reporting about 300 something every 10 seconds [08:44:13] and I assume that it would get averaged every flush period [08:44:15] instead of ‘added' [08:44:31] which is what seems to be happening instead (added every 60s?) [08:44:36] I’m wondering what I’m doing wrong [08:45:27] https://github.com/wikimedia/operations-software-tools-manifest/blob/master/tools/manifest/collector.py#L60 [08:45:31] I wonder if the ‘incr’ is the problem [08:45:36] and I should ‘set’ it somehow instead [08:45:43] this is using your python-statsd package for client [08:46:07] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 6.222 second response time [08:46:11] YuviPanda: looking [08:48:42] godog: cool [08:48:53] YuviPanda: we're flushing at 60s ATM btw [08:49:03] right [08:49:04] YuviPanda: how often does the collector run? [08:49:11] godog: approximately every 10s [08:49:17] this isn’t a diamond collector tho [08:50:58] YuviPanda: but yeah in this case it'll accumulate the counter until the flush and then reset it, so you might get duplicates in there if it is collecting every 10s [08:51:06] PROBLEM - puppetmaster https on virt1000 is CRITICAL - Socket timeout after 10 seconds [08:51:09] booo :( [08:51:26] godog: but isn’t it supposed to flush the value as being the average of all the values it gets? [08:52:24] YuviPanda: counters no, they are only increasing [08:52:40] right, so that’s part of the problem - I’m not sure which bit to use :D [08:53:00] godog: so if I want it to ‘flush average of all values received since last flush’ what should I use? [08:54:28] YuviPanda: why the average though? you can use a set and get an unique count [08:55:14] godog: because the deamon works like: 1. do things, 2. sleep 10s, 3. go back to 1 [08:55:37] so a set would all be unique only inside (1) and there can be upto 6 of them every minute [08:55:41] so average is my best bet [08:56:41] YuviPanda: the count will be unique inside a flush interval btw [08:57:04] true, but I don’t want to be sending potentially count packets when I can send only 1 packet [08:57:27] and I do think average is what actually makes more sense than set here [08:57:48] if first 3 report 300 and the last 2 report 200, I want to see that drop I think [08:58:35] 6operations, 10MediaWiki-extensions-Graph, 6Services, 10service-template-node, 7service-runner: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1208877 (10mobrovac) @akosiaris has started work on this, ETA: end of April [09:00:42] 6operations, 6Mobile-Apps, 6Services: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1208880 (10mobrovac) Another status update: T95533 has been resolved, allowing us to move forward on this front. ETA for deployment: end of April. [09:01:42] YuviPanda: I think I'm missing what you are interested in knowning from the metric [09:01:57] godog: it’s basically ‘number of valid service manifests in all of toollabs' [09:02:19] godog: if you want something that’s more concrete [09:02:21] godog: look at http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1429088531.664&target=tools.tools-services-02.WebServiceMonitor.startsuccess [09:02:26] that’s number of webservices restarted [09:02:37] hmm actually [09:02:41] *that* is averaging itself [09:02:59] * YuviPanda and giving it floating point values?! [09:05:24] 6operations, 6Mobile-Apps, 6Services: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1208896 (10mobrovac) [09:06:38] godog: alright, I gotta sleep. I’ll think about it some more and read up docs and write an email if needed :) [09:06:39] thanks [09:06:52] godog: thanks for statsite - it’s made graphite in general a lot more usable \o/ [09:07:49] YuviPanda: np, I think we'll need to make some adjustments too, also consider gauges for what you're trying to do [09:08:06] YuviPanda: it won't average though, just update [09:08:08] godog: yeah, but wouldn’t that only count the last entry before flush? [09:08:09] yeah [09:08:21] so if I have a drop in the first 5 runs in a minute it’ll just no twork [09:08:23] *not [09:08:24] I love how null are really null [09:08:38] 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1208904 (10hashar) Should we start drawing a network diagram representing the different lan / vlan we have and the traffic flows between... [09:10:40] hashar: yup [09:10:42] best feature [09:22:17] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.251 second response time [09:26:32] 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1208920 (10hashar) From upstream at http://lists.openstack.org/pipermail/openstack-infra/2015-April/0026... [09:32:06] PROBLEM - puppetmaster https on virt1000 is CRITICAL - Socket timeout after 10 seconds [09:36:56] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [09:41:26] <_joe_> we don't keep a list of all our datacenters anywhere in our code? [09:41:43] <_joe_> meh [09:42:10] what do you mean? [09:42:28] <_joe_> in puppet, we don't have any list that includes all our datacenters [09:42:53] should be easy to add to realm.pp [09:42:53] <_joe_> or even dividing between "caching" ones and "main" ones [09:43:05] well that distinction is only meaningful per context [09:43:14] <_joe_> yes, it's just strange it's not there :) [09:43:17] what does that mean even :) [09:44:07] <_joe_> well, a distinction is that some datacenters are serving mediawiki/services directly, and the others don't [09:44:50] <_joe_> but I don't have a compelling use of it, right now, so nevermind [09:45:09] <_joe_> I do have an use for the list of our datacenters though, adding it. [09:46:47] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [09:48:05] even the list is a bit dependent on context :) [09:48:47] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [09:52:22] what are you trying to do? [09:52:54] 6operations: Encrypted password storage - https://phabricator.wikimedia.org/T96130#1208952 (10MoritzMuehlenhoff) 3NEW [09:53:21] <_joe_> paravoid: I'm trying to elaborate on https://gerrit.wikimedia.org/r/#/c/204068/ [09:53:28] <_joe_> which has a couple of issues [09:54:27] <_joe_> and in writing one function, I would've loved not to hardcode the list of all our datacenters inside the function itself [09:55:32] <_joe_> but it was completely secondary to my problem, and I got derailed as usual :) [09:57:05] 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation, 5Patch-For-Review: Provide Debian package python-pymysql for jessie-wikimedia - https://phabricator.wikimedia.org/T96131#1208959 (10hashar) 3NEW [09:59:34] 6operations: Encrypted password storage - https://phabricator.wikimedia.org/T96130#1208966 (10Joe) I think it would be nice/necessary to be able to have ACLs on different sections of the store, and be able to select what each user/group of users will be able to read and/or write. I know pwstore allows this, just... [10:00:00] (03PS1) 10Filippo Giunchedi: eventlogging: adjust counters thresholds [puppet] - 10https://gerrit.wikimedia.org/r/204237 (https://phabricator.wikimedia.org/T90111) [10:04:47] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.104 second response time [10:05:47] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [10:07:47] PROBLEM - puppet last run on mw2013 is CRITICAL puppet fail [10:13:48] (03PS5) 10Hashar: Initial Debian packaging [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/203961 (https://phabricator.wikimedia.org/T89142) [10:14:47] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60626 bytes in 1.898 second response time [10:15:09] (03CR) 10Hashar: "Added python-pymysql and python-daemon to Build-Deps since dh_python2 does not find them." [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/203961 (https://phabricator.wikimedia.org/T89142) (owner: 10Hashar) [10:16:05] 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation: Provide Debian package python-pymysql for jessie-wikimedia - https://phabricator.wikimedia.org/T96131#1208971 (10hashar) [10:19:56] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [10:21:36] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60647 bytes in 0.784 second response time [10:25:47] RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:25:50] 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1208974 (10hashar) [10:26:22] !log bounce jobchron on mw1001 [10:26:34] 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1028174 (10hashar) [10:26:36] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [10:26:44] morebots: I am disappoint [10:26:44] I am a logbot running on tools-exec-02. [10:26:44] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [10:26:44] To log a message, type !log . [10:26:59] 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1028174 (10hashar) I have updated the dependency table in the task details to take in account Jessie ins... [10:27:20] godog: let me look at the logs [10:27:45] ssh tools-login.eqiad.wmflabs [10:27:47] become morebots [10:28:04] LoginError: (, {u'result': u'WrongPass'}) [10:28:14] godog: I guess keystone / ldap whatever needs yet another restart [10:28:32] we should probaqbly get a monitoring probe for that service [10:29:37] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60634 bytes in 1.297 second response time [10:30:04] hashar: indeed that would be nice [10:30:17] !log restart keystone on virt1000 (#2) [10:30:39] Logged the message, Master [10:30:41] ! [10:30:43] magic [10:31:02] !log bounce jobchron on mw1001 [10:31:07] Logged the message, Master [10:31:19] meanwhile I feel really more at ease with debian packaging [10:36:24] 6operations, 10MediaWiki-JobRunner: jobchron logs are not rotated - https://phabricator.wikimedia.org/T96132#1208986 (10fgiunchedi) 3NEW [10:37:58] (03CR) 10Alexandros Kosiaris: [C: 032] "Stupid mistake. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/203228 (owner: 10Hashar) [10:44:17] PROBLEM - puppetmaster https on virt1000 is CRITICAL - Socket timeout after 10 seconds [10:48:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor dependency comment, other LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [10:50:46] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [10:51:14] <_joe_> why is the puppet master in labs going down repeatedly? has anyone looked? [10:56:39] <_joe_> looking [10:57:16] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.720 second response time [11:02:27] PROBLEM - puppetmaster https on virt1000 is CRITICAL - Socket timeout after 10 seconds [11:16:07] <_joe_> !log restarted apache2 on virt1000, passenger gone to hell [11:16:57] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.085 second response time [11:22:16] PROBLEM - puppetmaster https on virt1000 is CRITICAL - Socket timeout after 10 seconds [11:23:39] (03CR) 10Alexandros Kosiaris: [C: 032] ssh::userkey: Allow a prefix to be specified for a key [puppet] - 10https://gerrit.wikimedia.org/r/202731 (owner: 10Alexandros Kosiaris) [11:38:26] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.355 second response time [11:41:53] (03CR) 10Alexandros Kosiaris: [C: 032] Specify ssh userkey policy for ganeti clusters [puppet] - 10https://gerrit.wikimedia.org/r/202730 (owner: 10Alexandros Kosiaris) [11:43:37] PROBLEM - puppetmaster https on virt1000 is CRITICAL - Socket timeout after 10 seconds [11:45:03] (03PS7) 10Alexandros Kosiaris: Specify ssh userkey policy for ganeti clusters [puppet] - 10https://gerrit.wikimedia.org/r/202730 [11:45:31] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Specify ssh userkey policy for ganeti clusters [puppet] - 10https://gerrit.wikimedia.org/r/202730 (owner: 10Alexandros Kosiaris) [11:47:00] (03PS7) 10Alexandros Kosiaris: ssh::userkey: Allow a prefix to be specified for a key [puppet] - 10https://gerrit.wikimedia.org/r/202731 [11:48:31] (03PS1) 10Filippo Giunchedi: graphite: switch remaining machines to statsdlb [puppet] - 10https://gerrit.wikimedia.org/r/204247 [11:49:44] (03PS2) 10Filippo Giunchedi: graphite: switch remaining machines to statsdlb [puppet] - 10https://gerrit.wikimedia.org/r/204247 [11:49:52] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: switch remaining machines to statsdlb [puppet] - 10https://gerrit.wikimedia.org/r/204247 (owner: 10Filippo Giunchedi) [11:51:37] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.042 second response time [12:01:21] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [12:02:52] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [12:06:02] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:06:11] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [12:10:19] (03PS7) 10Alexandros Kosiaris: Provision the ssh key added in 3c8c524 [puppet] - 10https://gerrit.wikimedia.org/r/201462 [12:10:49] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Provision the ssh key added in 3c8c524 [puppet] - 10https://gerrit.wikimedia.org/r/201462 (owner: 10Alexandros Kosiaris) [12:10:52] PROBLEM - puppetmaster https on virt1000 is CRITICAL - Socket timeout after 10 seconds [12:20:43] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.149 second response time [12:21:08] (03CR) 10Hashar: "Thanks Alexandros. I think we can express the resources dependency explicitly:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [12:26:01] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:26:01] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:31:37] 7Puppet, 6operations: Prepend timestamp in /var/log/puppet.log - https://phabricator.wikimedia.org/T75989#1209049 (10hashar) 5Open>3declined Shelling out to date for each line is not clever. Maybe we could extend the puppet console log formatter with our own class but I have no idea whether it is doable no... [12:35:32] PROBLEM - puppetmaster https on virt1000 is CRITICAL - Socket timeout after 10 seconds [12:37:02] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.077 second response time [12:37:02] akosiaris: do you know anything about what’s happening, other than a million alerts firing? [12:37:30] abogott: keystone is not answering, obviously neither is nova [12:37:37] but that is not standard, rather intermittent [12:38:01] at first due to a keystone debug log I thought it was LDAP not answering [12:38:07] but it it seems this is working fine [12:38:21] I have gone through restarting keystone half a dozen time [12:38:29] ok [12:38:37] also nova-* services [12:38:56] a few minutes ago I killed a couple of mysqldumps on nova and keystone databases [12:39:22] 7Puppet, 6Labs: Puppet logs should be timestamped in a human-readable way - https://phabricator.wikimedia.org/T88108#1209056 (10scfc) [12:39:25] 7Puppet, 6operations: Prepend timestamp in /var/log/puppet.log - https://phabricator.wikimedia.org/T75989#1209057 (10scfc) [12:39:48] but that was like a shotgun approach, I did not really believe mysql was the problem [12:39:58] akosiaris: is it intermittent? [12:40:03] yes [12:40:08] ok [12:40:38] for example now nova list is returning just fine [12:40:53] a few minutes ago it would stall and never return a single result [12:41:15] yeah, naturally it’s working fine now that I’m trying to see the issue :) [12:41:31] how was memory on virt1000 when things were at their worst? [12:42:02] oom did not come out if that is what you are asking [12:42:24] ok [12:42:46] btw, we noticed this due to puppetmaster failing [12:43:09] while it seems it was a more deeply hidden problem [12:43:33] did you restart opendj by chance? [12:43:39] the machine was into heavy iowait into one of the CPUs [12:43:45] yes, on neptunium [12:43:52] ah, ok, that explains the dns issue... [12:44:11] and of course it did not help [12:44:32] (there’s a dumb issue with pdns where it can’t recover if ldap restarts.) [12:44:51] 7Puppet, 6operations: Prepend timestamp in /var/log/puppet.log - https://phabricator.wikimedia.org/T75989#1209071 (10scfc) I had the same wish because I had assumed that "manual" Puppet runs (`sudo puppet agent -tv`) do not log to `/var/log/syslog`. But they do, so all Puppet runs are logged with individually... [12:48:33] andrewbogott: so, how long are the mysqldumps in /usr/local/sbin/db-bak.sh supposed to last ? [12:49:27] There’s a huge amount of data in that db that we can just drop. [12:49:49] Since there’s latent wikitech data there that’s no longer used. [12:49:54] 9:45 mysqldump --single-transaction -u root keystone -c [12:50:07] almost 10 hours... I 'd say we should [12:50:10] springle: is there any reason I can’t just drop the wikitech data from the virt1000 db? [12:50:36] akosiaris: so, that almost fits. It mysql was hammered such that keystone couldn’t query it... [12:50:49] Although that doesn’t explain the puppet issue, since as far as I know puppet doesn’t use mysql [12:51:13] <_joe_> puppet was blocked communicating with keystone [12:51:13] 7Puppet, 6operations, 5Patch-For-Review: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1209083 (10hashar) a:5hashar>3None [12:51:19] <_joe_> to set some nova values [12:51:55] puppet is calling nova directly, right ? [12:51:55] Are we talking about the local puppet run on virt1000, or the puppetmaster? [12:52:02] hmmm [12:52:03] <_joe_> puppetmaster [12:52:09] keystone | Query | 40302 | Sending data | SELECT /*!40001 SQL_NO_CACHE */ * FROM `token` [12:52:17] virt1000 is the puppetmaster for labs too, right? [12:52:21] yes [12:52:27] ok that would explain the heavy IOwait [12:52:49] I’m trying to think why they puppetmaster would hit keystone. It should be talking straight to ldap. [12:53:08] On the other hand, puppet client runs have a fact which hits the metatdata service and /that/ probably hits keystone. Is it possible that that’s what you were seeing? [12:53:27] /usr/bin/python /usr/bin/nova --os-region-name eqiad --os-auth-url http://virt1000.wikimedia.org:35357/v2.0 --os-password --os-username novaadmin --os-tenant-name editor-engagement meta mwui set puppetstatus=failed [12:53:41] that was seen all over for all VMs more or less [12:53:46] ah! [12:53:57] Yes, puppetmaster hitting metadata [12:53:57] puppet calls it for some reason [12:54:02] So, ok, this all adds up. [12:54:21] puppetstatus=failed is set by labstatus.rb isn't it ? [12:54:31] so me killing mysqldumps actually fixed the problem ? [12:54:42] akosiaris: maybe :) [12:54:44] <_joe_> lol [12:54:52] oh, come on... [12:55:06] then, that mysql is misbehaving somehow... :-( [12:55:26] or being abused! [12:55:36] yeah, point taken [12:55:46] as always, I fix applications, not databases [12:55:52] :) [12:56:26] Probably https://phabricator.wikimedia.org/T92693 is the proper fix for this [12:56:31] that and thinning out the obsolete data in that db [12:56:47] Which I’ve talked to sean about a few times but probably he’s waiting for me to do it and I’m waiting for him to do it. [12:56:52] So I’ll just do it right now :) [12:57:25] heh, I guess i know it’s backed up at least [12:57:33] I would say purging all expired tokens from that table regularly would be a sane thing to do [12:57:50] akosiaris: keystone tokens you mean? [12:58:09] !log dropping labswiki and labswiki_eqiad from mysql on virt1000 [12:58:12] I am afraid to do a select count(*) but show table status says 4763172 rows with a data length of 71531757568 bytes [12:58:16] andrewbogott: yes [12:58:18] Logged the message, Master [12:58:31] * andrewbogott is more-or-less terrified of the ‘drop database’ command [12:58:31] !log restarted opendj on neptunium [12:58:36] Logged the message, Master [12:58:42] !log restarted keystone, nova services on virt1000 [12:58:47] Logged the message, Master [12:59:00] this is actually not the correct timestamp but at least here it is [12:59:09] !log restarted pdns on virt1000 and labcontrol2001 to recover from the opendj restart [12:59:15] Logged the message, Master [12:59:15] <_joe_> andrewbogott: don't do it, then [12:59:29] `expires` datetime DEFAULT NULL, [12:59:44] there... the table even has a nice field for the DELETE from where query ;-) [12:59:52] _joe_: I suspect that backing up those tables was part of what made the mysql backup job run forever and gobble resources [13:00:21] andrewbogott: that is not what I witnessed [13:00:33] I saw the token table being backed up for like forever [13:00:38] well, half a day [13:01:37] and /usr/local/sbin/db-bak.sh is not backing up is specifically backing up labswiki, keystone, nova and then glance and mysql [13:02:01] funny thing is, it is done in a nice -n 19 wrapper [13:02:08] akosiaris: I just typed ‘select * from token’ and now it’s hanging. That’s because that table is… super big? [13:02:25] andrewbogott: yup [13:02:30] I suggest ctrl-c [13:02:35] so keystone doesn’t clean up after itself, ever [13:02:39] and reissuing with something like limit 10; [13:02:53] if that is true (which it might very well be), it is unfortunate [13:03:31] http://www.sebastien-han.fr/blog/2012/12/12/cleanup-keystone-tokens/ [13:03:32] ahahaha [13:03:37] deja vu ? [13:04:03] don't forget whoever mentioned earlier that keystone fix -> restart pdns as well [13:04:04] ha! Yes, that certainly fits [13:04:19] The setup runs for 2 months now and already 1970938 and I don’t run a public cloud. I can’t imagine the nightmare with a public cloud… [13:04:58] andrewbogott: sorry about waking you up btw [13:05:13] what puzzles me is why today and not yesterday or last week or something though [13:05:18] akosiaris: np, I woke up around 5 mins before you called [13:05:40] akosiaris: I think this has been happening, a little bit, periodically. And we just hit a limit where it was finally too much and the problem became more severe. [13:06:32] akosiaris: so, for the moment I think I support a cron that deletes expired tokens. are you deep enough in that you can feed me a mysql command that will do that? [13:06:55] I think so [13:07:03] oh, I guess that page includes the exact command for that, huh? [13:07:07] so, that guy proposes mysql -u${mysql_user} -p${mysql_password} -h${mysql_host} -e 'USE keystone ; DELETE FROM token WHERE NOT DATE_SUB(CURDATE(),INTERVAL 2 DAY) <= expires;' [13:07:23] but there are some improvements we can do [13:08:01] for example use keystone is not needed in the query [13:08:23] mysql keystone -e 'DELETE etc' is slightly better [13:08:51] also that 2 days thing should be more configurable [13:09:01] but otherwise the basic premise seems correct to me [13:10:23] btw, I am starting to think keystone actually cleans up after itself [13:10:42] 2015-03-06 05:16:06 seems to be the earliest of tokens we got [13:11:05] no scratch that [13:11:15] it was me misreading some fields [13:11:21] 2012-08-18 01:34:54 [13:11:35] is the earliest we got, so no, it does not cleanup [13:12:59] btw andrewbogott driver = keystone.token.backends.memcache.Token [13:13:08] we could keep the tokens in memcached as well [13:13:35] not that I like the idea of adding one more part in that machinery, but someone obviously has done it [13:14:11] yeah, I was thinking about memcached, but… since wikitech is on a different server now we don’t currently depend on memcached on virt1000. Nice to keep it that way [13:21:02] (03PS1) 10Andrew Bogott: Clean up expired keystone tokens. [puppet] - 10https://gerrit.wikimedia.org/r/204256 [13:21:08] akosiaris: ^ [13:21:32] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [13:21:42] (03PS2) 10Andrew Bogott: Clean up expired keystone tokens. [puppet] - 10https://gerrit.wikimedia.org/r/204256 [13:22:34] (03CR) 10jenkins-bot: [V: 04-1] Clean up expired keystone tokens. [puppet] - 10https://gerrit.wikimedia.org/r/204256 (owner: 10Andrew Bogott) [13:23:02] (03CR) 10Thcipriani: Add submodules to master checkoutMediaWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204080 (https://phabricator.wikimedia.org/T88442) (owner: 10Thcipriani) [13:23:56] akosiaris: is there any reason why this isn’t a keystone bug? [13:24:39] (03PS1) 10ArielGlenn: html dumps will be served from host where they are produced, via proxy [puppet] - 10https://gerrit.wikimedia.org/r/204257 [13:25:08] (03PS3) 10Andrew Bogott: Clean up expired keystone tokens. [puppet] - 10https://gerrit.wikimedia.org/r/204256 [13:25:30] (03CR) 10jenkins-bot: [V: 04-1] html dumps will be served from host where they are produced, via proxy [puppet] - 10https://gerrit.wikimedia.org/r/204257 (owner: 10ArielGlenn) [13:25:40] 6operations, 5Interdatacenter-IPsec: Strongswan: security association reauthentication failure - https://phabricator.wikimedia.org/T96111#1209143 (10Gage) I spoke with Tobias from Strongswan on IRC about this: <+ecdsa> jgage: The log you posted shows a rekey collision for the IPv6 SA, but that seems to be han... [13:25:59] (03CR) 10Alexandros Kosiaris: contint: make Jessie slaves package builders (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [13:26:37] andrewbogott: a bug ? well a missing feature I would say (other would too I suppose) [13:26:48] ok, logging [13:27:00] (03PS3) 10Thcipriani: Add submodules to master checkoutMediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204080 (https://phabricator.wikimedia.org/T88442) [13:27:42] andrewbogott: are you sure about those 2 days ? [13:27:51] perhaps it should be more ? [13:27:56] akosiaris: no, that’s just a c/p [13:28:04] I guess we can make it a lot bigger and still get the benefit. [13:28:11] yup [13:28:46] hmm so it deletes expired tokens 2 days after they are expired [13:29:08] it actually feels right... [13:29:17] yeah, 2 days is quite a bit. [13:29:26] if they're expired they're useless for ongoing access for a long-running whatever somehow? [13:29:37] (03PS2) 10ArielGlenn: html dumps will be served from host where they are produced, via proxy [puppet] - 10https://gerrit.wikimedia.org/r/204257 [13:29:52] or is it possible for $something to use a token to open some connection and keep it alive past expiry so long as the token isn't deleted? [13:30:11] I don’t know. I certainly can’t think of anything like that. [13:30:22] This is all REST stuff so nothing should persist. [13:30:31] ok [13:30:34] I think so too [13:30:39] then again, famous last words [13:31:11] (03CR) 10BBlack: [C: 031] Clean up expired keystone tokens. [puppet] - 10https://gerrit.wikimedia.org/r/204256 (owner: 10Andrew Bogott) [13:32:22] (03CR) 10Alexandros Kosiaris: [C: 031] "Premise seems just fine." [puppet] - 10https://gerrit.wikimedia.org/r/204256 (owner: 10Andrew Bogott) [13:34:12] (03CR) 10Andrew Bogott: [C: 032] Clean up expired keystone tokens. [puppet] - 10https://gerrit.wikimedia.org/r/204256 (owner: 10Andrew Bogott) [13:34:32] akosiaris: “Provision the ssh key added” shall I merge? [13:35:06] akosiaris: for the package builder role on ci, should I just require => Class['contint::packages::labs'] so ? [13:35:52] should realize it before role::package::builder , but then the module uses ensure_package and I am afraid 'cowbuilder' will end up being realized earlier [13:37:57] 7Blocked-on-Operations, 6operations, 5Patch-For-Review: Install nodejs, nginx and other dependencies on francium - https://phabricator.wikimedia.org/T94457#1209183 (10ArielGlenn) https://gerrit.wikimedia.org/r/#/c/204257/ nginx setup for the html dumps producing host, which will temporarily be serving its du... [13:38:38] andrewbogott: yes please [13:39:17] akosiaris: you disabled puppet on virt1000 earlier? [13:39:24] hashar: yeah that require seems fine. I doubt cowbuilder will be realized beforehand though [13:39:35] andrewbogott: ah yes, I did, should have enabled it [13:39:41] ok, I’ll re-enable [13:39:52] akosiaris: giving it a try [13:40:04] (03CR) 10Hashar: contint: make Jessie slaves package builders (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [13:40:13] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:40:13] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [13:42:11] hashar: btw, I think force => true would force that symlink anyway [13:42:44] (03PS6) 10Hashar: contint: make Jessie slaves package builders [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) [13:42:58] no wait... it would be deleting an entire directory if the race condition you describe happens... hmmm not sure what will happen [13:43:11] akosiaris: yeah I thought about that. But if the dir is created, the cow images are created which takes a while then they are deleted and recreated. that is annoying [13:43:42] (03PS2) 10Hashar: package_builder: fix dependency order for hooks [puppet] - 10https://gerrit.wikimedia.org/r/203228 [13:43:48] (03PS1) 10Andrew Bogott: Move the keystone token cron into openstack::database-server [puppet] - 10https://gerrit.wikimedia.org/r/204259 [13:44:07] (03CR) 10Hashar: "Cherry picked to production to get rid of the parent change that needs further work." [puppet] - 10https://gerrit.wikimedia.org/r/203228 (owner: 10Hashar) [13:44:28] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] package_builder: fix dependency order for hooks [puppet] - 10https://gerrit.wikimedia.org/r/203228 (owner: 10Hashar) [13:45:41] (03CR) 10Andrew Bogott: [C: 032] Move the keystone token cron into openstack::database-server [puppet] - 10https://gerrit.wikimedia.org/r/204259 (owner: 10Andrew Bogott) [13:47:49] (03CR) 10Chad: Add submodules to master checkoutMediaWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204080 (https://phabricator.wikimedia.org/T88442) (owner: 10Thcipriani) [13:48:59] (03PS12) 10BBlack: r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [13:49:49] (03CR) 10jenkins-bot: [V: 04-1] r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 (owner: 10BBlack) [13:50:33] <_joe_> bblack: I'll take a look shortly [13:51:12] _joe_: I need to iterate at least once, and then push it through puppet-compiler yet, to find stupid things :) [13:52:29] (03PS3) 10Alexandros Kosiaris: ganeti: Reference correctly the ganeti cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/203035 [13:52:32] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60609 bytes in 0.180 second response time [13:52:51] <_joe_> bblack: I'm dealing with yet-another-small-change in hhvm 3.6 vs 3.3 [13:53:05] akosiaris: thanks for sorting out the keystone issue. I guess we’ll check back tomorrow to make sure the cron actually worked. [13:53:19] hm, or I could change it to ’14’ so that it fires in five minutes [13:53:20] * andrewbogott does that [13:53:33] andrewbogott: run it manually [13:53:43] the very first time at least [13:54:04] !log purging expired keystone tokens on virt1000 [13:54:10] Logged the message, Master [13:54:28] (03CR) 10Alexandros Kosiaris: [C: 032] ganeti: Reference correctly the ganeti cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/203035 (owner: 10Alexandros Kosiaris) [13:55:47] * andrewbogott runs a big expensive query on virt1000, most likely reproducing the same failure that got us here to begin with… [13:56:55] 6operations, 5Interdatacenter-IPsec: Strongswan: security association reauthentication failure - https://phabricator.wikimedia.org/T96111#1209206 (10Gage) I reduced some timeouts in order to recreate the problem; config changes suggested by ecdsa have not yet been made: ``` conn %default ikelifetime=6m... [14:00:04] chasemp: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150415T1400). Please do the needful. [14:00:11] nope^ [14:01:39] (03PS5) 10Filippo Giunchedi: graphite: introduce carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/181080 (https://phabricator.wikimedia.org/T85908) [14:01:43] 6operations, 10ops-eqiad, 10ops-fundraising: barium has a failed HDD - https://phabricator.wikimedia.org/T93899#1209209 (10Cmjohnson) New disk is on-line nclosure Device ID: N/A Slot Number: 3 Drive's position: DiskGroup: 1, Span: 0, Arm: 0 Enclosure position: N/A Device Id: 3 WWN: 5000c500794334bb Sequence... [14:03:54] (03PS1) 10Alexandros Kosiaris: Followup fix for 7ba51bc [puppet] - 10https://gerrit.wikimedia.org/r/204263 [14:04:54] paravoid: thanks for fixing labvirt networking. Looks good now. [14:05:23] great [14:05:31] (03CR) 10Alexandros Kosiaris: [C: 032] Followup fix for 7ba51bc [puppet] - 10https://gerrit.wikimedia.org/r/204263 (owner: 10Alexandros Kosiaris) [14:08:53] (03CR) 10Hashar: [C: 04-1] "I have applied patchset 6 and ran it on integration-slave-jessie-1001.eqiad.wmflabs and cowbuilder ends up being installed first :(" [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [14:09:13] (03PS13) 10BBlack: r::c::config::active_nodes -> hiera cache::$cluster::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [14:09:20] akosiaris: packages are still realized first regardless of the require => Class[..] :-(((( [14:10:30] hashar: ok, gimme a sec to sort something out and I 'll look into it [14:10:41] (03PS1) 10Alexandros Kosiaris: Typo fix in ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/204264 [14:10:42] PROBLEM - puppet last run on cp3041 is CRITICAL puppet fail [14:12:08] (03CR) 10Alexandros Kosiaris: [C: 032] Typo fix in ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/204264 (owner: 10Alexandros Kosiaris) [14:13:19] (03PS1) 10Filippo Giunchedi: gdash: display udp errors in graphite dashboard [puppet] - 10https://gerrit.wikimedia.org/r/204265 [14:13:51] RECOVERY - puppet last run on ganeti1002 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:14:00] (03PS2) 10Filippo Giunchedi: gdash: display udp errors in graphite dashboard [puppet] - 10https://gerrit.wikimedia.org/r/204265 [14:14:19] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: display udp errors in graphite dashboard [puppet] - 10https://gerrit.wikimedia.org/r/204265 (owner: 10Filippo Giunchedi) [14:14:22] RECOVERY - puppet last run on ganeti2004 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [14:14:22] RECOVERY - puppet last run on ganeti2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:14:22] RECOVERY - puppet last run on ganeti2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:14:22] RECOVERY - puppet last run on ganeti2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:04] (03CR) 10Hashar: "Found out package_builder init has:" [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [14:15:22] RECOVERY - puppet last run on ganeti1003 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [14:16:02] RECOVERY - puppet last run on ganeti2006 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [14:16:52] RECOVERY - puppet last run on ganeti2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:31] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [14:21:11] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-JobRunner, 7Graphite: jobrunner metrics audit - https://phabricator.wikimedia.org/T95913#1209253 (10fgiunchedi) picking this up, related https://gerrit.wikimedia.org/r/204237 https://gerrit.wikimedia.org/r/203839 https://gerrit.wikimedia.org/r/203847... [14:21:18] 6operations, 10ops-eqiad, 10ops-fundraising: barium has a failed HDD - https://phabricator.wikimedia.org/T93899#1209254 (10Cmjohnson) 5Open>3Resolved package updates were successful..resolving this ticket [14:26:10] (03CR) 10Hashar: "So it is all good to me and I would +2 it but I prefer who ever handles the deployment / maintenance of jouncebot to trigger the merge. Ju" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/203985 (owner: 10BryanDavis) [14:28:52] RECOVERY - puppet last run on cp3041 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:29:24] 6operations, 5Interdatacenter-IPsec: Strongswan: security association reauthentication failure - https://phabricator.wikimedia.org/T96111#1209294 (10Gage) Ok, good news. Further discussion with ecdsa has revealed that this problem is fixed in 5.3.0, which is released but not yet packaged for Debian. Bug: http... [14:30:54] (03PS14) 10BBlack: r::c::config::active_nodes -> hiera cache::$cluster::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [14:32:17] (03CR) 10Thcipriani: Add submodules to master checkoutMediaWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204080 (https://phabricator.wikimedia.org/T88442) (owner: 10Thcipriani) [14:32:22] RECOVERY - puppet last run on ganeti1004 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:32:34] 10Ops-Access-Requests, 6operations, 10Continuous-Integration: Add user wmde-fisch to LDAP group wmde - https://phabricator.wikimedia.org/T95546#1209309 (10hashar) The Jenkins account shows up with the 'wmde' group at https://integration.wikimedia.org/ci/user/wmde-fisch/ @WMDE-Fisch should thus be able to co... [14:33:42] RECOVERY - puppet last run on ganeti1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:37:42] (03PS1) 10Filippo Giunchedi: restbase: add ganglia cluster [puppet] - 10https://gerrit.wikimedia.org/r/204274 [14:43:33] ori: are VE preconnects to rest.wikimedia.org active already? [14:49:58] (03PS1) 10Filippo Giunchedi: statsite: default to localhost, override as needed [puppet] - 10https://gerrit.wikimedia.org/r/204275 [14:52:07] * anomie sees nothing for SWAT this morning [14:54:36] !log running deleteEmptyAccounts.php --fix on metawiki (CentralAuth) [14:54:41] Logged the message, Master [14:57:12] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [14:57:52] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 3 below the confidence bounds [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150415T1500). Please do the needful. [15:00:28] <^d> I'll take it [15:00:30] <^d> No patches! [15:00:39] (03PS15) 10BBlack: r::c::config::active_nodes -> hiera cache::$cluster::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [15:01:04] <^d> bblack: hiera refactors usually take 10+ patches but they're so worth it in the end :) [15:13:34] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Other than removing the default value for mission-critical lookups, I think this patch is now good to be merged." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/204068 (owner: 10BBlack) [15:15:30] (03PS1) 10Andrew Bogott: Install labvirt-star cert on labvirt nodes. [puppet] - 10https://gerrit.wikimedia.org/r/204279 [15:15:57] (03PS16) 10BBlack: r::c::config::active_nodes -> hiera cache::$cluster::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [15:16:35] 6operations, 5Interdatacenter-IPsec: Update 3.19 kernel to 3.19.4 - https://phabricator.wikimedia.org/T96146#1209407 (10MoritzMuehlenhoff) 3NEW [15:16:41] (03PS2) 10Filippo Giunchedi: statsite: default to localhost, override as needed [puppet] - 10https://gerrit.wikimedia.org/r/204275 [15:17:59] 6operations, 6Labs: One instance hammering on NFS should not make it unavailable to everyone else - https://phabricator.wikimedia.org/T95766#1209414 (10coren) NFS indeed does not allow us to know which enduser is responsible for any specific traffic, as an unavoidable consequence of the levels of abstraction t... [15:18:55] 6operations, 5Interdatacenter-IPsec: Update 3.19 kernel to 3.19.4 - https://phabricator.wikimedia.org/T96146#1209416 (10BBlack) In practice, getting this to the to-be-ipsec nodes will take quite some time for cache reboots once it's in the repo and package updated on the hosts... [15:19:54] 6operations, 5Interdatacenter-IPsec: Update 3.19 kernel to 3.19.4 - https://phabricator.wikimedia.org/T96146#1209417 (10BBlack) (I mention the above mainly as a side note about having ipsec rollout date depend on the fix or not) [15:20:43] (03CR) 10Andrew Bogott: [C: 032] Install labvirt-star cert on labvirt nodes. [puppet] - 10https://gerrit.wikimedia.org/r/204279 (owner: 10Andrew Bogott) [15:22:11] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:27:33] (03PS1) 10Andrew Bogott: Use the already-existing $certname var in libvirtd.conf [puppet] - 10https://gerrit.wikimedia.org/r/204282 [15:27:52] !log disabling puppet on caches JIC for https://gerrit.wikimedia.org/r/204068 merge [15:27:57] Logged the message, Master [15:28:34] (03PS17) 10BBlack: r::c::config::active_nodes -> hiera cache::$cluster::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [15:30:06] (03CR) 10Giuseppe Lavagetto: [C: 031] r::c::config::active_nodes -> hiera cache::$cluster::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 (owner: 10BBlack) [15:30:36] (03CR) 10BBlack: [C: 032] r::c::config::active_nodes -> hiera cache::$cluster::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 (owner: 10BBlack) [15:30:49] (03PS1) 10Alexandros Kosiaris: Typo fixes in role::ganeti [puppet] - 10https://gerrit.wikimedia.org/r/204283 [15:31:49] (03CR) 10Andrew Bogott: [C: 032] Use the already-existing $certname var in libvirtd.conf [puppet] - 10https://gerrit.wikimedia.org/r/204282 (owner: 10Andrew Bogott) [15:32:00] (03CR) 10Alexandros Kosiaris: [C: 032] Typo fixes in role::ganeti [puppet] - 10https://gerrit.wikimedia.org/r/204283 (owner: 10Alexandros Kosiaris) [15:35:08] !log re-enabling puppet on caches, canary nodes were no-op \o/ [15:35:15] Logged the message, Master [15:39:01] ottomata: want to try sending varnish stats straight to graphite? [15:39:02] PROBLEM - nova-compute process on labvirt1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [15:40:54] (03PS1) 10Alexandros Kosiaris: Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/204285 [15:44:17] (03CR) 10Alexandros Kosiaris: [C: 032] Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/204285 (owner: 10Alexandros Kosiaris) [15:45:07] godog: today is a bad day :( analytics cluster is really unhappy right now with too many jobs running, am trying to help the produciton ones through, then will figure out some better queues for users [15:45:18] but, you are welcome to just try it on your own [15:45:23] and i can help with any qs you might have [15:46:53] (03PS3) 10Filippo Giunchedi: statsite: default to localhost, override as needed [puppet] - 10https://gerrit.wikimedia.org/r/204275 [15:47:54] ottomata: ack, let me know if you free up today or we can pick it up tomorrow too! [15:56:11] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [15:58:08] (03PS4) 10Gage: mailman: SENDER_HEADERS use from only [puppet] - 10https://gerrit.wikimedia.org/r/154846 (https://bugzilla.wikimedia.org/46049) (owner: 10John F. Lewis) [16:00:14] (03CR) 10Gage: [C: 032] mailman: SENDER_HEADERS use from only [puppet] - 10https://gerrit.wikimedia.org/r/154846 (https://bugzilla.wikimedia.org/46049) (owner: 10John F. Lewis) [16:04:11] (03CR) 10Mobrovac: restbase: add ganglia cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204274 (owner: 10Filippo Giunchedi) [16:05:46] Coren: so, I’m trying to make a self-signed cert, and no matter what I try the Subject and Issuer are the same… but the cert I’m trying to replicate has issuer CN=Wikimedia CA [16:05:48] any idea? [16:06:32] Needs moar contekts. [16:07:40] I need a replacement for https://dpaste.de/FJnj [16:07:43] As a rule, the steps are (a) generate key, (b) create csr, (c) sign csr with same key. [16:07:48] that uses labvirt* instead of virt* [16:08:11] That's... not a self-signed cert. :-) [16:08:47] Wikimedia CA is… us, isn’t it? [16:09:05] Or is ‘signed by us’ different from self-signed? [16:09:29] wikimedia CA ? [16:09:33] Yep. :-) "self-signed cert" means a certificate that signs /itself/. You want a cert signed by our CA. :-) [16:09:43] (03CR) 10Nuria: eventlogging: adjust counters thresholds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204237 (https://phabricator.wikimedia.org/T90111) (owner: 10Filippo Giunchedi) [16:09:47] Coren: ok! How do I do that? :) [16:10:06] we have a CA now? [16:10:07] (Which, afaik, we don't *have*). Someone somewhere has created a CSC with that issuer. You need to locate it and use its key. [16:10:13] greg-g: not really [16:10:24] akosiaris: We totally should, though. [16:10:34] a real CA ? [16:10:43] as in a CA that is in browsers ? [16:10:44] akosiaris: An internal one. [16:10:45] Um… it’s surely not a real CA [16:10:48] not for browsers [16:11:04] we got an internal CA for a few very specific things [16:11:07] we actually got 2 [16:11:20] on that I personally lost the keys for like 1,5 year ago [16:11:27] akosiaris: Heh. [16:11:27] and one that I created to replace that first one [16:11:37] akosiaris: What's its dn? [16:11:44] so andrewbogott you have me to talk to [16:11:53] Coren: hmm lemme check [16:12:18] Coren: is that a different key from virt-star.eqiad.wmnet.key? [16:12:31] Subject: C=US, ST=California, L=San Francisco, O=Wikimedia Foundation, OU=Operations, CN=WMF CA 2014-2017 [16:12:36] Coren: ^ [16:12:47] openssl x509 -in files/ssl/wmf_ca_2014_2017.crt -text [16:13:10] andrewbogott: It is - the virt-star one was /signed/ by one with a dn of C=US, ST=California, L=San Francisco, O=Wikimedia Foundation, CN=Wikimedia CA [16:13:11] this one is actually the new one and it lives entirely in the private puppet repo [16:13:26] andrewbogott: If you need to have the same signer, you need to locate that cert and key [16:13:34] the without 2014-2017 is the old one [16:13:43] akosiaris: Do we deploy that root cert and trust it? [16:13:44] and I suggest killing it [16:13:50] RECOVERY - Host analytics1020 is UPING OK - Packet loss = 0%, RTA = 1.12 ms [16:13:51] Coren: yes [16:14:04] (03CR) 10Filippo Giunchedi: eventlogging: adjust counters thresholds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204237 (https://phabricator.wikimedia.org/T90111) (owner: 10Filippo Giunchedi) [16:14:05] andrewbogott: You can use a cert signed by that key then. [16:14:14] require certificates::wmf_ca [16:14:22] and require certificates::wmf_ca_2014_2017 respectively [16:14:23] (03CR) 10Filippo Giunchedi: restbase: add ganglia cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204274 (owner: 10Filippo Giunchedi) [16:14:24] old and new [16:14:34] It doesn’t have to be the same, it just has to play nice with wmf-ca.pem [16:14:38] s/require/class/ but you get the picture [16:14:57] andrewbogott: Wait. wmf-ca.pem? Where does that come from? [16:15:05] * andrewbogott is pretty lost [16:15:17] 6operations, 10ops-eqiad, 6Labs: labvirt100x boxes 'no carrier' on eth1 - https://phabricator.wikimedia.org/T95973#1209547 (10Cmjohnson) 5Open>3Resolved This should be resolved now thanks to Faidons fix. [16:15:23] andrewbogott: what are you trying to do ? [16:15:25] andrewbogott: OKay. The nutshell: [16:15:29] * akosiaris reading backlog [16:15:41] andrewbogott: You need a cert that is signed by an authorithy the clients will recognize. [16:16:01] andrewbogott: If the clients has our internal CA certs, then any cert signed with them will work. [16:16:29] ottomata: an1020 has been fixed. [16:16:45] So… on the old virt nodes, we have libvirtd settings: key_file = "/var/lib/nova/virt-star.eqiad.wmnet.key" cert_file = "/etc/ssl/localcerts/virt-star.eqiad.wmnet.crt" ca_file = "/etc/ssl/certs/wmf-ca.pem" [16:16:48] andrewbogott: So you can create a CSR for your labvirt* cert, and sign it with our CA [16:16:53] That doesn’t work on libvirt100x because of the wrong hostname [16:17:08] So I presume a need a new, similar cert but for libvirt* instead of virt* [16:17:18] you presume correctly [16:17:19] That ca_file, where does it come from? That's the one whose key you need. [16:18:25] You need to sign your labvirt* cert with the key to that one. :-) [16:18:41] PROBLEM - Hadoop NodeManager on analytics1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:18:41] PROBLEM - puppet last run on analytics1020 is CRITICAL puppet fail [16:18:41] ok… and it sounds like akosiaris thinks that key is lost forever, yes? [16:19:05] hm [16:19:08] I would say is pretty sure vs thinks [16:19:11] I'm not sure it's the same, but if it is then yes - you'll need to also change the ca_file everywhere. [16:19:42] actually, I have to stop working on this and go to the quarterly review meeting. [16:19:51] akosiaris, if you have time to sort this out, the patch that needs fixing is https://gerrit.wikimedia.org/r/#/c/204279/ [16:19:59] (03CR) 10Filippo Giunchedi: [C: 031] * Simplify package build, also the stepping stone for adding a systemd unit file (T95055) [debs/ircecho] - 10https://gerrit.wikimedia.org/r/204045 (owner: 10Muehlenhoff) [16:20:07] otherwise I will return to this later on and struggle :/ [16:20:28] andrewbogott: where did that cert come from ? [16:20:46] andrewbogott: actually, get to your meeting and we will talk later [16:20:53] I made it, it’s signed with labvirt-star.eqiad.wmnet.key [16:20:59] which is a copy of virt-star.eqiad.wmnet.key [16:21:10] and the cert doesn’t work because… wrong CA [16:21:22] (03CR) 10Filippo Giunchedi: Add a systemd unit file (T95055) (032 comments) [debs/ircecho] - 10https://gerrit.wikimedia.org/r/204054 (owner: 10Muehlenhoff) [16:21:31] you self-signed the certificate ? [16:21:37] that will not work [16:21:45] So I see! [16:21:53] * andrewbogott reboots in hopes of getting laptop camera to work [16:22:17] (03CR) 10Filippo Giunchedi: * Simplify package build, also the stepping stone for adding a systemd unit file (T95055) (031 comment) [debs/ircecho] - 10https://gerrit.wikimedia.org/r/204045 (owner: 10Muehlenhoff) [16:22:23] akosiaris: Terminology woes; andrewbogott thought "singed ourselves" and "self-signed" were the same. [16:22:31] !log running revision render thin-out script on wikipedia HTML [16:22:38] Logged the message, Master [16:22:53] Which, admitedly, sounds reasonable unless you are familiar with pki. :-) [16:23:23] (03PS2) 1020after4: Trebuchet: run all state changing git commands with umask 002 [puppet] - 10https://gerrit.wikimedia.org/r/201344 (https://phabricator.wikimedia.org/T94754) (owner: 10BryanDavis) [16:23:50] RECOVERY - Hadoop NodeManager on analytics1020 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:23:51] RECOVERY - puppet last run on analytics1020 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:27:23] (03CR) 10Alexandros Kosiaris: "This will not work. The certificate needs to be signed by a valid CA and not be self-signed. We got 2 WMF internal CAs, one we try to depr" [puppet] - 10https://gerrit.wikimedia.org/r/204279 (owner: 10Andrew Bogott) [16:29:08] !log demon Synchronized php-1.26wmf1/extensions/CentralAuth/: (no message) (duration: 00m 13s) [16:29:13] Logged the message, Master [16:29:38] (03CR) 10Filippo Giunchedi: [C: 031] "minor nit but LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [16:31:12] (03PS1) 10Alex Monk: Note which LDAP groups are allowed in HTTP login prompts mentioning labs [puppet] - 10https://gerrit.wikimedia.org/r/204291 [16:33:58] 6operations: Encrypted password storage - https://phabricator.wikimedia.org/T96130#1209635 (10Dzahn) We had an existing ticket for this in RT, it used to be https://rt.wikimedia.org/Ticket/Display.html?id=6665 which was imported over to phab as T83410 Let's merge them? [16:37:49] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1209651 (10csteipp) >>! In T95229#1207763, @GWicke wrote: >> Graphoid is 530 kloc's of javascript. > > If the codebase is too large to review, then why don't w... [16:44:48] !log restarted eventlogging && deployed d241d75ee2fab554bc47cf8d1ba83f5df2130633 [16:45:57] Logged the message, Master [16:47:00] PROBLEM - NTP on analytics1020 is CRITICAL: NTP CRITICAL: Offset unknown [16:48:24] hey cmjohnson1 [16:48:24] you working on an20? [16:48:24] that's an odd error [16:48:40] no, I just plugged the eth cable in ...you may want to reboot it [16:49:18] ooo [16:49:19] ok [16:49:31] !log rebooting analyics1020 [16:49:37] Logged the message, Master [16:49:49] too bad you didn't do an apt-get upgrade first :) [16:50:50] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:54] haha [16:50:56] oh? [16:53:04] it has pending updates for openjdk, among many others [16:53:21] RECOVERY - Host analytics1020 is UPING OK - Packet loss = 0%, RTA = 1.29 ms [16:57:30] 10Ops-Access-Requests, 6operations: Requesting access to tin.eqiad.wmnet for mforns - https://phabricator.wikimedia.org/T96163#1209702 (10mforns) 3NEW [16:59:30] 10Ops-Access-Requests, 6operations: Requesting access to hafnium for mforns - https://phabricator.wikimedia.org/T96164#1209712 (10mforns) 3NEW [17:04:19] thanks cmjohnson1, an20 is looking much better [17:04:52] cool, i'm pretty sure just plugging the network cable in willy nilly was the problem [17:06:29] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1209731 (10GWicke) @csteipp, it's not an either-or. If we have doubts about the XSS cleanliness of the output, then additional sanitization can help to further... [17:12:30] 6operations, 5Interdatacenter-IPsec: Update 3.19 kernel to 3.19.4 - https://phabricator.wikimedia.org/T96146#1209737 (10MoritzMuehlenhoff) AFAICS the aes256gcm bug is bypassed with https://phabricator.wikimedia.org/rOPUP1ab5d2ccdb85b37c220c49a3e6678688098dcaeb so that shouldn't be a blocker [17:15:00] !log running migrateAccount.php --auto (CentralAuth) [17:15:07] Logged the message, Master [17:26:05] (03CR) 10Filippo Giunchedi: "some general comments" (031 comment) [software/sentry] - 10https://gerrit.wikimedia.org/r/201006 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [17:31:36] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1209804 (10JohnLewis) [17:42:15] (03CR) 10Mobrovac: [C: 031] restbase: add ganglia cluster [puppet] - 10https://gerrit.wikimedia.org/r/204274 (owner: 10Filippo Giunchedi) [17:45:05] bd808: I have tested https://gerrit.wikimedia.org/r/#/c/204098/ and it works well on labs-vagrant feel free to merge it [18:00:04] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150415T1800). Please do the needful. [18:16:12] twentyafterfour: will be back in ~30 min, in case anything needed for wikidata [18:25:38] !log running forceRenameUsers.php (SUL finalization) on test* wikis [18:25:44] Logged the message, Master [18:27:08] (03PS2) 10Yuvipanda: tools: Remove remnants of portgranter code [puppet] - 10https://gerrit.wikimedia.org/r/204014 (https://phabricator.wikimedia.org/T93046) [18:27:35] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Remove remnants of portgranter code [puppet] - 10https://gerrit.wikimedia.org/r/204014 (https://phabricator.wikimedia.org/T93046) (owner: 10Yuvipanda) [18:30:29] (03PS3) 10Yuvipanda: tools: Separate registration / unregistreation for proxylistener [puppet] - 10https://gerrit.wikimedia.org/r/204193 (https://phabricator.wikimedia.org/T96059) [18:30:35] (03CR) 10Yuvipanda: tools: Separate registration / unregistreation for proxylistener [puppet] - 10https://gerrit.wikimedia.org/r/204193 (https://phabricator.wikimedia.org/T96059) (owner: 10Yuvipanda) [18:30:44] Coren: ^ can you +1? [18:31:00] * Coren reads [18:33:15] (03CR) 10coren: [C: 031] "Reasonably sane, but not tested by me. :-)" [puppet] - 10https://gerrit.wikimedia.org/r/204193 (https://phabricator.wikimedia.org/T96059) (owner: 10Yuvipanda) [18:39:53] (03CR) 10Yuvipanda: [C: 032] "Alright, i'm slowly and very carefully doing this now :)" [puppet] - 10https://gerrit.wikimedia.org/r/204193 (https://phabricator.wikimedia.org/T96059) (owner: 10Yuvipanda) [18:41:42] twentyafterfour: have you started deploying yet? [18:42:52] legoktm: haven't started scapping yet no, everything ok? [18:44:03] twentyafterfour: I have a i18n update (WikimediaMessages) that should go out asap, I'm putting up the patch now, can you include it in your scap? [18:44:36] legoktm: sure thing [18:47:29] twentyafterfour: the active branches are wmf1 and wmf2 right? [18:48:37] legoktm: yes I just cut wmf2 a little while ago and I'm about to phase out 1.25wmf24 [18:49:08] wmf2 isn't yet active anywhere [18:49:33] (It's currently checking out all the submodules for wmf2) [18:50:23] * aude back [18:51:07] hmm, wikimedia.org has become very laggy for me all of a sudden [18:51:19] tinet.ams05.atlas.cogentco.com (130.117.14.50) 14.209 ms 17.279 ms 13.724 ms [18:51:25] 10 xe-7-2-2.was10.ip4.gtt.net (141.136.111.14) 887.090 ms 907.144 ms 783.867 ms [18:51:33] twentyafterfour: the submodule bumps are wmf2: https://gerrit.wikimedia.org/r/204317 and wmf1: https://gerrit.wikimedia.org/r/204318 [18:51:48] seems somewhere in between those two, because it makes a huge jump in ping response from there [18:53:39] twentyafterfour: do you want me to merge those or will you take care of it? [18:53:42] thedj: do you have a traceroute? [18:53:58] legoktm: I can take care of it for you no problem [18:54:14] * aude can't help but know we've had problems with the route to stuff like gerrit (ssh) and labs [18:54:37] aude: aude http://pastebin.ca/2973441 [18:55:01] thanks :D [18:56:07] thedj: hmm [18:56:37] not sure but if it's an ongoing problem then probably ask paravoid and/or create a task [18:56:50] sometimes we can work around the issue or try to deal with it somehow [18:57:23] i can't even fetch right now :( [18:57:46] your IP in private please :) [18:59:47] GTT ? [19:01:12] gtt/cogent [19:01:44] or maybe just ziggo/gtt, unsure yet [19:03:51] (03PS2) 10Dereckson: Set meta namespace and site name on or.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203785 (https://phabricator.wikimedia.org/T94142) [19:05:09] (03PS1) 1020after4: Add 1.26wmf2 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204319 [19:05:11] (03PS1) 1020after4: Wikipedias to 1.26wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204320 [19:05:13] (03PS1) 1020after4: Group0 to 1.26wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204321 [19:05:15] (03PS1) 1020after4: Remove 1.25wmf20 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204322 [19:06:50] (03PS3) 10Dereckson: Set meta namespace and site name on or.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203785 (https://phabricator.wikimedia.org/T94142) [19:08:13] (03CR) 10Dereckson: "PS2: Rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203785 (https://phabricator.wikimedia.org/T94142) (owner: 10Dereckson) [19:09:54] 6operations, 10hardware-requests, 5Continuous-Integration-Isolation: eqiad: 2 hardware access request for CI isolation on labsnet - https://phabricator.wikimedia.org/T93076#1210155 (10hashar) labnodepool1001 has been installed and is ready for service implementation scandium (zuul mergers) should land in la... [19:11:24] (03PS1) 10Yuvipanda: Revert "tools: Separate registration / unregistreation for proxylistener" [puppet] - 10https://gerrit.wikimedia.org/r/204323 [19:11:40] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "tools: Separate registration / unregistreation for proxylistener" [puppet] - 10https://gerrit.wikimedia.org/r/204323 (owner: 10Yuvipanda) [19:15:35] Coren: I reverted it for now, needs identd debugging [19:16:03] What issue did you run into? [19:16:11] PROBLEM - puppet last run on cp3049 is CRITICAL puppet fail [19:17:00] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [19:17:07] 6operations, 10Wikimedia-Mailing-lists: scrub non-free PDF from list archives - https://phabricator.wikimedia.org/T95195#1210167 (10Jalexander) IANAL but my recommendation is to leave it, the risks are too high until we get a demand, if we get a demand then we have a legal requirement. [19:17:17] Coren: basically identd can’t figure out which user is connecting from [19:17:58] Coren: I think problem is that the client closes connection too early [19:18:14] Ah. Right, identd must have an actively open socket to work. [19:18:29] 6operations: Update DNS for the Wikipedia store, before May 31 - https://phabricator.wikimedia.org/T96182#1210171 (10vshchepakina) 3NEW a:3Jgreen [19:18:41] PROBLEM - puppet last run on cp3019 is CRITICAL puppet fail [19:19:11] PROBLEM - puppet last run on cp3047 is CRITICAL puppet fail [19:19:34] Coren: yeah, am reworking it all now :) [19:20:10] PROBLEM - puppet last run on cp3017 is CRITICAL puppet fail [19:21:50] well [19:21:52] not ‘all' [19:21:54] but enough biits [19:26:26] !log tools -exec-03 drained, rebooting [19:27:21] PROBLEM - puppet last run on analytics1027 is CRITICAL Puppet last ran 4 hours ago [19:33:21] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:33:22] RECOVERY - puppet last run on cp3019 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:33:50] RECOVERY - puppet last run on analytics1027 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [19:34:02] RECOVERY - puppet last run on cp3049 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:35:31] RECOVERY - puppet last run on cp3047 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [19:35:49] anyone have a clue about https://phabricator.wikimedia.org/T96114 CSS isn't loading [19:35:57] see the last comment from Krenair [19:36:21] RECOVERY - puppet last run on cp3017 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:43:48] (03PS1) 10Yuvipanda: Revert "Revert "tools: Separate registration / unregistreation for proxylistener"" [puppet] - 10https://gerrit.wikimedia.org/r/204329 [19:44:01] Coren: ^ wanna take a look? :) [19:44:28] YuviPanda: Can you give me a minute? I'm in the middle of shuffling jobs around. [19:44:40] (03CR) 10jenkins-bot: [V: 04-1] Revert "Revert "tools: Separate registration / unregistreation for proxylistener"" [puppet] - 10https://gerrit.wikimedia.org/r/204329 (owner: 10Yuvipanda) [19:44:43] Coren: sure [19:45:34] (03PS2) 10Yuvipanda: Revert "Revert "tools: Separate registration / unregistreation for proxylistener"" [puppet] - 10https://gerrit.wikimedia.org/r/204329 [19:46:44] (03PS1) 10Ottomata: Add 2 new FairScheduler queues: priority and production [puppet] - 10https://gerrit.wikimedia.org/r/204330 [19:47:18] YuviPanda: Looking now while I wait for draining. [19:47:24] (03CR) 10Ottomata: [C: 032 V: 032] Add 2 new FairScheduler queues: priority and production [puppet] - 10https://gerrit.wikimedia.org/r/204330 (owner: 10Ottomata) [19:47:25] grr.. gnome-terminal crashed and took out weechat. I should really use screen :-o [19:47:44] Coren: cool. I just added a recv on the client and a send on the server [19:47:50] * YuviPanda hasn’t really done socket programming as such [19:48:55] (03CR) 1020after4: [C: 032] Add 1.26wmf2 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204319 (owner: 1020after4) [19:49:04] (03CR) 1020after4: [C: 032] Remove 1.25wmf20 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204322 (owner: 1020after4) [19:49:42] YuviPanda: I don't know about python, but if you want to make certain you don't have half-closed sockets with pending data, in C you'd normally do an explicit shutdown(sock, 2) when you want to make something synchronous. [19:50:00] 6operations, 10Wikimedia-Mailing-lists: scrub non-free PDF from list archives - https://phabricator.wikimedia.org/T95195#1210258 (10Slaporte) 5Open>3declined a:3Slaporte Hi @jeremyb, please report cases of potential copyright infringement through our standard DMCA process where appropriate: http://wikime... [19:50:01] Coren: I think the .close() would be the equivalent [19:50:11] oooh no [19:50:24] (03PS1) 10Chad: Hiera-ize the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/204331 [19:50:37] YuviPanda: Python might do it implicitly in the .close() though I wouldn't know for certain. [19:50:46] (03Merged) 10jenkins-bot: Add 1.26wmf2 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204319 (owner: 1020after4) [19:51:05] ori: is the VE preconnect patch live? [19:51:15] (03CR) 10jenkins-bot: [V: 04-1] Hiera-ize the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/204331 (owner: 10Chad) [19:51:52] Coren: read docs, doesn’t :) good catch [19:52:16] I don't know python all that well, but I know sockets. :-) [19:52:35] (03PS2) 10Chad: Hiera-ize the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/204331 [19:53:12] (03PS3) 10Yuvipanda: Revert "Revert "tools: Separate registration / unregistreation for proxylistener"" [puppet] - 10https://gerrit.wikimedia.org/r/204329 [19:53:13] Coren: ^ [19:54:41] (03CR) 10coren: [C: 031] "Unreverse not unnaproved." [puppet] - 10https://gerrit.wikimedia.org/r/204329 (owner: 10Yuvipanda) [19:55:06] :-) Because "revert revert" :-) [19:55:20] (03PS4) 10Yuvipanda: Revert "Revert "tools: Separate registration / unregistreation for proxylistener"" [puppet] - 10https://gerrit.wikimedia.org/r/204329 [19:55:21] 6operations, 10Wikimedia-Mailing-lists: scrub non-free PDF from list archives - https://phabricator.wikimedia.org/T95195#1210276 (10Slaporte) 5declined>3Open a:5Slaporte>3None [19:55:39] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "Revert "tools: Separate registration / unregistreation for proxylistener"" [puppet] - 10https://gerrit.wikimedia.org/r/204329 (owner: 10Yuvipanda) [19:58:30] !log twentyafterfour Started scap: testwiki to php-1.26wmf2 and rebuild l10n cache [19:59:52] I negate your negation with negation. ftw [20:00:10] <3 triple negatives [20:01:59] 6operations, 10MediaWiki-extensions-Sentry, 6Multimedia, 10hardware-requests, 3Multimedia-Sprint-2015-03-25: Procure hardware for Sentry - https://phabricator.wikimedia.org/T93138#1210290 (10RobH) a:5RobH>3Tgr @tgr: has the setup in labs been puppetized at this time? We tend to not allocate bare met... [20:03:06] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1210295 (10RobH) [20:03:08] 6operations, 10hardware-requests: hardware for global ganglia aggregator in eqiad - https://phabricator.wikimedia.org/T95792#1210292 (10RobH) 5Open>3declined a:3RobH update from in person & irc conversations: the ganglia aggregator for codfw is install2001, so this should ideally go on carbon (which som... [20:04:19] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1210296 (10csteipp) @gwicke and I talked in person and agreed that if all service output that is html, svg, or any other xml-derived format is run through an ex... [20:05:21] 6operations, 10Wikimedia-Mailing-lists: scrub non-free PDF from list archives - https://phabricator.wikimedia.org/T95195#1210297 (10Krenair) 5Open>3declined a:3Krenair [20:13:12] (03CR) 10Chad: "Seems like it'll work: http://puppet-compiler.wmflabs.org/715/change/204331/html/ :)" [puppet] - 10https://gerrit.wikimedia.org/r/204331 (owner: 10Chad) [20:14:39] <^d> YuviPanda: ^ was deceptively easy :) [20:15:05] (03CR) 10Thcipriani: [C: 031] "Seems like the best way to make this work across N environments vs beta + prod." [puppet] - 10https://gerrit.wikimedia.org/r/204331 (owner: 10Chad) [20:16:52] (03PS3) 10Chad: Hiera-ize the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/204331 [20:21:42] 6operations, 10ops-codfw: rack/wire/initial setup of db2043-db2070 - https://phabricator.wikimedia.org/T89368#1210313 (10Papaul) db2051 to db2070 Rack table update racking complete wiring complete [20:25:45] (03PS1) 10Yuvipanda: tools: Fix scope issue + do not explicitly shutdown socket [puppet] - 10https://gerrit.wikimedia.org/r/204333 [20:25:56] (03CR) 10jenkins-bot: [V: 04-1] tools: Fix scope issue + do not explicitly shutdown socket [puppet] - 10https://gerrit.wikimedia.org/r/204333 (owner: 10Yuvipanda) [20:25:58] (03PS2) 10Yuvipanda: tools: Fix scope issue + do not explicitly shutdown socket [puppet] - 10https://gerrit.wikimedia.org/r/204333 [20:26:04] (03PS3) 10Alex Monk: Add AffCom user group application contact page on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204205 (https://phabricator.wikimedia.org/T95789) [20:27:25] (03CR) 10Yuvipanda: [C: 032] tools: Fix scope issue + do not explicitly shutdown socket [puppet] - 10https://gerrit.wikimedia.org/r/204333 (owner: 10Yuvipanda) [20:28:01] !log deployed parsoid version ac7a01b9 [20:31:40] YuviPanda, hmm ... why is that not getting logged you know? [20:31:49] morebots is probably dead [20:32:08] (https://wikitech.wikimedia.org/wiki/Morebots) [20:32:15] I’ll restart it in a while if nobody does [20:32:25] ah, ok. [20:32:26] but someone needs to own that and fix it. nobody really does atm [20:35:45] YuviPanda, I'll update the SAL page directly for now [20:46:23] !log twentyafterfour Finished scap: testwiki to php-1.26wmf2 and rebuild l10n cache (duration: 47m 53s) [20:47:33] testwiki still shows 1.26wmf1 [20:47:47] * twentyafterfour wonders where I screwed up [20:48:28] ? [20:49:44] i see 1.26wmf2 [20:49:51] oh weird - https://test.wikipedia.org/wiki/Special:Version shows "This is a test of release of MediaWiki 1.26wmf1 (49cbab3). " at the top but "MediaWiki 1.26wmf2 (8e57fcd) [20:49:54] 18:50, 15 April 2015" further down [20:50:10] strange [20:50:22] yeah [20:50:23] that's a central notice banner [20:50:29] not sure how it gets set [20:50:49] (03CR) 1020after4: [C: 032] Wikipedias to 1.26wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204320 (owner: 1020after4) [20:50:54] It's not, it's a sitenotice https://test.wikipedia.org/wiki/MediaWiki:Sitenotice [20:51:02] (03Merged) 10jenkins-bot: Wikipedias to 1.26wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204320 (owner: 1020after4) [20:51:15] mere {{CURRENTVERSION}} [20:51:45] and where is CURRENTVERSION defined? maybe it's cached? [20:53:17] 6operations: Update DNS for the Wikipedia store, before May 31 - https://phabricator.wikimedia.org/T96182#1210351 (10Jgreen) p:5Normal>3Triage a:5Jgreen>3None [20:54:25] seems like it's cache somewhere - I logged in to testwiki and now the sitenotice shows wmf2 at the top of the page.. but https://test.wikipedia.org/wiki/MediaWiki:Sitenotice shows 1.26wmf1 (with a different hash now!) in the page body [20:55:10] 6operations: Give Google webmaster tools access to jon katz (Read only is fine) - https://phabricator.wikimedia.org/T90980#1210357 (10Philippe-WMF) FWIW, and late, but... approved from my end. pb ___________________ Philippe Beaudette Director, Community Advocacy Wikimedia Foundation, Inc. 415-839-6885, x 664... [20:57:21] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.26wmf1 [20:59:26] (03CR) 1020after4: [C: 032] Group0 to 1.26wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204321 (owner: 1020after4) [20:59:56] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#1210388 (10Eevans) Some additional information: The new local-only option introduced in 2.1.4 does //not// support authentication, or encryption. So th... [21:00:19] (03Merged) 10jenkins-bot: Group0 to 1.26wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204321 (owner: 1020after4) [21:01:17] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.26wmf2 [21:03:11] !log twentyafterfour Purged l10n cache for 1.25wmf24 [21:04:18] (03PS5) 10Andrew Bogott: Have sink create ldap host entries. [puppet] - 10https://gerrit.wikimedia.org/r/202582 [21:08:36] (03PS1) 10Yuvipanda: tools: Fix missing import [puppet] - 10https://gerrit.wikimedia.org/r/204337 [21:10:01] PROBLEM - puppet last run on mw1008 is CRITICAL Puppet has 1 failures [21:10:49] (03PS2) 10Yuvipanda: tools: Fix missing import [puppet] - 10https://gerrit.wikimedia.org/r/204337 [21:11:01] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Fix missing import [puppet] - 10https://gerrit.wikimedia.org/r/204337 (owner: 10Yuvipanda) [21:12:21] !log cleaned up /srv/mediawiki/php-1.25wmf20 [21:13:50] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1210443 (10BBlack) >>! In T95229#1210296, @csteipp wrote: > Ops, from my perspective, it would be really great to be able to plan for using alternate, unauthent... [21:16:05] (03PS7) 10Andrew Bogott: Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 [21:17:54] (03CR) 10Rush: [C: 031] Set up ssh keys so that designate can clear salt and puppet certs. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204067 (owner: 10Andrew Bogott) [21:24:41] RECOVERY - puppet last run on mw1008 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [21:24:56] (03PS8) 10Andrew Bogott: Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 [21:25:20] 6operations, 10hardware-requests, 5Patch-For-Review: Decom/repurpose rbf* hosts - https://phabricator.wikimedia.org/T95153#1210535 (10RobH) 5Open>3Resolved added back to spares [21:25:59] (03PS1) 10Yuvipanda: tools: Fix more silly copy paste errors [puppet] - 10https://gerrit.wikimedia.org/r/204340 [21:26:10] (03CR) 10jenkins-bot: [V: 04-1] tools: Fix more silly copy paste errors [puppet] - 10https://gerrit.wikimedia.org/r/204340 (owner: 10Yuvipanda) [21:26:16] (03PS2) 10Yuvipanda: tools: Fix more silly copy paste errors [puppet] - 10https://gerrit.wikimedia.org/r/204340 [21:27:56] (03CR) 10Yuvipanda: [C: 032] tools: Fix more silly copy paste errors [puppet] - 10https://gerrit.wikimedia.org/r/204340 (owner: 10Yuvipanda) [21:34:21] <^d> Can someone kick hhvm on mw1191? Complaints of full TC cache on fluorine. [21:37:29] ^d: done [21:37:33] <^d> thx [21:37:34] !log restarted hhvm on mw1191 [21:41:11] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1210586 (10csteipp) > What we're already possibly-planning around is including *.wikimedia.org in all of the certs so that potentially one IP + one cert can han... [21:42:03] (03PS1) 10Yuvipanda: tools: Set portreleaser to be epilog script for web queues [puppet] - 10https://gerrit.wikimedia.org/r/204366 (https://phabricator.wikimedia.org/T96059) [21:42:42] (03CR) 10Dzahn: Hiera-ize the mediawiki-installation dsh group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204331 (owner: 10Chad) [21:44:51] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1210595 (10GWicke) It would be great to use this for upload especially. [21:47:38] (03PS9) 10Andrew Bogott: Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 [21:48:46] (03CR) 10jenkins-bot: [V: 04-1] Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 (owner: 10Andrew Bogott) [21:49:22] (03CR) 10Rush: [C: 031] "niiiice" [puppet] - 10https://gerrit.wikimedia.org/r/204067 (owner: 10Andrew Bogott) [21:51:05] (03CR) 10Dzahn: [C: 04-1] Set up ssh keys so that designate can clear salt and puppet certs. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204067 (owner: 10Andrew Bogott) [21:51:38] andrewbogott: "@resolve" is specific to ferm [21:51:50] so I have to hardcode an ip? [21:51:51] http://ferm.foo-projects.org/download/2.1/ferm.html#_resolve__hostname1_hostname2________type__ [21:53:13] that, or you have to do the DNS lookup differently [21:53:21] in an .erb template [21:54:36] mutante: scope.function_ipresolve? [21:54:37] Socket.gethostbyname("hal") [21:56:06] http://grokbase.com/p/gg/puppet-users/136n1jtdcg/how-to-resolve-hostnames-to-ip-addresses-in-templates [21:57:23] andrewbogott: i suppose that works too, because i see we use it in the strongswan module [21:57:32] one is in puppet itself the other in .erb [21:57:52] well, no, both are used in templates [21:57:55] which is very new :) [21:58:13] in general, it would be better to avoid resolving DNS inside of puppet if we can, in cases where possible [21:58:21] modules/strongswan/lib/puppet/parser/functions/ipresolve.rb: newfunction(:ipresolve, [21:58:32] bblack: really? It’s /better/ to have the ip hard-coded? [21:58:34] ^ ah, so that is our own function [21:58:41] That seems fragile if we want to move something [21:59:27] how about still avoiding to have it in the manifests, so put it in hiera, but use the IP? [21:59:27] andrewbogott: no, it's /better/ to pass a hostname down to whatever-configuration on the host, and let it resolve that on the target host :) [21:59:43] can an ssl cert’s “from=“ resolve? [21:59:53] but if unavoidable, we can do that in puppet because the tool requires IPs as inputs [22:00:57] (it would be much better if the tool didn't, though, but I can see how a firewall is kind of a special case. Still...) [22:01:44] (03PS10) 10Andrew Bogott: Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 [22:02:36] (03CR) 10jenkins-bot: [V: 04-1] Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 (owner: 10Andrew Bogott) [22:03:09] goddamn I am tired of this patch [22:05:14] (03PS11) 10Andrew Bogott: Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 [22:06:21] (03PS2) 10Yuvipanda: tools: Set portreleaser to be epilog script for web queues [puppet] - 10https://gerrit.wikimedia.org/r/204366 (https://phabricator.wikimedia.org/T96059) [22:09:44] (03CR) 10Andrew Bogott: [C: 032] Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 (owner: 10Andrew Bogott) [22:10:01] Coren: wanna +1 https://gerrit.wikimedia.org/r/#/c/204366/? [22:10:07] then it’s only the monitoring script left [22:13:32] (03CR) 10MaxSem: "I also think that having 2 "api" entry points on the same domain is going to be misleading, can we use "rest" or something?" [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [22:13:45] (03CR) 10coren: [C: 031] tools: Set portreleaser to be epilog script for web queues [puppet] - 10https://gerrit.wikimedia.org/r/204366 (https://phabricator.wikimedia.org/T96059) (owner: 10Yuvipanda) [22:14:05] (03PS3) 10Yuvipanda: tools: Set portreleaser to be epilog script for web queues [puppet] - 10https://gerrit.wikimedia.org/r/204366 (https://phabricator.wikimedia.org/T96059) [22:14:22] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Set portreleaser to be epilog script for web queues [puppet] - 10https://gerrit.wikimedia.org/r/204366 (https://phabricator.wikimedia.org/T96059) (owner: 10Yuvipanda) [22:15:00] PROBLEM - puppet last run on virt1000 is CRITICAL puppet fail [22:15:11] (03CR) 10GWicke: "@MaxSem: The idea is to have all APIs share the same root eventually, so that clients can just point to http://project.org/api/ for all th" [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [22:16:12] (03PS1) 10Andrew Bogott: Avoid duplicate definition of puppetmaster::certmanager [puppet] - 10https://gerrit.wikimedia.org/r/204397 [22:17:25] (03CR) 10Andrew Bogott: [C: 032] Avoid duplicate definition of puppetmaster::certmanager [puppet] - 10https://gerrit.wikimedia.org/r/204397 (owner: 10Andrew Bogott) [22:20:00] RECOVERY - puppet last run on virt1000 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:29:21] PROBLEM - puppet last run on labcontrol2001 is CRITICAL puppet fail [22:34:59] (03PS2) 10Dereckson: User rights configuration on ne.wikipedia - Filemover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203335 (https://phabricator.wikimedia.org/T95103) [22:36:44] (03PS2) 10Dzahn: sshd: set Message Authentication Code ciphers [puppet] - 10https://gerrit.wikimedia.org/r/185329 [22:58:21] (03CR) 10Nuria: [C: 031] eventlogging: adjust counters thresholds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204237 (https://phabricator.wikimedia.org/T90111) (owner: 10Filippo Giunchedi) [23:01:07] Hmm, no jouncebot [23:01:08] (03PS3) 10Dzahn: sshd: use Chacha20-poly1305,AES-CGM ciphers [puppet] - 10https://gerrit.wikimedia.org/r/185325 [23:01:12] I'm doing SWAT today [23:01:17] YuviPanda: No jouncebot? [23:03:08] Dereckson: You around for your config patches? [23:03:32] Hi. Yup. [23:04:39] And my test plan is ready to check the changes when deployed. http://etherpad.wikimedia.org/p/deploy-20150416-SWAT-evening [23:05:43] Awesome [23:05:46] I'll start merging them now [23:05:52] And once they merge I'll deploy them all at once [23:06:00] (03CR) 10Catrope: [C: 032] Namespace configuration on ru.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202912 (https://phabricator.wikimedia.org/T95110) (owner: 10Dereckson) [23:06:01] RoanKattouw: possibly. No idea why everyone asks me :) [23:06:05] (03CR) 10Catrope: [C: 032] User rights configuration on ne.wikipedia - Filemover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203335 (https://phabricator.wikimedia.org/T95103) (owner: 10Dereckson) [23:06:13] I can take a look in an hour or so [23:06:15] YuviPanda: Do you know who owns it? [23:06:21] (03CR) 10Catrope: [C: 032] Namespace configuration on it.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203354 (https://phabricator.wikimedia.org/T93870) (owner: 10Dereckson) [23:06:22] Nobody atm [23:06:25] (03CR) 10Catrope: [C: 032] Logo configuration on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203422 (https://phabricator.wikimedia.org/T75424) (owner: 10Dereckson) [23:06:30] (03CR) 10Catrope: [C: 032] Set meta namespace and site name on or.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203785 (https://phabricator.wikimedia.org/T94142) (owner: 10Dereckson) [23:11:36] RoanKattouw: it is unowned and used very similar to morebots [23:11:52] Which is also dead atm [23:12:39] (03PS1) 10Andrew Bogott: Install the cert_manager with a file resource [puppet] - 10https://gerrit.wikimedia.org/r/204411 [23:12:41] (03PS1) 10Andrew Bogott: Just hardcode the designate ip. [puppet] - 10https://gerrit.wikimedia.org/r/204412 [23:15:34] (03CR) 10Andrew Bogott: [C: 032] Install the cert_manager with a file resource [puppet] - 10https://gerrit.wikimedia.org/r/204411 (owner: 10Andrew Bogott) [23:17:29] (03Merged) 10jenkins-bot: Namespace configuration on ru.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202912 (https://phabricator.wikimedia.org/T95110) (owner: 10Dereckson) [23:17:32] (03Merged) 10jenkins-bot: Namespace configuration on it.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203354 (https://phabricator.wikimedia.org/T93870) (owner: 10Dereckson) [23:17:35] (03Merged) 10jenkins-bot: Logo configuration on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203422 (https://phabricator.wikimedia.org/T75424) (owner: 10Dereckson) [23:17:37] (03Merged) 10jenkins-bot: Set meta namespace and site name on or.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203785 (https://phabricator.wikimedia.org/T94142) (owner: 10Dereckson) [23:18:09] (03CR) 10Andrew Bogott: [C: 032] Just hardcode the designate ip. [puppet] - 10https://gerrit.wikimedia.org/r/204412 (owner: 10Andrew Bogott) [23:21:59] (03CR) 10Catrope: [C: 032] "Jenkins?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203335 (https://phabricator.wikimedia.org/T95103) (owner: 10Dereckson) [23:22:32] Oh, right [23:22:43] Oh sorry [23:22:45] seen the dep [23:22:50] Yeah [23:22:58] Do you want me to do the dependencies too? Or just the one? [23:23:13] I offer to postpone this deploy with the other two another day. [23:23:24] OK cool [23:23:31] I'll do the other 4 now then [23:25:40] !log catrope Synchronized wmf-config/InitialiseSettings.php: SWAT (duration: 00m 14s) [23:26:16] Dereckson: There you go [23:26:31] Thank you. Checking. [23:27:14] superm401: Around for your SWAT? [23:27:22] RoanKattouw, yep [23:27:46] !log catrope Synchronized php-1.26wmf1/extensions/Citoid: SWAT (duration: 00m 12s) [23:27:46] Cool [23:28:35] !log catrope Synchronized php-1.26wmf1/extensions/Flow: SWAT (duration: 00m 16s) [23:28:43] superm401: There you go ---^^ [23:29:12] RoanKattouw, works, thanks. [23:29:34] Sweet [23:29:44] !log catrope Synchronized php-1.26wmf2/extensions/Citoid: SWAT (duration: 00m 14s) [23:32:41] Changes verified, all seems to work fine. [23:33:42] I were a little afraid on or.wikt, as they don't have any test on the community portal, but [[Special:All pages]] gives me some results. [23:34:20] (including a village pump, I will suggest to create a redirect from the community portal link to this page on the Phabricator task) [23:34:32] s/any test/any text [23:39:00] PROBLEM - puppet last run on holmium is CRITICAL Puppet has 1 failures [23:39:01] PROBLEM - puppet last run on cp3017 is CRITICAL puppet fail [23:39:27] !log catrope Synchronized php-1.26wmf1/extensions/VisualEditor: SWAT (duration: 00m 12s) [23:39:36] !log catrope Synchronized php-1.26wmf2/extensions/VisualEditor: SWAT (duration: 00m 12s) [23:40:36] MaxSem: you might want to weigh in at https://phabricator.wikimedia.org/T95229 [23:40:41] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1210921 (10Dzahn) now, also see T96182, which says " A Record to point to our new IP address: 23.227.38.32" in contradiction to what Andrew and myself ha... [23:41:48] (03CR) 10Andrew Bogott: [C: 032] Create the .ssh dir before sticking a key in it [puppet] - 10https://gerrit.wikimedia.org/r/204418 (owner: 10Andrew Bogott) [23:43:20] gwicke, I agree with you however all my reasons are already mentioned [23:44:24] MaxSem: a 'me too' is fine as well ;) [23:45:16] !log catrope Synchronized php-1.26wmf1/extensions/VisualEditor: Revert SWAT for VE wmf1, caused JS errors (duration: 00m 12s) [23:45:40] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:49:20] 6operations: Update DNS for the Wikipedia store, before May 31 - https://phabricator.wikimedia.org/T96182#1210927 (10Dzahn) At this point there are 3 different answers we have recevied: a) use the existing CNAME shopwikipedia.myshopify.com. and just move that from "shop" to "store" this seems logical because i... [23:49:50] (03PS1) 10Andrew Bogott: Fix insertion of the designate ip into the certmanager key [puppet] - 10https://gerrit.wikimedia.org/r/204422 [23:50:37] 6operations: Update DNS for the Wikipedia store, before May 31 - https://phabricator.wikimedia.org/T96182#1210936 (10Dzahn) Since it's already so confusing, It's probably better if we keep all the updates in one place, i'd suggest we keep using T92438. [23:50:57] (03CR) 10Andrew Bogott: [C: 032] Fix insertion of the designate ip into the certmanager key [puppet] - 10https://gerrit.wikimedia.org/r/204422 (owner: 10Andrew Bogott) [23:51:12] RoanKattouw: thank you for the deploy. [23:51:30] No problem [23:58:40] RECOVERY - puppet last run on cp3017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures