[00:02:48] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Puppet has 3 failures [00:08:19] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:13:40] grep -r ports.conf and i get "Exports configuration" hits, totally correct, just not expected:) [00:13:45] laters//away [00:24:45] mutante: Do you know of any DNS or firewall changes in Eqiad that might cause https://phabricator.wikimedia.org/T92351 it started between Feb 25 and March 6 [00:24:50] ebernhardson: thank you! [00:55:10] !log ori Synchronized docroot/foundation/misc/blank.gif: (no message) (duration: 00m 05s) [00:55:16] Logged the message, Master [01:22:55] !log reinstalling cp4007 + cp4015 [01:23:01] Logged the message, Master [02:00:26] RECOVERY - Disk space on fluorine is OK: DISK OK [02:03:24] ori: Hm.. used http://wikimediafoundation.org/misc/blank.gif instead of http://performance.wikimedia.org/blank.gif? [02:04:18] !log l10nupdate Synchronized php-1.25wmf19/cache/l10n: (no message) (duration: 00m 02s) [02:04:27] Logged the message, Master [02:05:25] !log LocalisationUpdate completed (1.25wmf19) at 2015-03-11 02:04:22+00:00 [02:05:31] Logged the message, Master [02:05:50] !log l10nupdate Synchronized php-1.25wmf20/cache/l10n: (no message) (duration: 00m 01s) [02:05:56] Logged the message, Master [02:07:02] !log LocalisationUpdate completed (1.25wmf20) at 2015-03-11 02:05:58+00:00 [02:07:08] Logged the message, Master [02:11:18] !log ori Synchronized php-1.25wmf19/extensions/WikimediaEvents: 2nd iteration of HTTPS test (duration: 00m 05s) [02:11:26] Logged the message, Master [02:11:33] !log ori Synchronized php-1.25wmf20/extensions/WikimediaEvents: 2nd iteration of HTTPS test (duration: 00m 05s) [02:11:38] Logged the message, Master [02:31:43] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Mar 11 02:30:39 UTC 2015 (duration 30m 38s) [02:31:48] Logged the message, Master [03:32:47] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: puppet fail [03:35:21] icinga-wm: shh, i bet you it's not [03:36:07] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [03:36:19] ^ ran puppet.. shrug [03:40:17] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [03:45:17] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [03:50:17] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [03:55:17] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [04:00:14] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [04:03:24] PROBLEM - salt-minion processes on cp4016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [04:05:23] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [04:09:53] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [04:10:23] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [04:11:04] RECOVERY - salt-minion processes on cp4016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:14:54] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [04:15:23] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [04:19:54] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [04:20:23] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [04:24:53] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [04:25:23] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [04:29:54] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [04:30:23] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [04:34:53] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 167 seconds ago with 0 failures [04:35:24] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 116 seconds ago with 0 failures [05:30:35] legoktm: grrrit-wm is still dead :( [05:30:48] YuviPanda: is it gerrit-to-redis? [05:30:53] I restarted that also [05:31:45] [bigbrother] info: Restarting job 'gerrit-to-redis' [05:31:51] was that you? [05:31:53] legoktm: yeah, because I just killed it [05:31:54] yeah [05:31:59] no effects. hmm [05:32:37] legoktm: y’know, I’m wondering if I should just take redis out of the equation for grrrit-wm [05:32:42] and just have a small in-memory queue [05:32:49] nobody else is consuming gerrit-to-redis [05:33:07] YuviPanda: is gerrit-to-redis python or node? [05:33:14] gerrit-to-redis is python [05:33:16] grrrit-wm: is node [05:33:23] wikibugs is also down [05:33:29] yeah... [05:34:06] legoktm: redis *is* 7p [05:34:07] up [05:35:15] hm [05:35:18] no wikibugs is kinda working [05:35:53] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Puppet has 2 failures [05:35:54] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: Puppet has 2 failures [05:35:54] PROBLEM - puppet last run on mw1093 is CRITICAL: CRITICAL: Puppet has 2 failures [05:36:19] gah [05:36:24] PROBLEM - puppet last run on mw1091 is CRITICAL: CRITICAL: Puppet has 2 failures [05:36:24] PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Puppet has 2 failures [05:36:24] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: puppet fail [05:36:33] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 2 failures [05:36:34] that’s me [05:36:43] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Puppet has 2 failures [05:36:43] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 2 failures [05:36:46] or maybe not? [05:36:54] PROBLEM - puppet last run on mw1010 is CRITICAL: CRITICAL: Puppet has 2 failures [05:37:03] PROBLEM - puppet last run on mw1016 is CRITICAL: CRITICAL: Puppet has 2 failures [05:38:08] hmm [05:38:11] it might be transient [05:38:55] it's not me :) [05:39:01] bblack: yeah, transient... [05:39:26] bblack: I deleted a file in the patch I merged, and this is all where puppet was still trying to retreive the deleted files [05:40:04] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:47:35] !log yuvipanda Synchronized README: (no message) (duration: 00m 07s) [05:47:46] Logged the message, Master [05:47:49] :P [05:47:55] !log testing sync-file to make sure I didn’t break anything [05:48:00] we always get some puppet failures around 0625 utc, it's something related to cron.daily [05:48:00] Logged the message, Master [05:48:21] jgage: nah, this is a transient one caused by https://gerrit.wikimedia.org/r/#/c/195617/ [05:48:31] hm ok [05:48:38] I moved things from a file to hiera, and those nodes had code that referenced the file but the file was gone by then... [05:48:39] well done puppet [05:49:13] legoktm: well, that sync worked. [05:49:18] I still haven’t broken anything, dammit! [05:53:35] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [05:53:44] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [05:53:54] RECOVERY - puppet last run on mw1223 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [05:54:06] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [05:54:14] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [05:54:15] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [05:54:15] RECOVERY - puppet last run on mw1093 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [05:54:26] uh [05:54:35] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [05:54:45] RECOVERY - puppet last run on mw1091 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [05:54:51] armed and dangerous [05:54:55] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:55:04] RECOVERY - puppet last run on amssq33 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [05:55:10] legoktm: Are you awake? [05:55:15] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [05:55:34] Bsadowski1: yes [05:55:35] Uh, is the AbuseFilter hits enabled for all wikis that have it on irc.wikimedia.org [05:55:36] ? [05:55:45] I believe so [05:55:56] What about global abusefilter's? [05:56:16] Would those show up also? [05:57:07] I think so, not 100% sure [05:58:16] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: puppet fail [06:02:55] PROBLEM - Varnish HTTP bits on cp4003 is CRITICAL: Connection refused [06:03:24] PROBLEM - Varnishkafka log producer on cp4003 is CRITICAL: Connection refused by host [06:03:34] PROBLEM - configured eth on cp4003 is CRITICAL: Connection refused by host [06:03:44] PROBLEM - dhclient process on cp4003 is CRITICAL: Connection refused by host [06:03:55] PROBLEM - puppet last run on cp4003 is CRITICAL: Connection refused by host [06:03:58] bblack: ^ that you, I suppose? [06:04:05] PROBLEM - salt-minion processes on cp4003 is CRITICAL: Connection refused by host [06:04:06] * YuviPanda treis to bring grrrit-wm back up again [06:04:24] PROBLEM - DPKG on cp4003 is CRITICAL: Connection refused by host [06:04:34] PROBLEM - Disk space on cp4003 is CRITICAL: Connection refused by host [06:04:45] PROBLEM - HTTPS on cp4003 is CRITICAL: Return code of 255 is out of bounds [06:05:04] PROBLEM - RAID on cp4003 is CRITICAL: Connection refused by host [06:05:36] RECOVERY - Disk space on cp4003 is OK: DISK OK [06:05:45] RECOVERY - configured eth on cp4003 is OK: NRPE: Unable to read output [06:05:54] RECOVERY - dhclient process on cp4003 is OK: PROCS OK: 0 processes with command name dhclient [06:06:06] RECOVERY - RAID on cp4003 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [06:06:15] RECOVERY - salt-minion processes on cp4003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:06:34] RECOVERY - DPKG on cp4003 is OK: All packages OK [06:07:05] RECOVERY - HTTPS on cp4003 is OK: SSLXNN OK - 36 OK [06:07:15] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:07:15] RECOVERY - Varnish HTTP bits on cp4003 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.171 second response time [06:07:45] RECOVERY - Varnishkafka log producer on cp4003 is OK: PROCS OK: 2 processes with command name varnishkafka [06:08:31] that's me [06:08:49] (sorry, unavoidable when neon wins the race on discovering a new install before it's done puppeting :P) [06:10:14] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [06:11:09] bblack: :) [06:15:14] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [06:17:05] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:20:14] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [06:20:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [06:25:14] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [06:25:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [06:28:24] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:35] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:54] PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:04] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [06:30:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [06:30:14] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [06:30:55] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [06:35:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [06:35:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [06:35:14] RECOVERY - check_puppetrun on db1025 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:35:15] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [06:35:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [06:39:08] bblack: uh, do you think we should page Jeff_Green? [06:39:55] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [06:40:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [06:40:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [06:40:14] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [06:40:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [06:40:15] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 285 seconds ago with 0 failures [06:40:54] RECOVERY - check_puppetrun on thulium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:44:54] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 295 seconds ago with 0 failures [06:45:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [06:45:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [06:45:14] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [06:45:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [06:45:25] RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:15] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:50:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [06:50:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [06:50:14] RECOVERY - check_puppetrun on db1008 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:50:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [06:52:08] 10Ops-Access-Requests, 6operations, 10MediaWiki-extensions-ContentTranslation, 3LE-Sprint-84, 5Patch-For-Review: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1108590 (10santhosh) [06:55:14] RECOVERY - check_puppetrun on pay-lvs1002 is OK: OK: Puppet is currently enabled, last run 210 seconds ago with 0 failures [06:55:14] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 140 seconds ago with 0 failures [06:55:15] RECOVERY - check_puppetrun on silicon is OK: OK: Puppet is currently enabled, last run 209 seconds ago with 0 failures [07:07:05] (03CR) 10Yuvipanda: "Test" [puppet] - 10https://gerrit.wikimedia.org/r/195617 (owner: 10Yuvipanda) [07:07:13] legoktm: ’tis back! [07:08:29] <_joe_> mornin [07:11:26] PROBLEM - Disk space on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:26] PROBLEM - salt-minion processes on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:26] PROBLEM - puppet last run on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:44] PROBLEM - RAID on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:54] PROBLEM - DPKG on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:55] PROBLEM - SSH on rhenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:12:05] PROBLEM - dhclient process on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:14] PROBLEM - configured eth on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:47] (03CR) 10Yuvipanda: "test" [puppet] - 10https://gerrit.wikimedia.org/r/195617 (owner: 10Yuvipanda) [07:14:47] <_joe_> what's rhenium doing anyways? [07:14:50] * _joe_ loks [07:15:55] RECOVERY - RAID on rhenium is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [07:16:54] (03CR) 10Yuvipanda: "test" [puppet] - 10https://gerrit.wikimedia.org/r/195617 (owner: 10Yuvipanda) [07:17:06] <_joe_> yuvi is in a deadlock [07:19:34] PROBLEM - RAID on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:38] (03PS19) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [07:25:29] (03CR) 10Yuvipanda: "Is currently cherry-picked on beta \o/ and seems to work." [puppet] - 10https://gerrit.wikimedia.org/r/195340 (owner: 10Yuvipanda) [07:26:09] (03PS7) 10Yuvipanda: redis: Have redis machines overcommit if persistance is enabled [puppet] - 10https://gerrit.wikimedia.org/r/194095 (https://phabricator.wikimedia.org/T91498) [07:26:46] _joe_: hi, bored this morning? ;) [07:32:01] (03PS1) 10Matanya: memcached: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195857 (https://phabricator.wikimedia.org/T91908) [07:36:38] (03CR) 10Yuvipanda: "(I'll merge this today when more opsen are around. Just in case)" [puppet] - 10https://gerrit.wikimedia.org/r/194095 (https://phabricator.wikimedia.org/T91498) (owner: 10Yuvipanda) [07:37:45] PROBLEM - NTP on rhenium is CRITICAL: NTP CRITICAL: No response from NTP server [07:37:51] good morning [07:38:30] hi hashar , pushed a zuul lint change yesterday, would be happy if you glance [07:38:41] (03PS1) 10Matanya: ferm: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195858 (https://phabricator.wikimedia.org/T91908) [07:38:41] (03PS1) 10Matanya: ferm: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195858 (https://phabricator.wikimedia.org/T91908) [07:38:52] why two ? [07:39:00] (03PS2) 10Yuvipanda: statsdlb: fix string containing only a variable [puppet] - 10https://gerrit.wikimedia.org/r/195534 (owner: 10Matanya) [07:39:00] (03PS2) 10Yuvipanda: statsdlb: fix string containing only a variable [puppet] - 10https://gerrit.wikimedia.org/r/195534 (owner: 10Matanya) [07:39:02] matanya: will try :) I have a couple thousands notifications emails pending [07:39:50] thanks [07:42:36] (03PS1) 10Matanya: etherpad: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195859 (https://phabricator.wikimedia.org/T91908) [07:42:36] (03PS1) 10Matanya: etherpad: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195859 (https://phabricator.wikimedia.org/T91908) [07:52:09] (03PS1) 10Matanya: ocg: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195860 (https://phabricator.wikimedia.org/T91908) [08:00:03] (03PS1) 10Matanya: shinken: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195862 (https://phabricator.wikimedia.org/T91908) [08:02:15] (03Abandoned) 10Giuseppe Lavagetto: Adding new dns entries for mc1007-1016 (Do not merge until _joe_ reviews) [dns] - 10https://gerrit.wikimedia.org/r/190358 (owner: 10Cmjohnson) [08:04:51] (03PS1) 10Matanya: racktables: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195863 (https://phabricator.wikimedia.org/T91908) [08:06:18] 6operations, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1108642 (10Joe) These are going to be jessie, and I am going to install them today as they are my missing piece before I can mass-reinstall the appservers. [08:09:49] (03PS1) 10Matanya: wikimania_scholarships: resource attributes quoting and minor lint [puppet] - 10https://gerrit.wikimedia.org/r/195864 (https://phabricator.wikimedia.org/T91908) [08:20:14] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [08:20:23] hey guys [08:20:31] (03CR) 10Yuvipanda: "Done differently in I976f4c29d6730bd563ae6fb7a33c86b6249705d2" [puppet] - 10https://gerrit.wikimedia.org/r/194068 (https://phabricator.wikimedia.org/T91370) (owner: 10Yuvipanda) [08:20:39] (03Abandoned) 10Yuvipanda: tools: Puppetize LVM volume for redis data [puppet] - 10https://gerrit.wikimedia.org/r/194068 (https://phabricator.wikimedia.org/T91370) (owner: 10Yuvipanda) [08:20:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [08:20:56] could somebody tell me where can i find the user/pass combo for cassandra on restbase100x hosts? [08:22:22] nevermind, got in [08:22:22] :) [08:22:26] <_joe_> mobrovac: I guess in the puppet/private repo [08:22:36] <_joe_> which I don't think you have access to atm [08:22:47] you're right, i dont [08:22:51] * _joe_ should set up pwstore so that such info can be shared [08:23:38] good idea [08:23:39] +1 [08:24:36] <_joe_> mobrovac: then we need to do a big keysigning party [08:24:52] <_joe_> mmmh maybe the hackathon may be a good time to do that [08:25:07] yep, definitely [08:25:14] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [08:25:17] <_joe_> in the meantime, we have keybase! [08:25:17] some wine and keysigning :) [08:25:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [08:30:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [08:30:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [08:30:14] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [08:30:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [08:33:29] (03PS1) 10Yuvipanda: beta: Add new mwdeploy key [puppet] - 10https://gerrit.wikimedia.org/r/195865 [08:35:08] 7Puppet, 6operations, 10Beta-Cluster: Use keyholder for deploy key management - https://phabricator.wikimedia.org/T92367#1108691 (10yuvipanda) 3NEW a:3yuvipanda [08:35:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [08:35:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [08:35:14] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [08:35:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [08:35:24] (03PS2) 10Yuvipanda: beta: Add new mwdeploy key [puppet] - 10https://gerrit.wikimedia.org/r/195865 (https://phabricator.wikimedia.org/T92367) [08:35:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [08:36:48] 7Puppet, 6operations, 5Patch-For-Review: Convert host lists in dsh/files/groups to hiera - https://phabricator.wikimedia.org/T92259#1108702 (10yuvipanda) Thanks for the patch, @Dzahn! Can you also email ops@ and engineering@ to seee if anyone else still needs it? IIRC @ssastry is still using dsh for parsoid... [08:40:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [08:40:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [08:40:15] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [08:40:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [08:40:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [08:45:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [08:45:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [08:45:14] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [08:45:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [08:45:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [08:47:05] (03PS1) 10Matanya: wikistats: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195866 (https://phabricator.wikimedia.org/T91908) [08:50:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [08:50:14] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [08:50:14] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [08:50:15] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [08:50:15] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [08:50:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [08:50:15] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [08:50:16] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [08:50:45] (03PS1) 10Giuseppe Lavagetto: redis: correct naming of the new redis hosts [dns] - 10https://gerrit.wikimedia.org/r/195868 [08:50:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [08:55:14] RECOVERY - check_puppetrun on pay-lvs1002 is OK: OK: Puppet is currently enabled, last run 160 seconds ago with 0 failures [08:55:14] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [08:55:14] RECOVERY - check_puppetrun on payments1003 is OK: OK: Puppet is currently enabled, last run 242 seconds ago with 0 failures [08:55:15] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [08:55:15] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 87 seconds ago with 0 failures [08:55:15] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [08:55:16] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [08:55:16] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [08:55:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [08:56:43] (03PS3) 10Yuvipanda: beta: Add new mwdeploy key [puppet] - 10https://gerrit.wikimedia.org/r/195865 (https://phabricator.wikimedia.org/T92367) [08:56:54] YuviPanda: hey there :) [08:57:02] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: Add new mwdeploy key [puppet] - 10https://gerrit.wikimedia.org/r/195865 (https://phabricator.wikimedia.org/T92367) (owner: 10Yuvipanda) [08:57:06] * YuviPanda waves at hashar [08:57:09] hello! :) [08:57:25] would you some spare time to merge in a tiny change for me please? https://gerrit.wikimedia.org/r/#/c/195287/ that get rid of some prod/labs inconsistency for the Zuul server [08:58:29] (03PS4) 10Yuvipanda: Make zuul use master branch on both prod and labs [puppet] - 10https://gerrit.wikimedia.org/r/195287 (https://phabricator.wikimedia.org/T91984) (owner: 10Hashar) [08:58:30] hashar: yay for removing more labs/prod differences [08:58:55] (03CR) 10Yuvipanda: [C: 032 V: 032] Make zuul use master branch on both prod and labs [puppet] - 10https://gerrit.wikimedia.org/r/195287 (https://phabricator.wikimedia.org/T91984) (owner: 10Hashar) [08:59:00] hashar: done [08:59:13] YuviPanda: Danke! [08:59:20] hashar: yw! [08:59:27] I’m not messing with parsoid until tomorrow as requested by the VE team [08:59:41] yeah sorry about the parsoid role [08:59:49] ’tis ok! [08:59:52] I have been busy packaging Zuul as a Debian package [08:59:57] doing reviews today [09:00:06] CI is heavily overburdened, etc. I understand :) [09:00:15] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [09:00:15] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [09:00:15] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [09:00:15] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [09:00:16] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [09:00:17] well Timo and Kunal are doing great [09:00:52] indeed, *still* overburdened :) 3 people of which 2 are ‘part time’ isn’t enough. [09:00:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [09:01:06] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: puppet fail [09:02:07] definitely [09:02:12] though there are more folks being involved [09:02:35] yup, yup [09:02:43] I’m still futzing with scap on deployment-prep, btw [09:02:48] ;( [09:03:01] what is going on there? [09:03:13] it has suffered from network resolution issues / random ssh permission denied for ages [09:03:29] as well as triggering the l10n compilation which takes a while and cause the Jenkins job to fail [09:05:15] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [09:05:15] RECOVERY - check_puppetrun on payments1001 is OK: OK: Puppet is currently enabled, last run 202 seconds ago with 0 failures [09:05:16] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [09:05:16] RECOVERY - check_puppetrun on db1025 is OK: OK: Puppet is currently enabled, last run 276 seconds ago with 0 failures [09:05:16] RECOVERY - check_puppetrun on silicon is OK: OK: Puppet is currently enabled, last run 251 seconds ago with 0 failures [09:05:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [09:06:13] hashar: I’m getting rid of beta/scap, replacing it with the scap module. [09:06:17] hashar: scap is succeding, though. [09:06:20] (for now) [09:06:32] and also making it use the keyholder syystem [09:08:22] (03CR) 10Giuseppe Lavagetto: [C: 032] redis: correct naming of the new redis hosts [dns] - 10https://gerrit.wikimedia.org/r/195868 (owner: 10Giuseppe Lavagetto) [09:09:06] YuviPanda: the keyholder for the ssh agent right? [09:09:14] hashar: yup, yup [09:10:14] RECOVERY - check_puppetrun on indium is OK: OK: Puppet is currently enabled, last run 223 seconds ago with 0 failures [09:10:14] RECOVERY - check_puppetrun on payments1002 is OK: OK: Puppet is currently enabled, last run 212 seconds ago with 0 failures [09:10:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [09:11:30] tip of the day: a Gerrit search to do codereview https://gerrit.wikimedia.org/r/#/q/is:open+reviewer:self+label:Code-Review%253D0%252Cuser%253Dself,n,z [09:15:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [09:19:27] (03PS1) 10Yuvipanda: keyholder: Fix bash error [puppet] - 10https://gerrit.wikimedia.org/r/195870 [09:19:55] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:20:19] (03PS2) 10Yuvipanda: keyholder: Fix bash error [puppet] - 10https://gerrit.wikimedia.org/r/195870 [09:20:31] (03PS3) 10Yuvipanda: keyholder: Fix bash error [puppet] - 10https://gerrit.wikimedia.org/r/195870 [09:20:40] (03CR) 10Yuvipanda: [C: 032 V: 032] keyholder: Fix bash error [puppet] - 10https://gerrit.wikimedia.org/r/195870 (owner: 10Yuvipanda) [09:20:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [09:24:54] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [09:25:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [09:27:26] (03PS1) 10Yuvipanda: keyholder: Revoke Yuvi's bash license. [puppet] - 10https://gerrit.wikimedia.org/r/195871 [09:27:36] (03CR) 10jenkins-bot: [V: 04-1] keyholder: Revoke Yuvi's bash license. [puppet] - 10https://gerrit.wikimedia.org/r/195871 (owner: 10Yuvipanda) [09:27:38] (03PS2) 10Yuvipanda: keyholder: Revoke Yuvi's bash license. [puppet] - 10https://gerrit.wikimedia.org/r/195871 [09:29:54] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [09:30:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [09:30:14] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [09:30:14] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [09:30:15] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [09:30:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [09:32:25] (03CR) 10Yuvipanda: [C: 032] "Tested *this* time" [puppet] - 10https://gerrit.wikimedia.org/r/195871 (owner: 10Yuvipanda) [09:34:54] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [09:35:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [09:35:14] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [09:35:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [09:35:15] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [09:35:15] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [09:35:54] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [09:39:54] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [09:39:55] <_joe_> grr I have no way to get into frack to see what's going on [09:40:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [09:40:14] RECOVERY - check_puppetrun on tellurium is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:40:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [09:40:15] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [09:40:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [09:40:15] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [09:40:54] RECOVERY - check_puppetrun on thulium is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:42:44] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [09:44:02] (03PS1) 10Giuseppe Lavagetto: redis: set up servers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/195873 [09:44:54] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [09:45:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [09:45:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [09:45:15] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [09:45:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [09:45:25] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [09:48:14] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [09:49:54] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [09:50:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [09:50:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [09:50:15] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [09:50:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [09:50:16] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [09:50:42] (03PS1) 10Matanya: logstash: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195874 (https://phabricator.wikimedia.org/T91908) [09:53:22] (03PS2) 10Giuseppe Lavagetto: redis: set up servers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/195873 [09:54:54] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 242 seconds ago with 0 failures [09:55:14] RECOVERY - check_puppetrun on pay-lvs1002 is OK: OK: Puppet is currently enabled, last run 171 seconds ago with 0 failures [09:55:14] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 81 seconds ago with 0 failures [09:55:15] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [09:55:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [09:55:15] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 81 seconds ago with 0 failures [09:55:46] (03PS3) 10Giuseppe Lavagetto: redis: set up servers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/195873 [10:00:14] RECOVERY - check_puppetrun on db1008 is OK: OK: Puppet is currently enabled, last run 173 seconds ago with 0 failures [10:00:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [10:01:03] (03CR) 10Giuseppe Lavagetto: [C: 032] redis: set up servers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/195873 (owner: 10Giuseppe Lavagetto) [10:04:52] (03PS1) 10Giuseppe Lavagetto: netboot: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/195875 [10:05:14] RECOVERY - check_puppetrun on silicon is OK: OK: Puppet is currently enabled, last run 201 seconds ago with 0 failures [10:05:31] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] netboot: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/195875 (owner: 10Giuseppe Lavagetto) [10:15:50] 6operations: increase misc-web-lb cp pool from 2 to 3 systems? - https://phabricator.wikimedia.org/T86718#1108868 (10mark) @RobH: So those cp servers are the old Squids that we had many of. How many unused ones of those do we still have left? [10:25:14] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [10:25:14] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [10:25:15] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [10:25:15] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [10:30:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [10:30:14] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [10:30:14] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [10:30:15] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [10:30:15] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [10:30:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [10:30:16] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [10:30:24] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [10:33:15] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [10:35:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [10:35:14] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [10:35:15] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [10:35:15] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [10:35:15] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [10:35:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [10:35:16] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [10:35:16] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [10:37:14] 6operations, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1108873 (10Joe) I prepared everything, but sadly all these servers offer me a blank console so that I can't check what happens when I reboot them. Releasing the ticket for @RobH and @papaul consideration [10:37:36] 6operations, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1108875 (10Joe) a:5Joe>3RobH [10:38:26] 6operations, 6CA-team, 6MediaWiki-Core-Team, 10SUL-Finalization: db1068 (s4/commonswiki slave) is missing data about at least 6 users - https://phabricator.wikimedia.org/T91920#1108878 (10Krenair) >>! In T91920#1107363, @Springle wrote: >>>! In T91920#1105596, @Krenair wrote: >> Are the other two db1062 an... [10:40:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [10:40:14] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [10:40:14] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [10:40:15] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [10:40:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [10:40:15] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [10:40:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [10:40:16] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [10:40:16] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [10:43:58] all FR hosts puppet fail ? [10:45:15] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [10:45:15] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [10:45:15] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [10:45:15] RECOVERY - check_puppetrun on payments1001 is OK: OK: Puppet is currently enabled, last run 105 seconds ago with 0 failures [10:45:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [10:45:16] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [10:45:16] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [10:45:16] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [10:45:17] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [10:50:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [10:50:14] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [10:50:14] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [10:50:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [10:50:15] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [10:50:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [10:50:15] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [10:50:24] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [10:55:14] RECOVERY - check_puppetrun on pay-lvs1002 is OK: OK: Puppet is currently enabled, last run 160 seconds ago with 0 failures [10:55:14] RECOVERY - check_puppetrun on payments1003 is OK: OK: Puppet is currently enabled, last run 158 seconds ago with 0 failures [10:55:14] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [10:55:15] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:55:15] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [10:55:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [10:55:15] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [10:55:16] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [11:00:15] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [11:00:15] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [11:00:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [11:00:24] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [11:05:17] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [11:05:17] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [11:05:18] RECOVERY - check_puppetrun on silicon is OK: OK: Puppet is currently enabled, last run 213 seconds ago with 0 failures [11:05:18] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 208 seconds ago with 0 failures [11:10:15] RECOVERY - check_puppetrun on indium is OK: OK: Puppet is currently enabled, last run 188 seconds ago with 0 failures [11:10:15] RECOVERY - check_puppetrun on payments1002 is OK: OK: Puppet is currently enabled, last run 189 seconds ago with 0 failures [11:15:35] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [11:19:26] 6operations, 7HTTPS, 3HTTPS-by-default: Enable HSTS and point rel=canonical to HTTPS for all Russian Wikimedia projects - https://phabricator.wikimedia.org/T90527#1108914 (10Nemo_bis) [11:25:57] (03CR) 10Alexandros Kosiaris: [C: 031] "My personal plan is to kill racktables really soon (I should commit to a date btw) and make this argument moot. Until then, +1" [puppet] - 10https://gerrit.wikimedia.org/r/195444 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [11:28:18] where can i see servermon alex? [11:29:07] 10ops-pmtpa: Dear pmtpa@rt.wikimedia.org, Call for Submissions on Various Academic Disciplines - https://phabricator.wikimedia.org/T92372#1108929 (10emailbot) [11:30:14] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:33:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Aside from the inline comments, I am a bit unclear on why this is needed. I get from the topic that you want to override selectively the d" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/195492 (owner: 10Thcipriani) [11:35:45] <_joe_> mark: servermon.wikimedia.org [11:36:26] the racktables like bits of that seem empty, correct? [11:42:39] (03CR) 10Yuvipanda: "We're getting rid of all 'global wikitech variables' and moving them to hiera (already done for salt master key fingerprint). However, the" [puppet] - 10https://gerrit.wikimedia.org/r/195492 (owner: 10Thcipriani) [11:46:14] akosiaris: is there any way how I can see how servermon works for racktables like functionality? [11:47:49] mark with actual data in it ? [11:48:06] not really, with sample data yes [11:48:23] I can populate the service with some sample data for you mark [11:53:33] sample data is fine [11:53:39] i just want to see how it works [11:53:48] i'm one of the main users of racktables, and i haven't seen it yet [11:53:58] i'm confident servermon is tons better but I'd love to see it before racktables is gone ;) [11:54:58] ok will do [11:55:32] what happened to the migration script? [11:55:36] I thought you got it working in SF [11:55:50] <_joe_> I was sold it was 100% working bulletproof [11:55:55] <_joe_> :D [11:57:03] 100% working ... I got like an XLS that needs to be looked into by mark and chris cause some boxes don't get migrated [11:57:36] minor fixes here and there as well, otherwise it is in a rather pleasing to me state [12:00:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [12:00:15] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [12:01:36] so why sample data then? [12:03:14] good question. Not feeling 100% confident about their validity would be the answer [12:03:28] then again, I will never be 100% sure about their validity, so... [12:05:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [12:05:14] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [12:05:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [12:06:16] what, as a good computer scientist I expect you deliver a completely solid mathematical proof with your conversion script [12:08:50] we really need to bring back quips, eh [12:10:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [12:10:15] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [12:10:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [12:15:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [12:15:14] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [12:15:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [12:17:35] <_joe_> akosiaris: convert your script to haskell, it will help with the proof [12:20:15] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [12:20:15] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [12:20:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [12:20:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [12:21:45] _joe_: I am more familiar with standard ML as far as S/M goes [12:22:11] garbage in garbage out is not hard to prove... [12:22:15] <_joe_> akosiaris: I know exactly 0 about both [12:25:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [12:25:15] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [12:25:15] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [12:25:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [12:25:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [12:25:16] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [12:27:25] egads the icingaspam [12:28:21] hi jeff [12:28:56] akosiaris: why can't i seem to find the citoid::port fact in ops/puppet's hiera? [12:29:35] when the citoid module clearly takes that arg [12:30:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [12:30:14] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [12:30:15] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [12:30:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [12:30:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [12:30:16] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [12:30:16] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [12:30:28] mobrovac: I think the default just works, and hence isn’t override [12:30:29] n [12:30:51] ah so we're just being lazy and hardcoding stuff? [12:30:52] mobrovac: ^ as YuviPanda pointed out [12:30:52] oki [12:30:56] thnx YuviPanda [12:31:13] yw. this isn’t hardcoding, btw. if you wanna see hardcoding, look at the state of our puppet repo a year or two ago :P [12:31:28] <_joe_> one may suffice :) [12:31:31] mobrovac: in general, we set the defaults to work for prod, and override where necessary [12:31:35] (for labs, mostly) [12:31:45] i see [12:31:49] <_joe_> which is a sensible choice ;) [12:32:14] there’s the underlying assumption that ops/puppet is basically wmf/ops/puppet, so most modules won’t really work anywhere without our infrastructure :) [12:32:47] <_joe_> YuviPanda: well, if the modules are general enough, it's not going to be like that [12:33:08] true, actually [12:33:10] yes, for example the bacula module is generic enough [12:33:13] the citoid module can easily be used elsewhere [12:33:15] if needed [12:33:17] so I take that back. [12:33:19] not really [12:33:27] well [12:33:31] anywhere with a trebuchet setup? :) [12:33:32] aka, who else is going to install citoid ? [12:33:36] and trebuchet [12:33:41] and zotero [12:33:46] right. [12:33:47] still [12:33:48] and xulrunner [12:33:50] it’s much better than I had imagined. [12:33:55] I can keep going on [12:34:01] akosiaris: heh, that went from ‘anywhere’ to ‘ugh, almost nowhere’ very fast :P [12:34:13] I still can’t believe we’re running xulrunner in a server process... [12:34:14] oh well [12:34:26] akosiaris: _joe_ I am still pleasantly surprised at how far we’ve come :) [12:34:38] I am hoping gwicke and mobrovac will rewrite zotero into citoid to be honest [12:34:45] mobrovac: ^ see what I did there ;-) [12:34:49] ? [12:34:54] * mobrovac hides [12:34:58] :D [12:35:10] you can /me hides but you can never /hide [12:35:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [12:35:14] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [12:35:15] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [12:35:15] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [12:35:15] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [12:35:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [12:35:16] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [12:35:16] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: puppet fail [12:35:16] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 279 seconds ago with 0 failures [12:35:20] <_joe_> akosiaris: a lot of people will use zotero, are you kidding? [12:35:23] akosiaris: I responded on your question about salt master pubkey, btw. [12:35:24] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [12:35:49] YuviPanda: yeah I saw. I was thinking on the security repercussions of that change [12:36:00] <_joe_> I mean there are people running nixos in production, why can't we find someone running zotero? [12:36:21] akosiaris: yup, since it allows people to silently change salt masters. [12:36:44] I was more worried if it can happen in production than anything else [12:36:48] aaaah [12:36:49] I see [12:36:57] but I think not [12:37:07] there are at least two other variables that need to be overriden [12:37:14] I was thinking labs because you can now change salt masters by simply exploiting wikitech [12:37:22] akosiaris: yup, salt master and finger. those already have been hieraized [12:37:31] not for production [12:37:33] (03PS3) 10Glaisher: Manage username blacklist (TitleBlacklist) from Meta-Wiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195623 (https://phabricator.wikimedia.org/T38939) (owner: 10Legoktm) [12:37:34] but if you’ve exploited wikitech labs is screwed anyway (hello, LDAP) [12:37:50] akosiaris: ah, right. [12:38:11] (03CR) 10Glaisher: [C: 031] "Though we might want to wait sometime before actually deploying this. Some wikis also have abusefilters for preventing certain usernames. " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195623 (https://phabricator.wikimedia.org/T38939) (owner: 10Legoktm) [12:38:33] salt::master should be generalized enough to work for both labs and prod, but as bd808 says, one most know when to stop shaving a yak, lest y ou peel all the skin off the yak and it gores you to death / PETA comes after you. [12:38:35] (03CR) 10Alexandros Kosiaris: "OK, nitpicks then. Otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/195492 (owner: 10Thcipriani) [12:39:30] akosiaris: ty :) [12:40:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [12:40:14] RECOVERY - check_puppetrun on indium is OK: OK: Puppet is currently enabled, last run 258 seconds ago with 0 failures [12:40:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [12:40:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [12:40:15] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [12:40:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [12:40:15] RECOVERY - check_puppetrun on payments1002 is OK: OK: Puppet is currently enabled, last run 225 seconds ago with 0 failures [12:40:16] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: puppet fail [12:40:16] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [12:42:36] matanya: I was looking at https://gerrit.wikimedia.org/r/#/c/195874 [12:42:50] what is the logic behind quoting numeric arguments ? [12:43:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Numeric arguments should not be quoted" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/195874 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [12:43:51] akosiaris: link in a sec [12:44:05] 10Ops-Access-Requests, 6operations, 10MediaWiki-extensions-ContentTranslation, 3LE-Sprint-84, 5Patch-For-Review: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1109102 (10Milimetric) This seems to have spiraled into a lot bigger thing than I imagined. I would like to... [12:44:14] (03CR) 10Nemo bis: "Yes. Maybe the stewards could do that? Either way, [[meta:SN]] is a good place to discuss that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195623 (https://phabricator.wikimedia.org/T38939) (owner: 10Legoktm) [12:44:35] <_joe_> can I strongly object to the idea of quoting ensure => present ? [12:44:39] <_joe_> it's horrible [12:44:55] yes [12:45:05] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [12:45:08] i'll support [12:45:14] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [12:45:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [12:45:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [12:45:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [12:45:15] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: puppet fail [12:45:15] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [12:45:27] bblack, ping [12:46:14] so _joe_ and mark please say so on the ticket [12:46:16] YuviPanda, forgot to ask you - do we still have labs instances installed with the default admin password, or is there a way to have it autogenerated and presented on wikitech? Or some other magic way? [12:46:29] <_joe_> mark: we may want to edit https://wikitech.wikimedia.org/wiki/Puppet_coding#Resources [12:46:30] yurik: ? you mean for vagrant? [12:46:34] yep [12:46:40] yurik: it’s still the default. [12:46:43] :) [12:46:44] sigh [12:46:53] security through admin/vagrant obscurity )) [12:47:05] patches welcome :P [12:47:18] i hear you... no idea how to fix it though ) [12:47:25] indeed :) [12:47:42] wwwweeeeeellllll, you can fairly trivially fix it by having the password be autogenerated and put on the filesystem [12:47:58] YuviPanda, isn't there a way to enter a value on the roles page in wikitech? [12:48:10] there is, but that is public as well :P [12:48:21] so not much help [12:48:31] what password are you talking about ? [12:48:41] it doesn't need to be stored there - just a way to enter it, and it becomes a blank box after clicking "provision" [12:48:49] akosiaris: labs-vagrant’s default mediawiki install’s default user’s password [12:49:05] <_joe_> labs-vagrant? [12:49:10] yurik: nope, that functionality doesn’t exist :) [12:49:11] we have labs? [12:49:18] <_joe_> you mean mediawiki/vagrant? [12:49:30] _joe_: nope, labs-vagrant. labs_vagrant module. [12:49:35] akosiaris: i can't find the link, but the logic is numbers are strings, and as such should be quoted [12:49:40] _joe_, https://wikitech.wikimedia.org/wiki/Labs-vagrant [12:50:02] _joe_: is a hack, basically. lets people use roles / puppet code from mediawiki/vagrant in labs easily. [12:50:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [12:50:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [12:50:14] RECOVERY - check_puppetrun on payments1003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [12:50:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [12:50:15] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [12:50:15] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [12:50:16] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: puppet fail [12:50:18] it replaced mediawiki_singlenode [12:50:19] it ain't a hack! its the best way to deploy MW on labs [12:50:23] matanya: then you would not be able to perform operations on them like multiplication/addition etc [12:50:45] yurik: I was saying ‘hack’ in the most positive of ways :) [12:50:45] its awesome, just needs a bit of TLC [12:50:52] akosiaris: grep some cron resorces [12:51:06] I remember being super excited for weeks after building that :) [12:51:13] YuviPanda, imho, we should kill all other lab roles out there [12:51:25] <_joe_> I don't get you guys [12:51:25] <_joe_> :) [12:51:25] yurik: yeah, mediawiki_singlenode needs to die. [12:51:33] <_joe_> YuviPanda: ewww [12:51:35] and hope ori can merge vagrant & production puppets [12:51:47] <_joe_> YuviPanda: https://github.com/duritong/trocla may help I guess? [12:52:21] _joe_: nah. labs-vagrant is just for testing, and my position is that if you set up a testing install of mediawiki, *log in and change the password* :P [12:52:41] _joe_: can you please reason why quoting present is ugly ? [12:52:55] _joe_: and since it’s a mw password, a ‘fix’ would be fairly simple too - just autogenerate a password during install, and save it in the filesystem readable only by root. [12:53:00] but trocla looks interesting by itself... [12:53:16] yurik: I don’t think vagrant / prod puppet roles will ever merge, btw. They have very different purposes :) [12:53:24] yurik: and the prod mediawiki roles are a *lot* cleaner now... [12:53:25] <_joe_> matanya: because it's a constant term in the puppet DSL [12:53:25] <_joe_> it's not a random string [12:53:58] matanya: you probably refer to https://docs.puppetlabs.com/puppet/latest/reference/lang_datatypes.html#numbers, but as you can see numbers are not quoted in there [12:54:01] _joe_: so you prefer the other way around ? [12:54:12] <_joe_> matanya: yes [12:54:20] sigh, i was hoping to write service puppet only once to both, or at least have some code sharing [12:54:32] yes, that akosiaris , thanks [12:54:48] <_joe_> akosiaris: let's not let puppetlabs enter what was a rational and decent conversation until now :P [12:55:01] _joe_: deal :-) [12:55:11] * YuviPanda rewrites _joe_ in clojure + haskell [12:55:14] RECOVERY - check_puppetrun on pay-lvs1002 is OK: OK: Puppet is currently enabled, last run 135 seconds ago with 0 failures [12:55:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [12:55:15] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 209 seconds ago with 0 failures [12:55:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [12:55:15] RECOVERY - check_puppetrun on bismuth is OK: OK: Puppet is currently enabled, last run 153 seconds ago with 0 failures [12:55:35] btw they are casting between strings and numbers kind of implicitly until it stops working [12:55:55] I personlly don't care either way, but i care of consistncy [12:55:58] s/kind of// [12:56:14] so, please, lets decide and stick with the decision [12:56:46] * YuviPanda wonders if he should start bitching about what a waste of time aligning arrows are... [12:56:47] * YuviPanda doesn't [12:57:47] 7Puppet, 6operations, 5Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908#1109218 (10Joe) I have just noticed that we have this in our puppet guidelines, but my personal "standard" is as follows: - never quote what is going to be transformed in ruby ba... [12:58:07] <_joe_> akosiaris, mark if you want to add your opinions there [12:58:55] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [13:00:14] RECOVERY - check_puppetrun on pay-lvs1001 is OK: OK: Puppet is currently enabled, last run 292 seconds ago with 0 failures [13:00:15] RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 287 seconds ago with 0 failures [13:01:17] 6operations, 10Staging: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#1109229 (10Springle) >>! In T91797#1096926, @thcipriani wrote: > Maybe it would be a better to say that I expected that adding `role::mariadb` to a fresh server would... [13:05:28] (03PS2) 10Matanya: logstash: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195874 (https://phabricator.wikimedia.org/T91908) [13:09:46] (03PS1) 10Nemo bis: Set $wgRateLimits['badcaptcha'] to counter bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) [13:09:52] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Set up load balancing for appservers in dallas - https://phabricator.wikimedia.org/T92377#1109243 (10Joe) 3NEW a:3Joe [13:09:57] (03CR) 10jenkins-bot: [V: 04-1] Set $wgRateLimits['badcaptcha'] to counter bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [13:10:30] (03PS1) 10Giuseppe Lavagetto: mediawiki: add appserver cluster IPs in codfw [dns] - 10https://gerrit.wikimedia.org/r/195887 [13:11:29] akosiaris: any reason why apt's module sets $http_proxy = "http://webproxy.${::site}.wmnet:8080" , while zotero's hiera says http_proxy: url-downloader.wikimedia.org:8080 ? [13:12:31] mobrovac: different services. url-downloader is used by mediawiki services to download well URLs, webproxy is used for machines to access the internet as needed [13:12:46] i see [13:12:53] both proxy software but different configuration [13:13:01] squid to be exact [13:13:21] so presumably citoid should use url-downloader? [13:13:25] exactly [13:13:38] ok, let's hard-code that baby in the puppet class [13:13:39] :) [13:13:53] ahaha [13:15:03] (03CR) 10Springle: [C: 032] Add /etc/mysql dir before linking inside it [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/194925 (owner: 10Thcipriani) [13:15:05] you mean let’s parameterize that baby in the puppet class :P [13:15:06] what's with the /usr/lib/php5/sessionclean cronspam? [13:15:53] YuviPanda: with lang=wmf set, yes [13:15:57] :P [13:16:29] 6operations, 10Staging: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#1109268 (10yuvipanda) >>! In T91797#1109229, @Springle wrote: > While I sympathize with the request, if this happens it needs to default to off or be put into a separ... [13:16:34] mobrovac: :) [13:19:04] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:22:52] hm [13:23:31] is there any way to read the value of $lvs_service_ips outside of modules/lvs/manifests/configuration.pp ? [13:23:54] you can include the class and just reference it [13:23:56] I believe so, but I don't remember how [13:23:56] it's not pretty but it works [13:24:04] and is done all over the place actually [13:24:06] ah right, no actions there [13:24:07] there are several examples [13:24:23] this will move to hiera eventually [13:24:56] so, that *should be* a prime candidate for converting to hiera [13:25:01] yep [13:25:29] (03CR) 10Florianschmidtwelzow: [C: 04-1] Set $wgRateLimits['badcaptcha'] to counter bots (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [13:26:57] mobrovac: take a look at network.pp too. prime candidate for converting to hiera :) [13:27:16] 7Puppet, 6operations, 5Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908#1109282 (10mark) +2. [13:27:24] yep [13:27:39] plenty of work to be done here guys [13:27:40] :) [13:27:59] 6operations, 10Staging: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#1109283 (10Springle) # Provision box, sign puppet, first run, etc # xtrabackup clone & prepare data from another server # Start MariaDB service, wait for replic... [13:28:12] mobrovac: any / all help welcome :) [13:28:44] i should have seen that coming :D [13:32:54] (03PS2) 10Thcipriani: Add master_key param for salt_minion module [puppet] - 10https://gerrit.wikimedia.org/r/195492 [13:34:53] (03CR) 10JanZerebecki: [C: 031] "As both of them are in the past that woun't change behavior now." [puppet] - 10https://gerrit.wikimedia.org/r/195836 (https://phabricator.wikimedia.org/T92358) (owner: 10Dzahn) [13:37:59] (03CR) 10JanZerebecki: "Btw. this was last touched by Brian in 2013." [puppet] - 10https://gerrit.wikimedia.org/r/195836 (https://phabricator.wikimedia.org/T92358) (owner: 10Dzahn) [13:38:19] 7Puppet, 6operations, 5Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908#1109298 (10akosiaris) Up to now, I mostly follow this as well, though I do align arrows :P * I am fine with ensure => 'present' or ensure => present although my tendency is ensur... [13:39:07] (03CR) 10JanZerebecki: "Err i mean Bryan, sorry." [puppet] - 10https://gerrit.wikimedia.org/r/195836 (https://phabricator.wikimedia.org/T92358) (owner: 10Dzahn) [13:44:48] (03PS2) 10Nemo bis: Set $wgRateLimits['badcaptcha'] to counter bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) [13:48:14] yurik: pong (no, I haven't looked at any of your VCL-related things yet) [13:48:51] bblack, thx, dan has been asking, so I wanted to get a status update ) [13:49:50] akosiaris: any good reason we keep svn anyway ? [13:50:18] Our svn repos? [13:50:28] Historical value? [13:50:43] It was going to be imported to phabricator... [13:50:58] yes, that is my question [13:51:07] can we remove that after that ? [13:51:17] Are we moving to svn? :) [13:52:07] I think we can after [13:52:21] I dunno where they got with the migration [13:52:32] and a question to bblack : modules/varnish/manifests/common/vcl.pp line 35 is there a good reason the vcl unit test is not in the module layout ? [13:55:00] YuviPanda: Who's the third non-parttime CI sysadmin? [13:55:15] Krinkle: not sysadmin, ‘people working on it’. hashar mentioned you and legoktm :) [13:55:23] so I said ‘2 other' [13:55:30] Right [13:55:31] :) [13:56:28] and finally a question to ori : modules/apache/manifests/mod.pp has wrong relation sides, what is the reason for that [13:58:40] Reedy, matanya: https://phabricator.wikimedia.org/diffusion/SVN/ [13:58:48] (03PS1) 10Mobrovac: Puppetise Citoid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/195896 (https://phabricator.wikimedia.org/T89875) [13:58:55] akosiaris: ^^ [13:59:16] so safe to get rid of svn Krenair ? [13:59:20] <_joe_> matanya: what do you mean? [13:59:28] <_joe_> matanya: I am the one you should ask to btw [13:59:36] even better :) [13:59:42] I think he means the old separate svn server [13:59:47] SSL certs expired etc [14:00:01] yes, the puppet role, the server, the classes etc [14:00:04] chasemp: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150311T1400). [14:00:26] Nothing to see here today. [14:00:28] <_joe_> matanya: what do you mean by "wrong relation sides"? [14:00:37] oh, that part :) [14:01:09] _joe_: lines 38 and onward, the relation is right_to_left_relationship [14:01:15] (<-) [14:01:26] instead the other way around [14:01:49] <_joe_> matanya: and why is that a problem? [14:02:01] lint of course :) [14:02:18] <_joe_> oh I wouldn't care tbh [14:02:34] http://puppet-lint.com/checks/right_to_left_relationship/ [14:02:59] <_joe_> did I already say I think I know better than puppet-lint authors? [14:03:06] <_joe_> :) [14:03:20] <_joe_> they are ruby coders, after all [14:03:31] yes, that is a sign you are a good sysadmin, you think you know better than the software author :D [14:05:01] and a general fun reading : http://serverfault.com/questions/84685/early-signs-of-a-bad-sysadmin/ [14:05:11] <_joe_> matanya: anyway, create a patch and add me and ori as reviewers [14:05:42] <_joe_> matanya: specifically, puppet-lint defined a default coding standard, one we can decide to overrule if we want [14:06:07] yes, i know. I like following standards [14:06:34] (03CR) 10JanZerebecki: [C: 04-1] "Correction: Those variables are currently not used. So it would be better to remove them." [puppet] - 10https://gerrit.wikimedia.org/r/195836 (https://phabricator.wikimedia.org/T92358) (owner: 10Dzahn) [14:06:52] <_joe_> btw, the advice about that disappeared from http://docs.puppetlabs.com/guides/style_guide.html AFAICT [14:07:08] <_joe_> oh no it's still there [14:13:08] _joe_: to make sure i got your intensions: class apache::mod::authz_svn { package { 'libapache2-svn' :} -> apache::mod_conf { 'authz_svn': } } is the right order ? [14:15:10] <_joe_> matanya: it is, yes [14:15:17] thanks [14:19:21] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Eventlogging JS client should warn users when serialized event is more than "N" chars long and not sent the event - https://phabricator.wikimedia.org/T91918#1109436 (10ggellerman) a:3Nuria [14:20:16] (03PS3) 10Yuvipanda: Fatalmonitor: Count 'repeated N times: ' in error messages [puppet] - 10https://gerrit.wikimedia.org/r/195657 (owner: 1020after4) [14:20:24] (03CR) 10Yuvipanda: [C: 032 V: 032] Fatalmonitor: Count 'repeated N times: ' in error messages [puppet] - 10https://gerrit.wikimedia.org/r/195657 (owner: 1020after4) [14:20:26] (03PS1) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) [14:23:04] 6operations, 5Patch-For-Review: dysprosium net / disk issues for reuse as cache box - https://phabricator.wikimedia.org/T83070#1109442 (10BBlack) a:5BBlack>3Cmjohnson [14:23:32] 6operations, 10ops-eqiad: dysprosium net / disk issues for reuse as cache box - https://phabricator.wikimedia.org/T83070#908855 (10BBlack) [14:26:16] (03PS1) 10Matanya: apache mod: correct relationship declarations [puppet] - 10https://gerrit.wikimedia.org/r/195898 [14:26:47] (03CR) 10Ottomata: "I don't use dsh at all anymore." [puppet] - 10https://gerrit.wikimedia.org/r/195840 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [14:27:36] (03PS1) 10Giuseppe Lavagetto: lvs: add loadbalancers for appservers, api and rendering [puppet] - 10https://gerrit.wikimedia.org/r/195899 (https://phabricator.wikimedia.org/T92377) [14:28:15] <_joe_> ottomata: can you log into archiva? [14:28:20] <_joe_> I get repeated timeouts [14:32:53] 6operations: increase misc-web-lb cp pool from 2 to 3 systems? - https://phabricator.wikimedia.org/T86718#1109475 (10BBlack) If we add 1-2 boxes to this (4 is our usual minimum for a prod cache cluster, e.g. 2 per row/rack if possible), I'll need to merge it into the current ongoing work on netboot/disk-setup/et... [14:33:06] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [14:33:38] 6operations: Put archiva.wikimedia.org behind misc-web-lb and force https - https://phabricator.wikimedia.org/T88139#1109478 (10Ottomata) 5declined>3Open [14:34:00] 6operations: Put archiva.wikimedia.org behind misc-web-lb and force https - https://phabricator.wikimedia.org/T88139#1004787 (10Ottomata) [14:34:16] 6operations: Put archiva.wikimedia.org behind misc-web-lb and force https - https://phabricator.wikimedia.org/T88139#1004787 (10Ottomata) We should do this, login, duh [14:36:02] 6operations: Put archiva.wikimedia.org behind misc-web-lb and force https - https://phabricator.wikimedia.org/T88139#1109493 (10Ottomata) @robh can we get an ssl cert for archiva.wikimedia.org? [14:38:28] (03PS3) 10Yuvipanda: Add master_key param for salt_minion module [puppet] - 10https://gerrit.wikimedia.org/r/195492 (owner: 10Thcipriani) [14:38:37] thcipriani: ^ gonna merge now [14:38:48] (03CR) 10Yuvipanda: [C: 032 V: 032] Add master_key param for salt_minion module [puppet] - 10https://gerrit.wikimedia.org/r/195492 (owner: 10Thcipriani) [14:39:24] YuviPanda: neat. [14:39:39] 10Ops-Access-Requests, 6operations, 10MediaWiki-extensions-ContentTranslation, 3LE-Sprint-84, 5Patch-For-Review: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1109503 (10Ottomata) As far as I can tell, this is not really blocked on ops, as they could just sign the do... [14:40:04] thcipriani: btw, my deployment server patch has now spawned 6 other patches, all merged. I’m stuck with ssh key setup for mwdeploy atm, though... [14:40:08] (testing on deployment-prep) [14:41:43] (03PS2) 10Mobrovac: Puppetise Citoid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/195896 (https://phabricator.wikimedia.org/T89875) [14:42:34] YuviPanda: gotcha. I'm working on some staging-specific puppet role to handle mariadb setup without manual steps. [14:42:59] thcipriani: hmm, I think we should stick to ‘same as prod’ in this case as well, which is ‘some manual steps' [14:43:09] thcipriani: for redis as well, for example, restarts are manual. [14:43:32] I’m very weary of a staging puppet module, considering we’re just now trying to get rid of the beta/ module :) [14:43:44] we [14:43:50] we could perhaps script the steps needed... [14:44:01] right. That was one of my questions: what to do when staging goals conflict. [14:44:53] thcipriani: oh, right. I think ideally we should strive to make it automatic in *both* prod and staging :D but in cases when that isn’t desired (having puppet thrash around dbs does sound scary), we should stick to ‘close to prod, assuming prod has good reasons, if they do not, change prod as well, and then stick to prod' [14:47:22] thcipriani: also ‘when they do have good reasons, document them, see if we can turn it into a script that we then put it in the puppet repo' [14:47:37] (03CR) 10Subramanya Sastry: "We (parsoid) still use dsh." [puppet] - 10https://gerrit.wikimedia.org/r/195840 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [14:48:20] * thcipriani nods [14:49:18] (03PS1) 10KartikMistry: Add initial Debian packaging [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/195905 (https://phabricator.wikimedia.org/T91493) [14:50:00] (03PS2) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/195905 (https://phabricator.wikimedia.org/T91493) [14:50:22] * anomie sees nothing for SWAT this morning [14:52:44] (03PS1) 10BBlack: depool cp3021 + cp4004 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195906 [14:53:05] (03CR) 10BBlack: [C: 032 V: 032] depool cp3021 + cp4004 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195906 (owner: 10BBlack) [14:54:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] Puppetise Citoid's configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/195896 (https://phabricator.wikimedia.org/T89875) (owner: 10Mobrovac) [14:57:45] (03PS2) 10Giuseppe Lavagetto: mediawiki: add appserver cluster IPs in codfw [dns] - 10https://gerrit.wikimedia.org/r/195887 (https://phabricator.wikimedia.org/T92377) [14:58:11] 6operations, 10Citoid, 5Patch-For-Review, 3VisualEditor 2014/15 Q3 blockers: Configure citoid to use outbound proxy - https://phabricator.wikimedia.org/T89875#1109521 (10Jdforrester-WMF) [14:59:02] 10Ops-Access-Requests, 6operations, 10Citoid, 6Services, 3VisualEditor 2014/15 Q3 blockers: Give mvolz access to sha machine i.e. http://citoid.wikimedia.org/ - https://phabricator.wikimedia.org/T89057#1109531 (10Jdforrester-WMF) [14:59:22] 7Blocked-on-Operations, 6operations, 10Citoid, 6Scrum-of-Scrums, and 2 others: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1109532 (10Jdforrester-WMF) [14:59:54] 6operations, 10Citoid, 3VisualEditor 2014/15 Q3 blockers: Puppetize zotero - https://phabricator.wikimedia.org/T89867#1109533 (10Jdforrester-WMF) [15:00:02] 6operations, 10Citoid, 10hardware-requests, 5Patch-For-Review, 3VisualEditor 2014/15 Q3 blockers: Assign hardware for the zotero service - https://phabricator.wikimedia.org/T89869#1109534 (10Jdforrester-WMF) [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150311T1500). [15:00:06] (03CR) 10Dzahn: [C: 031] mediawiki: add appserver cluster IPs in codfw [dns] - 10https://gerrit.wikimedia.org/r/195887 (https://phabricator.wikimedia.org/T92377) (owner: 10Giuseppe Lavagetto) [15:00:09] 6operations, 10Citoid, 3VisualEditor 2014/15 Q3 blockers: Configure zotero to use an outbound proxy - https://phabricator.wikimedia.org/T89874#1109535 (10Jdforrester-WMF) [15:11:06] 6operations, 10Citoid, 5Patch-For-Review, 3VisualEditor 2014/15 Q3 blockers: Configure citoid to use outbound proxy - https://phabricator.wikimedia.org/T89875#1109577 (10Jdforrester-WMF) a:3mobrovac [15:14:13] off the top of your head, how many repos do we host in Gerrit? [15:15:40] greg-g: 1229 ? [15:16:54] looks right from my copy/paste, line count, divide by 3 (two lines per repo, one blank line between) [15:16:55] akosiaris: thanks :) [15:17:29] ssh -p 29418 akosiaris@gerrit.wikimedia.org gerrit ls-projects | wc -l [15:17:42] one line per repo, wc -l works wonders [15:17:55] USERINFO is weird and empty parent projects count as well [15:18:22] * greg-g didn't know of gerrit ls-projects, fancy [15:18:52] I get 1325 listed on git.wikimedia.org web UI [15:19:08] interesting [15:19:18] that is like 100 more ? [15:19:26] oh wait [15:19:46] deleted projects do not get delete on git.wikimedia.org [15:19:54] or github/wikimedia for that matter [15:20:12] get deleted* [15:21:22] https://phabricator.wikimedia.org/P387 <- the 1325 [15:21:26] bad grrrit-wm [15:22:13] paravoid: I’m fiddling with it now... [15:22:25] removed about 60% of moving parts in it earlier this morning. [15:22:31] so it should be a lot more reliable once it comes back up [15:23:07] Coren: labstore1003's puppet is failing, labmon1001 has a disk space alert [15:23:22] ori/_joe_: osmium last ran puppet 8 days ago [15:24:20] so, the "off the top of your head" was meant to indicate "exact number not needed" but thanks guys, didn't mean to send you down a spiral ;) [15:25:22] paravoid: 1001 is not a real issue; but Ima go ack the alert. 1003 I'm still trying to figure out. _joe_ tried to help me with understanding the hiera woe yesterday but his suggested fix didn't do the trick. [15:25:40] don't ack the alert unless there's a phab task on how to fix the alert [15:26:07] greg-g: you didn't really expect us to say something like 200ish ? [15:26:28] akosiaris: well, I just wrote "~1200" in the email I just sent :P [15:27:04] yeah sure, but we couldn't do that. It just would not feel right [15:27:25] not to mention that I actually had 0 idea. My take was around 300 [15:27:41] and it was obviously wrong [15:27:44] * greg-g nods [15:27:56] paravoid: Ohwait. Labmon1001. Nevermind, I got confused with my labstore1001 work - *that* would have been a false positive, labmon I don't know yet. :-) [15:28:16] * greg-g is management, he works in squishy numbers all the time, it's ok :) [15:29:37] heh, I just searched "icinga phabricator" to see if there was an icinga plugin for creating phab tasks from alerts, and all of the search results are from us [15:31:43] who still uses dsh sometimes? [15:32:45] on my list so far: mediawiki-installation, parsoid. [15:33:16] but suggesting to delete all other groups (that are most likely outdated too) [15:33:52] later we can replace the remaining ones and rm the dsh module altogether ? [15:36:03] mutante: +1 [15:36:17] mutante: should email engineering@ and ops@, though... [15:37:00] (03CR) 10Ottomata: [C: 032] Add cron job to drop refined webrequest partitions and data [puppet] - 10https://gerrit.wikimedia.org/r/195918 (owner: 10Ottomata) [15:37:43] (03CR) 10Dzahn: "@subbu amended to not leave the parsoid group untouched" [puppet] - 10https://gerrit.wikimedia.org/r/195840 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [15:39:38] (03PS1) 10Ottomata: Name refinery drop cron jobs accordingly [puppet] - 10https://gerrit.wikimedia.org/r/195922 [15:39:46] mutante: please do delete analytics dsh, i'm sure it is way out of date [15:40:04] (03CR) 10Ottomata: [C: 032 V: 032] Name refinery drop cron jobs accordingly [puppet] - 10https://gerrit.wikimedia.org/r/195922 (owner: 10Ottomata) [15:40:12] YuviPanda: ottomata: ok :) [15:40:34] mutante: thanks for getting rid of them :D [15:41:14] (03PS1) 10Ottomata: Fix variable name conflict [puppet] - 10https://gerrit.wikimedia.org/r/195923 [15:41:29] (03CR) 10Ottomata: [C: 032 V: 032] Fix variable name conflict [puppet] - 10https://gerrit.wikimedia.org/r/195923 (owner: 10Ottomata) [15:44:35] 10Ops-Access-Requests, 6operations, 10Citoid: Give mobrovac production access for citoid - https://phabricator.wikimedia.org/T92389#1109647 (10Mvolz) 3NEW [15:45:55] 10Ops-Access-Requests, 6operations, 10Citoid: Give mobrovac production access for citoid - https://phabricator.wikimedia.org/T92389#1109659 (10Mvolz) @mobrovac, do you have a shell account on the production machines yet? If not you need to follow: https://wikitech.wikimedia.org/wiki/Requesting_shell_access... [15:46:26] 10Ops-Access-Requests, 6operations, 10Citoid, 6Services: Give mobrovac production access for citoid - https://phabricator.wikimedia.org/T92389#1109660 (10Mvolz) [15:46:33] 6operations: revoke/delete bugzilla ssl certs - https://phabricator.wikimedia.org/T92041#1109662 (10Dzahn) [15:48:35] (03PS1) 10BBlack: repool cp3021,cp4004, mark amssq3[78] [puppet] - 10https://gerrit.wikimedia.org/r/195928 [15:48:48] (03CR) 10BBlack: [C: 032 V: 032] repool cp3021,cp4004, mark amssq3[78] [puppet] - 10https://gerrit.wikimedia.org/r/195928 (owner: 10BBlack) [15:49:41] (03PS2) 10Dzahn: planet: rm SSL settings and ports.conf.ssl [puppet] - 10https://gerrit.wikimedia.org/r/195809 [15:50:16] (03CR) 10Dzahn: [C: 032] planet: rm SSL settings and ports.conf.ssl [puppet] - 10https://gerrit.wikimedia.org/r/195809 (owner: 10Dzahn) [15:51:08] 6operations, 10ops-codfw: codw pfw* serial connections problem - https://phabricator.wikimedia.org/T84737#1109669 (10faidon) Anytime is fine, let me know if there are any problems. [15:51:14] Coren: re: labmon1001 pointed out by paravoid, are you taking a look or should I? [15:51:20] bblack: i'll merge that on palladium [15:51:32] ok, sorry :) [15:51:40] YuviPanda: Already on it; it just needs a bit of logrotate tweak; I'll point you at the changeset shortly. [15:51:47] Coren: ok [15:51:56] bblack: np, done [15:53:57] (03PS1) 10Alexandros Kosiaris: Grant mobrovac access to citoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/195932 (https://phabricator.wikimedia.org/T92389) [15:56:38] 10Ops-Access-Requests, 6operations, 10MediaWiki-extensions-ContentTranslation, 3LE-Sprint-84, 5Patch-For-Review: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1109685 (10KartikMistry) I've signed the document :) [16:01:28] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-App-Android-App, and 2 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1109694 (10dr0ptp4kt) [16:02:08] (03PS1) 10Glaisher: Update 'interface_editor' to 'interface-editor' at ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195934 (https://phabricator.wikimedia.org/T85731) [16:03:17] (03PS1) 10Dzahn: etherpad: remove SSL cert and settings frm role [puppet] - 10https://gerrit.wikimedia.org/r/195936 (https://phabricator.wikimedia.org/T85788) [16:03:19] (03CR) 10Alexandros Kosiaris: [C: 032] "Thank you!!!!" [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [16:03:20] 6operations, 10ops-codfw: setup and deploy mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#1109702 (10Joe) Spoke with mark: we can enable console redirection. [16:03:27] 6operations, 10ops-codfw: setup and deploy mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#1109703 (10Joe) a:5Joe>3None [16:03:55] _joe_: well it would be good to standardize across all servers [16:04:04] greg-g: Can I do an emergency deploy of https://gerrit.wikimedia.org/r/195935 ? VisualEditor isn't currently broken in wmf20, but only because the broken code went unsynced by accident, so once twentyafterfour does the deploy train later today it will break [16:04:04] so if we enable it there, let's make sure it doesn't cause problems on other servers [16:04:26] <_joe_> mark: ok, btw all eqiad appservers had it [16:04:33] <_joe_> so I assumed it was the standard [16:04:35] akosiaris: https://gerrit.wikimedia.org/r/#/c/195936/1 [16:04:40] weird [16:04:45] we should really find an automated way to do bios settings [16:05:33] RoanKattouw: doit [16:05:36] k [16:05:53] (03PS1) 10Glaisher: Set $wmgAbuseFilterEmergencyDisableThreshold to 0.30 at commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195938 (https://phabricator.wikimedia.org/T87431) [16:06:07] bblack found a way to do BIOS settings with the Lifecycle Controller to enable HyperThreading [16:06:13] RoanKattouw: Good timing, I was just about to branch 1.25wmf21 ... is this fixed in master? [16:06:24] twentyafterfour: So before you do that [16:06:24] it just requires that one extra reboot but is automatic [16:06:26] Two things [16:06:35] One, I want to make sure that https://gerrit.wikimedia.org/r/#/c/195553/ works correctly [16:06:43] I can do that after you run the script [16:06:46] (03CR) 10Alexandros Kosiaris: [C: 032] etherpad: remove SSL cert and settings frm role [puppet] - 10https://gerrit.wikimedia.org/r/195936 (https://phabricator.wikimedia.org/T85788) (owner: 10Dzahn) [16:06:53] Two, wmf20 seems messed up [16:07:03] None of the wmf20 extension branches had commits that adjusted their .gitreview files [16:07:09] This was fine in wmf19 but broke in wmf20 [16:07:10] (03CR) 10Dzahn: [C: 032] delete etherpad SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/195303 (https://phabricator.wikimedia.org/T92045) (owner: 10Dzahn) [16:07:15] So let's make sure that doesn't happen again [16:07:29] twentyafterfour: Can you wait a few minutes so I can deploy that fix, and then run the script? [16:07:43] RoanKattouw: Indeed, something went wrong with the make-wmf-branch script, I'm not entirely sure it's going to work this time either [16:07:50] greg-g: (BTW, the VE snafu is something that https://gerrit.wikimedia.org/r/#/c/195553/ should prevent) [16:07:55] OK [16:07:58] RoanKattouw: I'll wait, let me know when you're done [16:08:03] I will be around to help you with any problems you run into [16:09:06] (03CR) 10Dzahn: "just needed this before: https://gerrit.wikimedia.org/r/195936" [puppet] - 10https://gerrit.wikimedia.org/r/195303 (https://phabricator.wikimedia.org/T92045) (owner: 10Dzahn) [16:11:20] twentyafterfour: OK I haven't finished deploying yet because I need to watch some Jenkins paint dry, but I realized it doesn't interfere with your thing [16:11:37] (03CR) 10BryanDavis: [C: 031] wikimania_scholarships: resource attributes quoting and minor lint [puppet] - 10https://gerrit.wikimedia.org/r/195864 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [16:11:54] twentyafterfour: So whenever you're ready, go ahead and run your script (be sure to update it first, because I changed it recently), and ping me if you run into any problems [16:11:54] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.002 second response time [16:12:12] (03PS1) 10coren: Add logrotate fragment for graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/195940 [16:12:15] PROBLEM - uWSGI web apps on labmon1001 is CRITICAL: CRITICAL: Not all configured uWSGI apps are running. [16:12:18] YuviPanda: ^^ [16:12:39] RoanKattouw: ok, I'm submitting some patches to other scripts in that repo, then I'll branch [16:12:51] OK [16:13:09] YuviPanda: Oh bah whitespace. [16:13:16] Coren: I’m eating food atm, will look in a bit. Can you also take care of the other labmon alerts that just showed up? [16:13:47] YuviPanda: On it. [16:13:59] ty [16:18:56] 6operations, 5Patch-For-Review: revoke / delete etherpad.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92045#1109737 (10Dzahn) [16:19:48] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.079 second response time [16:21:41] !log catrope Synchronized php-1.25wmf20/extensions/VisualEditor/: Update and unbreak VE (duration: 00m 06s) [16:21:48] Logged the message, Master [16:21:54] RoanKattouw: Woo. [16:22:29] twentyafterfour: OK I'm done with my deploy, keep me posted on your make-wmf-branch progress (cc greg-g ) [16:22:54] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet last ran 8 days ago [16:23:11] (03PS1) 10Giuseppe Lavagetto: redis: fix hostnames in the dhcp file as well [puppet] - 10https://gerrit.wikimedia.org/r/195943 [16:23:24] <_joe_> grr [16:23:29] <_joe_> Puppet last ran 8 days ago [16:23:52] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-App-Android-App, and 2 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1109760 (10dr0ptp4kt) There's an update. See https://lists.wikimedia.org/pipermail/analytics/2015-March/003583.html. it wil... [16:23:55] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:24:02] _joe_: i just ran it [16:24:31] it was probably me who disabled it, though i don't remember why. sorry 'bout that. [16:25:34] (03CR) 10Tim Landscheidt: "See also T47828, T47829 and T47827." [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [16:25:39] <_joe_> ori: I just found this morning that rhenium is running completely and happily unpuppetized in prod [16:25:56] heh [16:26:10] <_joe_> sorry ruthenium [16:26:26] PROBLEM - Disk space on ms-be2009 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdk1 is not accessible: Input/output error [16:26:26] PROBLEM - RAID on ms-be2009 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [16:27:30] (03CR) 10Giuseppe Lavagetto: [C: 032] redis: fix hostnames in the dhcp file as well [puppet] - 10https://gerrit.wikimedia.org/r/195943 (owner: 10Giuseppe Lavagetto) [16:28:50] it's in puppet [16:29:20] without a role though [16:31:01] _joe_: what exactly was wrong with the hostnames? [16:31:17] <_joe_> rbd vs rdb [16:31:31] oh, damn, i even looking at it now and didnt notice [16:31:33] sorry dude =[ [16:31:42] RoanKattouw: make-wmf-branch seems to be progressing without errors this time. [16:31:47] <_joe_> eheh it can happen [16:31:59] <_joe_> you've been almost-consistent though :) [16:31:59] Good [16:32:08] i just copied and pasted the lines one under the other and still didnt see ;_: [16:32:25] i obviously have too many tabs and tasks open, swapdeathing my brain [16:34:46] <_joe_> robh: that has to do with how our brain interpolates while we read [16:35:07] <_joe_> it's much easier not to see wehn two letters are swapped in the middle of a word [16:35:33] ist wyh yuo can stlli raea tihs just fnei [16:35:49] win 15 [16:36:17] <_joe_> lose 16 [16:36:52] yea, the irssi game :p [16:37:13] (03PS1) 10BBlack: various s/esams.wm.o/esams.wmnet/ fixups for cache hostnames [puppet] - 10https://gerrit.wikimedia.org/r/195946 [16:38:37] RoanKattouw: so there is one problem with your change ... [16:38:39] (03CR) 10BBlack: [C: 032] various s/esams.wm.o/esams.wmnet/ fixups for cache hostnames [puppet] - 10https://gerrit.wikimedia.org/r/195946 (owner: 10BBlack) [16:38:49] it didn't sync the submodule remotes' url [16:38:56] sub-sub-modules [16:39:10] submodules should die in a fire [16:39:19] bblack: I agree [16:39:21] sub-sub sounds horrible [16:40:55] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: Puppet has 1 failures [16:41:12] twentyafterfour: Oh right, and I'm guessing the push failed because of that? [16:41:25] Things you don't notice in dry-run mode :D [16:42:26] No, it shouldn't have, should it [16:42:32] twentyafterfour: Can you explain the problem in more detail? [16:42:40] I don't really understand what's going on with submodule URL remapping there [16:43:11] (03PS1) 10RobH: adding katrik to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/195948 [16:43:37] RoanKattouw: the problem is when pushing it asks for a username and password because it's using the https:// git url [16:43:43] and it needs to use ssh:// git urls [16:43:47] I'll fix it [16:43:52] (03CR) 10RobH: [C: 032] adding katrik to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/195948 (owner: 10RobH) [16:43:54] (03PS2) 10Alexandros Kosiaris: adding kartik to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/195948 (owner: 10RobH) [16:44:01] OK [16:44:13] Oh right [16:44:19] (ssh git protocol uses my ssh agent key, https doesn't have that ability) [16:44:21] Yeah, you're right, that was stupid of me [16:44:27] I see that now [16:44:57] I'm not sure why all the checkouts use https in the first place, I guess because we can't do 'anonymous' git clone from ssh [16:45:06] exactly [16:45:08] Yeah [16:45:13] twentyafterfour: Oh and I forgot something else [16:45:16] https avoids the agent [16:45:24] twentyafterfour: When we branch the sub-submodule repo, that adds a commit [16:45:26] which is not the most secure thing in the world [16:45:37] twentyafterfour: So we then have to commit an update to the submodule to change its pointer to the sub-submodule [16:45:52] RoanKattouw: right [16:46:05] akosiaris: I don't think we allow pushing over https anyway (but I could be mistaken) [16:46:06] so I need to add a git commit in the extension submod? [16:46:13] Yeah [16:46:25] kart_ had that problem, I think d^ changed some repos to https:// to resolve that [16:46:27] 6operations, 5Patch-For-Review: revoke / delete etherpad.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92045#1109824 (10Dzahn) [16:46:28] RoanKattouw: yes you're right, it doesn't let me push over https, I tried entering my password [16:46:51] you need to get a special password for gerrit to do push over https [16:46:59] there is pushing over https [16:47:14] twentyafterfour: I would like to commit that submodule change together with the fixGitReview() change, in one "Creating new FOO branch" comimt [16:47:15] settings->http password [16:47:15] gerrit lets you set a password for it [16:47:23] The way I structured the code makes that a bit difficult [16:47:40] you have to set a separate password [16:47:43] 10Ops-Access-Requests, 6operations, 10MediaWiki-extensions-ContentTranslation, 3LE-Sprint-84, 5Patch-For-Review: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1109825 (10RobH) This is also a perfect example of why all access requests should be one person, one ticket.... [16:47:44] (03PS1) 10Cmjohnson: Adding dns entries including ipv6 for cp1071-1074 [dns] - 10https://gerrit.wikimedia.org/r/195950 [16:47:46] for "http auth" [16:47:48] twentyafterfour: I'm about to disappear for dinner, but I can fix the code if you want, or you can do it if you're already working on it and understand what needs to be done [16:48:33] RoanKattouw: I think I can get it, if I can't figure it out I'll make a ticket and assign it to you :) [16:48:46] OK cool [16:48:53] twentyafterfour: Let's at least manually fix it for wmf21 [16:49:22] RoanKattouw: that's what I am doing... I manually sync'd the submodule url, I'll make a manual commit and push after the script finishes [16:49:37] OK cool [16:49:57] Thanks man [16:51:59] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-App-Android-App, and 2 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1109863 (10DarTar) @dr0ptp4kt, awesome, can you document this parameter scheme somewhere on mediawiki.org or wikitech for fu... [16:52:43] 10Ops-Access-Requests, 6operations, 10Citoid, 6Services, 5Patch-For-Review: Give mobrovac production access for citoid - https://phabricator.wikimedia.org/T92389#1109865 (10RobH) @mobrovac, We've updated our procedure for access requests, and we'll need you to please review the following: https://wikit... [16:53:02] (03PS2) 10Dzahn: varnish: catch planet.wm.o as well [puppet] - 10https://gerrit.wikimedia.org/r/195646 (https://phabricator.wikimedia.org/T92051) (owner: 10John F. Lewis) [16:53:51] (03CR) 10RobH: [C: 04-1] "task still has pending steps for access being granted, but otherwise this looks good. (So once the ticket is resolved, my -1 can be remove" [puppet] - 10https://gerrit.wikimedia.org/r/195932 (https://phabricator.wikimedia.org/T92389) (owner: 10Alexandros Kosiaris) [16:54:17] (03CR) 10Dzahn: [C: 032] varnish: catch planet.wm.o as well [puppet] - 10https://gerrit.wikimedia.org/r/195646 (https://phabricator.wikimedia.org/T92051) (owner: 10John F. Lewis) [16:54:59] (03CR) 10Cmjohnson: [C: 032] Adding dns entries including ipv6 for cp1071-1074 [dns] - 10https://gerrit.wikimedia.org/r/195950 (owner: 10Cmjohnson) [16:55:20] Coren: did you manage the rest of the errors? [16:55:57] Coren: also https://github.com/graphite-project/graphite-web/pull/531 looks like it is related. [16:56:04] 10Ops-Access-Requests, 6operations, 10Citoid, 6Services, 5Patch-For-Review: Give mobrovac production access for citoid - https://phabricator.wikimedia.org/T92389#1109871 (10mobrovac) @RobH reviewed and signed a while ago. Adding @robla-wmf, could you please approve? [16:57:34] 10Ops-Access-Requests, 6operations, 10Citoid, 6Services, 5Patch-For-Review: Give mobrovac production access for citoid - https://phabricator.wikimedia.org/T92389#1109873 (10RobLa-WMF) approved [17:01:47] (03PS3) 10Jforrester: Add Draft namespace on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193827 (https://phabricator.wikimedia.org/T91223) (owner: 10Gerrit Patch Uploader) [17:02:05] RECOVERY - Disk space on ms-be2009 is OK: DISK OK [17:02:09] (03CR) 10Jforrester: "And done $wmgVisualEditorNamespaces in PS3." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193827 (https://phabricator.wikimedia.org/T91223) (owner: 10Gerrit Patch Uploader) [17:02:39] (03CR) 10Yuvipanda: [C: 04-1] Add logrotate fragment for graphite-web (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/195940 (owner: 10coren) [17:02:54] (03CR) 10BryanDavis: [C: 031] logstash: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195874 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [17:04:43] ok I'm apparently not allowed to update a branch pointer on extensions/VisualEditor? I was able to create the branch but pushing a new commit to that branch gives me this: [17:04:44] remote: Branch refs/heads/wmf/1.20wmf21: [17:04:46] remote: You are not allowed to perform this operation. [17:04:48] remote: To push into this reference you need 'Push' rights. [17:04:50] remote: User: twentyafterfour [17:04:59] twentyafterfour: 1.20wmf21?! [17:05:06] twentyafterfour: 1.25wmf21 surely? [17:05:07] doh [17:05:13] (03CR) 10coren: Add logrotate fragment for graphite-web (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/195940 (owner: 10coren) [17:05:31] James_f: that could be the problem ;) [17:05:57] YuviPanda: There's a bigger issue though - uwsgi doesn't survive a reload even though it can stop/start just fine. I'm trying to debug why atm. [17:06:10] oh [17:06:17] there’s uwsgictl vs service uwsgi [17:06:42] Coren: poke if you need any help. I’m going to file bugs about the exceptions. [17:06:47] Coren: do !log your actions here as well :) [17:07:49] YuviPanda: Yeah, same behaviour with uwsgictl and service uwsgi. [17:08:09] alright [17:08:13] Stop/start would work but be annoying as it'd cause 502s for several seconds every time. [17:13:19] 6operations, 5Patch-For-Review: revoke / delete etherpad.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92045#1109957 (10RobH) 5Open>3Resolved a:3RobH confirmed cert revoked from rapidssl. resolving task. [17:14:33] akosiaris: re https://gerrit.wikimedia.org/r/#/c/195896/ [17:14:40] i'd like those things to be in hiera as well [17:14:58] but without the values from lvs::config, that means hard-coding them there [17:15:16] well, parametrise them [17:15:21] yup. That is the idea. hiera being the configuration store [17:15:59] so i don't need to worry about "oh zotero's ip has changed but nobody set citoid's fact in hiera" ? [17:16:10] (03CR) 10Ori.livneh: [C: 031] dsh: delete most groups [puppet] - 10https://gerrit.wikimedia.org/r/195840 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [17:16:44] 6operations, 10ops-codfw: setup and deploy mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#1109981 (10Papaul) ok will start changing those settings on all the mw servers [17:17:10] oh it can happen, but the idea is that it is going to be way more difficult [17:17:27] and of course hiera can be used to populate lvs::configuration in the future [17:17:49] that is the goal at least [17:18:23] ok gr8 [17:18:24] thnx [17:18:26] 6operations, 7Graphite: Graphite web exceptions filling up /var/log on labmon1001 - https://phabricator.wikimedia.org/T92406#1110002 (10yuvipanda) 3NEW [17:18:38] <_joe_> akosiaris: yeah one day [17:18:44] (03PS2) 10Yuvipanda: Add logrotate fragment for graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/195940 (https://phabricator.wikimedia.org/T92406) (owner: 10coren) [17:18:53] <_joe_> when I have the balls to gut our lvs puppet classes [17:18:54] !log trying other ways to restart uwsgi on labmod1001 [17:18:57] <_joe_> :) [17:19:00] Logged the message, Master [17:19:21] <_joe_> Coren: have you tried to turn it on and off again? [17:19:30] 6operations, 7Graphite, 5Patch-For-Review: Graphite web exceptions filling up /var/log on labmon1001 - https://phabricator.wikimedia.org/T92406#1110020 (10yuvipanda) Is probably an instance of https://github.com/graphite-project/graphite-web/pull/531 [17:19:46] _joe_: That actually works, but is undesirable. :-) [17:21:04] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.017 second response time [17:21:15] YuviPanda: Well, I can do pre-stop post-start for the time being. It's ugly, but it'd solve the immediate issue. [17:21:42] (03PS1) 10Dzahn: depool cp1053 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195957 [17:21:46] YuviPanda: But right now for sure any attempt to get uwsgi to reload its config and reopen log files has it commit sppuku instead. [17:22:24] 6operations: revoke/delete bugzilla ssl certs - https://phabricator.wikimedia.org/T92041#1110031 (10RobH) 5Open>3Resolved a:3RobH certs revoked from rapidssl: bugzilla.wikimedia.org bug-attachment.wikimedia.org resolving task [17:22:28] Coren: I vaguely recall ori saying something about our uwsgi module and how it has some quirks about how it was started... [17:22:32] well, started/stopped/restarted [17:22:37] wonder if that’s related [17:22:43] Sounds like it is. [17:22:45] 6operations, 5Patch-For-Review: https://planet.wikimedia.org/ redirect broken - https://phabricator.wikimedia.org/T92051#1110034 (10Dzahn) a:3JohnLewis that patch fixed it. redirects again to: http://meta.wikimedia.org/wiki/Planet_Wikimedia thanks [17:22:55] 6operations, 5Patch-For-Review: https://planet.wikimedia.org/ redirect broken - https://phabricator.wikimedia.org/T92051#1110036 (10Dzahn) 5Open>3Resolved [17:23:14] Resolve the space issue with stop-start and open a task to look into the deeper restart issue then? We want to do this soon because them logs are growing fast. [17:23:56] 6operations, 10Wikimedia-Planet: https://planet.wikimedia.org/ redirect broken - https://phabricator.wikimedia.org/T92051#1102438 (10Dzahn) [17:24:18] 10Ops-Access-Requests, 6operations, 10Citoid, 6Services, 5Patch-For-Review: Give mobrovac production access for citoid - https://phabricator.wikimedia.org/T92389#1110043 (10RobH) @mobrovac: You totally did, and I somehow missed it, sorry about that! With @robla's approval, and alex's patchset (https://g... [17:24:27] 10Ops-Access-Requests, 6operations, 10Citoid, 6Services, 5Patch-For-Review: Give mobrovac production access for citoid - https://phabricator.wikimedia.org/T92389#1110045 (10RobH) p:5Triage>3Normal [17:24:37] Coren: uwsgictl [17:24:59] ori: Does the same. 'uwsgictl restart' causes complete self-destruction. [17:25:10] hm [17:25:23] robh: cheers! [17:25:27] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.255 second response time [17:26:34] Huh, that's odd... [17:26:58] (03CR) 10RobH: [C: 031] delete metrics.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/195304 (https://phabricator.wikimedia.org/T73156) (owner: 10Dzahn) [17:27:33] ori: Actually, that's a lie. It does that only when there was a failed restart via service previously. [17:27:39] ori: I just now noticed. [17:28:15] (03CR) 10Dzahn: [C: 032] depool cp1053 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195957 (owner: 10Dzahn) [17:29:21] (03CR) 10RobH: [C: 032] "After grepping through the manifests, it seems this is indeed no longer installed via install_certificate within puppet. As such, I'll me" [puppet] - 10https://gerrit.wikimedia.org/r/195304 (https://phabricator.wikimedia.org/T73156) (owner: 10Dzahn) [17:30:41] YuviPanda: It seems that uwsgictl fails only if service uwsgi got involved first. [17:31:31] (03PS3) 10coren: Add logrotate fragment for graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/195940 (https://phabricator.wikimedia.org/T92406) [17:32:23] (03PS4) 10coren: Add logrotate fragment for graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/195940 (https://phabricator.wikimedia.org/T92406) [17:32:46] YuviPanda: That one seems like it will work ^^. Worth a try anyways. [17:33:14] Coren: why daily and size 300? [17:33:32] Because "no more tha 300M but at least once daily regardless" [17:33:42] hmmm [17:33:44] 6operations, 5Patch-For-Review: revoke / delete metrics.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92044#1110130 (10RobH) I've gone ahead and removed the public cert and private key from the repos and merged, as well as shredding the files on stat1001 (where they used to reside.) [17:33:47] It'll rotate if it hits either. [17:34:08] (03CR) 10Ori.livneh: [C: 04-2] "backend ports is expected to be an array, so to use it with validate_re we have to explicitly convert it to a string." [puppet] - 10https://gerrit.wikimedia.org/r/195534 (owner: 10Matanya) [17:34:27] Prevents runaway logging from filling the fs even if it happens in a single day. [17:34:41] 6operations, 5Patch-For-Review: revoke / delete metrics.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92044#1110134 (10RobH) [17:34:59] Coren: hmm, ok. [17:35:10] (03PS5) 10Yuvipanda: Add logrotate fragment for graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/195940 (https://phabricator.wikimedia.org/T92406) (owner: 10coren) [17:35:16] 6operations, 5Patch-For-Review: revoke / delete metrics.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92044#1102217 (10RobH) [17:35:34] (03CR) 10Yuvipanda: [C: 031] "We should probably keep a little bit more logs (5 days maybe?) but this is ok too." [puppet] - 10https://gerrit.wikimedia.org/r/195940 (https://phabricator.wikimedia.org/T92406) (owner: 10coren) [17:35:50] 6operations, 5Patch-For-Review: revoke / delete metrics.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92044#1102217 (10RobH) [17:36:36] (03CR) 10coren: [C: 032] "Now that's there's an actual logrotate config, it's a simple matter to tweak at need." [puppet] - 10https://gerrit.wikimedia.org/r/195940 (https://phabricator.wikimedia.org/T92406) (owner: 10coren) [17:37:08] I'm babysitting its first run, make sure uwsgi remains healthy. [17:38:27] (03CR) 10Ori.livneh: [C: 04-2] "It's not really correcting it because the order is the same, just expressed differently. I had the arrows pointing the other way here as a" [puppet] - 10https://gerrit.wikimedia.org/r/195898 (owner: 10Matanya) [17:38:52] (03CR) 10Steinsplitter: [C: 031] Set $wmgAbuseFilterEmergencyDisableThreshold to 0.30 at commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195938 (https://phabricator.wikimedia.org/T87431) (owner: 10Glaisher) [17:40:49] Someone is working on some cool new logging thing where there's a slick frontend for logs, right? What is that called? [17:41:28] ragesoss: logstash [17:41:41] https://logstash.wikimedia.org/ [17:42:43] (03PS2) 10Ori.livneh: Set up a beacon namespace on bits [puppet] - 10https://gerrit.wikimedia.org/r/192370 [17:42:51] thanks much [17:45:25] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [17:50:35] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [17:51:45] 6operations, 6MediaWiki-Core-Team, 10hardware-requests, 5Patch-For-Review: Fluorine needs bigger disks - https://phabricator.wikimedia.org/T92417#1110216 (10Andrew) 3NEW [17:52:01] 6operations, 6MediaWiki-Core-Team, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1110224 (10Andrew) [17:52:14] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.024 second response time [17:52:49] robh: Do we have any standard server types with 4Tb drives, or bigger? Asking re: https://phabricator.wikimedia.org/T92417 [17:52:58] nope [17:53:08] dang [17:53:15] those disks arent cheap, and they tend to be slow [17:53:25] so we dont keep them onhand [17:53:37] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1110226 (10Ottomata) [17:53:46] Hm. So maybe we could split logs up between two different boxes… [17:54:05] or scale logging into a storage type box with raid [17:54:09] 2u box [17:54:11] YuviPanda: That poor little box. [17:54:29] so the 1U boxes tend to house either 4 lff or 6-8 sff disks [17:54:37] not much capacity overall, and spindle count is low [17:54:52] jumping to 2u gives us db-class boxes in regards to spindle count [17:55:19] robh: I’m guessing we don’t have any of /those/ sitting around [17:55:22] which, logging tends to be a lot of writes, I'd think it may need a proper hardware raid if you want to expand to keeping more history [17:55:23] uh... [17:55:26] you know, we might [17:55:35] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 1.234 second response time [17:55:37] old r510 db class boxes that are now out of warranty [17:55:40] (03PS1) 10coren: role::labs::nfs::dumps fix typo in hiera key [puppet] - 10https://gerrit.wikimedia.org/r/195964 [17:55:44] and are underpowered for general db use [17:55:50] springle would know, we should ask him on task [17:55:59] robh: great, I will do that. [17:56:08] I don't have any that he has released into spares for me, but that doesn't mean he doesn't have that planned [17:56:36] but those would be excellent for log storage... i think... lemme check storage space [17:57:09] andrewbogott: and i may be totalllllllly wrong [17:57:30] db1001 as example has only a 1.2TB data /a [17:57:32] robh: I need to regroup with the folks who actually use those logs. Maybe they don’t /want/ 180 days, in which case this is moot :) [17:58:01] yea... they are smaller 600gb disks too [17:58:10] what you would need is more like our db-slave boxen [17:58:20] of which we have no spare, and actually had to expand the cluster [17:58:23] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1110245 (10BBlack) I'm pretty much still backlogged on HTTPS-by-default work for the next week or two and not really working on complex VCL projects. There... [17:58:41] andrewbogott: yea, you are at the tipping point for 1u to 2u box jumps for storage [17:58:47] it tends to also be a price point jump [17:59:41] robh: ok, that sounds difficult and expensive enough that we should pursue other options (like not logging so much, or using two boxes.) [17:59:42] thx [17:59:59] yea, 2tb tends to be the limit where suddenly disks are slow [18:00:04] twentyafterfour, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150311T1800). [18:00:11] so we do order 3tb machines now [18:00:16] but they are misc and not fast disks [18:00:45] 6operations, 6MediaWiki-Core-Team, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1110251 (10Andrew) OK, current log retention policy looks like this: API logs: 30 days api-feature-usage logs: 90 days xff logs: 88 days everything else: 180 days If we were... [18:03:28] (03CR) 10coren: [C: 032] "Typo typo" [puppet] - 10https://gerrit.wikimedia.org/r/195964 (owner: 10coren) [18:04:01] 6operations, 6MediaWiki-Core-Team, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1110267 (10hoo) IMO 30 days are enough for API logs (and probably also for XFF logs, although I think we decided to no longer collect these at all?). Other logs (like exception an... [18:06:45] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:07:47] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1110293 (10faidon) This task completely lacks a rationale. Why should Varnish do that? [18:08:58] 6operations: revoke/delete SSL cert techblog.wikimedia.org - https://phabricator.wikimedia.org/T92021#1110294 (10RobH) 5Open>3Resolved a:3RobH techblog.wikimedia.org certificate revoked from rapidssl [18:09:37] !log branching wmf/1.25wmf21 [18:09:42] Logged the message, Master [18:09:45] 6operations, 6MediaWiki-Core-Team, 10hardware-requests, 5Patch-For-Review: Fluorine needs bigger disks - https://phabricator.wikimedia.org/T92417#1110300 (10RobH) We don't have any 4TB disks on site, but they could be ordered. What is the overall capacity and raid requirements for the logging server? [18:10:00] 6operations, 6MediaWiki-Core-Team, 10hardware-requests, 5Patch-For-Review: Fluorine needs bigger disks - https://phabricator.wikimedia.org/T92417#1110301 (10RobH) I forgot to ask speed requirements for disks. [18:11:08] (03PS1) 10BBlack: depool cp3005 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195968 [18:11:21] (03CR) 10BBlack: [C: 032 V: 032] depool cp3005 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195968 (owner: 10BBlack) [18:12:23] (03CR) 10Faidon Liambotis: "Brandon, could you elaborate why you don't like the idea?" [puppet] - 10https://gerrit.wikimedia.org/r/192370 (owner: 10Ori.livneh) [18:12:33] my +1/-1 is going to be conditional on that :) [18:13:32] 6operations, 6MediaWiki-Core-Team, 10hardware-requests, 5Patch-For-Review: Fluorine needs bigger disks - https://phabricator.wikimedia.org/T92417#1110320 (10Andrew) This might be moot -- sounds like we're maybe just retaining way more logs than anyone actually wants. Stay tuned... [18:14:15] paravoid: I did already, I think ori copied it to the ticket and nuria quoted me in the changset. My complains amount to generic "analytics is crazy" complaints, they're not worth blocking on [18:14:39] ah [18:15:06] (hence why my compaint came with a +1, it was more of a show of protest than a real block) [18:16:06] bblack: a real concern for sure (as in what you meantion has alreday happened), but this changeset i think makes things better rather than worst [18:16:09] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1110328 (10Yurik) Zero partners have been heavily stressing the need for the comprehensive data - tagging all traffic will give us the ability to estimate ov... [18:17:31] bblack: by the way , been enjoying vcl for now some days and i have made sure all changes we wanted to do with cookies are "doable", if i want to set a cookie in any mobile/text request where do i put the code in this templates: https://doc.wikimedia.org/puppetsource/templates/varnish/ [18:17:48] bblack: do i create a new template and include if from "mobile" and "text" [18:18:23] nuria: submit a patchset placing it directly in the current mobile/text templates, in the operations/puppet repository [18:18:46] bblack: so a new template that is external to both? [18:19:06] oh you mean it's shared? [18:19:38] well, there's lots of duplication between the two as it is, something that needs to be addressed eventually. but yes, if you want to include a new file to not add more duplication, that's fine too [18:19:39] bblack: so cookie setting needs to happen for "text" requests and "mobile" requests so it is shared [18:19:47] cookie setting? [18:19:54] paravoid: ya i know [18:20:17] pravoid, bblack everybody's favorite addition to the payload: [18:20:24] between this and yurik's incredibly vague reply for a pretty substantial resource request, this might be my cue to go to dinner [18:20:44] paravoid: https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_visit_solution [18:20:50] paravoid, why vague? ) [18:21:09] paravoid: mine comes with a through description of the lovely feature [18:21:12] let's do an expensive operation for every text request because... we might want to use it for some image thingy thing [18:21:17] so does mine )) [18:21:41] paravoid, you are not being fair - i said the primary reason is to have per parter analytics [18:21:42] yurik: that is yours, mine is is not a 'we might' [18:21:52] its not "might" either [18:21:52] for non-mobile traffic? [18:21:58] paravoid, yes [18:22:01] paravoid: the cookie thing is here: https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_visit_solution [18:22:04] paravoid: both mobile and text [18:22:06] they are doing ip-based whitelisting [18:22:11] we need to give them hard numbers [18:22:28] yurik: compare your task with Nuria's page [18:22:46] your task is literally 4 sentences spread out into two posts [18:23:29] please try to elaborate more in your requests [18:24:00] you should also have a task for those other two things you want [18:24:08] and add "blocked by" there as well [18:24:41] how are we supposed to prioritize this? [18:24:56] * andre__ recommends https://www.mediawiki.org/wiki/Phabricator/Project_management#Use_plain_language.2C_define_actions_and_expected_results [18:25:02] (03CR) 10Nikerabbit: Fatalmonitor: Count 'repeated N times: ' in error messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195657 (owner: 1020after4) [18:25:25] "Zero partners have been heavily stressing the need for the comprehensive data" -- what's the priority this has across Zero & Analytics? [18:25:47] if it's p: Low on the Zero or Analytics side for example, then it makes no sense for us to prioritize this now either [18:26:19] so yeah, I'm still calling this completely vague [18:26:28] and I'm going for dinner :) [18:26:59] James_F: Can you check that the visual editor sub-submodule branching worked as intended? [18:27:03] (03PS3) 10Ori.livneh: Set up a beacon namespace on bits [puppet] - 10https://gerrit.wikimedia.org/r/192370 [18:27:13] * James_F looks. [18:27:17] I'll check [18:27:21] (03CR) 10Ori.livneh: [C: 032] " ori: I'm ok with it" [puppet] - 10https://gerrit.wikimedia.org/r/192370 (owner: 10Ori.livneh) [18:27:24] I think it all worked but extra eyes couldn't hurt ;) [18:27:29] (03CR) 10Ori.livneh: [V: 032] " ori: I'm ok with it" [puppet] - 10https://gerrit.wikimedia.org/r/192370 (owner: 10Ori.livneh) [18:28:07] twentyafterfour: Looks perfect [18:28:11] now I just have to figure out how to clean up all these wmf/1.20wmf21 branchs [18:28:14] Yeah exactly [18:28:44] I *think* it's: [18:29:00] for r in $(cat listOfRepos); do cd $r; git push origin :wmf/1.20wmf21; cd -; done [18:29:09] yurik:can you identify every zero-partner by IP? Cause if that is the case we could do the "tagging" post-request with raw data in the cluster (asking to triple check, this might have been adressed) [18:29:28] 6operations, 10Wikimedia-Blog: Delete stat1002:/a/squid/archive/blog - https://phabricator.wikimedia.org/T92331#1110363 (10Tbayer) [18:29:30] twentyafterfour: Looks good to me. [18:30:00] Yeah that seems to be correct [18:30:38] (03Abandoned) 10Ori.livneh: Set up beacon endpoint for virtual media views [puppet] - 10https://gerrit.wikimedia.org/r/190821 (https://phabricator.wikimedia.org/T89088) (owner: 10Gilles) [18:31:02] nuria: we had to do it in varnish for caching reasons, because the page looks different for each carrier (that's been fixed somewhat now, but we still need to know if it was a zero carrier at all or not at the cache level) [18:31:09] RoanKattouw: Do we definitely not have "real" wmf/1.20wmf21 branches somewhere? [18:31:28] bblack: k [18:31:38] Hmm we might [18:31:46] !log cp1053 - comment in pybal for reinstall [18:31:51] Logged the message, Master [18:31:52] James_F: RoanKattouw: the new make-wmf-branch code is https://gerrit.wikimedia.org/r/#/c/195972/ [18:31:55] twentyafterfour: Yeah so what I just said, don't run that [18:32:13] twentyafterfour: We have plenty of real 1.20wmf21 branches lying around [18:32:25] RoanKattouw: Or was someone converting them to tags? [18:32:32] Apparently all the old branches disappeared from VE somehow without being converted to tags :( [18:32:43] 6operations, 10Wikimedia-Blog: Delete stat1002:/a/squid/archive/blog - https://phabricator.wikimedia.org/T92331#1110388 (10Tbayer) Since this may refer to to blog.wikimedia.org: From the perspective of the blog team, I can confirm that I'm not aware of this data and that we are not using it. (Personally I am k... [18:32:56] RoanKattouw: Yeah, that was because I was deleting them manually. Not a widespread issue. [18:33:19] RoanKattouw: really? I didn't think we used the wmf/1.NwmfN branching convention in the 1.20 series (I didn't see any real wmf/1.20wmf21 branches, just my accidental branches) [18:33:20] Right, OK [18:34:03] but I could be way wrong [18:34:26] (03CR) 10Gilles: "Awesome, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/190821 (https://phabricator.wikimedia.org/T89088) (owner: 10Gilles) [18:34:54] We've used them for ages [18:35:08] Just deleted/converted to tags [18:35:21] Or should've been... [18:37:20] 6operations, 10Fundraising Dash: Create sandbox site for Dash - https://phabricator.wikimedia.org/T87809#1110423 (10atgo) Hey @jgreen do you have any ideas for when you'll be able to look at this? [18:38:39] Reedy: Should have been, yeah, except VE has none of them [18:39:01] Reedy: It was basically impossible to find VE versions as recent as 1.25wmf12 [18:39:10] I never did any of that cleanup... :/ [18:39:20] I see a few 1.20wmfN branches but wmf10 is the highest number [18:40:15] Oh, hmm [18:40:20] Maybe 1.20wmf21 never existed [18:40:25] But didn't you get errors trying to push it? [18:40:27] Different people have done it over the time [18:40:34] Some messed up [18:40:55] so should I clean up all of these old branches while I'm at it? [18:40:58] I guess 1.20 takes us back a while [18:41:02] In the case of cherry-picks that means that MW's wmf branches' submodule pointers now point to things that can't be downloaded [18:41:12] I should see if I can salvage those branches [18:41:21] Those commits might still exist on the server if they haven't been gc'ed [18:42:15] well they exist on the server because that's where I'm pulling them from (none of my operations are on /srv/mediawiki-staging/php-*) [18:42:16] Hmm I misspoke, VE has everything back to 1.23wmf10 apparently [18:42:32] twentyafterfour: Yeah sorry I'm talking about two different issues at once [18:42:36] And mostly talking to myself [18:42:40] paravoid, point taken, will clean up ) [18:43:29] so I can probably write a script to convert all the old branches to tags [18:43:35] shouldn't be difficult [18:44:19] say, maybe everything < 1.25? [18:44:20] nuria, it is a massive pain to do it afterthefact - that's why we use bblack's excellent netmapper. Problem is, IPs change over time, so tagging them afterwards would be a much much more complicated project [18:44:25] !log cp1053 - reinstalling, PXE boot [18:44:27] (03PS2) 10Rush: admin module enable user cleanup [puppet] - 10https://gerrit.wikimedia.org/r/195656 [18:44:29] Logged the message, Master [18:44:30] Hmm I must have misremembered, now I don't see any missing branches [18:45:55] (03CR) 10Rush: [C: 032] admin module enable user cleanup [puppet] - 10https://gerrit.wikimedia.org/r/195656 (owner: 10Rush) [18:46:11] chasemp: ^ !very cool [18:46:45] (03CR) 10Dzahn: ""awarded a token"" [puppet] - 10https://gerrit.wikimedia.org/r/195656 (owner: 10Rush) [18:47:07] yurik: Ok, your call but if IP config has to be kept up to date on varnish too, doing it afterthefact might be an easier option coding wise, given that you could tag hourly per partition. But up to you and ops, after looking at VCL code this week I understand how to add something there is not simple. [18:47:40] nuria, thing is, it is already being done for all mobile traffic [18:47:50] we just need to pass all other traffic through the same tagging [18:48:06] currently there is a filter "if its m.wikipedia, tag" [18:48:19] yurik: but the volume of bits traffic and mobile traffic is very different [18:48:33] yurik: but again up to ops [18:48:45] nuria, true, but now we are not talking about how hard to code, but about performance - and from the looks of it, tagging is extremelly fast [18:49:35] its a C lib dict lookup [18:49:51] customized to deal with IPs [18:49:59] yurik: I guess it depends how complicated is the tagging logic but if ya, varnish seems to do everything real fast [18:50:33] simple - calls an external C library, passing in the IP, and sets the X-Analytics header [18:51:14] so i am not worried much about the performance, rather about finding brandon's time to do it [18:51:16] (03CR) 10BryanDavis: Fatalmonitor: Count 'repeated N times: ' in error messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195657 (owner: 1020after4) [18:51:32] yurik: Ya the difference is that now we are doing it once per page request (main document) and adding bits you wil do it, say, 20 times per page, likely more [18:52:16] yurik: which, again, might not be significant at all, we can measure it [18:52:29] yes, totally understand that, but again, it is really a perf question, and brandon has said before that it is very fast ) [18:52:55] (03CR) 10Dzahn: [C: 032] delete files/apache/blog_ports.conf [puppet] - 10https://gerrit.wikimedia.org/r/195808 (owner: 10Dzahn) [18:53:49] yurik: so (looking at a page load) about 20 times per mobile page and 50 times per desktop page, that does not seem like a deal breaker, true [18:54:13] exactly ) [18:54:28] and 50 seems exceesssive [18:54:53] yurik: no, that's a low bound actually for a page w/o many images [18:55:10] yurik: check the desktop site [18:55:18] ouch... ori has a lot of work to do to make it even marginally faster ) [18:56:35] nuria, actualy many of them are "data:" - i'm guessing they are part of a CSS file [18:57:29] yurik: but it's the same, they come from bits at the end, right? [18:57:41] yurik: for barak obama's page count is >120 [18:58:06] * yurik hides and waits for a local mysql slave copy [18:58:13] yurik: or maybe not, some of those might be base64 imgs [18:58:18] exactly [18:58:25] (03PS1) 10Dzahn: repool cp1053 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195981 [18:59:06] nuria, yurik: In Chrome's network view, you can click the filter icon (picture of a sieve) and then check the "Hide data URIs" checkbox on the far right [18:59:37] nuria doesn't know what a sieve is but she's learning .... [19:00:12] Or a funnel [19:00:25] I guess it looks more like a funnel than a sieve [19:00:28] ah super handy RoanKattouw [19:00:41] It also allows you to filter by category (images, XHR, etc) [19:01:11] yurik: so still 100 req, so I would definitely measure the change [19:01:23] yurik: now, this is with empty cache of course [19:01:31] RoanKattouw, thx, didn't know that trick [19:03:54] (03PS1) 10BBlack: repool cp3005, tag cp4012 [puppet] - 10https://gerrit.wikimedia.org/r/195983 [19:04:16] (03PS1) 10Dzahn: depool cp1061 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195984 [19:04:24] (03CR) 10BBlack: [C: 032 V: 032] repool cp3005, tag cp4012 [puppet] - 10https://gerrit.wikimedia.org/r/195983 (owner: 10BBlack) [19:05:52] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: Puppet has 1 failures [19:06:06] (03PS2) 10Dzahn: repool cp1053 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195981 [19:07:08] (03CR) 10Dzahn: [C: 032] repool cp1053 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195981 (owner: 10Dzahn) [19:08:55] Coren: ^ puppet failures on labstore2001 [19:09:35] Codfw? That one is due for a reinstall; but lemme see if that's something scary. [19:11:04] Coren: alright. you shuld ack it in that cas [19:11:25] YuviPanda: Once I know what it is. Right now, all I see is a surprisingly obscure message. [19:11:32] ealright [19:11:33] alright [19:11:56] (03CR) 10Dzahn: "http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=cp1053.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1426023366&g=" [puppet] - 10https://gerrit.wikimedia.org/r/195981 (owner: 10Dzahn) [19:11:59] Ah. It's the user group cleanup script that isn't working. [19:13:19] Ah, and by design - it's meant to exit quietly on all labstores. [19:13:21] (03PS5) 10Ori.livneh: fix up ordering for salt-minion package, config, service [puppet] - 10https://gerrit.wikimedia.org/r/162860 (owner: 10ArielGlenn) [19:13:41] I expect it shouldn't have been on labstore2001 at all to begin with. [19:14:09] That, or it was deployed just now and is about to make puppet fail on all four. [19:14:59] (03PS2) 10Dzahn: depool cp1061 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195984 [19:15:02] (03CR) 10Ori.livneh: [C: 032] fix up ordering for salt-minion package, config, service [puppet] - 10https://gerrit.wikimedia.org/r/162860 (owner: 10ArielGlenn) [19:15:17] Coren: I think chasemp was doing something about that? [19:15:24] ok so I don't have enough permissions on gerrit to delete all the garbage branches [19:15:28] (keyword matching on ‘cleanup’ ‘user’) [19:16:09] I know it was planned at least; the fix is trivial: just need to have it exit 0 when it bails out rather than exit 1 [19:16:20] (03CR) 10Dzahn: [C: 032] depool cp1061 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195984 (owner: 10Dzahn) [19:16:20] AFAIK that doesn't land on labstore hosts [19:16:25] I don't see admin module included [19:16:31] twentyafterfour: ^d can fix that for you if he's around [19:17:00] chasemp: labstore2001 isn't configured with ldap thingy as the older labstores are. [19:17:11] twentyafterfour: Or RoanKattouw probably has the gerrit admin superpowers too [19:17:58] ^d: halp? I can't delete the metric crapton of garbage branches that I accidentally created earlier... need to `git push origin :wmf/1.20wmf21` on every deployed extension [19:18:01] We should make twentyafterfour a Gerrit admin though [19:18:06] +1 [19:18:27] chasemp: I can just ack the error; this script is meant to run for a while and go away or not? [19:18:33] hey yeah then I could really do some damage [19:18:49] Coren: well I thought admin was commented out on //all// labstores but maybe not [19:18:50] with great power comes late nights fixing stuff :) [19:18:55] (lol [19:19:02] and as a butt-saver I put in teh hard bail out in case it was accidentally there [19:19:16] so I need to adjust if it's on labstore2* but not labstore1* [19:19:19] is that teh case? [19:19:21] bd808: Ito late nights fixing stuff so I guess I'm made for it [19:19:33] that should have said I love late nights fixing stuff [19:19:51] chasemp: It wouldn't have mangled 2001 since it specifically doesn't include the ldap thing nor will it ever - it's not going to be used for labs storage directly before the idmap thing has been ripped out entirely. [19:20:04] twentyafterfour: That's why we picked you :) [19:20:53] Coren: I don't see labstore2001 in puppet, can you help me out? [19:21:00] site.pp I mean [19:21:47] chasemp: IT's never been explicitly configured; it's a default box that was used for experiments. [19:22:03] chasemp: Which are being applied to 1002 instead. [19:22:05] ok so how is it getting admin then? from defaults in site.pp I guess? [19:22:11] * Coren nods. [19:22:19] I made it exit 1 as a "hey this shouldnt be here" [19:22:25] but can made exit 0 and it's a noop for this case [19:22:38] but won't account cleanup [19:22:54] Matching for labstore100* would also work. [19:23:18] I can do that but you are super duper sure that's never going to go nuts? [19:23:44] Yes. That box will never have the idmap mess that prevents it from being safe. [19:23:51] k :) no problem then [19:23:56] thanks for explaining [19:24:40] Thanks for being paranoid. Better to have a 'this can't happen' pop up that was not dangerous than not having it. :-) [19:25:42] (03PS1) 10Rush: admin module user cleanup only ignore labstore1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/195989 [19:26:04] oh, I forgot icinga, there may be 3x amssq alerts inc [19:26:09] (03CR) 10coren: [C: 031] "Is safe." [puppet] - 10https://gerrit.wikimedia.org/r/195989 (owner: 10Rush) [19:26:38] PROBLEM - Host amssq39 is DOWN: PING CRITICAL - Packet loss = 100% [19:27:03] I think I caught the other two in time, ignore the above alert! [19:27:21] (03CR) 10Rush: [C: 032] admin module user cleanup only ignore labstore1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/195989 (owner: 10Rush) [19:27:59] hey thanks bblack [19:28:48] RECOVERY - Host amssq39 is UP: PING OK - Packet loss = 0%, RTA = 89.17 ms [19:34:39] chasemp: could you review when you have the chance? [19:35:15] 6operations, 5Patch-For-Review: revoke / delete metrics.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92044#1110694 (10Dzahn) p:5Triage>3Normal [19:35:38] 6operations: Allocate a few servers to logstash - https://phabricator.wikimedia.org/T87031#1110698 (10RobH) 5Open>3declined a:3RobH This is now outdated, as stated, since task T84958 covers the hardware order. [19:35:41] 6operations, 7Icinga: "NRPE: Unable to read output" should not be OK for "configured eth" check - https://phabricator.wikimedia.org/T92293#1110701 (10Dzahn) p:5Triage>3Normal [19:38:06] 6operations, 10ops-eqiad: wipe search* and searchidx* hosts - https://phabricator.wikimedia.org/T92434#1110707 (10RobH) 3NEW a:3Cmjohnson [19:43:04] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [19:45:37] (03PS1) 10Dzahn: delete analytics dsh group [puppet] - 10https://gerrit.wikimedia.org/r/195992 [19:46:10] ori: seems ok (AFAIK) kinda weird the issue reference is to a dupe and not the primary and then I don't see anything in the issue about whether this is for sure going in? but mainly wanted to ask, the debian/ will b reviwed separately? [19:46:15] (03CR) 10Dzahn: [C: 032] delete analytics dsh group [puppet] - 10https://gerrit.wikimedia.org/r/195992 (owner: 10Dzahn) [19:48:12] thcipriani: aaargh [19:48:14] > Mar 11 19:47:12 deployment-mediawiki03 sshd[21390]: Failed publickey for mwdeploy from 10.68.16.58 port 42133 ssh2: RSA f0:54:06:fa:17:27:97:a2:cc:69:a0:a7:df:4c:0a:e3 [19:48:27] root@deployment-mediawiki03:/home/yuvipanda# ssh-keygen -lf /home/mwdeploy/.ssh/authorized_keys [19:48:27] 2048 f0:54:06:fa:17:27:97:a2:cc:69:a0:a7:df:4c:0a:e3 root@deployment-salt (RSA) [19:48:30] this is a fresh instance [19:48:42] so it does not have /etc/ssh/userkeys/ thing [19:49:21] !log cp1061 - comment in pybal, reinstalling [19:49:24] (03CR) 10Rush: [C: 031] "No reason to stall this that I know of." [debs/statsite] - 10https://gerrit.wikimedia.org/r/193095 (https://phabricator.wikimedia.org/T90111) (owner: 10Filippo Giunchedi) [19:49:29] Logged the message, Master [19:49:33] YuviPanda: so it's weird. I can totally get in form the mwdeploy user on deployment bastion. The finger print of the .ssh/id_rsa for mwdeploy does not at all match that public key. [19:49:33] thcipriani: and it *is* offering the correct key [19:49:33] debug1: Offering RSA public key: /etc/keyholder.d/mwdeploy_rsa [19:49:40] just a heads up, my deploy is going to run a bit long today, as usual :-/ [19:49:45] thcipriani: to mediawiki03? [19:49:53] thcipriani: so problem is beta::scap::target installs a different key [19:50:03] thcipriani: which puts the private key under /home/mwdeploy/.ssh [19:50:11] which is what I’m trying to get rid of, and that’s working for you [19:50:25] what I’m trying to make happen is to make it use the keyholder setup [19:50:26] with [19:50:26] debug1: Offering RSA public key: /etc/keyholder.d/mwdeploy_rsa [19:50:37] 6operations, 7Graphite, 5Patch-For-Review: Graphite web exceptions filling up /var/log on labmon1001 - https://phabricator.wikimedia.org/T92406#1110771 (10coren) 5Open>3Resolved a:3coren Logrotate settings now work. [19:50:38] basically [19:50:38] SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -v mwdeploy@deployment-mediawiki03 [19:50:45] 6operations, 7Graphite: Graphite web exceptions filling up /var/log on labmon1001 - https://phabricator.wikimedia.org/T92406#1110774 (10coren) [19:51:15] 6operations, 7Graphite: Graphite web exceptions filling up /var/log on labmon1001 - https://phabricator.wikimedia.org/T92406#1110779 (10yuvipanda) @Coren: We should investigate the actual exceptions too, I think. [19:51:36] YuviPanda: right, it's strange, the authorized_keys file doesn't seem to contain the old public key, so I _shouldn't_ be able to get in at all. [19:51:50] (03CR) 10Odder: [C: 031] Set $wmgAbuseFilterEmergencyDisableThreshold to 0.30 at commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195938 (https://phabricator.wikimedia.org/T87431) (owner: 10Glaisher) [19:52:21] thcipriani: check /etc/ssh/userkeys/mwdeploy :) that’s wh ere beta::scap::target puts them [19:52:28] ^ yeah that [19:52:43] it's a hack-a-licious setup [19:53:01] bd808: yup, I’m trying to get rid of it... [19:53:01] 6operations: Puppet should actively purge sudo and access rights not enumerated by the admins module - https://phabricator.wikimedia.org/T88826#1110783 (10chasemp) [19:53:03] 6operations, 6Security: Define in Puppet or remove rogue user accounts not currently defined in admin/data.yaml - https://phabricator.wikimedia.org/T90923#1110781 (10chasemp) 5Open>3Resolved So far no serious issues: https://gerrit.wikimedia.org/r/#/c/195656/ [19:53:14] hack-a-licious. Got it. [19:53:16] bd808: and have *mostly* succeeded, except for this strange key mismatch... [19:53:43] YuviPanda: wrong key offered or not accepted? [19:53:46] deployment-mediawiki03 should be clear of beta::scap::target influences though [19:53:58] bd808: so for all I can see, the right key is being offered, and then *not* being accepted... [19:54:10] hmm... [19:54:20] can you ssh directly using the proper key? [19:54:41] That's where I would start. [19:55:17] right key in the sense fingerprint comparison of pubkey in /home/mwdeploy/.ssh/authorized_keys + the message from sshd match... [19:55:20] bd808: nope, I can’t.. [19:55:26] so I figure the problem isn’t with keyholder [19:55:31] *nod* [19:55:37] but it’s some tiny thing with the keys themselves that I’m missing... [19:55:42] and will facepalm when I find out [19:55:45] where did you put the public key? [19:56:03] labs doesn't look in ~/.ssh for authorized_keys [19:56:12] they have to go in /etc [19:56:20] ... [19:56:20] oh [19:56:23] why [19:56:25] when [19:56:26] whattt [19:56:29] because labs [19:56:42] this is what _I_ was confused about! [19:56:44] WAAAAT [19:56:44] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [19:56:51] the keys are managed by wikitech and ldap data [19:56:57] …... [19:57:02] arghabharga [19:57:17] * bd808 knows more about labs than the labs root :) [19:57:37] I dunno what everyone was smoking when they gave me root [19:57:58] I put on my best emperor palpantine impression as well. cloak, moonlight, etc. [19:58:01] but YAY bd808 [19:58:14] I only know about how it works because of setting up the prior hack [19:58:17] also, jesus fucking christ, labs. [19:59:06] * bd808 hands wat a pamphlet [19:59:39] "Hack-a-licious: a story of labs" [20:00:04] gwicke, cscott, arlolra, subbu: Dear anthropoid, the time has come. Please deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150311T2000). [20:00:13] 6operations: Puppet should actively purge sudo and access rights not enumerated by the admins module - https://phabricator.wikimedia.org/T88826#1110801 (10chasemp) 5Open>3Resolved a:3chasemp Any //user// specific sudo rights will be /etc/sudoers.d/$user on the end system. I recently enabled account cleanu... [20:00:52] bd808: I’m going to try finding out why this is the case [20:01:48] (03PS1) 10BBlack: tag amssq39-41 [puppet] - 10https://gerrit.wikimedia.org/r/195998 [20:01:50] (03PS1) 10BBlack: fix salt-minion on jessie [puppet] - 10https://gerrit.wikimedia.org/r/195999 [20:02:02] (03CR) 10BBlack: [C: 032 V: 032] tag amssq39-41 [puppet] - 10https://gerrit.wikimedia.org/r/195998 (owner: 10BBlack) [20:02:04] YuviPanda: because the ssh keys are normally exported from nfs where they are managed by the wikitech upload-your-key process [20:02:29] right, but why won’t it look in /home/$user/.ssh as well, if it is looking in /etc/ssh/userkeys [20:02:35] But you can slide keys in on specific hosts too which is how the root key is propigated [20:03:03] because then revoking keys via wikitech might not work [20:03:14] and it would be easier to backdoor keys in various vms [20:03:28] (03CR) 10BBlack: [C: 032] fix salt-minion on jessie [puppet] - 10https://gerrit.wikimedia.org/r/195999 (owner: 10BBlack) [20:03:38] You either have key management or you don't bascially [20:04:08] bd808: well, you can do that now with putting them in /etc/ssh/userkeys [20:04:25] only if you have root in the right places [20:04:44] hmm [20:04:47] that makes sense. [20:04:54] in ~/.ssh you'd just need a user-to-user exploit to get their account [20:05:06] and then probably get paswordless sudo from that [20:05:10] well, if you’re added to a labs project, by default you have root :D [20:05:19] depends on the project [20:05:39] outside of tools, nothing atm [20:05:54] also ~/.ssh is project local [20:06:07] which has other potential confusion [20:06:11] right [20:06:18] so you’re getting it on everywhere... [20:06:29] 6operations: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1110814 (10vshchepakina) 3NEW [20:06:35] another thing we can do now is potentially get rid of mwdeploy from LDAP... [20:06:57] probably [20:07:09] as long as it never touches nfs in the project [20:07:17] s/touches/changes/ [20:07:29] (03CR) 10RobH: [C: 032] "confirmed that the certificate isn't installed elsewhere in our puppet manifests, so +2 and merging this." [puppet] - 10https://gerrit.wikimedia.org/r/195309 (https://phabricator.wikimedia.org/T92043) (owner: 10Dzahn) [20:07:30] yeah [20:07:40] bd808: well, we’ll set uid / gid for mwdeploy too... [20:07:46] and the idmapd mess is… slowly being sorted out [20:09:35] (03PS1) 10Mobrovac: Activate the RESTBase Virtual REST Service on test.wp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196000 [20:10:44] * gwicke looks around for a job runner config change review (https://gerrit.wikimedia.org/r/#/c/195364/) [20:11:44] !log deployed parsoid sha 73bf3162 [20:11:48] Logged the message, Master [20:12:06] 6operations, 5Patch-For-Review: delete / revoke stats.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92043#1110825 (10RobH) [20:12:14] 6operations, 5Patch-For-Review: delete / revoke stats.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92043#1102189 (10RobH) [20:12:16] (03CR) 10Aaron Schulz: [C: 031] Add restbase job runners [puppet] - 10https://gerrit.wikimedia.org/r/195364 (owner: 10GWicke) [20:13:04] 6operations, 5Patch-For-Review: delete / revoke stats.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92043#1110829 (10RobH) a:3RobH I merged @dzahn's patchset and also removed the key from the private repo, then shredded them off stat1001. in process of revoking the cert with rapidssl. [20:13:11] 6operations: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1110832 (10JohnLewis) @vshchepakina as the domain is currently just a cname to the shop's URL (which is shopwikipedia...), I can create the changes to move this to the main cname to the .wikipedia.org domain and chang... [20:13:20] 6operations, 5Patch-For-Review: delete / revoke stats.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92043#1110833 (10RobH) p:5Triage>3Normal [20:14:06] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: puppet fail [20:14:43] (03PS1) 10Yuvipanda: ssh: Allow keys in /etc/ssh/userkeys for prod as well [puppet] - 10https://gerrit.wikimedia.org/r/196001 [20:14:49] paravoid: chasemp ^ allows prod keys in /etc/ssh/userkeys [20:14:50] thoughts? [20:15:03] (03CR) 10GWicke: Activate the RESTBase Virtual REST Service on test.wp (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196000 (owner: 10Mobrovac) [20:15:06] 6operations, 5Patch-For-Review: delete / revoke stats.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92043#1110838 (10RobH) 5Open>3Resolved revocation via rapidssl complete, resolving [20:15:06] (unifying with labs, for mwdeploy, and saw comments in admin::user about how that should also use this) [20:15:09] YuviPanda: https://gerrit.wikimedia.org/r/#/q/topic:ssh-userkey,n,z [20:15:16] paravoid: hahaha [20:15:23] also oh wow, that’s a lot of changes [20:15:31] heh I was looking for paravoid's changes to point to [20:15:32] :) [20:15:45] (03CR) 10Mobrovac: Activate the RESTBase Virtual REST Service on test.wp (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196000 (owner: 10Mobrovac) [20:16:37] chasemp: paravoid I guess I can’t just merge these, since if they fuck up we are basically locked out (esp. if it happens on admin) [20:16:55] 6operations, 5Patch-For-Review: revoke / delete metrics.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T92044#1110841 (10RobH) 5Open>3Resolved a:3RobH confirmed revocation of cert with rapidssl, resolving. [20:16:56] (03PS2) 10Mobrovac: Activate the RESTBase Virtual REST Service on test.wp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196000 [20:16:57] you can, if you do it carefully :) [20:17:01] (03PS1) 1020after4: Add 1.25wmf21 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196003 [20:17:02] aaron: thanks re https://gerrit.wikimedia.org/r/#/c/195364/! [20:17:03] (03PS1) 1020after4: Wikipedias to 1.25wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196004 [20:17:04] and you can review for sure :) [20:17:05] (03PS1) 1020after4: Group0 to 1.25wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196005 [20:17:30] paravoid: yeah, doing that now. [20:17:46] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [20:17:52] paravoid: a big change I’ve been working on for the last 2-3 days (have already split off 4-5 changes off it) needs this now... [20:17:54] (03CR) 1020after4: [C: 032] Add 1.25wmf21 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196003 (owner: 1020after4) [20:17:56] so I shall review [20:17:57] robh: up for merging https://gerrit.wikimedia.org/r/#/c/195364/ ? [20:18:00] (03Merged) 10jenkins-bot: Add 1.25wmf21 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196003 (owner: 1020after4) [20:19:07] uh, i can look at it, but im not familar with that infrastructure, so i'll be reading how it works for awhile [20:19:11] (03CR) 10GWicke: [C: 032] Activate the RESTBase Virtual REST Service on test.wp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196000 (owner: 10Mobrovac) [20:19:37] robh: Aaron reviewed it already [20:19:54] so? [20:19:58] but is lacking the +2 powers [20:20:19] I have been scolded in the past for merging things I don't fully understand [20:20:26] so I don't do it now, sorry ;D [20:20:41] like i said, im happy to review the stuff and try to understand what its doing [20:20:49] but i assume it wont be fast. [20:21:02] (03Merged) 10jenkins-bot: Activate the RESTBase Virtual REST Service on test.wp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196000 (owner: 10Mobrovac) [20:21:05] (if it needs to happen right now i can try to find an ops person to review for you, being clinic duty person ;) [20:21:12] !log twentyafterfour Started scap: testwiki to php-1.25wmf21 and rebuild l10n cache [20:21:19] Logged the message, Master [20:21:30] robh: that might be more helpful actually [20:22:20] robh: although really, if there is anybody you can trust on job queue issues it's Aaron [20:22:47] Its not a question of trust, its a fact of I've been told I'm not to merge things I dont understand fully. which seems like a fairly sane policy [20:23:08] if it breaks, then aaron has to walk me through fixing whatever breaks as remote hands [20:23:13] which is non ideal in most cases [20:23:26] if this is fully test infrastructure and will have zero production impact, then my point is slightly moot [20:23:31] maybe he should have +2 for job runner related things then [20:23:32] (so let me know if thats the case) [20:23:38] twentyafterfour: scap-ing? [20:24:10] robh: the patch configures additional runners for a job that's currently run in the default queue [20:24:13] mobrovac: yes [20:24:15] (03PS1) 10Dzahn: repool cp1061 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196006 [20:24:46] robh: it follows the parsoid pattern you see in the same file [20:25:07] ok, i'll stop pinging opsen and ill familarize myself with the file [20:25:07] twentyafterfour: oki, need to sync-file 2 files, so just let me know once you're done, pls [20:25:11] also, those jobs are only enabled on test.wikipedia.org so far, where they are confirmed working fine [20:25:13] which is what i offered and you said to ping folks i thought [20:25:17] so i must have misunderstood [20:25:50] so i'll look at it [20:26:13] thanks! [20:26:38] but thats not a promise to merge, i dont understand this infrastructure at all, so no promises [20:26:57] my concern is i merge ans shit breaks [20:27:11] gwicke: if this does break things, you are comfortable backign this out and the resulting fixes? [20:27:22] (i should have asked that sooner ;) [20:27:37] (03PS1) 10John F. Lewis: shop: change main shop domain [dns] - 10https://gerrit.wikimedia.org/r/196007 (https://phabricator.wikimedia.org/T92438) [20:27:57] (03PS1) 10John F. Lewis: apache: remove shop.wp.o funnel + shop.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/196008 (https://phabricator.wikimedia.org/T92438) [20:28:15] robh: yes [20:28:57] robh: IMHO a puppet committer should always be responsible if things go wrong [20:29:26] gwicke: why are you pushing robh so hard? [20:29:39] 6operations: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1110859 (10JohnLewis) a:3JohnLewis I've submitted the patches for DNS and apache changes. https://gerrit.wikimedia.org/r/196007 and https://gerrit.wikimedia.org/r/196008 [20:29:43] he's been told to not merge stuff he doesn't understand, like everyone has, and there's nothing wrong with that [20:30:13] _joe_ is doing most if not all of the mediawiki puppet module reviews these days and he's not even listed as a reviewer [20:30:21] paravoid: I don't think there is anything wrong with that policy per se [20:30:56] then what? [20:31:54] I can review what is purely a comparison to parsoid and syntax/formatting, and that is not nothing, but its lacking an in depth and detailed understanding of the infrastructure involved [20:32:07] which makes me wonder if its not worth just waiting ofr someone with subject expertise [20:32:16] like Aaron [20:32:24] oh come on [20:32:42] so wait for the appropriate reviewer to be awake and have time? this doesn't seem like an emergency [20:32:54] would you feel comfortable with Aaron merging this? [20:32:55] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:33:04] (03CR) 10Yuvipanda: "This never actually got 'merged' and baby-sat. @Alex?" [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [20:33:06] you clearly have a differing opinion on how things should operate, you shouldn't raise it on every possible occurence [20:33:15] aaron does not have +2 in ops/puppet and has not requested it [20:33:17] I also offered to start asking around ops if anyone had time to look at this now, since you asked me to. (I'm really not trying to be a blocker, I promise.) [20:33:30] *if* he requests it, we can have a reasonable conversation about this [20:34:01] robh: I appreciate the effort, thanks [20:34:22] we'd like to ramp up the updates to other wikis, but I'd hate to swamp the default queue [20:34:24] (03CR) 10Dzahn: "@Yuvi different expectations on the meaning of +2" [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [20:34:27] aaron isn't everyone obviously, but I can tell you for sure that we are not going to handout +2s in ops/puppet to everyone [20:35:15] ops/puppet is special. it's very easy to break a whole lot of things very very quickly and efficiently. [20:35:16] (03CR) 10Yuvipanda: "I expect anyone who +2s to actually merge and babysit. Is that not correct?" [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [20:35:33] paravoid: indeed, but that's a strawman you are arguing against there [20:35:51] gwicke: i can try to catch _joe_ tomorrow during the day here in europe for this particular case [20:36:06] or just add him as a reviewer for starters? [20:36:12] he's not listed as one in the patchset [20:36:22] I've also been waking up early on my ops clinic week to touch base via pm with blockers on tasks when i dont see movement [20:36:30] well, I wasn't aware of the new policy of _joe_ being the sole puppet reviewer now [20:36:38] so if this had a phab task with ops patch for review, it would automatically land on my radar [20:36:38] he is not the sole puppet reviewer [20:36:44] added him now [20:36:52] he's the one who's being doing most of the mediawiki work, though [20:37:06] that's fairly well known, esp. to someone who attends ops meetings [20:37:10] (03CR) 10Dzahn: "The other interpretation is more literal that it means "approved to be merged" +2 Looks good to me, approved, vs. somebody else must appro" [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [20:37:53] (03CR) 10Yuvipanda: "So who is supposed to submit / babysit them instead?" [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [20:38:35] it's fine to forget, just don't make a big deal out of it? [20:39:08] (03CR) 10Dzahn: "the owner of the patch (i'm trying to describe how people use it differently, not even an opinion)" [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [20:39:13] (03CR) 10Yuvipanda: "I don't think +2 in that 'literal' sense works for ops/puppet. I wonder if Alex was just waiting for jenkins and forgot :)" [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [20:39:23] (03PS1) 10Nuria: Adding a Last-Access cookie to text and mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T92435) [20:39:31] YuviPanda/mutante: there's IRC too you know :P [20:39:43] (03CR) 10Yuvipanda: "Does anyone use it like that? People usually use +1 for that" [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [20:39:51] (03CR) 10Dzahn: "no, i don't think he forgot, i got quite a few +2 without submit reviews from him and then submitted them" [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [20:39:54] (03CR) 10Southparkfan: [C: 031] "Looks good." [dns] - 10https://gerrit.wikimedia.org/r/196007 (https://phabricator.wikimedia.org/T92438) (owner: 10John F. Lewis) [20:40:01] paravoid: but Gerrit is cool-er :p [20:40:22] (03CR) 10Nuria: "Feedback welcome, i am sure performance wise many things can be improved." [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T92435) (owner: 10Nuria) [20:40:33] paravoid: if we spam gerrit we spam IRC too, but not vice versa... [20:40:33] MAXIMUM SPAMMING [20:40:54] yes, with the added bonus that grrrit-wm appends my name on everyone of this comments [20:40:54] paravoid: I'm just trying to get a patch merged, and do the prerequisite lobbying [20:40:54] which raises my irc client :) [20:41:14] paravoid: I did say maximum spamming :P [20:41:14] YuviPanda: more maximum spamming: submit patches per comment and associate phab tickets ;) [20:41:14] IRC will be gone tomorrow, the gerrit comment is in the place it talks about [20:41:14] heh [20:41:28] Gerrit too will be gone some day... [20:41:34] so will we be all [20:41:36] (03CR) 10MarkTraceur: [C: 04-2] "Issues remain in UploadWizard that I think block this patch. See Phabricator ticked (T88918) for more information." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190744 (https://phabricator.wikimedia.org/T88918) (owner: 10Gerardduenas) [20:41:44] only because we refuse to import comments [20:42:08] we will all be gone some day because we refuse to import comments? [20:42:08] :) [20:42:15] nuria: nice work on that patch :) [20:42:15] !log twentyafterfour Finished scap: testwiki to php-1.25wmf21 and rebuild l10n cache (duration: 20m 59s) [20:42:21] Logged the message, Master [20:42:28] gwicke: you have a different opinion on how infrastructure-related changes should happen and you keep finding/raising issues about this all the time [20:42:35] on random tasks, meetings & gerrit patchsets [20:42:45] and this has got to stop, really [20:42:45] bblack: thanks man, boy, is VCL dry.... [20:43:09] paravoid: I'm not trying to have a general discussion here [20:43:13] mobrovac: go ahead and sync files, and let me know when you're finished so I can do the last steps of my deployment [20:43:17] bblack: will tests on beta labs once i incorporate your suggestions [20:43:44] !log mobrovac Synchronized wmf-config/InitialiseSettings.php: Activate the RESTBase Virtual REST Service on test.wp (duration: 00m 07s) [20:43:56] Logged the message, Master [20:43:56] (03CR) 10BBlack: "re appending to the cookie, we have a vmod loaded to handle this, there's an example of using it for the Set-Cookie case in our geoip code" [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T92435) (owner: 10Nuria) [20:43:56] and honestly, by being so persistent you push everyone into the defensive and achieving the exact opposite result [20:44:16] !log mobrovac Synchronized wmf-config/CommonSettings.php: Activate the RESTBase Virtual REST Service on test.wp (duration: 00m 06s) [20:44:22] Logged the message, Master [20:44:22] (03Abandoned) 10Yuvipanda: ssh: Allow keys in /etc/ssh/userkeys for prod as well [puppet] - 10https://gerrit.wikimedia.org/r/196001 (owner: 10Yuvipanda) [20:44:37] anyway, I shouldn't be making this conversation here/now [20:44:37] twentyafterfour: ok, done, cheers [20:44:37] I'm going to check out :) [20:44:51] :) [20:44:57] paravoid: goodnight! [20:46:11] paravoid: And don't worry too much; I actually think that things are working pretty well overall. [20:46:41] (03CR) 10Yuvipanda: [C: 031] ssh: introduce ssh::userkey resource [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [20:46:52] paravoid: I’m up for slowly deploying the userkey stuff tomorrow. [20:46:57] * YuviPanda reviews the series now [20:47:17] (03CR) 10Yuvipanda: [C: 031] ssh: recurse/purge => true for /etc/ssh/userkeys [puppet] - 10https://gerrit.wikimedia.org/r/183815 (owner: 10Faidon Liambotis) [20:47:20] 6operations: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1110944 (10Dzahn) Usually i would argue that only wikis should be in wikipedia.org and other services should be in wikimedia.org. This is different though because we already have shop.wikipedia.org and store.wikipedi... [20:48:02] (03CR) 10Yuvipanda: [C: 031] ssh: change userkeys' path hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/183816 (owner: 10Faidon Liambotis) [20:48:36] (03CR) 10Yuvipanda: [C: 031] ssh: support /etc/ssh/userkeys in production too [puppet] - 10https://gerrit.wikimedia.org/r/183817 (owner: 10Faidon Liambotis) [20:49:09] (03CR) 10Yuvipanda: [C: 031] reprepro: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183818 (owner: 10Faidon Liambotis) [20:49:33] (03CR) 10BBlack: "I don't have time to dive deep on reviewing this just yet, but my other thought is that we should probably avoid the Set-Cookie code at al" [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T92435) (owner: 10Nuria) [20:49:35] (03CR) 10Dzahn: [C: 032] repool cp1061 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196006 (owner: 10Dzahn) [20:50:11] (03CR) 10Yuvipanda: openstack: transition nova to ssh::userkey (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183819 (owner: 10Faidon Liambotis) [20:51:01] (03CR) 10Yuvipanda: [C: 031] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/183820 (owner: 10Faidon Liambotis) [20:51:36] (03CR) 10Yuvipanda: authdns: transition to ssh::userkey (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183821 (owner: 10Faidon Liambotis) [20:51:58] (03CR) 10Yuvipanda: [C: 031] puppet: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183822 (owner: 10Faidon Liambotis) [20:52:42] !log cp1061 repooled in pybal [20:52:48] Logged the message, Master [20:54:10] (03CR) 10Yuvipanda: [C: 031] "This we should do most carefully, of course. But root logins are still enabled, so maybe not that big of a catastrophe." [puppet] - 10https://gerrit.wikimedia.org/r/183823 (owner: 10Faidon Liambotis) [20:57:23] 7Puppet, 6Labs: Puppet Trebuchet provider compares refname with commit sha1 and does NOT refresh the git repo! - https://phabricator.wikimedia.org/T77002#1111012 (10chasemp) p:5High>3Normal [20:59:38] (03PS9) 10Legoktm: mediawiki: add configs to support the Dallas DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194830 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [20:59:59] (03PS2) 10Yuvipanda: ssh: introduce ssh::userkey resource [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [21:02:02] !log twentyafterfour Started scap: Sync security patches [21:02:08] Logged the message, Master [21:02:56] (03CR) 10Dzahn: "http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=cp1061.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1426023366&g=" [puppet] - 10https://gerrit.wikimedia.org/r/196006 (owner: 10Dzahn) [21:03:53] 6operations, 6Phabricator, 7Mail: Phabricator mails Message-ID has localhost.localdomain - https://phabricator.wikimedia.org/T75713#1111066 (10chasemp) @faidon can you comment on this? [21:04:14] (03PS1) 10Dzahn: depool cp1052 for reinstall (text) [puppet] - 10https://gerrit.wikimedia.org/r/196022 [21:05:55] (03CR) 10Dzahn: [C: 032] depool cp1052 for reinstall (text) [puppet] - 10https://gerrit.wikimedia.org/r/196022 (owner: 10Dzahn) [21:08:01] 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown, 7Documentation: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#1111076 (10chasemp) In regards to >>! In T67270#1008668, @chasemp wrote: > This kind of stalled out here. > > Am I right in thinking that this would... [21:11:27] twentyafterfour: looks like you're still mid-deploy? we can push back our 1:1, or maybe even move it to a non-deploy day for you [21:12:10] greg-g: well I'm almost done with deploy. but moving it to monday or tuesday would be good in the future [21:12:20] 6operations, 10Deployment-Systems, 7Graphite: [scap] Deploy events aren't showing up in graphite/gdash - https://phabricator.wikimedia.org/T64667#1111094 (10chasemp) p:5High>3Normal reducing priority to reflect the obvious back burner status [21:12:42] (03PS1) 10BBlack: remove more old cp* public subnet DNS [dns] - 10https://gerrit.wikimedia.org/r/196027 [21:12:44] andrewbogott_afk: btw, I’m going to work on https://phabricator.wikimedia.org/T85279 tomorrow (hopefully, once a bunch of other stuff gets resolved) [21:12:58] twentyafterfour: /me nods [21:13:53] (03PS1) 10Dzahn: LVS: add text and bits for codfw [puppet] - 10https://gerrit.wikimedia.org/r/196036 (https://phabricator.wikimedia.org/T92377) [21:15:07] (03PS2) 10Dzahn: LVS: add text and bits for codfw [puppet] - 10https://gerrit.wikimedia.org/r/196036 (https://phabricator.wikimedia.org/T92377) [21:16:58] (03CR) 10BBlack: [C: 032] remove more old cp* public subnet DNS [dns] - 10https://gerrit.wikimedia.org/r/196027 (owner: 10BBlack) [21:18:17] !log twentyafterfour Finished scap: Sync security patches (duration: 16m 14s) [21:18:22] Logged the message, Master [21:20:02] 6operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#1111110 (10yuvipanda) (a salt upgrade is in the works atm, thanks to @ArielGlenn) [21:20:21] 6operations, 6Phabricator, 7Mail: Phabricator mails Message-ID has localhost.localdomain - https://phabricator.wikimedia.org/T75713#1111112 (10faidon) Phabricator is almost certainly generating its own Message-IDs, I don't think it's an Exim issue. Can you troubleshoot this on the Phab side first? [21:24:45] (03PS1) 10Dzahn: LVS: add api,apaches and rendering for codfw [puppet] - 10https://gerrit.wikimedia.org/r/196067 (https://phabricator.wikimedia.org/T92377) [21:25:50] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/196067 for LVS config using these IPs" [dns] - 10https://gerrit.wikimedia.org/r/195887 (https://phabricator.wikimedia.org/T92377) (owner: 10Giuseppe Lavagetto) [21:27:51] (03CR) 1020after4: [C: 032] Wikipedias to 1.25wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196004 (owner: 1020after4) [21:27:58] (03Merged) 10jenkins-bot: Wikipedias to 1.25wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196004 (owner: 1020after4) [21:29:08] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf20 [21:29:16] Logged the message, Master [21:29:24] (03CR) 1020after4: [C: 032] Group0 to 1.25wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196005 (owner: 1020after4) [21:29:31] (03Merged) 10jenkins-bot: Group0 to 1.25wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196005 (owner: 1020after4) [21:29:58] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf21 [21:30:02] Logged the message, Master [21:30:49] !log twentyafterfour Purged l10n cache for 1.25wmf19 [21:30:54] Logged the message, Master [21:31:08] greg-g: give me one more minute ... [21:31:13] twentyafterfour: no worries [21:32:54] (03PS1) 10EBernhardson: Enable Flow post editing for autoconfirmed users on Mediawiki, English, Russian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196068 (https://phabricator.wikimedia.org/T90670) [21:33:06] 6operations, 6Phabricator, 7Mail: Phabricator mails Message-ID has localhost.localdomain - https://phabricator.wikimedia.org/T75713#1111158 (10chasemp) a:3chasemp >>! In T75713#1111112, @faidon wrote: > Phabricator is almost certainly generating its own Message-IDs, I don't think it's an Exim issue. Can yo... [21:34:00] (03PS1) 10Dzahn: add loadbalancer service records for codfw [dns] - 10https://gerrit.wikimedia.org/r/196069 (https://phabricator.wikimedia.org/T92377) [21:34:09] (03CR) 10jenkins-bot: [V: 04-1] add loadbalancer service records for codfw [dns] - 10https://gerrit.wikimedia.org/r/196069 (https://phabricator.wikimedia.org/T92377) (owner: 10Dzahn) [21:36:18] (03CR) 10Dzahn: "error: Name 'text-lb.codfw.wikimedia.org.': resolver plugin 'geoip' rejected resource name 'text-addrs/codfw'. this is config-geo , right?" [dns] - 10https://gerrit.wikimedia.org/r/196069 (https://phabricator.wikimedia.org/T92377) (owner: 10Dzahn) [21:38:24] (03PS1) 10BBlack: rename/tag cp3019+cp3020 [puppet] - 10https://gerrit.wikimedia.org/r/196071 [21:38:26] (03PS1) 10BBlack: more general s/wikimedia.org/wmnet/ fixups for cp* hosts [puppet] - 10https://gerrit.wikimedia.org/r/196072 [21:38:39] (03CR) 10BBlack: [C: 032 V: 032] rename/tag cp3019+cp3020 [puppet] - 10https://gerrit.wikimedia.org/r/196071 (owner: 10BBlack) [21:39:27] (03CR) 10jenkins-bot: [V: 04-1] more general s/wikimedia.org/wmnet/ fixups for cp* hosts [puppet] - 10https://gerrit.wikimedia.org/r/196072 (owner: 10BBlack) [21:40:08] (03PS2) 10BBlack: more general s/wikimedia.org/wmnet/ fixups for cp* hosts [puppet] - 10https://gerrit.wikimedia.org/r/196072 [21:40:37] !log finished train deployment [21:40:44] Logged the message, Master [21:41:21] (03CR) 10John F. Lewis: "@dzahn yeah there needs to be a 'codfw' line for the relevant config-geo clauses (as with eqiad, ulsfo and esams)" [dns] - 10https://gerrit.wikimedia.org/r/196069 (https://phabricator.wikimedia.org/T92377) (owner: 10Dzahn) [21:41:36] (03CR) 10Hashar: "Joe pointed out that arguments to ensure should probably be unquoted: https://phabricator.wikimedia.org/T91908#1109218" [puppet] - 10https://gerrit.wikimedia.org/r/195769 (owner: 10Matanya) [21:42:28] (03CR) 10BBlack: [C: 032] more general s/wikimedia.org/wmnet/ fixups for cp* hosts [puppet] - 10https://gerrit.wikimedia.org/r/196072 (owner: 10BBlack) [21:45:07] 7Puppet, 6operations, 5Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908#1111238 (10hashar) I tend to like unquoted parameters to ensure =>, or to generalized what joe said about not quoting ruby barewords. For boolean, I remember at least one occurre... [21:48:53] (03PS1) 10Dzahn: use 208.80.153.224 for text-lb.codfw.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/196075 (https://phabricator.wikimedia.org/T92377) [21:49:03] (03CR) 10jenkins-bot: [V: 04-1] use 208.80.153.224 for text-lb.codfw.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/196075 (https://phabricator.wikimedia.org/T92377) (owner: 10Dzahn) [21:50:23] (03PS2) 10Dzahn: use 208.80.153.224 for text-lb.codfw.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/196075 (https://phabricator.wikimedia.org/T92377) [21:57:10] (03PS1) 10Dzahn: config-geo: add text-addrs v4 and v6 for codfw [dns] - 10https://gerrit.wikimedia.org/r/196076 (https://phabricator.wikimedia.org/T92377) [21:57:18] (03CR) 10jenkins-bot: [V: 04-1] config-geo: add text-addrs v4 and v6 for codfw [dns] - 10https://gerrit.wikimedia.org/r/196076 (https://phabricator.wikimedia.org/T92377) (owner: 10Dzahn) [21:58:11] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/196076/1" [dns] - 10https://gerrit.wikimedia.org/r/196069 (https://phabricator.wikimedia.org/T92377) (owner: 10Dzahn) [21:58:39] (03CR) 10Dzahn: "fatal: plugin_geoip: resource 'text-addrs': the dcmap does not match the datacenters list" [dns] - 10https://gerrit.wikimedia.org/r/196076 (https://phabricator.wikimedia.org/T92377) (owner: 10Dzahn) [21:59:27] (03CR) 10Ori.livneh: ssh: introduce ssh::userkey resource (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [21:59:33] (03PS2) 10Dzahn: config-geo: add codfw with text-addrs v4 and v6 [dns] - 10https://gerrit.wikimedia.org/r/196076 (https://phabricator.wikimedia.org/T92377) [21:59:42] (03CR) 10jenkins-bot: [V: 04-1] config-geo: add codfw with text-addrs v4 and v6 [dns] - 10https://gerrit.wikimedia.org/r/196076 (https://phabricator.wikimedia.org/T92377) (owner: 10Dzahn) [22:01:57] mutante: now you have the opposite problem, all of them need definitions for it [22:02:28] (e.g. bits, etc) [22:03:22] bblack: heh, i just noticed yea. the order of things here seems the tricky part, not adding a new DC a lot :) [22:11:16] !log cp1052 - comment in pybal, reinstalling [22:11:21] Logged the message, Master [22:19:55] (03PS1) 10Dzahn: repool cp1052 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196082 [22:21:13] (03PS1) 10Kaldari: Turning on WikiGrokDebug config var for English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196083 [22:28:10] 6operations, 10ops-codfw: install cable covers in enclosure's sidewalls - https://phabricator.wikimedia.org/T84072#1111376 (10RobH) p:5Normal>3Lowest [22:29:26] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-App-Android-App, and 2 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1111389 (10Fjalapeno) I was going to pick this up but its not clear what the actual issue is from the ticket description. @... [22:37:11] 10Ops-Access-Requests, 6operations, 10MediaWiki-extensions-ContentTranslation, 3LE-Sprint-84, 5Patch-For-Review: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1111396 (10RobH) @nemo_bis: point taken I've added the following to scope: This document's scope applies t... [22:49:02] (03PS3) 10Mobrovac: Puppetise Citoid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/195896 (https://phabricator.wikimedia.org/T89875) [22:49:12] (03CR) 10Dzahn: [C: 032] repool cp1052 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196082 (owner: 10Dzahn) [22:59:01] 6operations, 3HTTPS-by-default, 5Patch-For-Review: Upgrade all HTTP frontends to Debian jessie - https://phabricator.wikimedia.org/T86648#1111435 (10Dzahn) cp1053 cp1061 cp1052 cp1054 cp1057 cp1056 done. i stopped adding the bug number to commit messages because it would be too spammy [23:00:04] RoanKattouw, ^d, Krenair, tgr: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150311T2300). Please do the needful. [23:04:53] (03CR) 10Dzahn: [C: 032] "correct, fixes "optional parameter listed before required parameter"" [puppet] - 10https://gerrit.wikimedia.org/r/195536 (owner: 10Matanya) [23:11:06] !log reinstalling rdb2001 [23:11:13] Logged the message, Master [23:12:09] RoanKattouw, ^d, tgr: who's doing that? [23:12:29] And where is jamesofur? [23:12:35] Oh ahm [23:12:36] here [23:12:38] Maybe not me? [23:12:51] * RoanKattouw is feeling very sleepy [23:13:18] I wonder how jouncebot decides which people to mention... [23:13:42] I just assumed it was programmed to mention people who had volunteered for that time slot of SWAT deploys [23:14:05] It also mentioned one person with a patch, but not all [23:14:05] I think it looks at who has patches in too right? [23:14:10] ah ok [23:14:15] SWAT volunteers plus the first patch owner, I think [23:14:20] first patch owner. what. [23:14:27] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-App-Android-App, and 2 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1111463 (10dr0ptp4kt) I'll update the ticket description. [23:14:29] * jamesofur rolls his eyes a bit [23:14:32] oh well, something to fix later I guess [23:15:15] tgr, okay so your requested patch is just fixing a maintenance script... [23:15:18] are you going to do that? [23:15:55] I can do that, sure [23:16:24] after the others are done [23:16:40] kaldari, have you made submodule updates? [23:17:33] hello [23:17:37] It was supposed to mention all the patch owners but I think the xpath expression that it uses to find them in the html is easily confused [23:18:16] Krenair: no, do you want me to? I usually let the SWAT deployer do that in case there are any security or other cherry-picks to the repos I don’t know about. [23:18:41] Krenair: It just looks at the schedule, is all [23:18:49] It pinging patch owners is a cool feature request actually [23:19:27] Or I guess it may already be a feature that just doesn't work :) [23:19:31] rebasing over security updates happens on tin, I don't think it's relevant to the public submodule update commits [23:19:42] For the record, submodule updates are not affected by security patches because... yeah what Krenair said [23:19:49] See, I need sleep, I'm late to everything :D [23:20:11] (03CR) 10Alex Monk: [C: 032] Turning on WikiGrokDebug config var for English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196083 (owner: 10Kaldari) [23:20:17] (03Merged) 10jenkins-bot: Turning on WikiGrokDebug config var for English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196083 (owner: 10Kaldari) [23:20:23] kaldari, yes please [23:20:53] Krenair: NP. one min.... [23:21:38] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/196083/ (duration: 00m 07s) [23:21:44] Logged the message, Master [23:21:56] * ebernhardson looks for tests in the jouncebot code that proves these xpaths do anything appropriate...but no :P [23:23:02] Krenair: My deployment repo is fresh, so it’s going to take a long time to do the submodule update. [23:23:35] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-App-Android-App, and 2 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1111532 (10dr0ptp4kt) [23:23:47] Sorry about that [23:23:56] RoanKattouw: g'night :) [23:24:01] okay [23:24:34] kaldari, you doing the wmf21 patches? [23:24:45] yes, doing wmf21 right now [23:24:48] will do wmf20 [23:24:51] thanks! [23:28:31] (03PS1) 10Dzahn: fix rdb codfw hostnames in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/196104 (https://phabricator.wikimedia.org/T86887) [23:28:57] Sorry... my computer kernel panic'd ... [23:29:19] will do jamesofur's patch next [23:29:28] (03CR) 10Dzahn: [C: 032] fix rdb codfw hostnames in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/196104 (https://phabricator.wikimedia.org/T86887) (owner: 10Dzahn) [23:29:33] Thanks Krenair [23:30:06] (03CR) 10Alex Monk: [C: 032] Disable anonymous page creation on swWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195197 (https://phabricator.wikimedia.org/T44894) (owner: 10Jalexander) [23:31:40] of course jenkins is going to put it in the queue to wait until after unrelated things are done :/ [23:32:04] why do our wmf deployment branches still run zend phpunit tests? [23:33:43] kaldari, how's it going? it's the git submodule update that takes a long time, right? [23:33:52] Krenair: I’m down to V now [23:34:01] unfortunately, W is a long one [23:35:46] Krenair: why do we do core tests for a config change? [23:36:02] jamesofur, we don't [23:36:05] (03Merged) 10jenkins-bot: Disable anonymous page creation on swWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195197 (https://phabricator.wikimedia.org/T44894) (owner: 10Jalexander) [23:36:16] but your config change is in the queue behind a core change [23:36:16] oh, it was just in queue [23:36:18] * jamesofur nods [23:36:27] yeah, didn't notice the queue nature right away [23:37:19] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/195197/3 (duration: 00m 06s) [23:37:23] jamesofur, please check ^ [23:37:26] Logged the message, Master [23:37:27] (03CR) 10Dzahn: [C: 032] "correct, fixes 3 x "optional parameter listed before required parameter"" [puppet] - 10https://gerrit.wikimedia.org/r/195531 (owner: 10Matanya) [23:37:41] Krenair: verified [23:37:50] thank you [23:38:32] (03CR) 10Dzahn: "rdb2001 gets an DHCP ACK but still does not boot into an installer afterwards" [puppet] - 10https://gerrit.wikimedia.org/r/196104 (https://phabricator.wikimedia.org/T86887) (owner: 10Dzahn) [23:39:33] Krenair: done: https://gerrit.wikimedia.org/r/#/c/196106/ [23:39:51] Krenair: because we run zend on the cluster! [23:41:00] legoktm, on... silver? where else? [23:41:02] !log krenair Synchronized php-1.25wmf20/extensions/WikiGrok/includes/Hooks.php: https://gerrit.wikimedia.org/r/#/c/196103/ (duration: 00m 08s) [23:41:07] Logged the message, Master [23:41:09] kaldari, ^ [23:41:09] Krenair: terbium, tin, etc. [23:41:22] still terbium and tin? sigh [23:41:29] oh oh [23:41:42] Krenair: looks like it needs a revert [23:41:53] (03CR) 10Dzahn: "Mar 11 23:37:29 carbon dhcpd: DHCPREQUEST for 10.192.0.119 (208.80.154.10) from b0:83:fe:e4:6a:74 via 10.192.0.3" [puppet] - 10https://gerrit.wikimedia.org/r/196104 (https://phabricator.wikimedia.org/T86887) (owner: 10Dzahn) [23:41:57] https://en.wikipedia.org/wiki/Main_Page [23:42:09] any known issue with en wiki? [23:42:10] * jamesofur facepalms [23:42:13] i get Exception encountered, of type "BadMethodCallException" [23:42:31] !log krenair Synchronized php-1.25wmf20/extensions/WikiGrok/includes/Hooks.php: revert (duration: 00m 05s) [23:42:37] Logged the message, Master [23:42:43] tis back [23:42:43] now fixed [23:42:47] Krenair: guess we should do wmf21 first :P [23:43:13] What just happened? [23:43:26] kaldari, let's clean up wmf20 first :) [23:43:27] Krenair: OK, I guess just hold off on my commits and only do the config change for now [23:43:30] 6operations, 10Wikimedia-Blog: Delete stat1002:/a/squid/archive/blog - https://phabricator.wikimedia.org/T92331#1111628 (10kevinator) I can't tell if this was an attempt to plug the blog into webstatscollector. However, the dates are about right: here are only 5 files dated early February 2013. Also seems the... [23:43:43] Krenair: Taking a look now… [23:43:44] I checked out the previous version of WikiGrok on tin and synced it quickly [23:44:27] 6operations, 6Phabricator, 7Mail: Phabricator mails Message-ID has localhost.localdomain - https://phabricator.wikimedia.org/T75713#1111636 (10chasemp) There is a directive called `metamta.domain` in the configuration that is set to: `phabricator.wikimedia.org`. This does cause new thread emails to get a co... [23:45:18] Krenair: I see the problem.... [23:46:32] kaldari, oohhh... [23:46:43] That missing $out in Hooks::isUIEnabled? [23:46:45] $out not defined :P [23:46:51] * Krenair rolls eyes [23:47:41] wonder how that got past Jenkins [23:48:11] kaldari, I wonder how it got past your local testing :P [23:48:28] Krenair: yeah, need to not make last minute changes :( [23:49:40] kaldari: congrtz, you deserve a shirt from bd808 ! :) [23:51:41] (03PS3) 10Ori.livneh: Added the jobchron daemon that complements jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/195337 (owner: 10Aaron Schulz) [23:51:50] (03CR) 10Dzahn: "actually, on second attempt and after restarting atftp i did see the installer pop up but then nothing.. partman?" [puppet] - 10https://gerrit.wikimedia.org/r/196104 (https://phabricator.wikimedia.org/T86887) (owner: 10Dzahn) [23:52:36] (03PS4) 10Aaron Schulz: Added the jobchron daemon that complements jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/195337 [23:53:12] (03CR) 10Ori.livneh: [C: 032 V: 032] Added the jobchron daemon that complements jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/195337 (owner: 10Aaron Schulz) [23:54:52] (03PS1) 10BBlack: tag amssq43-47 as Jessie [puppet] - 10https://gerrit.wikimedia.org/r/196115 [23:55:07] (03CR) 10BBlack: [C: 032 V: 032] tag amssq43-47 as Jessie [puppet] - 10https://gerrit.wikimedia.org/r/196115 (owner: 10BBlack) [23:55:41] ugh, one can never get distractd [23:56:07] Nemo_bis? [23:59:19] !log powercycling rbf2001, attempt reinstall (wrong IP?) [23:59:26] Logged the message, Master