[00:03:54] I'd still like to deploy https://gerrit.wikimedia.org/r/#/c/202218/ (simple debug logging to assess whether I need to partly revert a as-of-yet un-deployed patch or not) [00:04:01] but have to wait for jenkins... [00:05:36] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [00:06:16] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [00:06:27] (03PS1) 10Yuvipanda: Clear manifests before every collect [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202291 (https://phabricator.wikimedia.org/T95210) [00:06:30] legoktm: ^ [00:07:07] YuviPanda: looks sane [00:07:10] (03CR) 10Dzahn: [C: 032] bugzilla: use https by default for static [puppet] - 10https://gerrit.wikimedia.org/r/201964 (owner: 10John F. Lewis) [00:07:14] legoktm: you have +2 there [00:07:32] YuviPanda: nope, just -/+1 [00:07:45] legoktm: oh, try now? [00:07:52] operations/software/tools-manifest? I would be surprised if legoktm had +2 there :) [00:08:04] (03CR) 10Legoktm: [C: 032] Clear manifests before every collect [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202291 (https://phabricator.wikimedia.org/T95210) (owner: 10Yuvipanda) [00:08:05] not badly [00:08:05] :o [00:08:08] but, nonetheless [00:08:15] Krenair, just to confirm, you're not going to do the MessagePoster tonight, right? [00:08:17] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:08:20] YuviPanda: is it ok for me to have +2 in an operations/ repo? [00:08:29] superm401, wasn't planning to after you said you weren't [00:08:38] legoktm, operations/mediawiki-config :) [00:08:40] YuviPanda: also, jenkins-bot needs submit permissions [00:08:53] legoktm: Krenair there’s a ‘toollabs-trusted’ group I just created. [00:08:56] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:08:57] That's what I thought, just checking. [00:08:58] has legoktm and scfc [00:09:01] Krenair: oh right :P [00:09:17] RECOVERY - puppet last run on mw1149 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:09:40] YuviPanda, not visible? [00:09:44] legoktm: jenkinsbot should have it... [00:09:46] RECOVERY - puppet last run on mw1111 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [00:09:47] https://gerrit.wikimedia.org/r/#/admin/groups/uuid-a5a048979af724462c3eac0c2a5b8f226c1f63ed - "The page you requested was not found, or you do not have permission to view this page." [00:09:51] hidden ACLs are bad... [00:10:08] Krenair: try again? [00:10:11] Krenair: I had no idea it was hidden [00:10:15] YuviPanda, thanks :) [00:10:16] apparently you’ve to explicitly make the public [00:10:22] I wonder how many groups we have like that [00:10:37] RECOVERY - puppet last run on mw1084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:10:40] added a description too [00:10:49] how do you create groups anyway? [00:11:02] isn't that restricted to https://gerrit.wikimedia.org/r/#/admin/groups/119,members ? [00:11:02] Krenair: I have a ‘create new group’ link [00:11:04] under people [00:11:05] probably [00:11:12] (03CR) 10Legoktm: [C: 032] Clear manifests before every collect [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202291 (https://phabricator.wikimedia.org/T95210) (owner: 10Yuvipanda) [00:11:23] but you're not there... hmm [00:11:25] (03Merged) 10jenkins-bot: Clear manifests before every collect [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202291 (https://phabricator.wikimedia.org/T95210) (owner: 10Yuvipanda) [00:11:32] Krenair: included groups in there is administrators, and included in there is ldap/ops [00:11:36] and I’m in ldap/ops [00:11:46] I don't think ldap/ops is included in gerrit administrators [00:11:54] aha: https://gerrit.wikimedia.org/r/#/admin/projects/All-Projects,access [00:11:54] it is [00:11:54] https://gerrit.wikimedia.org/r/#/admin/groups/1,members [00:12:07] it’s just chained away [00:12:18] YuviPanda, ldap/ops is not listed there... [00:12:25] really? i see it [00:12:30] ldap groups aren't listed unless you're in them [00:12:41] awww gerrit [00:12:42] ldap/ops is included alongside Project and Group Creators in the All-Projects list for Create Group [00:12:43] err [00:12:51] legoktm, whaaaat [00:12:53] really? [00:16:56] Krenair: he just left the office [00:16:58] btw [00:17:24] ebernhardson or legoktm? [00:17:30] Krenair: lego [00:17:32] ah [00:18:07] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:21:06] Krenair: but: yes, all opsen can create gerrit repos [00:21:41] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Bugzilla HTML static version and database dump - https://phabricator.wikimedia.org/T1198#1184431 (10JohnLewis) [00:28:06] but they should request them on wiki [00:31:54] opsen can do a *LOT* of things they should probably never do :) [00:32:22] sometimes the applications make it far too tempting [00:32:32] the console is more scary [00:34:27] !log haedus/capella: disabling puppet. reclaim [00:34:32] (03PS1) 10Yuvipanda: Introduce Tool objects [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202293 [00:34:34] Logged the message, Master [00:36:24] !log krenair Synchronized php-1.25wmf24/includes/Title.php: debug logging - https://gerrit.wikimedia.org/r/#/c/202290/ (duration: 00m 15s) [00:36:27] Logged the message, Master [00:38:19] (03PS2) 10Yuvipanda: Introduce Tool objects [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202293 [00:38:36] (03CR) 10jenkins-bot: [V: 04-1] Introduce Tool objects [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202293 (owner: 10Yuvipanda) [00:39:19] (03PS3) 10Yuvipanda: Introduce Tool objects [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202293 [00:39:51] !log krenair Synchronized php-1.25wmf23/includes/Title.php: debug logging - https://gerrit.wikimedia.org/r/#/c/202218/ (duration: 00m 11s) [00:39:56] Logged the message, Master [00:40:02] Krinkle, ^ fyi [00:40:08] it'll be in fluorine:/a/mw-log/AdHocDebug.log [00:40:40] !log radon: revoke salt key, puppet cert [00:40:41] (03PS4) 10Yuvipanda: Introduce Tool objects [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202293 [00:40:44] Krenair: OK [00:40:46] Logged the message, Master [00:41:49] Krenair: Lots of data :) [00:42:07] yeah :S [00:42:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 1 below the confidence bounds [00:42:19] haha oh dear [00:43:16] Krinkle, {"file":"\/srv\/mediawiki\/php-1.25wmf23\/includes\/MediaWiki.php","line":539,"function":"newFromText","class":"Title","type":"::","args":[0,"REDIR"]},{"file":"\/srv\/mediawiki\/php-1.25wmf23\/includes\/MediaWiki.php","line":422,"function":"main","class":"MediaWiki","object":{},"type":"->","args":[]} [00:44:09] Krinkle, https://github.com/wikimedia/mediawiki/blob/master/includes/MediaWiki.php#L532 :) [00:44:29] xD [00:44:37] seriously... who wrote that. [00:44:54] and managed to get it approved into the MediaWiki class. [00:45:31] There's an answer to that question [00:45:45] https://gerrit.wikimedia.org/r/#/c/88203/ [00:46:04] Ah, well, not exactly [00:46:40] nah it was the same problem before [00:46:51] https://gerrit.wikimedia.org/r/#/c/24026/ [00:47:29] well, it *is* a dummy title [00:47:36] yeah but... [00:47:41] parameters, the wrong way around [00:47:48] for Title::newFromText [00:47:53] very lucky that worked at all [00:48:06] Yeah, but who's to say he didn't mean [[Redir:]] as dummy title :P [00:48:27] (03PS1) 10Dzahn: remove radon, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/202296 (https://phabricator.wikimedia.org/T88818) [00:48:57] I think the NS_ constant (which is just 0) as the first parameter shows that [00:50:13] I don't think the NS parameter can be an arbitrary string anyway, can it? [00:50:14] (03PS2) 10Dzahn: remove radon, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/202296 (https://phabricator.wikimedia.org/T88818) [00:50:24] (03CR) 10Dzahn: [C: 032] remove radon, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/202296 (https://phabricator.wikimedia.org/T88818) (owner: 10Dzahn) [00:51:01] Krenair: Kind of, it's filtered away [00:51:06] becomes [[0]] [00:51:14] Not Redir:0 [00:51:29] (03Restored) 10Dzahn: remove haedus/capella, decom [dns] - 10https://gerrit.wikimedia.org/r/201397 (https://phabricator.wikimedia.org/T94474) (owner: 10Dzahn) [00:51:47] krenair@fluorine:/a/mw-log$ cat AdHocDebug.log | grep -v appendRightsInfo | grep -v '0,"REDIR"' -c [00:51:47] 10 [00:51:54] intval('text') == 0 [00:51:54] so basically 2 main ones [00:51:56] one is almost fixed on master [00:51:57] Not 1 like in javascript [00:52:25] so REDIR casts to NS_MAIN oddly enough [00:52:41] Wouldn't parseInt on 'text' be NaN? [00:52:48] >>> parseInt( 'text' ) [00:52:49] Krenair: (number) NaN [00:52:50] parseInt yes [00:52:53] >> +'foo' [00:52:54] Krinkle: (number) NaN [00:53:08] Hm.. [00:53:13] :P [00:53:40] >> +!!'foo' [00:53:41] Krenair: (number) 1 [00:53:45] Right [00:53:48] booleans [00:54:00] >> Number(true) [00:54:00] Krinkle: (number) 1 [00:54:03] >> Number(false) [00:54:03] Krinkle: (number) 0 [00:54:06] yeah [00:54:22] Krinkle, so anyway, won't this code trigger a fatal exception on the latest master? [00:54:23] it's only php that can make numbers out of non-numerical strings [00:54:27] Krenair: Yes [00:54:30] !log haedus/capella - shutdown -h [00:54:35] Logged the message, Master [00:54:39] Uncaught exception InvalidArgumentException: Title::newFromText given something that isn't a string [00:55:11] posted a comment [00:55:17] Krenair: eh.. that's hardly the fault of a commit from 2012 [00:55:26] The thing that added the exception made it break [00:55:53] (03CR) 10Dzahn: [C: 032] remove haedus/capella, decom [dns] - 10https://gerrit.wikimedia.org/r/201397 (https://phabricator.wikimedia.org/T94474) (owner: 10Dzahn) [00:55:53] If I saw that code when it was originally posted, I would've -1'd it, had I seen the order of those params [00:56:03] Yeah [00:56:15] I will upload a fix soon. [00:56:26] want to see if I can reproduce it first [00:56:47] (03PS1) 10Yuvipanda: Better symlink race protection [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202297 (https://phabricator.wikimedia.org/T95210) [00:56:48] k, I'll review [00:57:53] 6operations, 6Phabricator, 5Patch-For-Review: reclaim radon as spare re-use server 'radon' as phab failover - https://phabricator.wikimedia.org/T88818#1184482 (10Dzahn) revoked salt key and puppet cert, it was already powered down removed from DNS still needs (maybe): racktables, switchport, disk wipe? [00:59:00] 6operations, 6Phabricator, 5Patch-For-Review: reclaim radon as spare re-use server 'radon' as phab failover - https://phabricator.wikimedia.org/T88818#1184483 (10Dzahn) 5Open>3Resolved nevermind, disks are also wiped already. i think this is done with all the "reclaim as spare" steps. [00:59:18] (03PS1) 10John F. Lewis: remove rbf* production dns [dns] - 10https://gerrit.wikimedia.org/r/202298 (https://phabricator.wikimedia.org/T95153) [00:59:19] (03CR) 10jenkins-bot: [V: 04-1] remove rbf* production dns [dns] - 10https://gerrit.wikimedia.org/r/202298 (https://phabricator.wikimedia.org/T95153) (owner: 10John F. Lewis) [00:59:48] Krinkle, haha, yeah [00:59:58] 1) get latest master of mediawiki core [01:00:04] 2) document.cookie = "forceHTTPS=true;"; in JS console [01:00:16] 3) refresh wiki, make sure you're on HTTP [01:00:21] 4) Exception encountered, of type "InvalidArgumentException" [01:00:32] ok, will patch [01:03:31] 6operations, 5Patch-For-Review: reclaim / decom haedus and capella - https://phabricator.wikimedia.org/T94474#1184487 (10Dzahn) revoked puppet cert, salt keys.. < mutante> !log haedus/capella - shutdown -h removed from DNS (kept mgmt) the steps needed for "reclaim" should be done [01:03:42] 6operations, 5Patch-For-Review: reclaim / decom haedus and capella - https://phabricator.wikimedia.org/T94474#1184488 (10Dzahn) 5Open>3Resolved [01:03:46] (03PS3) 10Yuvipanda: Set has_ganglia=false for labs [puppet] - 10https://gerrit.wikimedia.org/r/201942 (https://phabricator.wikimedia.org/T95107) (owner: 10Gergő Tisza) [01:04:49] (03PS2) 10John F. Lewis: remove rbf* production dns [dns] - 10https://gerrit.wikimedia.org/r/202298 (https://phabricator.wikimedia.org/T95153) [01:07:40] woah, wtf [01:07:42] Krinkle [01:07:44] 2015-04-07 00:56:19 mw1251 commonswiki: [01:07:45] 2015-04-07 00:56:19 mw1251 commonswiki: [01:08:01] 2015-04-07 00:43:34 mw1186 ptwiki: [01:08:02] 2015-04-07 00:43:36 mw1186 ptwiki: [01:08:27] json_encode( debug_backtrace() ) resulting in that? :/ [01:08:44] uploaded https://gerrit.wikimedia.org/r/#/c/202299/ btw [01:08:54] (03CR) 10Dzahn: "i think in this case mgmt should go as well, because they are not coming back with the same names, unlike "misc" servers with individual n" [dns] - 10https://gerrit.wikimedia.org/r/202298 (https://phabricator.wikimedia.org/T95153) (owner: 10John F. Lewis) [01:09:43] Krenair: Hm.. try matching to a different error around the same time for that server/wiki combo [01:09:50] in the other logs [01:09:54] (this would be easier in logstash) [01:11:17] yeah... considering some of these logs are GBs huge [01:11:24] -rw-r--r-- 1 udp2log udp2log 35G Apr 7 01:10 api-feature-usage.log [01:11:25] :| [01:11:41] 6operations, 10RESTBase: Set up multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1184502 (10GWicke) 3NEW [01:11:50] !log rbf eqiad and codfw - disable puppet (T95153) [01:11:55] (03PS3) 10John F. Lewis: remove rbf* production dns [dns] - 10https://gerrit.wikimedia.org/r/202298 (https://phabricator.wikimedia.org/T95153) [01:11:56] Logged the message, Master [01:12:11] 6operations, 10RESTBase: Set up multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1184511 (10GWicke) @fgiunchedi or @akosiaris, is this something you are potentially interested in taking on? [01:12:48] that makes me feel much better about the size we can let AdHocDebug.log (currently 15M, 20 minutes after deployment) grow to [01:13:10] 6operations, 10RESTBase: Set up multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1184512 (10GWicke) [01:13:22] Krenair: Shall we deploy those fixes now and see if we can make AdHocDebug.log stop? [01:13:37] there's much more hidden ones, like in Special:GlobalRenameProgress [01:14:03] Aye. Let's file some #Wikimedia-log-errors then :) [01:14:14] but, they're **much** less frequent [01:14:25] And assuming they don't get fixed in time, we'll need to revert the exception thrower in master [01:14:27] krenair@fluorine:/a/mw-log$ cat AdHocDebug.log | grep -v appendRightsInfo | grep -v '0,"REDIR"' -c [01:14:27] 23 [01:14:34] !log rbf2001/rbf2002 - stop redis server [01:14:38] Logged the message, Master [01:14:45] yeah... and the 1.25 branch is soon too [01:14:49] The exception solves nothing [01:14:54] Let's revert it on master. [01:15:00] yeah [01:15:00] And fix up all our callers [01:16:02] Let's keep the MapCacheLRU part which is harmless, but revert the rest [01:17:30] Krinkle, also, possibly raise a warning where we currently throw the exception? [01:17:51] Krenair: I actually wanted to back out the wfWarn from MapCacheLRU [01:17:56] I think that's an antipattern. [01:17:58] It's of no value. [01:18:08] PROBLEM - salt-minion processes on rbf2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [01:18:08] Garbage in, garbage out, and it works as expected. [01:18:19] ok, fine [01:18:22] it's not something we should throw a mediawiki logic level warning for [01:18:32] PHP does this perfectly [01:18:37] PROBLEM - salt-minion processes on rbf2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [01:18:38] and that way it has stack traces [01:19:23] you want to make the revert commit or shall I? [01:20:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [01:21:25] (03PS2) 10Yuvipanda: Better symlink race protection [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202297 (https://phabricator.wikimedia.org/T95210) [01:21:29] ACKNOWLEDGEMENT - salt-minion processes on rbf2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion daniel_zahn https://phabricator.wikimedia.org/T95153 [01:21:29] ACKNOWLEDGEMENT - salt-minion processes on rbf2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion daniel_zahn https://phabricator.wikimedia.org/T95153 [01:22:34] krenair@fluorine:/a/mw-log$ cat AdHocDebug.log | grep -v appendRightsInfo | grep -v '0,"REDIR"' | grep -v SpecialGlobalRenameProgress | grep -ve ": $" -c [01:22:34] 9 [01:23:49] ACKNOWLEDGEMENT - Host ganeti1003 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T94825 [01:30:34] Krenair: Go for it :) [01:31:32] ACKNOWLEDGEMENT - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Puppet has 3 failures daniel_zahn https://phabricator.wikimedia.org/T94042 [01:31:32] ACKNOWLEDGEMENT - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Puppet has 3 failures daniel_zahn https://phabricator.wikimedia.org/T94042 [01:31:32] ACKNOWLEDGEMENT - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Puppet has 3 failures daniel_zahn https://phabricator.wikimedia.org/T94042 [01:31:32] ACKNOWLEDGEMENT - puppet last run on ganeti2003 is CRITICAL: CRITICAL: Puppet has 3 failures daniel_zahn https://phabricator.wikimedia.org/T94042 [01:31:32] ACKNOWLEDGEMENT - puppet last run on ganeti2004 is CRITICAL: CRITICAL: Puppet has 3 failures daniel_zahn https://phabricator.wikimedia.org/T94042 [01:31:33] ACKNOWLEDGEMENT - puppet last run on ganeti2005 is CRITICAL: CRITICAL: Puppet has 3 failures daniel_zahn https://phabricator.wikimedia.org/T94042 [01:31:33] ACKNOWLEDGEMENT - puppet last run on ganeti2006 is CRITICAL: CRITICAL: Puppet has 3 failures daniel_zahn https://phabricator.wikimedia.org/T94042 [01:40:28] Krinkle, ok, let's deploy the backports for MediaWiki's call first? [01:44:48] (03PS1) 10Springle: repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202304 (https://phabricator.wikimedia.org/T94805) [01:45:03] meh [01:45:14] Krinkle, take a look at the difference between PS1 and PS2 of https://gerrit.wikimedia.org/r/#/c/202303/ [01:45:24] PS1 is a simple gerrit revert, PS2 changes some stuff [01:45:45] we should probably change the commit message as well [01:46:02] (03CR) 10Springle: [C: 032] repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202304 (https://phabricator.wikimedia.org/T94805) (owner: 10Springle) [01:46:07] (03Merged) 10jenkins-bot: repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202304 (https://phabricator.wikimedia.org/T94805) (owner: 10Springle) [01:46:28] Krenair: We can keep the InvalidArgumentException I suppose [01:46:43] (03PS1) 10Yuvipanda: Validate tool accounts before accepting them [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202305 [01:46:56] Krinkle, keep the type it actually throws? or the doc change? [01:47:08] e.g. throw InvalidArgumentException instead of MWException [01:47:23] !log springle Synchronized wmf-config/db-eqiad.php: repool db1035, warm up (duration: 00m 12s) [01:47:30] Logged the message, Master [01:47:32] The message could also be better. we never repeat the method name. Coudl just be '$text must be a string' [01:47:41] currently checking for !is_object [01:47:45] and warns for string [01:47:48] scap got all pretty and ascii-fied :) [01:47:50] and later we'll make it throw for any non-string [01:50:50] springle, yeah it was https://gerrit.wikimedia.org/r/#/c/201829/ :) [01:51:18] :D [01:51:51] Krinkle, shall we move to -dev? [01:51:56] k [01:59:38] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [02:09:38] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [02:26:08] PROBLEM - Disk space on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:17] PROBLEM - configured eth on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:27] PROBLEM - Hadoop DataNode on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:27] PROBLEM - dhclient process on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:38] PROBLEM - DPKG on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:48] PROBLEM - Hadoop NodeManager on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:58] PROBLEM - puppet last run on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:18] PROBLEM - salt-minion processes on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:37] PROBLEM - RAID on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:38] PROBLEM - SSH on analytics1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:35:08] !log l10nupdate Synchronized php-1.25wmf23/cache/l10n: (no message) (duration: 09m 00s) [02:35:18] Logged the message, Master [02:41:47] !log LocalisationUpdate completed (1.25wmf23) at 2015-04-07 02:40:43+00:00 [02:41:50] Logged the message, Master [02:46:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [02:55:27] PROBLEM - NTP on analytics1020 is CRITICAL: NTP CRITICAL: No response from NTP server [03:02:34] !log l10nupdate Synchronized php-1.25wmf24/cache/l10n: (no message) (duration: 06m 16s) [03:02:42] Logged the message, Master [03:07:16] !log LocalisationUpdate completed (1.25wmf24) at 2015-04-07 03:06:13+00:00 [03:07:23] Logged the message, Master [03:27:45] (03PS1) 10Yuvipanda: Send stats about webservices, manifests and errors! [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202318 [03:47:38] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: TransparencyReport repository master in Gerrit silently made private - https://phabricator.wikimedia.org/T89640#1184582 (10yuvipanda) So I gave the group (prateek, moiz and ori) direct push access to the public repo, and they can update it as they s... [03:49:37] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: puppet fail [03:51:12] (03PS2) 10Yuvipanda: Send stats about webservices, manifests and errors! [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202318 (https://phabricator.wikimedia.org/T95256) [03:53:28] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [04:02:03] (03CR) 10Tim Landscheidt: Send stats about webservices, manifests and errors! (031 comment) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202318 (https://phabricator.wikimedia.org/T95256) (owner: 10Yuvipanda) [04:03:19] (03CR) 10Yuvipanda: Send stats about webservices, manifests and errors! (031 comment) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202318 (https://phabricator.wikimedia.org/T95256) (owner: 10Yuvipanda) [04:06:27] RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [04:22:39] (03PS1) 10Yuvipanda: Add minimial setup.py [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202320 [04:22:58] 6operations, 10Datasets-General-or-Unknown, 5Patch-For-Review: dumps.wikimedia.org: The "/mediawiki" redirect should not add duplicate slash in path - https://phabricator.wikimedia.org/T73152#1184588 (10Dzahn) [04:23:20] (03CR) 10Dzahn: [C: 031] nginx: don't add duplicate slash [puppet] - 10https://gerrit.wikimedia.org/r/202074 (https://phabricator.wikimedia.org/T73152) (owner: 10Southparkfan) [04:24:47] (03PS5) 10Tim Landscheidt: Tools: Factor out registering with proxies [puppet] - 10https://gerrit.wikimedia.org/r/197658 (https://phabricator.wikimedia.org/T91954) [04:24:49] (03PS2) 10Tim Landscheidt: Tools: Make list of proxies for portgrabber configurable [puppet] - 10https://gerrit.wikimedia.org/r/201991 (https://phabricator.wikimedia.org/T91954) [04:24:51] (03PS1) 10Tim Landscheidt: Tools: Make proxylistener project-independent [puppet] - 10https://gerrit.wikimedia.org/r/202322 (https://phabricator.wikimedia.org/T87387) [04:25:03] (03PS1) 10Yuvipanda: Rename packages to tools.manifest [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202323 [04:25:48] (03PS2) 10Dzahn: dumps: nginx, don't add duplicate slash [puppet] - 10https://gerrit.wikimedia.org/r/202074 (https://phabricator.wikimedia.org/T73152) (owner: 10Southparkfan) [04:25:58] (03CR) 10Tim Landscheidt: "Just rebased for the depending patches, not tested for uwsgi-python and nodejs." [puppet] - 10https://gerrit.wikimedia.org/r/197658 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [04:26:27] (03CR) 10Dzahn: [C: 032] dumps: nginx, don't add duplicate slash [puppet] - 10https://gerrit.wikimedia.org/r/202074 (https://phabricator.wikimedia.org/T73152) (owner: 10Southparkfan) [04:26:33] (03CR) 10Tim Landscheidt: "Just rebased for the depending patch, not tested uwsgi-python and nodejs." [puppet] - 10https://gerrit.wikimedia.org/r/201991 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [04:27:14] (03CR) 10Tim Landscheidt: "Tested on Toolsbeta; as the proxylistener service isn't subscribed to its source file, merging this will not trigger a restart of proxylis" [puppet] - 10https://gerrit.wikimedia.org/r/202322 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [04:29:37] 6operations, 10Datasets-General-or-Unknown, 5Patch-For-Review: dumps.wikimedia.org: The "/mediawiki" redirect should not add duplicate slash in path - https://phabricator.wikimedia.org/T73152#1184606 (10Dzahn) merged. watched puppet apply on dataset1001. looks fixed to me, thanks :) [04:31:11] 6operations, 10Datasets-General-or-Unknown, 5Patch-For-Review: dumps.wikimedia.org: The "/mediawiki" redirect should not add duplicate slash in path - https://phabricator.wikimedia.org/T73152#1184609 (10Dzahn) 5Open>3Resolved a:3Dzahn - rewrite ^/(other/)?mediawiki(|/.*)$ $scheme://rel... [04:33:10] (03PS1) 10Ori.livneh: Blackhole the slow parse log on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202325 [04:33:25] ^ bd808 [04:33:50] (DefaultSettings.php says: [04:33:52] 5269 * Log destinations may be one of the following: [04:33:52] 5270 * - false to completely remove from the output, including from $wgDebugLogFile. [04:33:53] ) [04:34:27] let me check to be sure that it still does that :) [04:35:13] (03CR) 10Alex Monk: "... Sure hope we're not actually publicising it already somewhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202325 (owner: 10Ori.livneh) [04:36:01] It does! https://github.com/wikimedia/mediawiki/blob/master/includes/debug/logger/LegacyLogger.php#L136 [04:36:27] Krenair: we're not, but why would that be so bad? [04:36:30] (03CR) 10BryanDavis: [C: 031] Blackhole the slow parse log on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202325 (owner: 10Ori.livneh) [04:37:00] (03CR) 10Alex Monk: [C: 031] "actually this is probably okay because we're just sending it nowhere rather than back to the normal log" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202325 (owner: 10Ori.livneh) [04:37:37] ori, because I don't read everything properly before dumping my thoughts onto gerrit :p [04:37:49] me too! sometimes i even +2 [04:37:57] heh [04:38:44] (03CR) 10Dzahn: [C: 031] "yep,looks like the entire "rbf" type of server is not going to come back. bf was for BloomFilter and it has been removed in https://gerrit" [dns] - 10https://gerrit.wikimedia.org/r/202298 (https://phabricator.wikimedia.org/T95153) (owner: 10John F. Lewis) [04:42:44] (03CR) 10Dzahn: [C: 032] dumps: ferm service for rsyncd clients using hiera [puppet] - 10https://gerrit.wikimedia.org/r/188204 (owner: 10Dzahn) [04:42:57] mutante: I see alerts for rbf* [04:43:38] paravoid: i'll take care. right after that dumps change there [04:43:45] (which should be noop) [04:43:50] rbf is decom [04:43:57] I know, I opened the ticket :) [04:44:06] i only stopped it in codfw though, you never know :p [04:44:07] but something has gone wrong with the decom process [04:44:08] ok [04:44:21] it's late there, right [04:44:46] it's ok:) [04:44:48] I have no idea what your working hours are but don't feel obligated to do it if your workday is over, it's really no emergency :) [04:45:25] no worries.. *nod* [04:49:38] !log rbf - puppetstoredconfigclean.rb, remove from icinga [04:49:44] Logged the message, Master [04:49:49] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: puppet fail [04:50:02] and that's me :p [04:50:56] so the rbf hosts should be gone soon and the other one i'm looking at [04:52:08] also DNS https://gerrit.wikimedia.org/r/#/c/202298/ [05:13:50] (03PS1) 10Dzahn: Revert "dumps: ferm service for rsyncd clients using hiera" [puppet] - 10https://gerrit.wikimedia.org/r/202326 [05:16:03] (03CR) 10Dzahn: [C: 032] "https://wikitech.wikimedia.org/wiki/Puppet_Hiera#Limitations" [puppet] - 10https://gerrit.wikimedia.org/r/202326 (owner: 10Dzahn) [05:18:14] (03CR) 10Dzahn: "this wasn't critical either way, ferm service doesnt create rules until base::firewall is used, it was just about the puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/202326 (owner: 10Dzahn) [05:18:55] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [05:21:41] (03CR) 10Dzahn: ""Could not find data item dumps::rsync_clients in any Hiera data file" can't use role based lookup yet. https://wikitech.wikimedia.org/wik" [puppet] - 10https://gerrit.wikimedia.org/r/188204 (owner: 10Dzahn) [05:27:26] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:31:26] PROBLEM - DPKG on ruthenium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:34:30] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: TransparencyReport repository master in Gerrit silently made private - https://phabricator.wikimedia.org/T89640#1184736 (10Prtksxna) @msyed and I will be doing the deployment on Tuesday, 7 April 2015, at approximately 16:00:00 UTC. We will — * Build... [05:38:06] RECOVERY - DPKG on ruthenium is OK: All packages OK [06:01:16] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Apr 7 06:00:13 UTC 2015 (duration 0m 12s) [06:01:23] Logged the message, Master [06:14:22] (03PS2) 10Ori.livneh: Blackhole the slow parse log on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202325 [06:29:07] PROBLEM - Hadoop DataNode on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:35] PROBLEM - configured eth on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:36] PROBLEM - puppet last run on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:36] PROBLEM - dhclient process on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:36] PROBLEM - SSH on analytics1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:29:56] PROBLEM - salt-minion processes on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:06] PROBLEM - Hadoop NodeManager on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:06] PROBLEM - RAID on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:25] PROBLEM - Disk space on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:26] PROBLEM - DPKG on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:26] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: puppet fail [06:31:06] PROBLEM - puppet last run on virt1001 is CRITICAL: CRITICAL: puppet fail [06:31:25] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:45] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] RECOVERY - Hadoop DataNode on analytics1017 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [06:32:46] RECOVERY - configured eth on analytics1017 is OK: NRPE: Unable to read output [06:32:46] RECOVERY - dhclient process on analytics1017 is OK: PROCS OK: 0 processes with command name dhclient [06:32:47] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:55] RECOVERY - puppet last run on analytics1017 is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures [06:32:55] RECOVERY - SSH on analytics1017 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [06:33:15] RECOVERY - salt-minion processes on analytics1017 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [06:33:17] RECOVERY - Hadoop NodeManager on analytics1017 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:33:25] RECOVERY - RAID on analytics1017 is OK: OK: no disks configured for RAID [06:33:36] RECOVERY - Disk space on analytics1017 is OK: DISK OK [06:33:36] RECOVERY - DPKG on analytics1017 is OK: All packages OK [06:34:46] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:06] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:16] <_joe_> good morning [06:35:21] hi _joe_ [06:35:25] PROBLEM - puppet last run on mw2134 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:37] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:46] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:46:05] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:46:45] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:45] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:56] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:05] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:16] RECOVERY - puppet last run on mw2134 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:25] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:35] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:48:05] RECOVERY - puppet last run on virt1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:06] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:52:27] (03PS6) 10Yuvipanda: Tools: Factor out registering with proxies [puppet] - 10https://gerrit.wikimedia.org/r/197658 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [06:52:50] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Factor out registering with proxies [puppet] - 10https://gerrit.wikimedia.org/r/197658 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [06:59:39] (03CR) 10Yuvipanda: [C: 04-1] "I think all config files should be yaml, at least for toollabs, and so should this be." [puppet] - 10https://gerrit.wikimedia.org/r/201991 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [07:05:49] (03PS2) 10Yuvipanda: Rename package to tools.manifest [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202323 [07:09:09] (03CR) 10Alex Monk: "Is this still useful?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134124 (owner: 10Reedy) [07:12:09] <_joe_> !log powercycling mw2128, network driver crashes [07:12:15] Logged the message, Master [07:12:23] what kind of crashes? [07:13:49] <_joe_> paravoid: from what I see in dmesg, it seems a deadlock, but I'd like to confirm it's not an hardware issue [07:13:59] ok [07:14:07] _joe_: are you still fiddling with lvs2003? [07:14:21] (03PS1) 10Faidon Liambotis: otrs: kill exim_messages_in check [puppet] - 10https://gerrit.wikimedia.org/r/202331 [07:14:33] (03CR) 10Faidon Liambotis: [C: 032] otrs: kill exim_messages_in check [puppet] - 10https://gerrit.wikimedia.org/r/202331 (owner: 10Faidon Liambotis) [07:14:49] <_joe_> paravoid: nope, why? [07:14:56] puppet disabled [07:15:06] <_joe_> uhm that's my fault in fact [07:15:13] I know, that's why I'm asking :) [07:16:06] PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: Puppet last ran 4 days ago [07:17:46] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [07:18:28] !log powercycling analytics1020, unresponsive [07:18:33] Logged the message, Master [07:23:36] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [07:23:50] Warning! Power management firmware not responsive. [07:23:50] Disconnect and reconnect system input power. [07:24:09] Warning! Power management firmware initialization error. [07:24:11] <_joe_> ouch [07:24:12] awesomeness [07:30:05] 6operations, 10ops-eqiad, 10Analytics-Cluster: analytics1020 hardware failure - https://phabricator.wikimedia.org/T95263#1184828 (10faidon) 3NEW [07:37:33] <_joe_> and mw2128 is not coming up either [07:39:39] (03PS1) 10KartikMistry: Beta: Enable sv-da MT in Apertium [puppet] - 10https://gerrit.wikimedia.org/r/202340 [07:42:03] (03CR) 10Ecdsa: [C: 04-1] "The use of _null_ in the IKE and ESP proposals in the config snippet (and the documentation at https://wikitech.wikimedia.org/wiki/IPsec) " [puppet] - 10https://gerrit.wikimedia.org/r/201135 (owner: 10Gage) [07:42:42] oh wow [07:42:46] <_joe_> lol [07:42:49] that's someone from the strongswan project [07:42:50] (03PS1) 10KartikMistry: CX: Enable Swedish (sv) in target and sv-da MT [puppet] - 10https://gerrit.wikimedia.org/r/202341 (https://phabricator.wikimedia.org/T95108) [07:42:53] <_joe_> I was about to say the same [07:43:04] <_joe_> nice gerrit nick btw [07:43:06] <_joe_> :) [07:43:09] (03PS1) 10Yuvipanda: Handle webservice calls erroring out [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202342 [07:44:03] (03CR) 10KartikMistry: [C: 04-1] "To be deployed on 09/04/2015 (Thursday)." [puppet] - 10https://gerrit.wikimedia.org/r/202341 (https://phabricator.wikimedia.org/T95108) (owner: 10KartikMistry) [07:45:11] 6operations, 10ops-codfw: mw2128 not rebooting after network driver crash, blank console - https://phabricator.wikimedia.org/T95264#1184867 (10Joe) 3NEW [07:46:59] ACKNOWLEDGEMENT - Host mw2128 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T95264 [07:49:17] (03PS1) 10Yuvipanda: Rename service monitor to web service monitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202343 [07:58:05] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Prove that old-bugzilla is 100 % superseded - https://phabricator.wikimedia.org/T95265#1184877 (10Nemo_bis) 3NEW [08:01:03] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Complete migration of all Bugzilla users and their data - https://phabricator.wikimedia.org/T95266#1184887 (10Nemo_bis) 3NEW [08:01:18] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Complete migration of all Bugzilla users and their data - https://phabricator.wikimedia.org/T95266#1184887 (10Nemo_bis) [08:01:28] (03PS1) 10Faidon Liambotis: Repurpose rhenium, format as jessie [puppet] - 10https://gerrit.wikimedia.org/r/202345 (https://phabricator.wikimedia.org/T95243) [08:02:07] (03CR) 10Faidon Liambotis: [C: 032] Repurpose rhenium, format as jessie [puppet] - 10https://gerrit.wikimedia.org/r/202345 (https://phabricator.wikimedia.org/T95243) (owner: 10Faidon Liambotis) [08:05:17] PROBLEM - Host rhenium is DOWN: CRITICAL - Host Unreachable (208.80.154.52) [08:07:40] ACKNOWLEDGEMENT - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% Faidon Liambotis T95263 [08:09:08] RECOVERY - Host rhenium is UP: PING OK - Packet loss = 0%, RTA = 1.46 ms [08:09:29] ACKNOWLEDGEMENT - RAID on vanadium is CRITICAL: CRITICAL: Active: 4, Working: 4, Failed: 2, Spare: 0 Faidon Liambotis T94926 [08:09:47] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Provide alternate search interface for bugzilla database - https://phabricator.wikimedia.org/T95267#1184908 (10Nemo_bis) 3NEW [08:11:40] 6operations, 10Wikimedia-Bugzilla: Replicate Bugzilla database to Labs testing instance - https://phabricator.wikimedia.org/T30339#1184917 (10Nemo_bis) 5declined>3Open This makes sense again if old-bugzilla is at risk of removal. [08:11:46] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: TransparencyReport repository master in Gerrit silently made private - https://phabricator.wikimedia.org/T89640#1184919 (10akosiaris) @Prtksxna, yeah I will be around, just make sure to reach me if you need be before 17:00 UTC as I will be unavailabl... [08:13:08] PROBLEM - Host rhenium is DOWN: PING CRITICAL - Packet loss = 100% [08:13:17] (03CR) 10Hashar: "Looks good https://doc.wikimedia.org/mw-tools-scap/ . Thanks!" [tools/scap] - 10https://gerrit.wikimedia.org/r/202240 (owner: 10Hashar) [08:14:57] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1184924 (10Nemo_bis) [08:15:01] 6operations, 10Wikimedia-Bugzilla: Replicate Bugzilla database to Labs testing instance - https://phabricator.wikimedia.org/T30339#1184925 (10Nemo_bis) [08:15:58] RECOVERY - Host rhenium is UP: PING OK - Packet loss = 0%, RTA = 1.78 ms [08:16:55] (03CR) 10Tim Landscheidt: "I don't mind in this case where the file isn't touched by a human, but do you know a configuration apart from /etc/puppet/statsd.yaml that" [puppet] - 10https://gerrit.wikimedia.org/r/201991 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [08:17:42] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1184927 (10Nemo_bis) Does the script keep emails in place? Withot user IDs, the dump is not worth much. [08:19:28] PROBLEM - RAID on rhenium is CRITICAL: Connection refused by host [08:19:37] PROBLEM - configured eth on rhenium is CRITICAL: Connection refused by host [08:19:57] PROBLEM - puppet last run on rhenium is CRITICAL: Connection refused by host [08:20:37] PROBLEM - salt-minion processes on rhenium is CRITICAL: Connection refused by host [08:20:48] PROBLEM - DPKG on rhenium is CRITICAL: Connection refused by host [08:45:51] 6operations, 10ops-eqiad: ms-be1005.eqiad.wmnet: slot=5 dev=sdf failed - https://phabricator.wikimedia.org/T95268#1184953 (10fgiunchedi) 3NEW [08:46:29] ACKNOWLEDGEMENT - Disk space on ms-be1005 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdf1 is not accessible: Input/output error Filippo Giunchedi T95268 [08:46:29] ACKNOWLEDGEMENT - RAID on ms-be1005 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Filippo Giunchedi T95268 [08:46:29] ACKNOWLEDGEMENT - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi T95268 [08:56:48] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think that the approach to the patch here is wrong." [puppet] - 10https://gerrit.wikimedia.org/r/193834 (owner: 10ArielGlenn) [08:59:17] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Puppet has 1 failures [09:09:18] (03PS4) 10Giuseppe Lavagetto: analytics: use role, hiera [puppet] - 10https://gerrit.wikimedia.org/r/201477 (https://phabricator.wikimedia.org/T86774) [09:26:25] 6operations, 10hardware-requests, 5Patch-For-Review: Repurpose rhenium as "network insight" host - https://phabricator.wikimedia.org/T95243#1185029 (10mark) That's fine. [09:28:16] (03PS1) 10Faidon Liambotis: pmacct: rewrite the module/role [puppet] - 10https://gerrit.wikimedia.org/r/202356 [09:28:53] 6operations, 10hardware-requests, 5Patch-For-Review: Repurpose rhenium as "network insight" host - https://phabricator.wikimedia.org/T95243#1185036 (10faidon) 5Open>3Resolved a:3faidon Thanks! This is done then :) [09:31:36] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Provide alternate search interface for bugzilla database - https://phabricator.wikimedia.org/T95267#1185045 (10Aklapper) p:5Normal>3Triage Could you document some of the concrete searches here that you exclusively use old-bugzilla for? [09:35:27] (03CR) 10Faidon Liambotis: [C: 032] pmacct: rewrite the module/role [puppet] - 10https://gerrit.wikimedia.org/r/202356 (owner: 10Faidon Liambotis) [09:36:23] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Prove that old-bugzilla is 100 % superseded - https://phabricator.wikimedia.org/T95265#1185054 (10Aklapper) p:5Normal>3Triage This does not sound reasonable to me so I propose declining this. In general, new systems have improvements / advantages and regre... [09:36:57] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Puppet has 1 failures [09:38:06] (03PS1) 10Faidon Liambotis: Fix the usual stupid puppet include loop bug [puppet] - 10https://gerrit.wikimedia.org/r/202359 [09:38:38] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:40:32] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Complete migration of all Bugzilla users and their data - https://phabricator.wikimedia.org/T95266#1185061 (10Aklapper) p:5Normal>3Triage > old-bugzilla can't be removed until 100 % users have been actually created on Phabricator Realistically speaking, t... [09:40:50] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Fix the usual stupid puppet include loop bug [puppet] - 10https://gerrit.wikimedia.org/r/202359 (owner: 10Faidon Liambotis) [09:48:26] (03PS1) 10Hashar: Revert "base: Allow hiera to override ldap use_dnsmasq variable" [puppet] - 10https://gerrit.wikimedia.org/r/202361 [09:48:34] (03PS1) 10Ori.livneh: [WIP] Add a script for storing NavTiming metrics using RRD [puppet] - 10https://gerrit.wikimedia.org/r/202362 [09:52:41] (03PS1) 10Tim Landscheidt: Tools: Let bigbrother ignore empty lines and comments [puppet] - 10https://gerrit.wikimedia.org/r/202363 (https://phabricator.wikimedia.org/T94990) [09:54:10] (03Abandoned) 10Hashar: Revert "base: Allow hiera to override ldap use_dnsmasq variable" [puppet] - 10https://gerrit.wikimedia.org/r/202361 (owner: 10Hashar) [09:54:20] !log swift weight ms-be10[678] to 2000 [09:54:23] Logged the message, Master [09:56:41] (03CR) 10Hashar: "By renaming the use_dnsmasq variable to use_dnsmasq_server instance, that caused a wild range of instances to magically switch to the new " [puppet] - 10https://gerrit.wikimedia.org/r/202278 (https://phabricator.wikimedia.org/T95240) (owner: 10Yuvipanda) [09:57:19] (03CR) 10Ori.livneh: [WIP] Add a script for storing NavTiming metrics using RRD (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/202362 (owner: 10Ori.livneh) [09:58:24] (03PS1) 10Faidon Liambotis: Remove ferm rule for rsync & udpmcast from hooft [puppet] - 10https://gerrit.wikimedia.org/r/202365 [09:58:39] (03PS1) 10Giuseppe Lavagetto: analytics: correctly define dependencies [puppet] - 10https://gerrit.wikimedia.org/r/202366 [09:59:27] (03CR) 10Faidon Liambotis: [C: 032] Remove ferm rule for rsync & udpmcast from hooft [puppet] - 10https://gerrit.wikimedia.org/r/202365 (owner: 10Faidon Liambotis) [10:07:34] lovely puppet master: Invalid line 18: allow_ip :( [10:12:01] (03CR) 10Hashar: "As a side effect, that kills puppetmaster::self on labs because some of the removed dummy classes are not in labs/private.git" [puppet] - 10https://gerrit.wikimedia.org/r/202006 (owner: 10Faidon Liambotis) [10:13:04] (03PS1) 10Giuseppe Lavagetto: phabricator: use hiera for admin [puppet] - 10https://gerrit.wikimedia.org/r/202369 (https://phabricator.wikimedia.org/T86774) [10:14:03] # line_rate metric is per second, so we need to alert if this [10:14:06] # metric goes over 0.000046296 / second. Let's round [10:14:07] I really don't get that [10:14:08] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 817.342198142 [10:14:09] # down to warning on 0.00004, or critical on 0.00008. [10:14:12] warning => '0.00004', [10:14:14] critical => '0.00008', [10:14:30] the intention is: [10:14:31] # Alert if CirrusSearch-slow.log shows more than [10:14:31] # 10 slow searches within an hour. [10:14:48] RECOVERY - Host db2042 is UP: PING OK - Packet loss = 0%, RTA = 43.58 ms [10:15:24] (03CR) 10Hashar: "Forget me, something else is wrong on my puppetmaster :)" [puppet] - 10https://gerrit.wikimedia.org/r/202006 (owner: 10Faidon Liambotis) [10:15:47] 0.000046296 / second = 0.166../hour [10:16:12] or 1 every 6 hours? [10:16:23] I really don't get that [10:17:39] (03CR) 10Filippo Giunchedi: "are we going to reimage it? if so then it would work, if not there's e.g. the puppet certificate to change too, I don't think we support r" [puppet] - 10https://gerrit.wikimedia.org/r/192827 (https://phabricator.wikimedia.org/T90676) (owner: 10John F. Lewis) [10:18:37] PROBLEM - DPKG on db2042 is CRITICAL: Connection refused by host [10:18:49] PROBLEM - configured eth on db2042 is CRITICAL: Connection refused by host [10:18:57] PROBLEM - Disk space on db2042 is CRITICAL: Connection refused by host [10:19:27] PROBLEM - dhclient process on db2042 is CRITICAL: Timeout while attempting connection [10:19:38] PROBLEM - puppet last run on db2042 is CRITICAL: Timeout while attempting connection [10:19:57] PROBLEM - salt-minion processes on db2042 is CRITICAL: Timeout while attempting connection [10:20:08] PROBLEM - RAID on db2042 is CRITICAL: Timeout while attempting connection [10:20:58] PROBLEM - Host db2042 is DOWN: PING CRITICAL - Packet loss = 100% [10:23:38] RECOVERY - Host db2042 is UP: PING OK - Packet loss = 0%, RTA = 43.15 ms [10:29:11] 6operations, 5Patch-For-Review, 7Swift: swift eqiad capacity planning - https://phabricator.wikimedia.org/T1268#1185140 (10fgiunchedi) avg 77% used, new machines currently at weight 2000 and rebalancing ``` $ sudo swift-recon -d --human-readable ==============================================================... [10:32:55] (03PS2) 10Giuseppe Lavagetto: analytics: correctly define dependencies [puppet] - 10https://gerrit.wikimedia.org/r/202366 [10:33:44] (03CR) 10Giuseppe Lavagetto: [C: 032] analytics: correctly define dependencies [puppet] - 10https://gerrit.wikimedia.org/r/202366 (owner: 10Giuseppe Lavagetto) [10:36:59] (03PS1) 10Filippo Giunchedi: admin: add filippo aliases [puppet] - 10https://gerrit.wikimedia.org/r/202375 [10:37:01] (03PS1) 10Filippo Giunchedi: tungsten: deprovision roles [puppet] - 10https://gerrit.wikimedia.org/r/202376 (https://phabricator.wikimedia.org/T90591) [10:37:36] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] admin: add filippo aliases [puppet] - 10https://gerrit.wikimedia.org/r/202375 (owner: 10Filippo Giunchedi) [10:37:55] _joe_: ^ [10:38:27] <_joe_> yeah right now I have a crazy dependency cycle in puppet, that didn't show up in the compiler, WTF? [10:42:33] (03PS1) 10Giuseppe Lavagetto: analytics: fix puppet dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/202377 [10:43:03] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] analytics: fix puppet dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/202377 (owner: 10Giuseppe Lavagetto) [10:43:49] <_joe_> godog: I merged your alias addition, fyi [10:44:18] _joe_: sweet, thanks [10:46:22] (03PS5) 10Giuseppe Lavagetto: analytics: use role, hiera [puppet] - 10https://gerrit.wikimedia.org/r/201477 (https://phabricator.wikimedia.org/T86774) [10:59:10] !log install db2042, puppet sign, etc [10:59:16] Logged the message, Master [10:59:17] ah that was you :) [10:59:47] eh? [11:00:05] there were alerts about it before and I was trying to login and debug :) [11:00:06] oh critical? [11:00:08] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Provide alternate search interface for bugzilla database - https://phabricator.wikimedia.org/T95267#1185185 (10Nemo_bis) I use everything at https://old-bugzilla.wikimedia.org/query.cgi?format=advanced : mostly search by summary, product, components, comment,... [11:00:27] :) [11:00:28] they didn't page, just echoed here [11:01:45] RECOVERY - RAID on db2042 is OK: OK: no RAID installed [11:02:15] RECOVERY - configured eth on db2042 is OK: NRPE: Unable to read output [11:02:34] RECOVERY - dhclient process on db2042 is OK: PROCS OK: 0 processes with command name dhclient [11:02:54] RECOVERY - DPKG on db2042 is OK: All packages OK [11:03:05] RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [11:03:15] RECOVERY - Disk space on db2042 is OK: DISK OK [11:04:29] (03PS6) 10Giuseppe Lavagetto: analytics: use role, hiera [puppet] - 10https://gerrit.wikimedia.org/r/201477 (https://phabricator.wikimedia.org/T86774) [11:04:42] (03CR) 10Giuseppe Lavagetto: [C: 032] "http://puppet-compiler.wmflabs.org/679/change/201477/html/" [puppet] - 10https://gerrit.wikimedia.org/r/201477 (https://phabricator.wikimedia.org/T86774) (owner: 10Giuseppe Lavagetto) [11:12:24] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: puppet fail [11:12:35] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: puppet fail [11:15:47] <_joe_> that's me ^^ [11:16:34] PROBLEM - NTP on db2042 is CRITICAL: NTP CRITICAL: Offset unknown [11:19:22] (03PS1) 10Giuseppe Lavagetto: stat100*: fix cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/202381 [11:19:57] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] stat100*: fix cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/202381 (owner: 10Giuseppe Lavagetto) [11:20:04] RECOVERY - NTP on db2042 is OK: NTP OK: Offset -0.01058888435 secs [11:20:21] (03PS2) 10Glaisher: Add 'editeditorprotected' protection level on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201940 (https://phabricator.wikimedia.org/T94368) [11:20:36] (03CR) 10Glaisher: "done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201940 (https://phabricator.wikimedia.org/T94368) (owner: 10Glaisher) [11:21:09] <_joe_> eww how many typos did I do? [11:24:17] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:26:23] (03PS1) 10Giuseppe Lavagetto: hiera: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/202382 [11:32:03] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Provide alternate search interface for bugzilla database - https://phabricator.wikimedia.org/T95267#1185208 (10Aklapper) Bugzilla's search is more powerful than Phabricator's. The question is how often this power is actually needed and by how many users. :) I... [11:32:39] (03Abandoned) 10Faidon Liambotis: VCL: Be consistent about using vmod_header to append header items [puppet] - 10https://gerrit.wikimedia.org/r/184997 (owner: 10Ori.livneh) [11:33:27] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/202382 (owner: 10Giuseppe Lavagetto) [11:36:04] (03PS2) 10Giuseppe Lavagetto: phabricator: use hiera for admin [puppet] - 10https://gerrit.wikimedia.org/r/202369 (https://phabricator.wikimedia.org/T86774) [11:36:38] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] phabricator: use hiera for admin [puppet] - 10https://gerrit.wikimedia.org/r/202369 (https://phabricator.wikimedia.org/T86774) (owner: 10Giuseppe Lavagetto) [11:40:52] <_joe_> no more explicit declarations of the admin class in site.pp [11:40:57] nice [11:41:17] <_joe_> now I need to remove the 100s of inclusions and I'm done with it [11:41:38] and add to standard, right? [11:41:42] <_joe_> yep [11:42:06] <_joe_> with a guard to ensure it's not installed wherever we don't want it to [11:42:17] <_joe_> like maybe labstore? and labs of course [11:42:30] labstore is going to get fixed soon, last I heard [11:47:57] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [11:49:18] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1185242 (10Aklapper) [11:51:21] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1185246 (10Aklapper) >>! In T85141#1184927, @Nemo_bis wrote: > Does the script keep emails in place? Without user IDs, the dump is not worth much. Depends how strong your use case i... [11:51:49] (03PS1) 10Springle: assign db2042 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/202383 [11:52:16] (03CR) 10Springle: [C: 032] assign db2042 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/202383 (owner: 10Springle) [11:57:33] !log xtrabackup clone db2016 to db2042 [11:57:40] Logged the message, Master [12:04:35] PROBLEM - mysqld processes on db2042 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [12:04:54] RECOVERY - salt-minion processes on db2042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:05:21] 6operations: Make sure that Anasuya has been removed from fdcsupport@ alias - https://phabricator.wikimedia.org/T95212#1185273 (10Aklapper) [12:07:12] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Provide alternate search interface for bugzilla database - https://phabricator.wikimedia.org/T95267#1185277 (10Qgil) As time goes, the technical superiority of Bugzilla's search will be hampered by the fact that a bigger percentage of reports will be found onl... [12:07:20] ACKNOWLEDGEMENT - mysqld processes on db2042 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Sean Pringle cloning [12:17:50] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Complete migration of all Bugzilla users and their data - https://phabricator.wikimedia.org/T95266#1185310 (10Qgil) I think we discuss somewhere already that complete migration of Bugzilla users was not a goal. Many of those "users" consist in a single and dep... [12:21:12] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 5Patch-For-Review, 7notice: Create a static HTML version of Bugzilla - https://phabricator.wikimedia.org/T85140#1185325 (10Qgil) [12:23:58] PROBLEM - puppet last run on ganeti2002 is CRITICAL: CRITICAL: Puppet has 3 failures [12:44:13] PROBLEM - puppet last run on ganeti2001 is CRITICAL: CRITICAL: Puppet has 3 failures [12:52:53] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [12:59:10] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1185416 (10JohnLewis) >>! In T85141#1184927, @Nemo_bis wrote: > Does the script keep emails in place? Currently the plan is to include the user ids in the original dump but whether t... [13:06:34] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:08:44] 6operations, 10Wikimedia-Bugzilla: Replicate Bugzilla database to Labs testing instance - https://phabricator.wikimedia.org/T30339#1185441 (10JohnLewis) While replication is probably the wrong word here, it is the intended plan to make the dump available on labs on the dbstores. @Andrew can advise further with... [13:16:19] (03PS2) 10Nikerabbit: Add skins to wgLocalisationUpdateRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169716 (https://bugzilla.wikimedia.org/67154) (owner: 10Reedy) [13:17:53] (03CR) 10Nikerabbit: "Oops I forgot. Added a request in SWAT for today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169716 (https://bugzilla.wikimedia.org/67154) (owner: 10Reedy) [13:32:26] !log scheduled downtime for barium to replace disk. [13:32:29] Logged the message, Master [13:37:04] PROBLEM - Host barium is DOWN: PING CRITICAL - Packet loss = 100% [13:37:54] <_joe_> hey Jeff_Green , are you on top of it? [13:39:44] _joe_: that is actually cmjohnson1 [13:40:01] but for some reason scheduled downtime did not work ? [13:44:06] (03CR) 10Giuseppe Lavagetto: "This change was completely wrong. You should've added the new snapshot-admins to the list in hiera, and not here. This change and the foll" [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [13:48:15] jouncebot, next [13:48:16] In 1 hour(s) and 11 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150407T1500) [13:48:54] RECOVERY - Host barium is UP: PING OK - Packet loss = 0%, RTA = 2.94 ms [13:53:53] (03CR) 10Ottomata: [C: 031] admin: add jkatz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/202248 (https://phabricator.wikimedia.org/T94939) (owner: 10Dzahn) [13:54:17] _joe_: i guess analytics+hiera went all smooth? [13:55:20] <_joe_> ottomata: yep, but please look at the commit history, the analytics code contained an antipattern that made it work differently depending if "include standard" happened before or after "include role::analytics" [13:57:22] that sounds familiar, looking [13:59:08] hm, _joe_, not sure where to look, differences in site.pp? [14:00:10] <_joe_> ottomata: no, role/analytics.pp and role/analytics/*.pp [14:02:17] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1185616 (10JohnLewis) [14:03:16] andrewbogott: poke me when you're around [14:04:45] andrewbogott: not in particular no (re: duty handoff) [14:04:57] hm, interesting. ok, thanks _joe_ [14:05:12] JohnFLewis: what’s up? [14:06:16] andrewbogott: read the last comment/my comment on https://phabricator.wikimedia.org/T30339 [14:10:02] JohnFLewis: I don’t exactly understand the point of that… is it just for posterity? [14:10:11] As far as I know everything from bz has been imported into phab [14:11:19] andrewbogott: yeah but some people want to use the database to make tools for things like retrieving votes and better search interfaces because it's better than Phabricator? I don't get it either but it's something volunteer wants :) [14:12:06] 6operations: iridium "standard" conflict with exim in role - https://phabricator.wikimedia.org/T92879#1185671 (10Joe) The problem here was simply that standard was included _before_ of the role declaration, so all role-bound hiera variables were not available at the compiler when evaluating the standard class.... [14:12:28] 6operations: iridium "standard" conflict with exim in role - https://phabricator.wikimedia.org/T92879#1185672 (10Joe) a:3Joe [14:12:39] JohnFLewis: volunteers want it but haven’t commented on the ticket? or am I missing someone? [14:12:42] jouncebot: next [14:12:42] In 0 hour(s) and 47 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150407T1500) [14:12:47] ^ Krinkle [14:13:12] andrewbogott: volunteers have made it know, Nemo_bis is just one one of them [14:13:54] he can probably provide more insight on why/who wants it, I'm just trying to organise and make sure things are handled in a way that everyone is happy [14:14:19] 6operations, 10Wikimedia-Bugzilla: Replicate Bugzilla database to Labs testing instance - https://phabricator.wikimedia.org/T30339#1185679 (10Andrew) Since bugzilla is a dead end and everything was migrated to phab, I don't understand what this is for. Can someone explain? [14:16:18] (03PS2) 10Andrew Bogott: admin: add jkatz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/202248 (https://phabricator.wikimedia.org/T94939) (owner: 10Dzahn) [14:16:56] (03PS3) 10Alexandros Kosiaris: Provision the ssh key added in 3c8c524 [puppet] - 10https://gerrit.wikimedia.org/r/201462 [14:16:58] (03PS1) 10Alexandros Kosiaris: ssh: allow parameterization of authorized_keys [puppet] - 10https://gerrit.wikimedia.org/r/202392 [14:17:00] (03PS1) 10Alexandros Kosiaris: sodium: specify the position of authorized_keys_file [puppet] - 10https://gerrit.wikimedia.org/r/202393 [14:17:02] (03PS1) 10Alexandros Kosiaris: ssh: remove lucid special casing for authorized_keys_file [puppet] - 10https://gerrit.wikimedia.org/r/202394 [14:17:04] (03PS1) 10Giuseppe Lavagetto: phabricator: re-include standard [puppet] - 10https://gerrit.wikimedia.org/r/202395 (https://phabricator.wikimedia.org/T92879) [14:17:24] (03CR) 10Andrew Bogott: [C: 032] admin: add jkatz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/202248 (https://phabricator.wikimedia.org/T94939) (owner: 10Dzahn) [14:18:08] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster, 5Patch-For-Review: Requesting access to analytics-users (stat1002) for Jkatz - https://phabricator.wikimedia.org/T94939#1185701 (10Andrew) 5Open>3Resolved [14:18:18] (03PS2) 10Giuseppe Lavagetto: phabricator: re-include standard [puppet] - 10https://gerrit.wikimedia.org/r/202395 (https://phabricator.wikimedia.org/T92879) [14:18:42] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] phabricator: re-include standard [puppet] - 10https://gerrit.wikimedia.org/r/202395 (https://phabricator.wikimedia.org/T92879) (owner: 10Giuseppe Lavagetto) [14:20:44] 6operations, 10Wikimedia-Bugzilla: Replicate Bugzilla database to Labs testing instance - https://phabricator.wikimedia.org/T30339#1185719 (10Krenair) Presumably for easy searching of old BZ data that will be hard to find in Phabricator, and to get at data (e.g. non-migrated users' email domains) that's not cu... [14:21:37] 6operations, 5Patch-For-Review: iridium "standard" conflict with exim in role - https://phabricator.wikimedia.org/T92879#1185727 (10Joe) This is now fixed. Please not that this pitfall was clearly stated here: https://wikitech.wikimedia.org/wiki/Puppet_Hiera#Limitations "Any hiera lookup happening before the... [14:22:00] 6operations, 5Patch-For-Review: iridium "standard" conflict with exim in role - https://phabricator.wikimedia.org/T92879#1185730 (10Joe) 5Open>3Resolved [14:23:37] (03PS5) 10Mobrovac: Enable group1 wikis in RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/198433 (https://phabricator.wikimedia.org/T93452) (owner: 10GWicke) [14:27:53] 6operations, 10ops-eqiad, 10Incident-20150401-LabsNFS-Overload: Verify visually that the labstore shelves' wiring is stable - https://phabricator.wikimedia.org/T94828#1185800 (10coren) [14:37:49] 6operations, 10ops-eqiad, 10Incident-20150401-LabsNFS-Overload: Inspect and diagnose labstore1001's H800 controler - https://phabricator.wikimedia.org/T95293#1185852 (10coren) 3NEW [14:38:04] 6operations, 10ops-fundraising: upgrade tellurium.frack.eqiad.wmnet to Trusty - https://phabricator.wikimedia.org/T95294#1185861 (10Jgreen) 3NEW [14:39:12] 6operations, 10ops-eqiad, 10Incident-20150401-LabsNFS-Overload: Inspect and diagnose labstore1001's H800 controler - https://phabricator.wikimedia.org/T95293#1185872 (10coren) [14:50:44] Krenair: Since you have patches in it, do you want to SWAT this morning? [14:51:03] aha, I was about to ask who was doing it this morning :) [14:51:14] ok [14:51:54] Krenair: you doing swat? [14:52:21] yes [14:52:32] k [14:57:49] (03PS5) 10BBlack: scale varnish->varnish backend weight for prod 2layer clusters [puppet] - 10https://gerrit.wikimedia.org/r/201714 (https://phabricator.wikimedia.org/T86663) [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, superm401: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150407T1500). Please do the needful. [15:00:46] 6operations, 10ops-eqiad: check Temperature Alarm: asw-d-eqiad. - https://phabricator.wikimedia.org/T94997#1186005 (10RobH) So these alarms are still happening. Perhaps we should have @Mark advise if this is too high for this, or perhaps its just an overly cautious threshhold? [15:01:17] ok, ay [15:01:20] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations: Access to francium - https://phabricator.wikimedia.org/T94093#1186013 (10Kelson) @GWicke Can you please give me a root access? So I can test to create a WPEN ZIM file and we can have a look how to allow multiples processes to share the same local P... [15:01:25] so [15:01:49] superm401's changes are being merged [15:02:00] Nikerabbit, you there? [15:02:06] Krenair: ay ay captain [15:02:12] haha [15:02:49] Krenair: but due to nature of the patch, I believe it cannot be tested until LU actually runs next time [15:02:54] right. [15:03:06] I think that's probably OK. [15:03:11] unless you know magic commands to run it manually... but that takes time [15:03:28] Let's not run LU manually in a swat window :) [15:03:37] I promise to look at logs tomorrow [15:03:42] Krenair, you +2'ed the bump before I updated it. It's not a big difference though. You +2'ed X, the update is Revert(Revert(X)). So basically the same thing. But it would be better to do the latter. [15:04:02] I think the second upload already cancelled the merge. [15:04:39] superm401, I can read the deployment calendar, I cannot read minds [15:05:15] Krenair, sorry, I should have mentioned this when I did the revert last night. [15:05:28] It's just so the bump matches the tip of the Flow release branch. [15:05:31] right, I see what's going on [15:06:45] superm401, remind me, are we syncing these as separate patches or as one? [15:06:50] andrewbogott: also Daniel wanted me to poke you with https://phabricator.wikimedia.org/T94717 [15:06:56] 10Ops-Access-Requests, 6operations, 3Continuous-Integration-Isolation: Grant hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1186041 (10hashar) 3NEW a:3RobH [15:07:11] I have a feeling I double checked this last time [15:07:16] hashar: why are you assigning me access requests? [15:07:32] andrewbogott: he thinks Moritz might have to be renamed because of umlaut [15:07:43] 10Ops-Access-Requests, 6operations, 3Continuous-Integration-Isolation: Grant hashar root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1186055 (10RobH) a:5RobH>3None [15:07:47] robh: it is a mistake sorry. I created that one as a subtask of another which is assigned to you :D [15:07:49] Krenair, it doesn't matter, as long as no one tries to use the Flow one first. A sync-dir or scap of all of MediaWiki dir should be fine. [15:07:51] heh, cool [15:08:04] just making sure you didnt think i was doing all access requests ;D [15:09:24] superm401, scap would just be to do the i18n update right? [15:09:48] Krenair, there is i18n change, but only for one error message that should not happen, so it's necessarily required. [15:10:06] is or is not? [15:10:28] Sorry, "it's not". [15:10:35] I would just skip it. [15:10:49] Given the nature of the only change to those (new message that should not appear at all if it's done properly), I think it's probably fine to not run the proper i18n for that (as users will not see it) [15:11:24] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations: Access to francium - https://phabricator.wikimedia.org/T94093#1186058 (10GWicke) >>! In T94093#1186013, @Kelson wrote: > @GWicke > > Can you please give me a root access? PM sent. [15:13:18] uh oh [15:13:21] bd808 [15:13:31] (03PS1) 10Nuria: Adding metric 'wikimetrics successful reports' to statsd labs [puppet] - 10https://gerrit.wikimedia.org/r/202404 (https://phabricator.wikimedia.org/T94193) [15:13:42] Krenair: what's up? [15:13:45] 15:12:10 sync-dir failed: Command 'find '/srv/mediawiki-staging/php-1.25wmf24' -name '*.php' -or -name '*.inc' -or -name '*.phtml' -or -name '*.php5' | xargs -n1 -P6 -exec php -l >/dev/null' returned non-zero exit status 124 [15:14:09] yuck [15:14:33] maybe scap log say smore [15:14:40] says* [15:15:04] probably not. The >/dev/null would swallow output [15:15:15] :( [15:15:28] Someone can run it manually I guess. [15:15:41] Do we often sync-dir a whole branch? [15:15:46] (03PS1) 10Giuseppe Lavagetto: snapshot: fixup for I1863679d and Icc9aba2 [puppet] - 10https://gerrit.wikimedia.org/r/202406 [15:15:48] (03PS1) 10Giuseppe Lavagetto: standard: include admin wherever needed [puppet] - 10https://gerrit.wikimedia.org/r/202407 (https://phabricator.wikimedia.org/T86774) [15:16:06] * aude normally not but can't see why it is a problem [15:16:06] scap doesn't run the linter on the release branches [15:16:34] Ew... what the heck is that [15:16:41] Parse error: syntax error, unexpected T_STRING in /srv/mediawiki-staging/php-1.25wmf24/extensions/DonationInterface/vendor/psr/log/Psr/Log/LoggerTrait.php on line 13 [15:16:49] (I ran it without the >/dev/null) [15:16:51] ah [15:16:57] trait? [15:17:00] php5.3 trying to lint php5.4 files [15:17:04] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [15:17:04] yeah [15:17:27] I don't think I can sync this commit by individual files [15:17:30] ^ 5xx is me, it's transient spike [15:17:34] RECOVERY - Host mw2128 is UP: PING OK - Packet loss = 0%, RTA = 43.08 ms [15:17:40] that file is part of the PSR-3 core library but not used by us [15:18:03] Running a sync-dir for a whole branch seems nasty [15:18:08] yeah :( [15:18:14] bd808, fyi it's https://gerrit.wikimedia.org/r/#/c/202258/1 I'm trying to do [15:18:41] I reviewed it all yesterday, etc. and I expected sync-dir php-1.25wmf24 to do the trick [15:18:43] That needs a scap anyway [15:18:49] due to the i18n changes? [15:18:53] yeah [15:18:54] yes [15:18:54] has messages [15:18:57] we discussed this already [15:19:23] it adds a new (error) message that should only really happen in development [15:19:29] looks like I have no choice but to scap though [15:20:23] bd808, superm401: ok? [15:20:43] Sure, I have no problem with it being scapped. [15:21:01] Doesn't bother me. It's been taking ~40m now that we have codfw in the mix [15:21:15] sigh :( [15:21:16] which sucks and needs to be looked into [15:21:23] !log krenair Started scap: https://gerrit.wikimedia.org/r/#/c/202258/1 [15:21:31] Logged the message, Master [15:21:32] I'll comment about the lint failure at https://gerrit.wikimedia.org/r/#/c/192380/ [15:22:00] we have the same file in mediawiki/vendor [15:22:12] this is a problem with tin honestly [15:22:30] tin is running 5.3 and not hhvm as the php interpreter [15:22:34] PROBLEM - Host mw2128 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:48] and editing Composer imports is madness [15:23:06] also we don't lint the release branch in a scap [15:23:14] why not [15:23:22] we assume that Jenkins did that [15:23:33] haha, whoops [15:23:33] <_joe_> < bd808> tin is running 5.3 and not hhvm as the php interpreter <== if it's a problem, please open a ticket [15:23:45] assuming Jenkins did anything... [15:23:57] _joe_, pretty sure there already is one [15:24:03] RECOVERY - Host mw2128 is UP: PING OK - Packet loss = 0%, RTA = 42.95 ms [15:24:22] https://phabricator.wikimedia.org/search/query/2JBP4AKG3rhQ/#R -> https://phabricator.wikimedia.org/T87036 [15:25:06] (03CR) 10Ottomata: [C: 032] Adding metric 'wikimetrics successful reports' to statsd labs [puppet] - 10https://gerrit.wikimedia.org/r/202404 (https://phabricator.wikimedia.org/T94193) (owner: 10Nuria) [15:25:32] <_joe_> Krenair: meh, noone notified/pinged me [15:25:49] Krenair: do I need to wait until scap finishes? Sauna is warming up ;) [15:26:17] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Moritz Muehlenhoff in ops - https://phabricator.wikimedia.org/T94717#1186082 (10MoritzMuehlenhoff) Authentication works with Chrome, but fails with Firefox/Iceweasel (https://bugzilla.mozilla.org/show_bug.cgi?id=41489 filed in 2000). Since the... [15:26:21] bd808, if we're going to start requiring 5.4 features, shouldn't we update PHPVersionError? [15:26:40] We don't require them or use them [15:26:42] Nikerabbit, if this takes 40 minutes, you might as well come back later and I'll do your patch later [15:26:52] This is a dormant file in the PSR-3 library [15:26:55] bd808, it's pretty subtle to import libraries, parts of which wouldn't run on our system, even if we're currently not using them. [15:27:01] bd808, easy to start using it later by accident. [15:27:37] Krenair: in any case, I will be on/off here the evening [15:27:43] okay [15:27:44] PROBLEM - DPKG on mw2128 is CRITICAL: Connection refused by host [15:27:53] PROBLEM - nutcracker process on mw2128 is CRITICAL: Connection refused by host [15:27:54] PROBLEM - salt-minion processes on mw2128 is CRITICAL: Connection refused by host [15:27:54] PROBLEM - HHVM processes on mw2128 is CRITICAL: Connection refused by host [15:28:03] PROBLEM - puppet last run on mw2128 is CRITICAL: Connection refused by host [15:28:14] PROBLEM - HHVM rendering on mw2128 is CRITICAL: Connection refused [15:28:14] PROBLEM - RAID on mw2128 is CRITICAL: Connection refused by host [15:28:24] PROBLEM - configured eth on mw2128 is CRITICAL: Connection refused by host [15:28:40] I don't want to get into a slippery slope debate about it, but honestly we should update all the misc servers to have a PHP that hasn't been end of life'd upstream. PHP 5.3 is ancient and crusty [15:28:53] PROBLEM - nutcracker port on mw2128 is CRITICAL: Connection refused by host [15:28:54] PROBLEM - Disk space on mw2128 is CRITICAL: Connection refused by host [15:29:01] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Moritz Muehlenhoff in ops - https://phabricator.wikimedia.org/T94717#1186085 (10Dzahn) wow @ HTTP authentication does not support non-ISO-8859-1 characters , still being open since 15 years. thanks for the link! and thanks Andrew for doing... [15:29:04] PROBLEM - Apache HTTP on mw2128 is CRITICAL: Connection refused [15:29:04] PROBLEM - dhclient process on mw2128 is CRITICAL: Connection refused by host [15:29:12] 15:28:37 Finished mw-update-l10n (duration: 06m 06s) [15:29:12] 15:28:37 Started cache_git_info [15:29:23] I don't run this command often enough to remember which stage of the process that is [15:29:31] But that whole like of work is a bit muddied by hhvm/trusty vs php5/jessie vs hhvm/jessie as the options [15:29:47] bd808, I didn't realize all PHP 5.3 was EOL. All the more reason. [15:29:47] Krenair: the start [15:30:09] Krenair: prepping tin before syncing [15:30:20] it is syncing stuff now [15:30:28] superm401, do you subscribe to wikitech-l? [15:30:42] (03PS1) 10RobH: archiva.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/202411 (https://phabricator.wikimedia.org/T88139) [15:30:43] Krenair, yeah. [15:31:22] PHP 5.3.29 was the last 5.3 release. 2014-08-14 -- https://php.net/archive/2014.php#id2014-08-14-1 [15:32:11] The security updates now only come from Debian backports from newer PHP versions [15:33:41] (03CR) 10RobH: [C: 032] archiva.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/202411 (https://phabricator.wikimedia.org/T88139) (owner: 10RobH) [15:34:15] 6operations: Force https for archiva.wikimedia.org - https://phabricator.wikimedia.org/T88139#1186123 (10RobH) [15:34:40] Alarm keeps going off here. I'll brb. [15:35:01] ottomata: the change we did for cirrus serach is still unmerged: https://gerrit.wikimedia.org/r/#/c/200593/ [15:35:05] *search [15:35:20] 6operations: Force https for archiva.wikimedia.org - https://phabricator.wikimedia.org/T88139#1186129 (10RobH) a:5RobH>3Ottomata Since this wasn't just a 'buy cert' task, I'm assigning it back to @ottomata. The archiva.wikimedia.org cert and key are now in the pub/private repos. Please ensure you use the G... [15:36:28] Back [15:39:54] PROBLEM - Host mw2128 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:10] btw bd808, what exactly is happening in the sync-proxies step? [15:40:30] are we actually pushing files out to those, and asking all other servers to download from them? [15:40:35] yes [15:41:10] pulling from tin to them but the point is that we provide a list of possible hosts to the MW servers when they sync [15:41:12] !log re-weighted pybal esams/upload (all-1 to all-3) [15:41:18] Logged the message, Master [15:41:33] There is logic to find the "closest" (first server with fewest network hops) to sync from [15:42:30] (03CR) 1020after4: [C: 032] "I think that this is slightly better than the way we've been doing it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134124 (owner: 10Reedy) [15:42:31] Krenair: There is a high level list of the operations at -- https://doc.wikimedia.org/mw-tools-scap/api.html#scap.Scap [15:42:41] RECOVERY - Host mw2128 is UP: PING OK - Packet loss = 0%, RTA = 43.28 ms [15:42:45] interesting, thanks [15:43:06] (03CR) 10Chad: "Any other thoughts here?" [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [15:43:53] !log krenair Finished scap: https://gerrit.wikimedia.org/r/#/c/202258/1 (duration: 22m 29s) [15:43:57] Logged the message, Master [15:44:10] superm401, hey [15:44:13] please check [15:44:39] (03Merged) 10jenkins-bot: Add script to update version of all group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134124 (owner: 10Reedy) [15:44:45] 22.5 minutes ain't bad [15:45:19] Krenair, is there a recent Wikitech discussion about the PHP 5.4 thing? I don't see anything since October. [15:45:24] Checking now [15:45:26] superm401, (core side only, not flow yet) [15:46:15] superm401, in october yeah [15:46:51] https://lists.wikimedia.org/pipermail/wikitech-l/2014-October/079151.html etc. [15:47:15] Krenair, works. Tested with https://www.mediawiki.org/wiki/User_talk:Mattflaschen_%28WMF%29#Test_message [15:47:28] thank goodness [15:48:22] That was posted via mediawiki.feedback. [15:49:03] oh how was it so quick? [15:49:33] !log krenair Synchronized php-1.25wmf24/extensions/Flow: https://gerrit.wikimedia.org/r/#/c/202262/2 (duration: 00m 14s) [15:49:33] superm401, please test [15:49:39] Logged the message, Master [15:50:13] is scap done? [15:50:16] yes [15:50:18] * aude got distracted [15:50:21] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:51:24] Krenair, doesn't work yet. I'll clear cache and try again. [15:52:47] yeah will probably need a proper refresh [15:52:52] maybe even a new browser session [15:54:09] * Krenair waves to stephanebisson [15:54:41] PROBLEM - Host mw2128 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:01] any luck superm401? [15:55:02] Krenair, works, thanks. [15:55:07] https://www.mediawiki.org/wiki/Talk:Sandbox [15:55:10] great [15:55:15] Nikerabbit, you still there? [15:55:41] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:56:42] guess not [15:56:46] aude [15:56:49] here [15:56:51] Krenair: ay ay captain [15:56:55] oh, ok [15:57:01] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:57:22] (03CR) 10Alex Monk: [C: 032] Add skins to wgLocalisationUpdateRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169716 (https://bugzilla.wikimedia.org/67154) (owner: 10Reedy) [15:58:37] oh wtf that jenkins queue is ridiculous [15:58:48] :( [15:58:58] oh new style, I like it [15:59:07] I think Krinkle did it [15:59:20] or contributed in some way [15:59:31] RECOVERY - Host mw2128 is UP: PING OK - Packet loss = 0%, RTA = 45.30 ms [15:59:42] Krenair: I contributed our dashboard upstream (Yay) and then OpenStack added new features to it [16:01:11] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [16:01:40] (03CR) 10Yuvipanda: "All the new stuff? ;)" [puppet] - 10https://gerrit.wikimedia.org/r/201991 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [16:01:41] PROBLEM - Host mw2128 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:20] When I updated the CI stack last week, the dashboard broke a little bit, so it was easiest to pull in all upstream changes [16:03:00] Which gave us filtering and collapsing, and it now visualises the dependencies [16:04:54] twentyafterfour, you merged https://gerrit.wikimedia.org/r/#/c/134124/ while I was deploying? :/ [16:09:15] Krenair: still waiting for Jenkins? [16:09:23] Krinkle, I dislike how it seems to show 0 minutes remaining for everything that's blocking our patch [16:09:37] for probably the last several minutes [16:09:41] Krenair: The ETA numbers did not change over the last 4 months [16:09:44] They were already there [16:10:03] now I actually notice them :P [16:10:05] Krenair: The queueing has also not changed. It combined these in the same queue for a few months, but things are slower in general. [16:10:17] Krenair: It says 0 because the zend job averages 8 minutes [16:10:22] so it's estimated time is up [16:10:32] it's taking longer [16:10:41] if you click on that job it's at 65% phpunit right now [16:10:42] still running [16:10:46] probably the VM is a bit slow [16:11:24] akosiaris: I am pushing changes to the public Transparency Report repository now. https://phabricator.wikimedia.org/T89640 [16:11:29] moizsyed: ^ [16:13:55] (03Merged) 10jenkins-bot: Add skins to wgLocalisationUpdateRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169716 (https://bugzilla.wikimedia.org/67154) (owner: 10Reedy) [16:14:00] yay [16:15:14] yay [16:15:14] ok [16:15:29] prtksxna: I'm around if needed [16:15:33] Kind of. Just woke up [16:15:45] Morning YuviPanda o/ [16:15:56] * aude trying hard not to get distracted [16:16:00] still here :) [16:16:06] !log krenair Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/169716/ (duration: 00m 14s) [16:16:11] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [16:16:13] Logged the message, Master [16:16:20] YuviPanda: ( -_-)旦~ [16:16:24] Krenair: ok, we are done? [16:16:36] I think so [16:16:39] moizsyed: YuviPanda: Still stuck at 23% :P [16:16:55] !log https://gerrit.wikimedia.org/r/#/c/134124/ was merged but has not been synced [16:16:58] Logged the message, Master [16:20:00] !log krenair Synchronized php-1.25wmf24/extensions/Wikidata: https://gerrit.wikimedia.org/r/#/c/202398/1 (duration: 00m 18s) [16:20:02] aude, please test [16:20:04] Logged the message, Master [16:20:16] not much to test but looking [16:20:56] 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3§ Wikidata-Sprint-2015-03-24: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1186337 (10hoo) @Daniel: Could you have a quick look at this? Looks fixed to me, but I think you're the only one who can tell for... [16:21:16] think it's ok [16:21:29] thanks :) [16:21:47] 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3§ Wikidata-Sprint-2015-03-24, 10Wikidata-Sprint-2015-04-07: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1186344 (10Tobi_WMDE_SW) [16:22:47] nuria, aye, thanks I know [16:22:51] James_F, Krinkle: will do VE next, then changes for DismissableSitenotice [16:23:04] OK :) [16:27:44] moizsyed: YuviPanda https://github.com/wikimedia/wikimedia-TransparencyReport/commits/master [16:34:05] prtksxna, is that supposed to be fully ready to release? [16:34:14] 6operations, 5Patch-For-Review: iridium "standard" conflict with exim in role - https://phabricator.wikimedia.org/T92879#1186421 (10chasemp) >>! In T92879#1185727, @Joe wrote: > This is now fixed. > > Please not that this pitfall was clearly stated here: > https://wikitech.wikimedia.org/wiki/Puppet_Hiera#Limi... [16:34:21] Krenair: Yup [16:35:28] ottomata: an1021 kafka broker alert [16:35:47] ottomata: also an1020 is out of commission, hardware issue, T95263 [16:36:02] Jeff_Green: what's with the barium/backup4001 diskspace warnings? [16:37:55] paravoid: looking [16:38:12] ACKNOWLEDGEMENT - Host mw2128 is DOWN: PING CRITICAL - Packet loss = 100% Faidon Liambotis T95264 [16:39:54] James_F, finally [16:40:01] Fun. [16:40:06] paravoid: should be better now [16:40:23] Jeff_Green: yup! [16:40:48] Krenair: (Yay debug=true) [16:41:33] !log krenair Synchronized php-1.25wmf24/extensions/VisualEditor/lib/ve: https://gerrit.wikimedia.org/r/#/c/202400/ (duration: 00m 12s) [16:41:33] James_F, please check [16:41:40] Logged the message, Master [16:41:48] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: TransparencyReport repository master in Gerrit silently made private - https://phabricator.wikimedia.org/T89640#1186462 (10Prtksxna) The report is [[ http://transparency.wikimedia.org/ | public ]] now. Thanks @akosiaris @MSyed @yuvipanda! [16:42:07] down to 6 active alerts in total [16:42:10] prtksxna: yw [16:42:15] almost there [16:42:22] maybe we can reach 3 tomorrow [16:42:43] Number of active alerts is CRITICAL: 66.666666667% below threshold [16:43:29] YuviPanda: You can take away our push rights now. [16:43:34] not enough 6s [16:43:37] but you can't [16:43:38] take away [16:43:41] our FREEEEEEEEEEEEEEEEDOM [16:44:01] ori: prtksxna isn't in the USA [16:44:26] one at a time! we just got you, right? [16:44:39] :P [16:45:10] manybubbles: hey, here? [16:45:16] lol [16:45:29] paravoid: in three hours I'll be fully paying attention [16:45:30] James_F, looks good to me... [16:45:35] manybubbles: heh ok [16:45:38] And me. [16:45:44] manybubbles: I'm trying to figure out the CirrusSearch-slow.log_line_rate alert [16:45:56] prtksxna: I am happy that everything worked out well :-) [16:45:57] paravoid: might be able to ignore it - its a bit sensitive [16:46:01] manybubbles: the "metric goes over 0.000046296 / second" part... [16:46:10] paravoid: its too sensitive [16:46:22] akosiaris: Me too. Thanks for simplifying the process to begin with. [16:46:31] manybubbles: yes, I'd like to bump it a bit, but I don't understand the where the current values come from tbh [16:46:44] paravoid: will respond in 15 minutes [16:46:49] manybubbles: ok :) [16:47:00] :-) [16:47:00] manybubbles: not urgent, obviously [16:48:17] Krinkle [16:48:21] !log krenair Synchronized php-1.25wmf24/includes/skins/Skin.php: https://gerrit.wikimedia.org/r/#/c/202313/ (duration: 00m 12s) [16:48:24] Logged the message, Master [16:48:28] Krenair: Yup. [16:48:39] please check :) [16:49:32] 6operations, 6Labs, 10hardware-requests: eqiad: (6) labs virt nodes - https://phabricator.wikimedia.org/T89752#1186497 (10Cmjohnson) labvirt1001 wmf4669 u13/14 ge-3/0/8 labvirt1002 wmf4670 u15/16 ge-3/0/20 labvirt1003 wmf4671 u17/18 ge-3/0/21 labvirt1004 wmf4672 u33/34 ge-5/0/18 labvirt1005 wmf4673 u35/36... [16:49:43] Krenair: Aye, my example test was a 23 wiki. Will find a 24 wiki now [16:50:01] oh testwiki is affected, good [16:50:44] Krenair: Yep, fixed on wmf24 / testwiki [16:51:12] !log krenair Synchronized php-1.25wmf23/includes/skins/Skin.php: https://gerrit.wikimedia.org/r/#/c/202391/ (duration: 00m 13s) [16:51:16] Logged the message, Master [16:51:18] Krinkle, please check [16:52:32] Krenair: Fixed on wmf23 / frwiki [16:52:41] great [16:53:25] Krinkle, what should we do about the Title::newFromText stuff? move to evening swat? [16:53:39] Krenair: Was it not scheduled? [16:53:49] it was, but we're an hour over schedule [16:54:08] Krenair: Well, if we do it in the evening, we won't see the results until tomorrow. [16:54:14] true [16:54:15] the rare ones anyway [16:54:17] Let's do it now :) [16:54:26] (if you're up for it now) [16:54:27] we have another hour until train deploy, so I'll do it now [16:54:35] Ah right [16:54:51] RECOVERY - Host ganeti1003 is UP: PING OK - Packet loss = 0%, RTA = 1.87 ms [16:55:25] Krinkle, I take it you don't mind that I didn't sync-file the credits :) [16:55:33] or was it the release notes [16:55:36] I think it was the release notes [16:55:38] Sure, fine :) [16:56:06] 6operations, 10ops-eqiad: ganeti1003 DIMM problem - https://phabricator.wikimedia.org/T94825#1186529 (10Cmjohnson) 5Open>3Resolved DIMM swapped error did not re-appear. Sending Dell part back FEDEX Track is 9611918 2393026 47880909 USPS 9202 3946 5301 2426 3897 94 [16:56:41] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:56:50] PROBLEM - dhclient process on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:30] PROBLEM - salt-minion processes on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:59:17] 6operations, 10ops-eqiad, 10ops-fundraising: barium has a failed HDD - https://phabricator.wikimedia.org/T93899#1186561 (10Cmjohnson) The disk that failed was an add-on of a 3TB disk and not covered under warranty. We do not have any 3TB disks on-site to swap out and will need to order more. [17:01:26] 6operations, 10ops-eqiad, 10ops-fundraising: barium has a failed HDD - https://phabricator.wikimedia.org/T93899#1186565 (10Cmjohnson) The disk that failed was an add-on of a 3TB disk and not covered under warranty. We do not have any 3TB disks on-site to swap out and will need to order more. https://rt.wik... [17:02:19] (03CR) 10Glaisher: [C: 031] Set $wgRateLimits['badcaptcha'] to counter bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [17:05:19] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1186570 (10vshchepakina) James is correct: We understand that we cannot use store.wikipedia.org, and we are OK with that. Now we are just asking we switc... [17:06:10] 6operations, 10ops-eqiad: labnodepool1001 setup tasks: labels/ports/racktables - https://phabricator.wikimedia.org/T95048#1186588 (10RobH) 5Resolved>3Open My mistake, this has to be in labs nodes vlan, so it needs to be in Row B and in the labs hosts vlan. Chatted with chris via irc so he is aware of this. [17:06:13] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1186590 (10RobH) [17:12:50] (03CR) 10Yuvipanda: "No variables were renamed - _server was just an intermediary that was used to allow hiera overrides because you can not reassign variables" [puppet] - 10https://gerrit.wikimedia.org/r/202278 (https://phabricator.wikimedia.org/T95240) (owner: 10Yuvipanda) [17:17:39] Krinkle, so I think everything got merged [17:19:00] PROBLEM - mailman I/O stats on sodium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=92.80 Read Requests/Sec=51.50 Write Requests/Sec=29.40 KBytes Read/Sec=2142.40 KBytes_Written/Sec=302.40 [17:20:59] Krenair: cool [17:21:27] !log krenair Synchronized php-1.25wmf24/includes/api/ApiQuerySiteinfo.php: https://gerrit.wikimedia.org/r/#/c/202333/ (duration: 00m 10s) [17:21:32] Logged the message, Master [17:23:20] Krinkle, https://gerrit.wikimedia.org/r/#/c/202333/ doesn't appear to have had an effect :/ [17:24:37] oh right [17:24:40] because I'm stupid [17:25:21] !log krenair Synchronized php-1.25wmf24/includes/api/ApiQuerySiteinfo.php: actually apply the change this time (duration: 00m 11s) [17:25:26] Logged the message, Master [17:25:39] that's better [17:26:59] !log krenair Synchronized php-1.25wmf23/includes/api/ApiQuerySiteinfo.php: https://gerrit.wikimedia.org/r/#/c/202332/ (duration: 00m 13s) [17:27:02] Logged the message, Master [17:27:34] Krinkle, okay so that line is gone from the new logs now [17:27:50] Krenair: Hm. did you skip it earlier? [17:27:58] I forgot to run the rebase part :) [17:28:09] Only did git log HEAD..origin/wmf/1.25wmf24, then sync [17:28:14] no git rebase origin/wmf/1.25wmf24 [17:28:20] so it did nothing [17:28:21] (03PS1) 10RobH: setting the mgmt and production dns entries for labnodepool1001 [dns] - 10https://gerrit.wikimedia.org/r/202446 (https://phabricator.wikimedia.org/T95045) [17:28:38] Krenair: Hm.. you didnt use git pull? [17:28:44] that does rebase right [17:29:15] 6operations, 3Continuous-Integration-Isolation, 5Patch-For-Review: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1186679 (10RobH) [17:29:16] Krinkle, no... [17:29:51] (03CR) 10RobH: [C: 032] setting the mgmt and production dns entries for labnodepool1001 [dns] - 10https://gerrit.wikimedia.org/r/202446 (https://phabricator.wikimedia.org/T95045) (owner: 10RobH) [17:29:52] On tin it's configured to do rebase I htink [17:29:59] To preserve security patches [17:30:20] The instructions are a bit unclear on this [17:30:57] You must run fetch and log to detect any other changes which aren't supposed to be there [17:31:24] No patches on core, currently. Ping me about extensions if you need patches. [17:31:26] and you must also run rebase in case there are live hack commits (e.g. security patches) that need to be preserved [17:31:44] I usually do git remote update; git log ..origin [17:31:49] but I think most people do git pull [17:31:52] csteipp, we're still rebasing over some on wmf23 actually, despite them all being public in wmf24 [17:31:59] 6operations, 3Continuous-Integration-Isolation, 5Patch-For-Review: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1186702 (10RobH) [17:32:00] Oh, sorry :) [17:32:17] Either way the instructions are messed up [17:32:33] The stuff it says you must do, does everything you need [17:32:47] but it suggests git pull before any of that [17:32:50] I blame bad wiki editors [17:33:10] Krinkle, alright this looks fine, am going to do the fun part [17:33:33] oh? [17:33:57] Krinkle, the includes/MediaWiki.php change [17:34:07] to fix csteipp's param order for Title::newFromText [17:35:28] !log krenair Synchronized php-1.25wmf24/includes/MediaWiki.php: https://gerrit.wikimedia.org/r/#/c/202301/ (duration: 00m 15s) [17:35:33] Logged the message, Master [17:36:17] works fine [17:36:23] fixes logs as expected [17:36:27] still doing swat? [17:36:41] Yeah :/ [17:36:44] ugh [17:37:16] sort of [17:37:23] Krenair: wmf23 is still flooding quite rapidly [17:37:32] yep [17:37:35] !log krenair Synchronized php-1.25wmf23/includes/MediaWiki.php: https://gerrit.wikimedia.org/r/#/c/202302/ (duration: 00m 12s) [17:37:38] Logged the message, Master [17:37:48] Not seeing any new wmf24 entries thoguh [17:37:59] Krinkle, now it should be better [17:38:06] (03PS1) 10RobH: setting labnodepool1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/202450 (https://phabricator.wikimedia.org/T95045) [17:38:20] works fine [17:38:24] no more log spam [17:38:40] still the odd entry we should clean up, but the main offenders are gone [17:38:46] Yeah [17:38:53] let's see if anything pops up in the next few hours [17:39:14] yep [17:39:21] will keep an eye on the logs, am done with deployments [17:39:56] (03CR) 10RobH: [C: 032] setting labnodepool1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/202450 (https://phabricator.wikimedia.org/T95045) (owner: 10RobH) [17:40:14] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1186746 (10RobH) [17:41:37] twentyafterfour, remember you need to do https://gerrit.wikimedia.org/r/#/c/134124/ :) [17:42:01] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1186777 (10RobH) [17:42:54] * Krenair returns to inbox madness [17:44:36] 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3§ Wikidata-Sprint-2015-03-24, 10Wikidata-Sprint-2015-04-07: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1186803 (10daniel) Fore redirects, the encoding {"entity"} is correct. There is no "old... [17:47:25] 6operations, 7Graphite, 5Patch-For-Review: replace txstatsd - https://phabricator.wikimedia.org/T90111#1186832 (10Krinkle) Here are the properties the Nagf dashboard for Labs uses: https://github.com/wikimedia/nagf/blob/b91fb1622/inc/Nagf.php#L48-L109 https://tools.wmflabs.org/nagf/ https://tools.wmflabs.or... [17:47:42] Krenair: I'll deploy it when my window opens ~13 minutes from now [17:48:53] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Moritz Muehlenhoff in ops - https://phabricator.wikimedia.org/T94717#1186847 (10Andrew) [17:49:17] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Moritz Muehlenhoff in ops - https://phabricator.wikimedia.org/T94717#1170954 (10Andrew) [17:51:22] 6operations, 6Collaboration-Team, 10Flow: Flow Exception Caught: DB connection error: Can't connect to MySQL server - https://phabricator.wikimedia.org/T95121#1186886 (10EBernhardson) 5Open>3Invalid a:3EBernhardson If this happens again we can reopen ticket. Possibly a one-off error that might just be... [17:52:34] 6operations, 10ops-eqiad: labnodepool1001 setup tasks: labels/ports/racktables - https://phabricator.wikimedia.org/T95048#1186900 (10RobH) port moved to row b ge-3/0/16 per chris's update via irc and set in labs-hosts vlan. however, that isn't what its attempting to hit the dhcp server on [17:55:50] (03CR) 10Legoktm: [C: 032] Introduce Tool objects [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202293 (owner: 10Yuvipanda) [17:55:59] (03Merged) 10jenkins-bot: Introduce Tool objects [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202293 (owner: 10Yuvipanda) [17:57:02] 6operations, 10ops-eqiad: labnodepool1001 setup tasks: labels/ports/racktables - https://phabricator.wikimedia.org/T95048#1186927 (10RobH) I can see the system hit dhcp on the analytics vlan, but I cannot find the mac address in the ethernet switching table for the mac address 84:2b:2b:fd:be:fd on the row b sw... [17:57:34] (03CR) 10Legoktm: [C: 04-1] "Additional dependencies should be specified in a requirements.txt" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202318 (https://phabricator.wikimedia.org/T95256) (owner: 10Yuvipanda) [17:58:58] (03CR) 10Legoktm: [C: 04-1] "skipsdist = True should be removed from tox.ini now." [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202320 (owner: 10Yuvipanda) [18:00:04] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150407T1800). Please do the needful. [18:00:20] (03CR) 10Legoktm: [C: 031] Rename package to tools.manifest [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202323 (owner: 10Yuvipanda) [18:02:36] ottomata: ping? [18:02:56] !log twentyafterfour Synchronized multiversion/updateGroup1: Deploying https://gerrit.wikimedia.org/r/#/c/134124/ (duration: 00m 12s) [18:02:59] Logged the message, Master [18:03:18] (03CR) 10Legoktm: Handle webservice calls erroring out (031 comment) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202342 (owner: 10Yuvipanda) [18:05:36] paravoid: hiya [18:05:42] ottomata: hey [18:05:50] ottomata: an1021 kafka broker alert [18:05:53] ottomata: also an1020 is out of commission, hardware issue, T95263 [18:05:53] haha [18:06:00] ! [18:06:01] ottomata: also stat1002 alerts [18:06:06] looking [18:06:07] geez [18:06:34] (03CR) 10Yuvipanda: "I'll add a requirements.txt file in the same patch as setup.py" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202318 (https://phabricator.wikimedia.org/T95256) (owner: 10Yuvipanda) [18:06:42] :) [18:07:34] geez ok. [18:08:24] paravoid: did you get an alert email about an21? or just see it in icinga? [18:08:35] 13:14 < icinga-wm> PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 817.342198142 [18:08:59] I also have an icinga tab open somewhere almost permanently [18:09:20] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 1934.10432325 [18:09:24] paravoid, if you ever feel like it, you may run [18:09:25] kafka preferred-replica-election [18:09:29] from any of the brokers [18:09:33] pretty much at any time [18:09:35] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka/Administration#Replica_Elections [18:10:49] 6operations, 10ops-eqiad, 10Analytics-Cluster: analytics1020 hardware failure - https://phabricator.wikimedia.org/T95263#1186954 (10Ottomata) a:3Cmjohnson [18:11:26] 6operations, 10ops-eqiad, 10Analytics-Cluster: analytics1020 hardware failure - https://phabricator.wikimedia.org/T95263#1184828 (10Ottomata) @Cmjohnson could you take a look at this? Danke! [18:13:51] RECOVERY - dhclient process on stat1002 is OK: PROCS OK: 0 processes with command name dhclient [18:14:04] (03CR) 10coren: Better symlink race protection (031 comment) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202297 (https://phabricator.wikimedia.org/T95210) (owner: 10Yuvipanda) [18:14:12] (03CR) 10coren: Handle webservice calls erroring out (031 comment) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202342 (owner: 10Yuvipanda) [18:14:22] 6operations, 7network: AS3491 peering change - https://phabricator.wikimedia.org/T95322#1186972 (10Andrew) 3NEW [18:14:32] RECOVERY - salt-minion processes on stat1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:16:11] PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: puppet fail [18:16:37] (03CR) 10Yuvipanda: Better symlink race protection (031 comment) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202297 (https://phabricator.wikimedia.org/T95210) (owner: 10Yuvipanda) [18:17:18] (03PS1) 1020after4: Group1 wikis to 1.25wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202459 [18:17:56] (03CR) 10coren: [C: 031] Validate tool accounts before accepting them [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202305 (owner: 10Yuvipanda) [18:18:57] (03CR) 10coren: "I know this only monitors web services for now, but is the continuous job monitoring intended to go in a different class in the end?" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202343 (owner: 10Yuvipanda) [18:19:00] RECOVERY - Disk space on stat1002 is OK: DISK OK [18:19:56] (03CR) 10Yuvipanda: "Yup! It'll also have different syntax as well - named 'workers' similar to https://devcenter.heroku.com/articles/procfile" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202343 (owner: 10Yuvipanda) [18:20:06] (03CR) 10coren: [C: 031] "Yes, thank you. The pun made me chuckle, then blanch a little. :-)" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202323 (owner: 10Yuvipanda) [18:20:13] (03CR) 1020after4: [C: 032] Group1 wikis to 1.25wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202459 (owner: 1020after4) [18:20:49] (03CR) 10Yuvipanda: "I will claim lack of american / western historical knowledge in defense :)" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202323 (owner: 10Yuvipanda) [18:21:18] mutante: I can't +2 your one-line https://gerrit.wikimedia.org/r/#/c/199581/ DNS change, who can? bblack or the departent store mogul JohnLewis ? [18:22:03] spagewmf: I wish I could +2 there :p [18:22:27] andrewbogott is on ops duty so ask him [18:23:08] JohnLewis: thanks, I saw your name on https://gerrit.wikimedia.org/r/#/projects/operations/dns,dashboards/default and love the [18:23:19] "Never knowingly undersold" motto :) [18:25:01] (03CR) 10Andrew Bogott: [C: 032] point dev.wikimedia to cluster, not misc-web [dns] - 10https://gerrit.wikimedia.org/r/199581 (https://phabricator.wikimedia.org/T372) (owner: 10Dzahn) [18:25:28] (03Merged) 10jenkins-bot: Group1 wikis to 1.25wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202459 (owner: 1020after4) [18:26:01] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [18:26:13] andrewbogott <3 [18:26:19] spagewmf: welcome, connection went :p [18:27:01] also what do you mean you saw my name spagewmf? [18:28:20] 10ops-fundraising, 10Fundraising-Backlog: Need Civi access for Donor Services agent - https://phabricator.wikimedia.org/T95011#1187013 (10Jgreen) [18:28:31] PROBLEM - puppet last run on mc2010 is CRITICAL: CRITICAL: puppet fail [18:28:53] 10ops-fundraising, 10Fundraising-Backlog: Need Civi access for Donor Services agent - https://phabricator.wikimedia.org/T95011#1187014 (10Jgreen) a:5atgo>3Jgreen [18:28:54] JohnFLewis: I mean I clicked the magnifying glass icon on the gerrit patch, that took me to the operations/dns dashboard, and there I saw a trusted familiar name :) [18:30:06] spagewmf: oh I see :) yeah I submit patches and review them though I don't have the final authority to kill stuff ;) [18:31:41] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [18:33:21] RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [18:33:51] (03PS2) 10Alexandros Kosiaris: ssh: remove lucid special casing for authorized_keys_file [puppet] - 10https://gerrit.wikimedia.org/r/202394 [18:33:53] (03PS2) 10Alexandros Kosiaris: ssh: allow parameterization of authorized_keys [puppet] - 10https://gerrit.wikimedia.org/r/202392 [18:33:55] (03PS2) 10Alexandros Kosiaris: sodium: specify the position of authorized_keys_file [puppet] - 10https://gerrit.wikimedia.org/r/202393 [18:39:23] (03CR) 10coren: [C: 031] Better symlink race protection [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202297 (https://phabricator.wikimedia.org/T95210) (owner: 10Yuvipanda) [18:44:16] (03PS1) 10Ori.livneh: Update eventlogging service alias to point to eventlog1001 [dns] - 10https://gerrit.wikimedia.org/r/202464 [18:44:19] ottomata: yt? ^ [18:45:30] RECOVERY - puppet last run on mc2010 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:46:22] twentyafterfour: still deploying? [18:46:29] no hurry [18:49:13] (03PS1) 1020after4: Add git-new-workdir to tin (via deployment::deployment_server) [puppet] - 10https://gerrit.wikimedia.org/r/202467 [18:49:19] 6operations, 10ops-codfw: prepare equipment list for eqdfw - https://phabricator.wikimedia.org/T91077#1187044 (10Papaul) @RobH: The PDU's were shipped with no power cords and there are using the same power cords cr1-codfw and cr2-codfw are using (nema 5-15p to c19). Before shipping we need to purchase those p... [18:49:47] papaul: good catch! ^ [18:50:03] we'll order the longer kind since its a caching site and we wont really know the setup [18:50:09] just safer to be longer [18:50:13] aude: almost done [18:50:17] (03CR) 10jenkins-bot: [V: 04-1] Add git-new-workdir to tin (via deployment::deployment_server) [puppet] - 10https://gerrit.wikimedia.org/r/202467 (owner: 1020after4) [18:50:31] I paused to submit that patch ^ :-/ [18:50:37] 6operations, 10ops-codfw: prepare equipment list for eqdfw - https://phabricator.wikimedia.org/T91077#1187048 (10RobH) good catch, we'll order the longer (6') type, since thats the standard and having some slack at the peering site won't be an issue. [18:50:43] twentyafterfour: ok [18:51:16] robh: ok [18:51:18] (03PS2) 1020after4: Add git-new-workdir to tin (via deployment::deployment_server) [puppet] - 10https://gerrit.wikimedia.org/r/202467 [18:51:21] papaul: its not nema though [18:51:28] the nema connector is the standard wall outlet [18:51:31] we use c13 to c19 [18:51:36] ok [18:51:46] sorry, c14 [18:51:55] c14 is the side witht he male ends [18:51:56] c14 to c19? [18:51:58] yep [18:52:00] ok [18:52:36] its two per mc80 right? [18:52:40] mx80 even [18:52:40] yes [18:52:52] so we need 4 [18:53:15] fyi: this is where im pulling the pricing [18:53:15] * MC8 blinks [18:53:16] http://www.stayonline.com/power-iec-c14-c19-cords.aspx [18:53:23] just nice to see what im talking about [18:54:04] ori, hiya [18:54:07] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group1 to 1.25wmf24 [18:54:12] aude: done [18:54:18] Logged the message, Master [18:54:24] ottomata: loose end from EL migration G202464 [18:54:33] errr https://gerrit.wikimedia.org/r/#/c/202464 [18:54:38] twentyafterfour: great [18:54:59] papaul: wait, so how did they get plugged in @ codfw for testing? [18:55:49] oh, huh k [18:56:07] huh, didn't know that was a thing, ori, hafnium grahpite consumer should use that then , eh? [18:56:09] RobH: you are referring to the mx80 or pdu's? [18:56:11] (03CR) 10Ottomata: [C: 032] Update eventlogging service alias to point to eventlog1001 [dns] - 10https://gerrit.wikimedia.org/r/202464 (owner: 10Ori.livneh) [18:56:13] mx80s [18:56:14] ottomata: yep [18:56:25] the pdus you dont really ahve a way to test [18:56:45] robh:no [18:57:05] i can use the short cable that came with cr2-codfw [18:57:21] so you had a single power cable to fire up both mx80s that are being shipped? [18:57:24] ori, done. [18:57:29] well, merged your change. [18:57:52] (03PS3) 10Yuvipanda: Rename package to tools.manifest [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202323 [18:57:52] is that enough, or is some action required? isn't there some additional thing that needs to be done? [18:57:54] (03PS2) 10Yuvipanda: Add minimial setup.py & requirements.txt [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202320 [18:57:56] (03PS2) 10Yuvipanda: Handle webservice calls erroring out [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202342 [18:57:56] papaul: and i was wrong when i correcte dyou about nema [18:57:58] (03PS2) 10Yuvipanda: Rename service monitor to web service monitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202343 [18:58:04] i forgot that the horizontal PDUs we got were nema [18:58:15] robh: lol [18:58:23] cuz we never ordered them before [18:59:09] robh: the mx80s are already in plunged in there are using the regular cables that i use to plug the servers [19:00:12] the mx80s use c19 plugs [19:00:16] right? [19:00:22] how can they use normal c13 server cables? [19:01:23] robH: no the mx80 are not using the c19 [19:01:40] ok, you can see why that is contridictory to 'robh: the mx80s are already in plunged in there are using the regular cables that i use to plug the servers ' [19:01:41] ;D [19:01:58] I'll put in for nema20 to c19 cables [19:02:04] glad you noticed we needed them [19:02:07] cuz i totally forgot. [19:02:22] ok, who's got experience setting up nginx ssl sites with our puppet configs? [19:02:34] i'm poking around and getting something, but not sure which pieces to use [19:02:46] ori: yes, i ran authdns-update [19:02:52] ottomata: cool, thanks [19:02:55] yup! :) [19:05:23] (03PS4) 10Yuvipanda: Rename package to tools.manifest [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202323 [19:05:25] (03PS3) 10Yuvipanda: Handle webservice calls erroring out [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202342 [19:05:27] (03PS3) 10Yuvipanda: Rename service monitor to web service monitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202343 [19:06:15] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1187062 (10RobH) [19:06:47] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1179075 (10RobH) network switch port setup has an issue, described in sub-task T95048, once resolved installation can continue. [19:11:57] robh: right now i am using c13 to c14 cables to plug he mx80 to codfw for me to be able to plug the mx80 to the pdu's you order i will need nema 5-15p to c13 and to plug the pdu's to there power i will need nema 5-15p to c14 [19:12:02] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: puppet fail [19:12:04] yep [19:12:14] look at the picture i send you [19:16:22] PROBLEM - mailman I/O stats on sodium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=174.50 Read Requests/Sec=50.60 Write Requests/Sec=60.90 KBytes Read/Sec=2656.00 KBytes_Written/Sec=1009.65 [19:30:21] (03PS1) 10Nemo bis: [English Planet] Add Timo Tijhof (second blog) [puppet] - 10https://gerrit.wikimedia.org/r/202471 [19:31:57] (03CR) 10Dzahn: "@Krinkle, you want that added?" [puppet] - 10https://gerrit.wikimedia.org/r/202471 (owner: 10Nemo bis) [19:32:12] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:32:17] (03PS1) 10Yuvipanda: [WIP] Write logs on tool's homedir more securely [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202472 [19:32:22] RECOVERY - Host mw2128 is UP: PING WARNING - Packet loss = 93%, RTA = 45.89 ms [19:32:22] Coren: ^ WIP patch [19:32:36] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Write logs on tool's homedir more securely [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202472 (owner: 10Yuvipanda) [19:32:38] I’ll probably want to clean that up to be more python-y (passing a callback seems more rubyish/js-ey) [19:35:24] (03PS2) 10Yuvipanda: [WIP] Write logs on tool's homedir more securely [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202472 [19:36:08] (03PS1) 10Andrew Bogott: Include labs ldap client config on holmium [puppet] - 10https://gerrit.wikimedia.org/r/202473 [19:36:29] (03CR) 10Krinkle: "I don't intend to use it much. It's a CodePen feed where I'd blog about general engineering things. Not related to Wikimedia in any way." [puppet] - 10https://gerrit.wikimedia.org/r/202471 (owner: 10Nemo bis) [19:36:40] (03PS1) 10Ottomata: [WIP] Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) [19:37:23] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Moritz Muehlenhoff in ops - https://phabricator.wikimedia.org/T94717#1187197 (10Dzahn) [19:37:42] (03CR) 10Andrew Bogott: [C: 031] base: add nmap to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/201725 (owner: 10Dzahn) [19:39:02] PROBLEM - Host mw2128 is DOWN: PING CRITICAL - Packet loss = 100% [19:39:19] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Moritz Muehlenhoff in ops - https://phabricator.wikimedia.org/T94717#1170954 (10Dzahn) Andrew has renamed the user, thanks! Moritz reported there is an issue with the login on Gerrit. Is that resolved? [19:39:31] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) (owner: 10Ottomata) [19:39:47] (03CR) 10Andrew Bogott: [C: 032] Include labs ldap client config on holmium [puppet] - 10https://gerrit.wikimedia.org/r/202473 (owner: 10Andrew Bogott) [19:40:39] (03PS2) 10Ottomata: [WIP] Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) [19:42:12] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) (owner: 10Ottomata) [19:44:22] (03PS3) 10Ottomata: [WIP] Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) [19:44:32] (03CR) 10Yuvipanda: WIP: Proper labs_storage class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [19:45:07] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) (owner: 10Ottomata) [19:45:25] Coren: reviewing PS3 now [19:46:29] (03PS4) 10Ottomata: [WIP] Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) [19:47:10] 6operations, 10Wikimedia-Bugzilla: Replicate Bugzilla database to Labs testing instance - https://phabricator.wikimedia.org/T30339#1187230 (10Dzahn) >>! In T30339#1185679, @Andrew wrote: > Since bugzilla is a dead end and everything was migrated to phab, I don't understand what this is for. Can someone explai... [19:47:17] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) (owner: 10Ottomata) [19:48:07] 6operations, 6Commons, 6Multimedia: 220px of Fachada_e_lateral_da_Catedral_São_Sebastião_após_pintura,_Coronel_Fabriciano_MG.JPG not purging - https://phabricator.wikimedia.org/T95333#1187236 (10Bawolff) 3NEW [19:48:11] 6operations, 10Wikimedia-Bugzilla: Replicate Bugzilla database to Labs testing instance - https://phabricator.wikimedia.org/T30339#1187245 (10Dzahn) I suggest closing this as duplicate of T85141 or rejected. it's something in between. [19:48:29] 6operations, 6Commons, 6Multimedia: 220px of Fachada_e_lateral_da_Catedral_São_Sebastião_após_pintura,_Coronel_Fabriciano_MG.JPG not purging - https://phabricator.wikimedia.org/T95333#1187247 (10Bawolff) [19:49:30] (03PS5) 10Ottomata: [WIP] Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) [19:50:55] (03PS6) 10Ottomata: [WIP] Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) [19:51:29] (03CR) 10Yuvipanda: [C: 04-1] "The bash scripts still need a lotta inline docs as to what they're doing, I think." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [19:51:36] Coren: ^ I’m heading to the office now [19:51:52] (03CR) 10Chad: "One inline nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/202467 (owner: 1020after4) [19:51:55] Ah, right. I'll have those full of # for you [19:53:14] (03CR) 1020after4: "ok" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/202467 (owner: 1020after4) [19:55:14] 6operations, 10ops-codfw: mw2128 not rebooting after network driver crash, blank console - https://phabricator.wikimedia.org/T95264#1187274 (10Papaul) The system was rebooting in PXE mode and it was a blank black page. After discussing with Joe on IRC, I setup the boot option to boot first to C drive and NIC i... [19:57:19] Coren: can you also look at the seteuid in https://gerrit.wikimedia.org/r/#/c/202472/ and see if that seems sane? [19:59:40] (03CR) 10coren: [C: 031] "Should be okay, with one inline comment picking a minor nit." (031 comment) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202472 (owner: 10Yuvipanda) [19:59:50] (03PS3) 1020after4: Add git-new-workdir to tin (via deployment::deployment_server) [puppet] - 10https://gerrit.wikimedia.org/r/202467 [19:59:59] Coren: ah, fair enough :) [20:00:40] In practice, it wouldn't change anything in our context because this runs with ruid=euid=0 [20:01:37] Coren: yeah [20:01:47] Coren: should I setegid as well? [20:01:53] hmm, I probably should [20:02:27] hmm [20:02:32] processes don’t really have ‘groups' [20:02:44] (03PS2) 10Ori.livneh: Add a script for storing NavTiming metrics using RRD [puppet] - 10https://gerrit.wikimedia.org/r/202362 [20:02:45] Not unless you also setgroups() to do it cleanly. File permissions will normally override anyways. [20:02:58] 6operations, 6Commons, 6Multimedia: 220px of Fachada_e_lateral_da_Catedral_São_Sebastião_após_pintura,_Coronel_Fabriciano_MG.JPG not purging - https://phabricator.wikimedia.org/T95333#1187295 (10Bawolff) [20:03:51] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Puppet has 1 failures [20:04:14] YuviPanda: In this case, probably not worth it. [20:04:43] Coren: right. [20:04:54] * Coren ponders. [20:05:05] I wonder if I should use seteuid instead of sudo for the webservice call as well [20:05:29] Actually, I'm not sure we audit our file permissions against root group write, so even if you don't setgroup() I'd setegid() to the primary group anyways. [20:05:43] We dont no [20:06:12] YuviPanda: You should, but changing the uid for real is a bit tricker and requires a fork() to do right. [20:06:29] YuviPanda: The gain is no dependency on sudo and therefore ldap. [20:06:54] YuviPanda: I can show you the C code to do it, but I'm not sure how to do it in python properly. [20:07:38] (Also: much, much faster because you avoid invoking sudo entirely which is a whole exec() and so one) [20:08:22] yeah [20:12:02] (03PS7) 10Ottomata: [WIP] Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) [20:13:57] (03CR) 10Thcipriani: "typo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/202467 (owner: 1020after4) [20:18:42] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:21:27] Coren: yeah no dependency on LDAP would be great :D can you do the c snippet? I'm sure I can translate [20:21:37] YuviPanda: Sure thing. [20:22:19] (03PS3) 10Ori.livneh: Add a script for storing NavTiming metrics using RRD [puppet] - 10https://gerrit.wikimedia.org/r/202362 [20:22:25] Coren: I'm hoping to move all webservices over by end of week [20:23:32] (03PS1) 10Ori.livneh: Make ircecho deduplicate statuses on all lines in buffer [puppet] - 10https://gerrit.wikimedia.org/r/202581 [20:23:37] YuviPanda: ^ [20:23:51] :D [20:24:46] ori: can you add a comment? [20:25:01] Hmm this whole thing is a pile in need of rewriting anyway... [20:25:04] So maybe not [20:25:26] too lazy [20:25:48] Hehe [20:25:59] I'll merge shortly a [20:26:14] grrrit-wm lagged? [20:26:44] (03CR) 10Ori.livneh: [C: 032] Make ircecho deduplicate statuses on all lines in buffer [puppet] - 10https://gerrit.wikimedia.org/r/202581 (owner: 10Ori.livneh) [20:28:06] something is lagged [20:28:11] (03PS4) 1020after4: Add git-new-workdir to tin (via deployment::deployment_server) [puppet] - 10https://gerrit.wikimedia.org/r/202467 [20:30:22] oh, durr. YuviPanda, sorry, I didn't mean to merge that. [20:30:28] it's good to go, but still accidental. [20:31:04] (03CR) 10Ori.livneh: [C: 032] Blackhole the slow parse log on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202325 (owner: 10Ori.livneh) [20:32:37] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1187424 (10Dzahn) ``` #!/bin/bash # delete security bugs from bugzilla database for T85141 # secbugs.list is P440 # dzahn 20150407 declare -a tables=(bugs bugs_activity bugs_fulltext... [20:32:46] (03PS1) 10Andrew Bogott: Have sink create ldap host entries. [puppet] - 10https://gerrit.wikimedia.org/r/202582 [20:33:47] ori: is ok [20:34:40] (03CR) 1020after4: "puppet-lint and parser-validate are both weak. But thcipriani is strong." [puppet] - 10https://gerrit.wikimedia.org/r/202467 (owner: 1020after4) [20:37:55] (03Merged) 10jenkins-bot: Blackhole the slow parse log on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202325 (owner: 10Ori.livneh) [20:42:54] (03PS8) 10Ottomata: [WIP] Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) [20:45:06] (03CR) 10Thcipriani: "1 more inline note, then LGTM. Also, yes, I am suitably ashamed of being such a pedant :(" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/202467 (owner: 1020after4) [20:49:44] (03PS9) 10Ottomata: [WIP] Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) [20:51:06] (03PS10) 10Ottomata: Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) [20:53:10] !log ori Synchronized wmf-config/InitialiseSettings.php: I60ef00d2b: Blackhole the slow parse log on private wikis (duration: 00m 13s) [20:53:13] Logged the message, Master [20:55:51] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [500.0] [21:08:09] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1187562 (10Dzahn) @Eloquence @csteipp I have attempted to sanitize the Bugzilla database by removing security bugs. The goal would be to be able to put a SQL dump file on dumps.wm.or... [21:08:29] (03PS1) 10Chad: Add REL1_25 branches to ExtDist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202591 [21:09:22] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:09:25] (03PS11) 10Ottomata: Set up https with archiva certificate for archiva.wikmedia.org [puppet] - 10https://gerrit.wikimedia.org/r/202474 (https://phabricator.wikimedia.org/T88139) [21:09:27] (03CR) 10Legoktm: [C: 031] Add REL1_25 branches to ExtDist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202591 (owner: 10Chad) [21:09:50] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1187575 (10JohnLewis) http://www.ravenbrook.com/tool/bugzilla-schema/?action=single&version=3.4.2&view=View+schema is also an official schema of a few versions back which I based most... [21:10:51] (03PS2) 10Andrew Bogott: Have sink create ldap host entries. [puppet] - 10https://gerrit.wikimedia.org/r/202582 [21:12:53] (03CR) 10Platonides: [C: 031] Add REL1_25 branches to ExtDist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202591 (owner: 10Chad) [21:14:47] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1187598 (10JohnLewis) [21:14:51] 6operations, 10Wikimedia-Bugzilla: Replicate Bugzilla database to Labs testing instance - https://phabricator.wikimedia.org/T30339#1187596 (10JohnLewis) 5Open>3declined a:3JohnLewis [21:17:31] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1187606 (10JohnLewis) [21:18:42] (03CR) 10Legoktm: "Probably want to wait a bit...ExtDist will make 1.25 the default and confuse people, last time we disabled it: 89bcb29a3573bba1258f79da9e9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202591 (owner: 10Chad) [21:20:30] (03PS3) 10Andrew Bogott: Have sink create ldap host entries. [puppet] - 10https://gerrit.wikimedia.org/r/202582 [21:24:19] !log Manually started dumpwikidatattl.sh as datasets on snapshot1003 [21:24:22] Logged the message, Master [21:33:15] (03PS2) 10Legoktm: Add REL1_25 branches to ExtDist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202591 (owner: 10Chad) [21:34:17] (03CR) 10Legoktm: [C: 031] Add REL1_25 branches to ExtDist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202591 (owner: 10Chad) [21:42:27] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1187667 (10csteipp) @Dzahn, how did you generate secbugs.list? In general, greping for a few of the security bug titles in the final output would be a good final check. I'm assuming... [21:54:27] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1187711 (10Dzahn) >>! In T85141#1187667, @csteipp wrote: > @Dzahn, how did you generate secbugs.list? So we have https://static-bugzilla.wikimedia.org/ and that was created by a scr... [22:00:17] 6operations, 5Patch-For-Review: Force https for archiva.wikimedia.org - https://phabricator.wikimedia.org/T88139#1187741 (10Dzahn) Make sure the intermediate certificate is actually added to the certificate chain. Either you need to specify it in manifests/certs.pp or with install_certficate. Common error we r... [22:02:41] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1187751 (10Dzahn) >>! In T85141#1187667, @csteipp wrote: > I'm assuming you stripped out logincookies and tokens tables too? And stripped out any hidden comments? We did that using t... [22:11:18] paravoid, do you remember why you didn't like hash-based sharding for the tile service? As discussed with MaxSem [22:13:42] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1187772 (10Andrew) I have resumed the process with shopify, this time asking to add 'store.wikimedia.org' to the cert. I will update here as events warr... [22:24:57] (03PS2) 10Gage: IPsec: improved cipher selection [puppet] - 10https://gerrit.wikimedia.org/r/201135 [22:25:54] (03PS1) 10Aaron Schulz: Removed unused var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202605 [22:32:47] (03CR) 10Gage: "@Ecdsa: Thank you for the review! It was unclear to me that multiple encryption algorithms may be included in one proposal, but I have con" [puppet] - 10https://gerrit.wikimedia.org/r/201135 (owner: 10Gage) [22:48:34] jouncebot, next [22:48:34] In 0 hour(s) and 11 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150407T2300) [22:48:52] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: puppet fail [22:58:52] I just added a SWAT item for Flow. Currently the only thing listed [23:00:04] RoanKattouw, ^d, Krenair, superm401: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150407T2300). Please do the needful. [23:00:34] (03PS1) 10Thcipriani: Wrap lvs class in has_lvs hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/202611 [23:01:29] hm [23:02:03] (03PS2) 10Thcipriani: Wrap lvs class in has_lvs hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/202611 [23:02:16] I'm fine with doing it myself if that's simpler. [23:02:23] probably is, yeah [23:07:32] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [23:11:41] (03PS3) 10Thcipriani: Wrap lvs class in has_lvs hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/202611 (https://phabricator.wikimedia.org/T91560) [23:12:05] (03PS1) 10Yuvipanda: puppetmaster: Guess int/bool types from ldap variable values [puppet] - 10https://gerrit.wikimedia.org/r/202613 (https://phabricator.wikimedia.org/T95240) [23:12:10] andrewbogott_afk: thcipriani ^ [23:12:18] thcipriani: thanks for that pointer :) [23:13:36] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: Guess int/bool types from ldap variable values [puppet] - 10https://gerrit.wikimedia.org/r/202613 (https://phabricator.wikimedia.org/T95240) (owner: 10Yuvipanda) [23:13:56] YuviPanda: yw, I was surprised to find it wasn't something you could set in whatever ldap library. [23:14:31] (03PS2) 10Yuvipanda: puppetmaster: Guess int/bool types from ldap variable values [puppet] - 10https://gerrit.wikimedia.org/r/202613 (https://phabricator.wikimedia.org/T95240) [23:16:04] thcipriani: ldap was actually spitting out =,= string pairs [23:16:17] thcipriani: so they were a poor serialization format, so no way to actually put any type info there [23:16:23] (unless it was json or something) [23:17:37] (03PS3) 10Yuvipanda: puppetmaster: Guess int/bool types from ldap variable values [puppet] - 10https://gerrit.wikimedia.org/r/202613 (https://phabricator.wikimedia.org/T95240) [23:17:54] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetmaster: Guess int/bool types from ldap variable values [puppet] - 10https://gerrit.wikimedia.org/r/202613 (https://phabricator.wikimedia.org/T95240) (owner: 10Yuvipanda) [23:18:42] YuviPanda: yeah, seems to do the same thing in ruby (as evidenced by puppet code) so it's probably something in libldap. [23:19:01] thcipriani: no, it’s mostly that LDAP doesn’t support dictionary types... [23:19:15] thcipriani: so since we want to support arbitrary number of keys / values, they have to be key value pairs somehow [23:19:28] so they just made up their own format (puppet) and our ldap vars wer ein that format [23:21:35] !log mattflaschen Synchronized php-1.25wmf24/extensions/Flow/: Deploy Flow for LQT/Echo conversion feature (duration: 00m 13s) [23:21:40] Logged the message, Master [23:21:46] YuviPanda: ah, I see. What a weird problem. [23:22:06] thcipriani: yup. so puppet made up its own format and then did theire own implementation, and now we’re just matching :) [23:22:13] thcipriani: ideally, it would just be JSON or something :) [23:22:25] Done SWAT (i.e. the Flow deploy) [23:35:52] (03PS5) 1020after4: Add git-new-workdir to tin (via deployment::deployment_server) [puppet] - 10https://gerrit.wikimedia.org/r/202467 [23:36:38] (03CR) 1020after4: Add git-new-workdir to tin (via deployment::deployment_server) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/202467 (owner: 1020after4) [23:36:42] (03PS3) 10Yuvipanda: Write logs to the tool's homedir more securely [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202472 [23:36:48] Coren: ^ [23:36:51] am testing now [23:37:05] legoktm: ori ^ (for python style, I’m using a ‘with’ here, not sure if that’s the best thing to use) [23:37:14] (03CR) 10jenkins-bot: [V: 04-1] Write logs to the tool's homedir more securely [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202472 (owner: 10Yuvipanda) [23:37:19] I bet [23:38:03] (03PS4) 10Yuvipanda: Write logs to the tool's homedir more securely [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202472 [23:38:57] (03CR) 10coren: [C: 031] "Pretty, to boot." [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202472 (owner: 10Yuvipanda) [23:40:23] (03CR) 10Ori.livneh: "* There's no point in returning self from __enter__. The return value of __enter__ is meant to be used in 'with foo() as x:' constructions" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202472 (owner: 10Yuvipanda) [23:40:34] ori: yeah, there’s a far prettier way to do this, doing now [23:40:54] i feel like i’ve been away from python long enough that I’m quite rusty [23:43:19] (03PS5) 10Yuvipanda: Write logs to the tool's homedir more securely [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/202472 [23:43:45] (03CR) 10Thcipriani: [C: 031] Add git-new-workdir to tin (via deployment::deployment_server) [puppet] - 10https://gerrit.wikimedia.org/r/202467 (owner: 1020after4)