[00:00:05] RoanKattouw ostriches rmoen Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151112T0000). Please do the needful. [00:00:24] silly jouncebot, I removed half of those deployers earlier [00:01:46] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [00:04:53] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [00:07:25] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds [00:07:43] (03PS1) 10Jhobs: Deploy QuickSurveys to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252606 (https://phabricator.wikimedia.org/T110661) [00:10:00] Krenair: are you doing the SWAT deploy tonight? [00:10:03] no [00:11:13] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 2 below the confidence bounds [00:25:33] RoanKattouw, ostriches: We still on for evening SWAT? [00:25:50] Sure, I can do it [00:25:55] thanks [00:26:09] * ostriches is off today [00:26:37] ostriches: my bad, keep forgetting I floated the holiday today [00:26:45] I don't think that what you're trying to do will work [00:26:47] Let me check real quick [00:27:08] RoanKattouw: if not, take a look at patch set 1? [00:27:49] twentyafterfour: Is there a task for the jobqueue issue? [00:28:13] Hmm, wait, can we use wg vars directly these days? [00:28:16] Krinkle: I'm not aware of one [00:28:22] Does that actually work now? [00:28:45] I was basing it off of what another coworker said and by scrolling up a few lines in CommonSettings.php [00:28:58] Yeah I see the comment now [00:29:08] I'm willing to try your patch as is, but I think Florian was wrong [00:29:10] I'm not sure though [00:29:21] I don't *think* you can use wg vars directly for extension settings [00:29:30] But perhaps you can in the new magical extension.json world [00:29:39] Krenair: Do you know? [00:29:48] If it doesn't work, do you think we can rollback to PS1 and redeploy within the window? [00:29:54] twentyafterfour: Oh, it was resolved? [00:29:58] Yes, we can [00:30:00] Yes, you can use wg directly in InitialiseSettings [00:30:00] Let's try it [00:30:06] We're doing it already IIRC [00:30:14] Hah, OK [00:30:14] TIL [00:30:25] (03CR) 10Catrope: [C: 032] Refactor wmgQuickSurveysConfig to wgQuickSurveysConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252456 (https://phabricator.wikimedia.org/T113443) (owner: 10Jhobs) [00:30:28] excellent [00:30:43] I would check it on tin though, before running the sync [00:30:55] Right [00:31:06] (03Merged) 10jenkins-bot: Refactor wmgQuickSurveysConfig to wgQuickSurveysConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252456 (https://phabricator.wikimedia.org/T113443) (owner: 10Jhobs) [00:31:26] RoanKattouw: remember you can stage on tin, sync from mw1017 and use debug to test wikis other than test wikis [00:31:42] Yup, works [00:32:03] Krinkle: Yes, but in my case I just needed to var_dump() the value of a global, which is easily done with eval.php on tin [00:32:10] ah yeah [00:32:15] you can, but for this a simple eval.php should do everything needed [00:32:19] yeah [00:32:41] Yeah, this should work fine as long as the extension is included in CommonSettings before the $wgConf->export() part I think. [00:32:46] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Fix typo in QuickSurveys config (duration: 00m 31s) [00:32:48] or something [00:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:33:03] jhobs: Deployed [00:33:15] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [00:34:35] Krinkle: yes resolved by https://gerrit.wikimedia.org/r/#/c/252600/ [00:34:38] err [00:35:11] https://gerrit.wikimedia.org/r/#/c/252588/ [00:36:27] RoanKattouw: confirmed working on testwiki, thanks! [00:37:40] !log restbase: canary deploy of restbase-deploy ce4abe784 on restbase1001 [00:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:50:21] !log restbase: starting full deploy of restbase-deploy ce4abe784 [00:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:56:24] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 1 below the confidence bounds [01:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151112T0100). [01:02:07] !log restbase: finished full deploy of restbase-deploy ce4abe784 [01:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:08:17] (03PS2) 10Jhobs: Deploy QuickSurveys to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252606 (https://phabricator.wikimedia.org/T110661) [01:22:18] (03CR) 10Bmansurov: Deploy QuickSurveys to English Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252606 (https://phabricator.wikimedia.org/T110661) (owner: 10Jhobs) [01:25:58] (03CR) 10Jhobs: Deploy QuickSurveys to English Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252606 (https://phabricator.wikimedia.org/T110661) (owner: 10Jhobs) [01:36:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [01:41:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [01:52:58] 6operations, 5Patch-For-Review: Alert when used_memory gets too high for redis queues - https://phabricator.wikimedia.org/T118331#1800198 (10ori) >>! In T118331#1799944, @aaron wrote: > Action is to investigate if jobs are not being run or not undelayed What does "not undelayed" mean? [02:06:33] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [02:17:03] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [5000000.0] [02:23:45] !log l10nupdate@tin Synchronized php-1.27.0-wmf.5/cache/l10n: l10nupdate for 1.27.0-wmf.5 (duration: 06m 56s) [02:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:02] (03PS1) 10Ori.livneh: decom abacist role [puppet] - 10https://gerrit.wikimedia.org/r/252621 [02:29:14] (03PS2) 10Ori.livneh: decom abacist role [puppet] - 10https://gerrit.wikimedia.org/r/252621 [02:29:20] (03CR) 10Ori.livneh: [C: 032 V: 032] decom abacist role [puppet] - 10https://gerrit.wikimedia.org/r/252621 (owner: 10Ori.livneh) [02:30:13] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000000.0] [02:35:54] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [02:36:14] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 1 failures [02:38:21] (03PS1) 10Ori.livneh: delete abacist module/role; decommed and unused [puppet] - 10https://gerrit.wikimedia.org/r/252622 [02:38:49] (03PS2) 10Ori.livneh: delete abacist module/role; decommed and unused [puppet] - 10https://gerrit.wikimedia.org/r/252622 [02:39:07] (03CR) 10Ori.livneh: [C: 032 V: 032] delete abacist module/role; decommed and unused [puppet] - 10https://gerrit.wikimedia.org/r/252622 (owner: 10Ori.livneh) [02:43:43] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [03:02:43] 6operations, 5Patch-For-Review: Alert when used_memory gets too high for redis queues - https://phabricator.wikimedia.org/T118331#1800242 (10aaron) When a job with jobReleaseTimestamp set actually becomes poppable (which should be around the nominal specified time). [03:05:14] (03CR) 10Bmansurov: [C: 031] Deploy QuickSurveys to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252606 (https://phabricator.wikimedia.org/T110661) (owner: 10Jhobs) [03:44:16] (03PS1) 10Ori.livneh: Rename ::redis to ::redis::legacy [puppet] - 10https://gerrit.wikimedia.org/r/252626 [03:59:23] (03PS2) 10Ori.livneh: Rename ::redis to ::redis::legacy [puppet] - 10https://gerrit.wikimedia.org/r/252626 [04:03:34] (03CR) 10Yuvipanda: [C: 031] "+1 on refactoring :) Puppet compile all the things, and I'll watch out for alerts on the labs stuff we care about." [puppet] - 10https://gerrit.wikimedia.org/r/252626 (owner: 10Ori.livneh) [04:06:28] (03PS1) 10Glaisher: Revert "Increase abusefilter emergency disable threshold on MediaWiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 [04:07:35] (03PS3) 10Ori.livneh: Rename ::redis to ::redis::legacy [puppet] - 10https://gerrit.wikimedia.org/r/252626 [04:19:22] (03CR) 10Ori.livneh: "gallium.wikimedia.org: https://puppet-compiler.wmflabs.org/1256/" [puppet] - 10https://gerrit.wikimedia.org/r/252626 (owner: 10Ori.livneh) [04:20:02] (03PS4) 10Ori.livneh: Rename ::redis to ::redis::legacy [puppet] - 10https://gerrit.wikimedia.org/r/252626 [04:21:42] (03CR) 10Ori.livneh: [C: 032] Rename ::redis to ::redis::legacy [puppet] - 10https://gerrit.wikimedia.org/r/252626 (owner: 10Ori.livneh) [04:27:20] PROBLEM - puppet last run on mc1016 is CRITICAL: CRITICAL: puppet fail [04:27:44] hmmm [04:27:47] * ori checks on that host [04:28:10] PROBLEM - puppet last run on mc2004 is CRITICAL: CRITICAL: Puppet has 1 failures [04:28:17] race conditions [04:28:23] that's so crazy, though [04:28:26] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Package[redis-server] is already declared in file /etc/puppet/modules/redis/manifests/init.pp:21; cannot redeclare at /etc/puppet/modules/redis/manifests/legacy.pp:23 on node mc1016.eqiad.wmnet [04:28:40] PROBLEM - puppet last run on mc2001 is CRITICAL: CRITICAL: Puppet has 1 failures [04:29:00] all of these are ephemeral and the result of a race condition [04:30:19] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [04:30:21] PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: Puppet has 1 failures [04:30:59] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures [04:31:19] RECOVERY - puppet last run on mc1016 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [04:32:00] (03PS1) 10Ori.livneh: fix-up for I90842b834: fix hieradata for codfw memcaches [puppet] - 10https://gerrit.wikimedia.org/r/252629 [04:32:12] (03CR) 10Ori.livneh: [C: 032 V: 032] fix-up for I90842b834: fix hieradata for codfw memcaches [puppet] - 10https://gerrit.wikimedia.org/r/252629 (owner: 10Ori.livneh) [04:33:19] PROBLEM - puppet last run on mc2013 is CRITICAL: CRITICAL: Puppet has 1 failures [04:33:59] RECOVERY - puppet last run on mc2004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [04:34:09] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [04:34:11] RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [04:34:30] RECOVERY - puppet last run on mc2001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [04:34:41] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [04:35:09] RECOVERY - puppet last run on mc2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:54:04] (03PS1) 10Ori.livneh: Move Package['redis-server'] and File[$dir] from redis::legacy to redis [puppet] - 10https://gerrit.wikimedia.org/r/252631 [05:32:05] (03PS1) 10Andrew Bogott: Add hiera config for labtest cluster. [puppet] - 10https://gerrit.wikimedia.org/r/252633 [05:32:37] (03PS2) 10Andrew Bogott: Add hiera config for labtest cluster. [puppet] - 10https://gerrit.wikimedia.org/r/252633 [06:18:56] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1800378 (10intracer) Don't know if it' correct place to write. We assumed that we can use images directly from commons on external websites,... [06:22:00] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [06:30:30] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:09] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:00] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:29] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:30] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:30] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:41] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:16] (03PS1) 10Yuvipanda: docker: Add role + class for setting up registry [puppet] - 10https://gerrit.wikimedia.org/r/252640 [06:36:37] _joe_: godog ^ alert about carbon [06:37:03] <_joe_> YuviPanda: uj? where? [06:37:15] <_joe_> oh, saw it, will fix later [06:37:15] > PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [06:37:18] ok [06:37:40] (03PS2) 10Yuvipanda: docker: Add role + class for setting up registry [puppet] - 10https://gerrit.wikimedia.org/r/252640 [06:42:40] (03PS3) 10Yuvipanda: docker: Add role + class for setting up registry [puppet] - 10https://gerrit.wikimedia.org/r/252640 [06:43:39] (03PS4) 10Yuvipanda: docker: Add role + class for setting up registry [puppet] - 10https://gerrit.wikimedia.org/r/252640 [06:44:30] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old. [06:45:51] (03PS5) 10Yuvipanda: docker: Add role + class for setting up registry [puppet] - 10https://gerrit.wikimedia.org/r/252640 [06:47:35] (03PS6) 10Yuvipanda: docker: Add role + class for setting up registry [puppet] - 10https://gerrit.wikimedia.org/r/252640 [06:47:51] (03CR) 10Yuvipanda: [C: 032 V: 032] docker: Add role + class for setting up registry [puppet] - 10https://gerrit.wikimedia.org/r/252640 (owner: 10Yuvipanda) [06:51:05] (03PS1) 10Yuvipanda: docker: Conform to puppet's stupid conventions [puppet] - 10https://gerrit.wikimedia.org/r/252641 [06:51:16] (03PS2) 10Yuvipanda: docker: Conform to puppet's stupid conventions [puppet] - 10https://gerrit.wikimedia.org/r/252641 [06:51:31] (03CR) 10Yuvipanda: [C: 032 V: 032] docker: Conform to puppet's stupid conventions [puppet] - 10https://gerrit.wikimedia.org/r/252641 (owner: 10Yuvipanda) [06:55:50] (03PS1) 10Yuvipanda: docker: Fix registry package name [puppet] - 10https://gerrit.wikimedia.org/r/252642 [06:56:15] (03PS2) 10Yuvipanda: docker: Fix registry package name [puppet] - 10https://gerrit.wikimedia.org/r/252642 [06:56:24] (03CR) 10Yuvipanda: [C: 032 V: 032] docker: Fix registry package name [puppet] - 10https://gerrit.wikimedia.org/r/252642 (owner: 10Yuvipanda) [06:58:01] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:57] (03PS1) 10Yuvipanda: docker: Make sure data directory exists [puppet] - 10https://gerrit.wikimedia.org/r/252643 [07:01:09] (03PS2) 10Yuvipanda: docker: Make sure data directory exists [puppet] - 10https://gerrit.wikimedia.org/r/252643 [07:01:19] (03CR) 10jenkins-bot: [V: 04-1] docker: Make sure data directory exists [puppet] - 10https://gerrit.wikimedia.org/r/252643 (owner: 10Yuvipanda) [07:01:55] (03CR) 10jenkins-bot: [V: 04-1] docker: Make sure data directory exists [puppet] - 10https://gerrit.wikimedia.org/r/252643 (owner: 10Yuvipanda) [07:03:01] (03PS3) 10Yuvipanda: docker: Make sure data directory exists [puppet] - 10https://gerrit.wikimedia.org/r/252643 [07:04:02] (03CR) 10Yuvipanda: [C: 032] docker: Make sure data directory exists [puppet] - 10https://gerrit.wikimedia.org/r/252643 (owner: 10Yuvipanda) [07:22:30] (03PS1) 10Yuvipanda: docker: Use ssl for the registry [puppet] - 10https://gerrit.wikimedia.org/r/252645 [07:26:41] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:26:41] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:09] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:15] (03PS2) 10Giuseppe Lavagetto: pybal: add support for instrumentation [puppet] - 10https://gerrit.wikimedia.org/r/252244 [07:27:39] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:27:41] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:00] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:49] (03PS2) 10Yuvipanda: docker: Use ssl for the registry [puppet] - 10https://gerrit.wikimedia.org/r/252645 [07:30:36] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: add support for instrumentation [puppet] - 10https://gerrit.wikimedia.org/r/252244 (owner: 10Giuseppe Lavagetto) [07:32:13] (03PS3) 10Yuvipanda: docker: Use ssl for the registry [puppet] - 10https://gerrit.wikimedia.org/r/252645 [07:32:39] (03CR) 10Yuvipanda: [C: 032 V: 032] docker: Use ssl for the registry [puppet] - 10https://gerrit.wikimedia.org/r/252645 (owner: 10Yuvipanda) [07:35:32] (03PS1) 10Ori.livneh: redis: add doc-block [puppet] - 10https://gerrit.wikimedia.org/r/252646 [07:35:38] <_joe_> !log restarting pybal on lvs1004 [07:35:43] :) [07:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:36:56] (03PS2) 10Ori.livneh: Move Package['redis-server'] and File[$dir] from redis::legacy to redis [puppet] - 10https://gerrit.wikimedia.org/r/252631 [07:37:25] <_joe_> ori: redis::legacy? [07:37:29] <_joe_> what's this? [07:38:26] _joe_: explained in detail in the commit message for https://gerrit.wikimedia.org/r/#/c/252626/ [07:38:37] <_joe_> ok, thanks [07:39:01] basically the redis module sucks, i want to renovate it, but it has many users [07:39:35] i could have gone with the '.*_new' approach (i.e., create a redis_new module) but those have a way of becoming permanent and not replacing the code they were meant to replace [07:39:37] <_joe_> ori: yeah I was reading the code [07:39:44] <_joe_> err, commit message [07:39:55] <_joe_> and I think the problem with _new is basically the same [07:40:04] <_joe_> and I prefer the ::legacy naming anyways [07:40:19] <_joe_> it's gonna be less ugly once we migrate [07:40:24] yep [07:40:36] i think it's the right balance: share code with current users but make it possible to build something new too [07:40:40] gotta run [07:40:42] <_joe_> this is doable here since it's just a class [07:40:46] <_joe_> yeah good night [07:43:22] hmm [07:43:32] so I don't think I can have docker registry say [07:43:42] 'readonly for everyone, readwrite with pw' [07:43:57] without also going into doing full blown authz/n with an external JWT REST SERVER [07:45:36] (03PS1) 10Yuvipanda: docker: Restart registry when config changes [puppet] - 10https://gerrit.wikimedia.org/r/252647 [07:45:58] (03PS2) 10Yuvipanda: docker: Restart registry when config changes [puppet] - 10https://gerrit.wikimedia.org/r/252647 [07:46:42] (03CR) 10Yuvipanda: [C: 032 V: 032] docker: Restart registry when config changes [puppet] - 10https://gerrit.wikimedia.org/r/252647 (owner: 10Yuvipanda) [07:47:39] (03PS1) 10Giuseppe Lavagetto: pybal: remove explicit monitoring_port [puppet] - 10https://gerrit.wikimedia.org/r/252648 [07:48:36] (03PS2) 10Giuseppe Lavagetto: pybal: remove explicit monitoring_port [puppet] - 10https://gerrit.wikimedia.org/r/252648 [07:49:01] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: remove explicit monitoring_port [puppet] - 10https://gerrit.wikimedia.org/r/252648 (owner: 10Giuseppe Lavagetto) [08:24:50] <_joe_> YuviPanda: [08:25:04] _joe_: ? [08:33:22] (03PS1) 10ArielGlenn: make neodymium secondary salt master for all minions [puppet] - 10https://gerrit.wikimedia.org/r/252651 [08:36:39] (03CR) 10ArielGlenn: [C: 032] make neodymium secondary salt master for all minions [puppet] - 10https://gerrit.wikimedia.org/r/252651 (owner: 10ArielGlenn) [08:39:31] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1800496 (10ArielGlenn) https://gerrit.wikimedia.org/r/#/c/252651/ added as secondary salt master to all hosts, tested on one client, responds to commands from neodymium [08:40:37] (03CR) 10Alexandros Kosiaris: [C: 032] Mathoid: Enable texvcinfo generation [puppet] - 10https://gerrit.wikimedia.org/r/252429 (owner: 10Mobrovac) [08:40:45] (03PS3) 10Alexandros Kosiaris: Mathoid: Enable texvcinfo generation [puppet] - 10https://gerrit.wikimedia.org/r/252429 (owner: 10Mobrovac) [08:42:01] !log restarted salt-master on palladium [08:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:51:01] PROBLEM - puppet last run on scandium is CRITICAL: CRITICAL: puppet fail [08:52:34] PROBLEM - salt-minion processes on analytics1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:58:12] RECOVERY - salt-minion processes on analytics1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:04:07] (03CR) 10Alexandros Kosiaris: "Was not discussed in the ops meeting due to *my* negligence. Rescheduling for the next one. Sorry :-(" [puppet] - 10https://gerrit.wikimedia.org/r/249501 (https://phabricator.wikimedia.org/T112914) (owner: 10Yurik) [09:06:03] RECOVERY - puppet last run on scandium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:28:38] (03PS3) 10Ori.livneh: Move Package['redis-server'] and File[$dir] from redis::legacy to redis [puppet] - 10https://gerrit.wikimedia.org/r/252631 [09:28:45] (03CR) 10Ori.livneh: [C: 032 V: 032] Move Package['redis-server'] and File[$dir] from redis::legacy to redis [puppet] - 10https://gerrit.wikimedia.org/r/252631 (owner: 10Ori.livneh) [09:30:13] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [09:35:12] PROBLEM - puppet last run on rdb1007 is CRITICAL: CRITICAL: puppet fail [09:37:03] RECOVERY - puppet last run on rdb1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:39:48] !log corrected apt sources on db2057, db2058, db2062, db2065, db2066, db2067, db2068, db2069 [09:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:43:09] (03PS1) 10Hashar: Drop role::zuul::production [puppet] - 10https://gerrit.wikimedia.org/r/252656 [09:44:35] !log ori@tin Synchronized php-1.27.0-wmf.5/extensions/UniversalLanguageSelector: I9206a79: Only use jQuery.tipsy for undo popover (duration: 01m 15s) [09:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:45:22] (03CR) 10Hashar: [C: 031 V: 031] "Puppet compilation: https://puppet-compiler.wmflabs.org/1265/gallium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/252656 (owner: 10Hashar) [09:46:12] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [09:46:51] (03PS2) 10Jcrespo: Adding ferm to db1035, activating performance_schema [puppet] - 10https://gerrit.wikimedia.org/r/252435 [09:47:56] (03CR) 10Alexandros Kosiaris: [C: 032] Drop role::zuul::production [puppet] - 10https://gerrit.wikimedia.org/r/252656 (owner: 10Hashar) [09:51:59] !log uploading new mariadb package linked to openssl instead of yassl [09:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:53:25] !log rebooting/upgrading/restarting mariadb on db1035 after maintenance finished [09:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:42] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [09:58:33] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:08:34] (03PS1) 10Phuedx: Remove Browse experiment config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252660 (https://phabricator.wikimedia.org/T113686) [10:09:45] (03PS1) 10Ori.livneh: Deactivate my own key on suspicion my computer was compromised [puppet] - 10https://gerrit.wikimedia.org/r/252661 [10:11:42] !log ori@tin Synchronized php-1.27.0-wmf.6/extensions/UniversalLanguageSelector: I9206a79: Only use jQuery.tipsy for undo popover (duration: 00m 31s) [10:11:44] (03CR) 10Ori.livneh: [C: 032] Deactivate my own key on suspicion my computer was compromised [puppet] - 10https://gerrit.wikimedia.org/r/252661 (owner: 10Ori.livneh) [10:11:46] (03CR) 10Yuvipanda: [C: 032] Deactivate my own key on suspicion my computer was compromised [puppet] - 10https://gerrit.wikimedia.org/r/252661 (owner: 10Ori.livneh) [10:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:15:13] PROBLEM - Host mw1041 is DOWN: PING CRITICAL - Packet loss = 100% [10:23:56] (03CR) 10Joal: "Having this field as top level instead of in X-Analytics header is fine by me." [puppet] - 10https://gerrit.wikimedia.org/r/252427 (owner: 10BBlack) [11:29:07] 7Puppet, 7Ruby: Fix easy problems reported by RuboCop in operations/puppet - https://phabricator.wikimedia.org/T112651#1800675 (10zeljkofilipin) 5stalled>3Open [12:23:35] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1800807 (10BBlack) >>! In T118365#1800106, @GWicke wrote: > @bblack: The basic issue is that we are using a blanket limit across different APIs with vastly different costs. Some batch APIs let... [12:39:31] PROBLEM - puppet last run on rdb2001 is CRITICAL: CRITICAL: puppet fail [12:40:52] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [13:06:12] RECOVERY - puppet last run on rdb2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:07:42] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [13:13:36] (03PS1) 10Alexandros Kosiaris: Remove nuria and bmansurov's SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/252680 [13:14:21] (03CR) 10Muehlenhoff: [C: 031] Remove nuria and bmansurov's SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/252680 (owner: 10Alexandros Kosiaris) [13:14:55] (03CR) 10Alexandros Kosiaris: [C: 032] Remove nuria and bmansurov's SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/252680 (owner: 10Alexandros Kosiaris) [13:16:07] (03PS1) 10Giuseppe Lavagetto: base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) [13:41:51] (03PS1) 10Zfilipin: Added test target to Rakefile [puppet] - 10https://gerrit.wikimedia.org/r/252686 (https://phabricator.wikimedia.org/T117993) [13:43:46] (03PS2) 10Zfilipin: Added test target to Rakefile [puppet] - 10https://gerrit.wikimedia.org/r/252686 (https://phabricator.wikimedia.org/T117993) [13:44:21] (03CR) 10Zfilipin: "Patch set 2 renames rakefile to Rakefile." [puppet] - 10https://gerrit.wikimedia.org/r/252686 (https://phabricator.wikimedia.org/T117993) (owner: 10Zfilipin) [13:59:53] PROBLEM - Restbase endpoints health on xenon is CRITICAL: /transform/wikitext/to/html{/title}{/revision} is CRITICAL: Test Transform wikitext to html returned the unexpected status 400 (expecting: 200) [14:00:23] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /transform/wikitext/to/html{/title}{/revision} is CRITICAL: Test Transform wikitext to html returned the unexpected status 400 (expecting: 200) [14:00:41] PROBLEM - Restbase endpoints health on cerium is CRITICAL: /transform/wikitext/to/html{/title}{/revision} is CRITICAL: Test Transform wikitext to html returned the unexpected status 400 (expecting: 200) [14:01:23] PROBLEM - Restbase endpoints health on restbase-test2001 is CRITICAL: /transform/wikitext/to/html{/title}{/revision} is CRITICAL: Test Transform wikitext to html returned the unexpected status 400 (expecting: 200) [14:02:08] known ^^ [14:02:12] * mobrovac on it [14:02:43] mobrovac: sweet, thanks! was about to ask [14:03:22] PROBLEM - Restbase endpoints health on restbase-test2002 is CRITICAL: /transform/wikitext/to/html{/title}{/revision} is CRITICAL: Test Transform wikitext to html returned the unexpected status 400 (expecting: 200) [14:04:42] PROBLEM - Restbase endpoints health on restbase-test2003 is CRITICAL: /transform/wikitext/to/html{/title}{/revision} is CRITICAL: Test Transform wikitext to html returned the unexpected status 400 (expecting: 200) [14:06:12] _joe_: houston we have a service_checker problem! [14:10:00] <_joe_> mobrovac: uhm, how so? [14:10:12] <_joe_> mobrovac: another one? :P [14:10:23] haha [14:10:58] _joe_: so, you see all these alerts ^^ [14:11:15] it's because we have a real request validator now in RB [14:11:44] and it doesn't like the fact that service_checker sends it the content as app/json, and not formdata [14:12:34] <_joe_> not formdata? [14:12:41] <_joe_> what do you mean? [14:13:04] <_joe_> you require that the client sends you a specific Accept: header? [14:13:05] multipart-formdata [14:13:25] no no [14:13:30] <_joe_> maybe we didn't add the Accept header? [14:13:35] for that specific endpoint [14:13:58] we require that the users sends the payload as url-encoded [14:14:08] <_joe_> please explain with more verbosity [14:14:10] but service_checker sends it as json [14:14:13] haha [14:14:17] <_joe_> ok [14:14:28] <_joe_> let me take a look at the spec [14:14:39] <_joe_> if that is specified there, it's easy for me to manage [14:14:56] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/252686 (https://phabricator.wikimedia.org/T117993) (owner: 10Zfilipin) [14:16:34] _joe_: https://github.com/wikimedia/restbase/blob/master/specs/mediawiki/v1/content.yaml#L892-L893 [14:17:29] <_joe_> mobrovac: kk [14:18:02] <_joe_> mobrovac: uff, check_service could soon turn into a python software of its own [14:18:32] yup [14:18:43] and that's good(TM) for the health of our system :) [14:21:07] _joe_: hmm things just got weirder [14:21:44] _joe_: i can get the request to pass even with app/json headers when i do a manual request [14:21:55] but it fails in service_checker [14:22:27] is it possible that you do not set the content-type: app/json header when sending the request? [14:22:33] <_joe_> mobrovac: uhm, give me half an hour and I'll check [14:22:47] <_joe_> mobrovac: perfectly possible I did fuckup somehow [14:27:11] (03PS3) 10Jcrespo: Adding ferm to db1035, activating performance_schema [puppet] - 10https://gerrit.wikimedia.org/r/252435 [14:28:52] (03CR) 10Jcrespo: [C: 032] Adding ferm to db1035, activating performance_schema [puppet] - 10https://gerrit.wikimedia.org/r/252435 (owner: 10Jcrespo) [14:31:42] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [14:57:34] _joe_: there's also a problem in our validation, so hang on with those changes [14:58:07] _joe_: btw, we will need to make changes to the script eventually so that it can send app/json and multipart/form-data as well [14:58:14] <_joe_> mobrovac: yeah I'm pretty caught up at the moment btw [14:58:31] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:02:16] (03PS1) 10Jcrespo: Repool db1035 with low weight, after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252691 [15:27:37] (03CR) 10Chad: "It will, which is incorrect obvs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252399 (owner: 10Florianschmidtwelzow) [15:28:07] (03CR) 10Jcrespo: [C: 032] Repool db1035 with low weight, after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252691 (owner: 10Jcrespo) [15:30:41] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1035 after maintenance (duration: 00m 29s) [15:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:56] mw1041 failed [15:31:12] checking it [15:32:54] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [15:33:32] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [15:34:00] it is on dsh but it is down for 5 hours [15:34:13] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [15:35:13] RECOVERY - Restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [15:35:36] allow me to check first the deployment, will go back to mw later [15:36:02] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [5000000.0] [15:36:50] Hey all, is there any reason why all of our role manifests are at the top level in manifests/role rather than grouped in subdirs? [15:37:03] RECOVERY - Restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [15:37:08] ‘cause I want to move the labs roles into a labs/ subdir [15:37:32] RECOVERY - Restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [15:38:11] it looks kernel-crashed [15:39:09] !log powercycling mw1041 [15:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:53] <_joe_> mobrovac: I guess service checker wasn't wrong afterall [15:40:13] _joe_: nope :) sorry for bothering [15:40:22] <_joe_> no np [15:40:34] <_joe_> I am actually happy it helps us find bugs [15:40:59] _joe_: actually, it was python - we validated booleans on smallcaps only, but python's got True and False [15:41:16] <_joe_> lol [15:41:32] <_joe_> we might want to fix that [15:41:42] RECOVERY - Host mw1041 is UP: PING WARNING - Packet loss = 93%, RTA = 0.24 ms [15:41:50] we put case-insensitive checks now so it's good [15:42:43] andrewbogott: I think it's because most match the path. If they don't, they should be subdir'd [15:43:12] do you know if puppet runs common-sync? [15:43:33] I think I was told that, but I may got it wrong [15:45:43] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [15:46:13] jynus: I think it does on the scap master(s) only on fresh install [15:46:20] !log restbase canary deploy on rb1001 of e5e45d11ad3f [15:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:46] scap master == tin? [15:47:02] I will run it anyway, just in case [15:47:33] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:47:43] jynus: tin and mira [15:48:34] !log restbase start deploy of e5e45d11ad3f [15:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:56] Permission denied. as which user I am supposed to run common-sync? [15:49:42] mwdeploy I think [15:50:53] oh, I almost forgot about a meeting [15:51:12] (03PS1) 10Andrew Bogott: Move role/nova.pp into role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/252702 [15:51:29] can someone finish syncing mw1041, and a general check of the last crash? [15:51:39] jynus: I believe that if you’re in the deployer group you can sync-common as yourself [15:51:43] jynus: I’ll try [15:51:50] I can do it as root [15:51:57] but last time I was told not to [15:52:06] because I created root owned files [15:52:18] btw, we are talking about ‘sync-common’ and not ‘common-sync’ right? [15:52:24] just do whatever, running out of time [15:52:27] andrewbogott: sync-command :) [15:52:31] *common [15:52:50] make sure it has the latest deploy that if failed because it had crashed [15:53:07] JohnFLewis: you’re just making it worse! :) [15:53:11] in paricular, the latest version of wmf-config/db-eqiad.php [15:53:39] then checkiing the kernel logs to see if there is a larger issue why it crased [15:53:43] jynus: I’m getting some pushback about unable to delete empty dirs and also quite a lot of [15:53:44] rsync: delete_file: unlink(docroot/wikipedia.org/cases/case_007.php) failed: Permission denied (13) [15:53:52] andrewbogott, that is what I got too [15:53:56] hm [15:54:04] so, I think that may be an actual failure [15:54:06] maybe depool it [15:54:10] but, let me scare up someone from releng [15:54:10] to be sure? [15:54:25] oh, they don’t have logins probably [15:54:28] um... [15:54:30] then analyze it [15:54:45] jynus: at this point I clearly know less than you. Yeah, depooling sounds right. [15:54:45] andrewbogott: I'd imagine they do [15:55:06] a depool then investigate sounds best though, you know, safe side :) [15:55:13] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [15:55:13] yep [15:56:24] 6operations, 10Wiki-Loves-Monuments-General, 10Wikimedia-DNS, 7domains: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1801065 (10JanZerebecki) 3NEW [15:56:32] ostriches: have a minute? jynus and I are trying to sync-common a specific host and can’t tell if we’re seeing bad behavior or… normal behavior [15:56:45] andrewbogott: which box? [15:56:53] bd808: mw1041 [15:56:54] mw1041 [15:57:03] was just powercycle after crashing [15:57:09] Yo [15:57:23] I'll take a look [15:57:28] (03PS1) 10JanZerebecki: add wikilovesmonument.org [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) [15:57:51] (sorry I cannot continue myself, if you depool it, I can in 1 hour) [15:58:00] ostriches: bd808 is on it, thanks for appearing on demand though :) [15:58:00] (03PS1) 10Filippo Giunchedi: restbase: remove non existant cassandra nodes from seeds [puppet] - 10https://gerrit.wikimedia.org/r/252704 [15:58:24] andrewbogott: hmmm... has this server been depooled for quite a while? [15:58:36] gwicke: ^ [15:58:43] bd808, my last deployment failed [15:58:58] bd808: since we were discussing whether or not to depool it… we were not thinking that it was already depooled :) [15:59:01] but it was on dsh [15:59:19] sorry about that [15:59:26] (03CR) 10GWicke: [C: 031] restbase: remove non existant cassandra nodes from seeds [puppet] - 10https://gerrit.wikimedia.org/r/252704 (owner: 10Filippo Giunchedi) [15:59:33] godog: thank you! [15:59:42] there are .git directories laying about. That may be a result of some changes we made in scap [16:00:05] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151112T1600). [16:00:06] (03CR) 10John F. Lewis: [C: 04-1] "just a comment on the MXes. likely what the domain currently uses but should be changed over :)" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) (owner: 10JanZerebecki) [16:01:41] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: remove non existant cassandra nodes from seeds [puppet] - 10https://gerrit.wikimedia.org/r/252704 (owner: 10Filippo Giunchedi) [16:01:45] andrewbogott, jynus: some errors there are legitimate due to messed up file permissions. some things in /srv/mediawiki/docroot are owned by root. [16:02:04] others I think were caused by a scap change and I can clean up myself [16:02:07] bd808: would that be a result of someone doing a sync as root, earlier? [16:02:29] andrewbogott: the script *should* always sudo to mwdeploy [16:02:34] hm [16:02:54] !log restbase end deploy of e5e45d11ad3f [16:02:58] bd808, jynus, are you curious about what happened, or shall I just wipe out those files and resync? [16:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:03:26] I guess we should definitely depool before doing that :) [16:03:28] andrewbogott: the offending files in /srv/mediawiki/docroot/wikipedia.org don't seem to be on tin at all [16:03:28] gwicke: np, merged I'll be running puppet [16:03:40] * andrewbogott doesn’t know how to pool/depool but can look it up. [16:03:43] andrewbogott only thing I want to know is if there is a reason for it failing: hw, kernel, etc. [16:03:54] the goofy files are /srv/mediawiki/docroot/wikipedia.org/cases/* [16:04:18] MatmaRex: jhobs ping for SWAT [16:04:30] thcipriani: i'm here [16:04:34] it seemed kernel-crashed [16:04:52] kernel: [28960486.777380] mce: [Hardware Error]: Machine check events logged [16:05:11] there are a ton of those [16:06:25] jynus: I don’t know especially what that means but it seems a good reason to distrust that box [16:06:27] bd808: should I hold on SWAT while you're investigating this? looks like mw1041 is in the mediawiki-installation dsh group. [16:06:55] thcipriani: you are going to get a failure from it but I don't think it will hurt anything [16:07:06] thcipriani: I think we should depool that box and set it aside for now. Is that a thing you know how to do? [16:07:10] looks like there are just some extra things that rsync can't clean up [16:08:01] andrewbogott: I saw some documentation for that at some point, but no, not a thing I know how to do. [16:08:14] hi thcipriani [16:08:26] 6operations: Hardware errors on mw1041 - https://phabricator.wikimedia.org/T118469#1801095 (10Andrew) 3NEW [16:08:30] jynus: https://phabricator.wikimedia.org/T118469 [16:08:49] !log cleaned up stale branch checkouts on mw1041: php-1.26wmf16 php-1.26wmf17 php-1.26wmf18 php-1.26wmf19 php-1.26wmf20 php-1.26wmf21 [16:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:03] MatmaRex: hi, so I noticed you patch depends on a commit in UploadWizard master that isn't deployed for .5 but is deployed for .6 [16:09:10] (03CR) 10Bartosz Dziewoński: "Should be safe now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248360 (https://phabricator.wikimedia.org/T98933) (owner: 10Bartosz Dziewoński) [16:09:29] thcipriani: oh, hmmm. hold on [16:09:58] thcipriani: i was under the impression we don't have UploadWizard on any wikipedias. but that might not be true, now that i think of it [16:10:53] thcipriani: it is on rowiki, apparently D: wut [16:11:03] thcipriani: okay, i'm moving to evening then. sorry about that [16:11:11] thanks for spotting [16:11:13] MatmaRex: no problem, thanks for double checking! [16:11:25] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1801111 (10daniel) >>! In T116247#1799843, @Ottomata wrote: > Is it time to consider creating a standalone repo for these schemas? In my oppinion,... [16:11:31] !log depooled T118469. https://phabricator.wikimedia.org/T118469 [16:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:42] andrewbogott: can you use your root powers to delete /srv/mediawiki/docroot/wikipedia.org/cases on mw1041 (or move it out of /srv/mediawiki at least). It was created by some root on 2015-08-31T22:40Z [16:11:54] bd808: ok [16:12:29] bd808: and sync-common right after? [16:12:43] I'll run the sync and check the output [16:13:02] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252606 (https://phabricator.wikimedia.org/T110661) (owner: 10Jhobs) [16:13:14] bd808: ok,have at [16:13:24] (03Merged) 10jenkins-bot: Deploy QuickSurveys to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252606 (https://phabricator.wikimedia.org/T110661) (owner: 10Jhobs) [16:13:45] andrewbogott: w00t. sync-common is happy again [16:13:54] great. [16:14:03] but those files were weird [16:14:07] I’m going to leave it depooled, though, until we know something about those kernel issues [16:14:18] I saved the mystery files in ~root/filesfromwikipedia.org [16:14:19] somebody testing something to do with languages [16:14:40] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1801114 (10Ottomata) @daniel This schema repo will be used by many codebases. EventLogging, Mediawiki, analytics refinery, etc. etc. Anyone creati... [16:14:58] 6operations: Hardware errors on mw1041 - https://phabricator.wikimedia.org/T118469#1801115 (10Andrew) Almost certainly unrelated, but in order to get sync-common to run gracefully on that server I had to move some files out of the way: root@mw1041:/srv/mediawiki/docroot/wikipedia.org# mv cases/ ~/filesfromwikip... [16:16:08] hmm, I have a feeling that whenever fatalmonitor actually outputs something on fluorine it's going to be not good. Lots of time generating output... [16:16:41] 6operations: Hardware errors on mw1041 - https://phabricator.wikimedia.org/T118469#1801124 (10Andrew) Depooled for now. [16:17:58] andrewbogott: thx ^ [16:18:32] cmjohnson1: can I leave that box in your hands now? I haven’t done much to investigate the problem other than to scratch my head and say “that doesn’t look good" [16:18:38] Who's got a sec. to merge something for me and hashar for puppetz? [16:18:41] yep..i will take it [16:18:46] cmjohnson1: great, thanks [16:18:53] ostriches: link? [16:18:55] https://gerrit.wikimedia.org/r/#/c/244498/ [16:19:32] andrewbogott: lots of cleanup, but we're set to take care of that on ytterbium and gallium ourselves, just need the merge to puppetmaster. [16:19:56] is this because things are moving to scandium? [16:20:20] (03PS5) 10BBlack: geoip.inc: use X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/252442 (https://phabricator.wikimedia.org/T89688) [16:20:31] (03CR) 10BBlack: [C: 032 V: 032] geoip.inc: use X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/252442 (https://phabricator.wikimedia.org/T89688) (owner: 10BBlack) [16:20:58] andrewbogott: No, it's because no jenkins jobs rely on us replicating repos in real-time to the nodes anymore. [16:21:10] And nothing uses them on gallium for sure, which is the only place we're replicating anymore. [16:21:23] oh good: 31341410 fwrite(): send of 61 bytes failed with errno=32 Broken pipe in /srv/mediawiki/php-1.27.0-wmf.5/vendor/nmred/kafka-php/src/Kafka/Socket.php on line 330 I think twentyafterfour posted this problem yesterday, I don't know what the conclusion of that was though. [16:22:08] ‘k [16:22:18] tldr: Stop replicating data nobody needs :p [16:22:20] thcipriani: if you guys are having kafka problems, lemme know [16:22:29] Think of the electrons! [16:22:31] (03PS7) 10Andrew Bogott: contint: stop gerrit replication to gallium [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [16:22:45] thcipriani: I think that's nothing to worry about? [16:23:40] hmm, k. 31 million records in the logs took a lot of time for fluorine to process. [16:23:43] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 15.79% of data above the critical threshold [100000000.0] [16:23:48] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1801151 (10daniel) @Ottomata If we have good versioned dependencies between the modules, that should work too. My concern is making sure that code, s... [16:23:52] 6operations: Hardware errors on mw1041 - https://phabricator.wikimedia.org/T118469#1801153 (10Cmjohnson) a:3Cmjohnson Taking this task to troubleshoot. [16:23:53] I got a vague response in #wikimedia-operations when I brought it up yesterday [16:23:57] anywho, back to swat. [16:23:57] (03CR) 10Andrew Bogott: [C: 032] contint: stop gerrit replication to gallium [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [16:24:16] thcipriani: indeed it's clogging up fatalmonitor but really you should probably use kibana instead ;) [16:24:47] I tend to use both, fatalmonitor seems to blow up in a more obvious way (to me at least) :) [16:24:58] fatalmonitor on fluorine doesn't get as much love these days. [16:25:10] * ostriches also remembers when fatalmonitor was in ~catrope/bin/ :p [16:25:41] jhobs: anything but a sync need to be done in this instance? [16:26:08] ostriches: merged, no errors on gallium. [16:26:09] thcipriani: i'm not sure what you mean by that (deploy noob) [16:26:31] (03CR) 10Andrew Bogott: "puppet compiler shows this as a no-op on labcontrol1001, labnet1001, labvirt1001" [puppet] - 10https://gerrit.wikimedia.org/r/252702 (owner: 10Andrew Bogott) [16:26:32] andrewbogott: Ty sir! [16:26:40] hashar: ^ merged :) [16:26:51] jhobs: no database changes or maintenance scripts: that kinda thing. [16:27:21] thcipriani: correct, just enabling the extension on enwiki and a small config change to use a live survey [16:27:46] jhobs: kk, I figured, just wanted to doublecheck :) [16:27:47] thcipriani: https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors caught problems that fatalmointor missed [16:27:57] andrewbogott: Ran puppet on ytterbium, removed the one replication.config stanza I wanted :) [16:28:04] 6operations, 10Traffic, 7Varnish: Varnish GeoIP is broken for HTTPS+IPv6 traffic - https://phabricator.wikimedia.org/T89688#1801163 (10BBlack) 5Open>3Resolved Basic isssue here seems resolved with https://gerrit.wikimedia.org/r/#/c/252442/ . Obviously, we still don't have good v6 data, but that's the mm... [16:28:53] !log gerrit: reloading replication plugin to pick up new config [16:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:21] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Deploy QuickSurveys to English Wikipedia [[gerrit:252606]] (duration: 00m 29s) [16:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:29] ^ jhobs check please [16:31:12] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1801180 (10BBlack) Oh and missing in the final paragraph above: the limiter is again only miss/pass traffic, vs the 34K/sec being all requests (~90%+ of which are cache hits). So ballpark term... [16:31:49] thcipriani: does it take a few minutes to propagate servers or something? Doesn't appear to be working on enwiki yet [16:32:12] jhobs: should be sync'd lemme double check [16:33:26] thcipriani: it's off testwiki, but definitely not on enwiki for me yet (testing by looking for mw.config.values.wgEnabledQuickSurveys in my js console) [16:34:46] jhobs: InitializeSettings is definitely up-to-date for at least a few app servers I spot-checked. ?debug=true as a url parameter might help (if you're not using that already) [16:35:58] thcipriani: that did the trick and everything looks great, thanks! [16:36:09] jhobs: awesome! Thanks for checking. [16:42:53] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [16:46:26] (03CR) 10Legoktm: "Er, it won't. The "latest stable MediaWiki" is an i18n message in WikimediaMessages, which isn't being changed yet. And the default branch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252399 (owner: 10Florianschmidtwelzow) [17:00:04] Deploy window Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151112T1700) [17:01:05] (03CR) 10Giuseppe Lavagetto: "This is scheduled to go out in 1 hour, and I was asked to assist with this." [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [17:01:50] deploying new parsoid code now. [17:02:44] !log starting parsoid deploy [17:04:03] is the bot down? [17:04:18] will manually update the sal page later. [17:05:47] hmm .. why is sync stuck on the last 2 nodes .. will keep retrying a bit more. [17:06:07] (03PS2) 10Rush: Move role/nova.pp into role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/252702 (owner: 10Andrew Bogott) [17:08:12] (03PS3) 10Andrew Bogott: Move some labs openstack classes into role/labs/openstack [puppet] - 10https://gerrit.wikimedia.org/r/252702 [17:08:20] oh wrong chan [17:08:30] quick survey broke layout of the articles [17:08:34] @media all and (min-width:768px){.ext-qs-loader-bar,.ext-quick-survey-panel{margin-left:1.4em;width:300px;clear:right;float:right}.infobox,.last-modified-bar,h2{clear:both}} [17:08:44] https://phabricator.wikimedia.org/T118475 [17:09:27] jhobs: ^^ [17:09:59] <_joe_> eheh I was about to comment "css, here is a thing I don't feel comfortable helping with" [17:10:15] (03CR) 10jenkins-bot: [V: 04-1] Move some labs openstack classes into role/labs/openstack [puppet] - 10https://gerrit.wikimedia.org/r/252702 (owner: 10Andrew Bogott) [17:10:19] hmm .. can someone help with this? git deploy sync is done on 42/44 nodes .. wtp1007 and wtp2010 aren't getting synced .. i've retried often enough. [17:10:29] <_joe_> apergos: ^^ [17:10:34] legoktm: thedj, thanks i'll take a look [17:10:36] <_joe_> salt troubles I guess [17:10:47] subbu: I'll have a look [17:10:53] thanks [17:10:58] <_joe_> apergos: thanks! [17:11:09] legoktm, thedj: in the future, I probably won't be in -ops all that often, so -mobile is probably a better place for this :) [17:11:12] <_joe_> i'll bbiab [17:11:52] apergos, should i "git deploy abort" the current attempt? or wait for a bit and retry git deploy sync after you investigate? [17:12:40] subbu: wait please, I want to see where we are first [17:12:41] thanks [17:12:48] sure. [17:17:08] (03PS4) 10Andrew Bogott: Move some labs openstack classes into role/labs/openstack [puppet] - 10https://gerrit.wikimedia.org/r/252702 [17:18:00] subbu: you can retry the fetch to them now [17:18:18] will do [17:18:32] nothing useful whatsoever in the logs, but the master was restarted earlier today, looks like they ended up wedged. [17:19:07] apergos, yay .. fetch went through. [17:19:11] good [17:19:35] checkout as well. thanks. all good now. [17:19:42] great, happy trails [17:25:00] andrewbogott, back [17:25:28] !log finished parsoid deploy sha 392e25eb [17:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:25:42] PROBLEM - puppet last run on mw2058 is CRITICAL: CRITICAL: puppet fail [17:28:20] I see T118469 [17:31:37] (03PS5) 10Andrew Bogott: Move some labs openstack classes into role/labs/openstack [puppet] - 10https://gerrit.wikimedia.org/r/252702 [17:33:31] (03CR) 10Chad: [C: 031] "Lego is right as usual :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252399 (owner: 10Florianschmidtwelzow) [17:36:02] (03CR) 10Andrew Bogott: [C: 032] Move some labs openstack classes into role/labs/openstack [puppet] - 10https://gerrit.wikimedia.org/r/252702 (owner: 10Andrew Bogott) [17:36:32] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on port 9042 [17:41:22] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: puppet fail [17:42:14] (03PS1) 10Andrew Bogott: Labs: Clean up and qualify some paths after the role renames. [puppet] - 10https://gerrit.wikimedia.org/r/252720 [17:43:02] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: puppet fail [17:43:58] (03CR) 10Andrew Bogott: [C: 032] Labs: Clean up and qualify some paths after the role renames. [puppet] - 10https://gerrit.wikimedia.org/r/252720 (owner: 10Andrew Bogott) [17:44:05] (03PS1) 10Zfilipin: Updated RuboCop to the newest released version [puppet] - 10https://gerrit.wikimedia.org/r/252721 (https://phabricator.wikimedia.org/T112651) [17:45:03] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: puppet fail [17:45:32] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:42] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: puppet fail [17:46:43] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:47:23] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 9.743 second response time [17:48:53] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:50:36] (03PS1) 10Andrew Bogott: Catch up hiera settings with role::nova rename [puppet] - 10https://gerrit.wikimedia.org/r/252723 [17:52:26] (03CR) 10Andrew Bogott: [C: 032] Catch up hiera settings with role::nova rename [puppet] - 10https://gerrit.wikimedia.org/r/252723 (owner: 10Andrew Bogott) [17:53:13] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:24] RECOVERY - puppet last run on mw2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:55:13] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:57:46] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1801337 (10EBernhardson) We have already run into many annoyances with trying to keep schemas in line across repositories. I'd be happy to be proved... [17:58:04] (03PS1) 10Andrew Bogott: Obvious typo fix [puppet] - 10https://gerrit.wikimedia.org/r/252726 [17:58:22] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: puppet fail [17:58:34] (03CR) 10Giuseppe Lavagetto: "The relevant discussion seems is in T110070, and since this was scheduled, I see no reason not to deploy it." [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [17:58:44] <_joe_> MaxSem: let's go [17:58:52] (03PS5) 10Giuseppe Lavagetto: Switch www.wikimedia.org to source control [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [17:58:56] 10Ops-Access-Requests, 6operations: Give access to stat1002 to mobrovac - https://phabricator.wikimedia.org/T118399#1801339 (10RobH) @mobrovac, Can you please review https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups and advise if you need access to analytics-users, analytics-privatedata-... [17:59:05] 10Ops-Access-Requests, 6operations: Give access to stat1002 to mobrovac - https://phabricator.wikimedia.org/T118399#1801340 (10RobH) p:5Triage>3Normal [17:59:05] <_joe_> MaxSem: it will take some time to be deployed [17:59:13] (03CR) 10Andrew Bogott: [C: 032] Obvious typo fix [puppet] - 10https://gerrit.wikimedia.org/r/252726 (owner: 10Andrew Bogott) [17:59:26] (03PS6) 10Giuseppe Lavagetto: Switch www.wikimedia.org to source control [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [17:59:41] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Switch www.wikimedia.org to source control [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [17:59:50] 10Ops-Access-Requests, 6operations: Give access to stat1002 to mobrovac - https://phabricator.wikimedia.org/T118399#1801341 (10RobH) a:3mobrovac I've assigned this back to @mobrovac for his input and he can then assign to his manager for approval. Once we have this info, please place up for grabs or assign... [17:59:52] <_joe_> MaxSem: you here? [18:00:10] <_joe_> andrewbogott: don't merge the patch I submitted please [18:00:24] <_joe_> I need to disable puppet on the appservers first, just to be sure [18:00:27] _joe_: ok. I’m done for the moment anyway, I think [18:00:33] <_joe_> ok [18:01:17] _joe_, yep! [18:02:18] <_joe_> MaxSem: I'm erring on the side of caution, but I know apache configs tend to evade my syntax checking :) [18:02:43] _joe_, it was tested in beta [18:02:56] still, no 100% warranty, of course:) [18:03:01] <_joe_> MaxSem: yeah, I said, being overcautious [18:04:51] <_joe_> MaxSem: do you have ori's extension to be able to set the X-wikimedia-debug header? [18:04:52] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.002 second response time [18:05:03] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:05:39] _joe_, I have [18:06:13] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: puppet fail [18:06:49] <_joe_> MaxSem: try it then, I guess the patch had issues [18:07:13] <_joe_> it gives me a 404 [18:07:31] err [18:08:09] (03PS1) 10Andrew Bogott: Catch nodepool manifest up with labs role renaming [puppet] - 10https://gerrit.wikimedia.org/r/252727 [18:08:48] hmm, the files are all in place [18:08:49] .. [18:09:19] fix incoming [18:09:21] (fuck) [18:09:43] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [18:09:57] <_joe_> remove the /prod/ there [18:10:03] yup [18:10:20] changed that part at a last moment [18:11:00] <_joe_> MaxSem: should I fix that? [18:11:06] <_joe_> I can if needed [18:11:13] (03PS2) 10Andrew Bogott: Catch up yet more manifests with labs role renaming [puppet] - 10https://gerrit.wikimedia.org/r/252727 [18:11:27] (03PS1) 10MaxSem: Fix path [puppet] - 10https://gerrit.wikimedia.org/r/252729 [18:11:31] _joe_, ^^^ [18:11:50] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Fix path [puppet] - 10https://gerrit.wikimedia.org/r/252729 (owner: 10MaxSem) [18:12:44] <_joe_> MaxSem: let's see [18:12:48] )_joe_ let me know when it’s safe to merge again? (Nothing is urgent enough for me to want to work around the default puppet-merge behavior) [18:12:55] <_joe_> andrewbogott: go [18:13:05] ‘kthanks [18:13:21] (03PS3) 10Andrew Bogott: Catch up yet more manifests with labs role renaming [puppet] - 10https://gerrit.wikimedia.org/r/252727 [18:14:07] <_joe_> MaxSem: much better [18:14:12] <_joe_> :) [18:14:27] <_joe_> MaxSem: take a look, confirm it's ok, then I'll deploy to the rest of the cluster [18:14:43] _joe_, LGTM [18:14:58] (03CR) 10Andrew Bogott: [C: 032] Catch up yet more manifests with labs role renaming [puppet] - 10https://gerrit.wikimedia.org/r/252727 (owner: 10Andrew Bogott) [18:15:25] funny, the rest of my patches don't have this /prod/ [18:16:08] (03CR) 10MaxSem: [C: 04-1] "Blocked on portals." [puppet] - 10https://gerrit.wikimedia.org/r/252366 (owner: 10MaxSem) [18:17:32] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:17:52] RECOVERY - puppet last run on labnodepool1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:18:47] (03PS1) 10Muehlenhoff: Restrict LDAP access on the corp LDAP mirror [puppet] - 10https://gerrit.wikimedia.org/r/252730 [18:19:10] (03PS1) 10MaxSem: Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252731 [18:19:29] (03CR) 10MaxSem: [C: 032] Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252731 (owner: 10MaxSem) [18:19:49] (03Merged) 10jenkins-bot: Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252731 (owner: 10MaxSem) [18:19:50] <_joe_> MaxSem: still running puppet now, it might take up to 20 minutes to be updated everywhere [18:19:59] ok [18:21:13] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:21:30] !log maxsem@tin Synchronized portals/: The rest of portals (duration: 00m 31s) [18:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:22:34] (03CR) 10MaxSem: "Not blocked." [puppet] - 10https://gerrit.wikimedia.org/r/252366 (owner: 10MaxSem) [18:22:58] (03PS1) 10Thcipriani: Revert "deployment::server: move IPv6 int to role" [puppet] - 10https://gerrit.wikimedia.org/r/252732 (https://phabricator.wikimedia.org/T118422) [18:26:19] 10Ops-Access-Requests, 6operations: Give access to stat1002 to mobrovac - https://phabricator.wikimedia.org/T118399#1801435 (10mobrovac) a:5mobrovac>3GWicke I need to be in the `analytics-privatedata-users` as my request is for querying webrequest logs. Assigning to @GWicke for approval. [18:26:28] _joe_, meanwhile there's a corresponding update to beta https://gerrit.wikimedia.org/r/#/c/252355/ and two changes to complete the migration: https://gerrit.wikimedia.org/r/252364 https://gerrit.wikimedia.org/r/252366 [18:26:46] <_joe_> should we perform the other changes as well? [18:27:39] ideally, yes [18:27:51] <_joe_> ok [18:28:05] <_joe_> we might pack them toghether once we are done with the first? [18:29:23] 10Ops-Access-Requests, 6operations: Give access to stat1002 to mobrovac - https://phabricator.wikimedia.org/T118399#1801451 (10GWicke) Approved. [18:29:35] 10Ops-Access-Requests, 6operations: Give access to stat1002 to mobrovac - https://phabricator.wikimedia.org/T118399#1801455 (10GWicke) a:5GWicke>3RobH [18:30:37] _joe_, I guess. the beta one can go right now as it does not touch prod [18:31:34] (03PS2) 10Giuseppe Lavagetto: Switch www.wikimedia.beta.wmflabs.org to Git [puppet] - 10https://gerrit.wikimedia.org/r/252355 (https://phabricator.wikimedia.org/T118009) (owner: 10MaxSem) [18:31:57] (03PS1) 10RobH: adding mobrovac to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/252738 (https://phabricator.wikimedia.org/T118399) [18:32:21] 10Ops-Access-Requests, 6operations: Give access to stat1002 to mobrovac - https://phabricator.wikimedia.org/T118399#1801467 (10RobH) [18:32:59] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Switch www.wikimedia.beta.wmflabs.org to Git [puppet] - 10https://gerrit.wikimedia.org/r/252355 (https://phabricator.wikimedia.org/T118009) (owner: 10MaxSem) [18:33:31] (03PS2) 10RobH: adding mobrovac to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/252738 (https://phabricator.wikimedia.org/T118399) [18:33:54] 10Ops-Access-Requests, 6operations: Give access to stat1002 to mobrovac - https://phabricator.wikimedia.org/T118399#1799234 (10RobH) All approvals are in and https://gerrit.wikimedia.org/r/#/c/252738/ is ready for merge on Monday (pending no objections). [18:36:35] (03PS1) 10RobH: updating mholloway ssh pub key [puppet] - 10https://gerrit.wikimedia.org/r/252740 [18:37:31] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Update mholloway's SSH key - https://phabricator.wikimedia.org/T118392#1801504 (10RobH) I confirmed that P2302 was created by @mholloway and his ssh key has been updated. (As phabricator and office wiki are both approved methods of pub key update.) htt... [18:37:44] (03PS2) 10RobH: updating mholloway ssh pub key [puppet] - 10https://gerrit.wikimedia.org/r/252740 [18:38:10] (03CR) 10RobH: [C: 032] updating mholloway ssh pub key [puppet] - 10https://gerrit.wikimedia.org/r/252740 (owner: 10RobH) [18:39:50] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Update mholloway's SSH key - https://phabricator.wikimedia.org/T118392#1801516 (10RobH) 5Open>3Resolved a:3RobH [18:49:49] <_joe_> MaxSem: I'll go on with the other two patches [18:49:58] ok [18:51:03] (03PS3) 10Giuseppe Lavagetto: Switch www.wikipedia.org to Git [puppet] - 10https://gerrit.wikimedia.org/r/252364 (owner: 10MaxSem) [18:51:06] question concerning Morning SWAT: Whats the status? [18:51:40] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/252364 (owner: 10MaxSem) [18:52:11] (03PS2) 10Giuseppe Lavagetto: Switch remaining portals to Git [puppet] - 10https://gerrit.wikimedia.org/r/252366 (owner: 10MaxSem) [18:52:21] thcipriani: You deployed the first patch, what's the status of the second? [18:53:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/252366 (owner: 10MaxSem) [18:54:03] Luke081515: I didn't see your patch there this morning. There were two there this morning when I started, but I didn't see yours. [18:54:03] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [5000000.0] [18:55:10] thcipriani: Can we deploy this at the evening swat? It's a throttle excemption. The only problem is, that I can't be available during the evening swat... [18:56:11] <_joe_> MaxSem: done, it will take some time to be applied everywhere (as usual, ~ 20 mins) [18:56:23] thank you _joe_! [18:57:52] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [18:58:42] Luke081515: blerg, ok, I can get it out here shortly, how long are you around? I could maybe sneak it in before the train if twentyafterfour doesn't mind. [18:59:09] thcipriani: I don't mind [19:00:00] twentyafterfour: thanks, should just take a second. Luke081515 I can babysit if you're not around: patch looks straightforward. [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151112T1900). [19:00:34] ok, thanks :) [19:00:39] ok, doing now. [19:01:15] (03CR) 10Thcipriani: [C: 032] "late SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252010 (https://phabricator.wikimedia.org/T118122) (owner: 10Luke081515) [19:02:00] (03Merged) 10jenkins-bot: Set throttle exception for University of Haifa wiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252010 (https://phabricator.wikimedia.org/T118122) (owner: 10Luke081515) [19:03:42] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [19:05:14] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: Set throttle exception for University of Haifa wiki event [[gerrit:252010]] (duration: 00m 29s) [19:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:23] ^ Luke081515 sync'd [19:05:28] great, thanks [19:05:59] Luke081515: sorry I missed you patch :( [19:06:23] no problem, becaue the late SWAT fixed the problem ;) [19:06:51] twentyafterfour: heads up, mira seemed to have some problems: exit 23 partial transfer :\ [19:08:42] er, wait, that problem was on tin... [19:09:56] looks related to the portals deploy: https://dpaste.de/117A [19:11:12] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [19:11:32] complete paste: https://dpaste.de/sKQi [19:15:26] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1801610 (10cscott) [19:16:36] thcipriani: the failed to set times thing is the problem that bd808 and ostriches were trying to solve by sudoing the branch checkout to make the branch owned by mwdeploy, but that caused other problems with the branch deployment on tuesday so I had to put the permissions back to the way they were before that change... so that will happen until we figure out a way to ignore those set time errors [19:17:42] kk. I'm not sure I understand the rational behind not letting a group with write permissions overwrite mtimes :\ [19:18:30] it's a rationale that is beyond our control, afaik [19:18:50] it's either unix permissions or it's rsync, not sure which one [19:18:58] that rationale is fundamental to unix but I too am not sure why [19:19:00] it's unix. I just don't get why. [19:20:04] serverfault doesn't seem to know why either: http://serverfault.com/questions/337766/how-to-allow-multiple-people-to-change-mtime-timestamp-of-a-file-through-sftp [19:20:20] possibly stupid question: do we need the mtimes? [19:20:50] rsync uses them as part of the algo to decide what needs to be updated [19:21:07] I guess we could use -c for content hashes instead but that could be slow considering we're talking 10s of gb. [19:21:11] We certainly need file mtimes for INitialzeSettings [19:21:34] can we just have rsyncd run as root? [19:21:47] I guess that's totally unsafe, but it'd be sorta safe if we ran it in a container, right? [19:21:47] _joe_, oh shi... [19:22:03] we're getting a redirect loop on images in prtals [19:22:06] twentyafterfour: it would be the rsync pull that needs to run as root [19:22:14] which yeah might be the fix [19:22:22] <_joe_> MaxSem: uhm should I rollback? [19:22:37] <_joe_> MaxSem: I just tested the html tbh [19:22:56] _joe_, yep, unless we can figure out the cause in a couple minutes [19:23:10] <_joe_> MaxSem: an url with the redir loop please? [19:23:15] <_joe_> I might find out [19:23:23] _joe_, https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikiquote-logo_sister_1.5x.png [19:24:04] <_joe_> MaxSem: gimme 3 mins [19:25:19] <_joe_> MaxSem: only wikipedia.org has images, right? [19:25:41] _joe_, fuck - not redirect loop but a redir from www to en. yes, only wp.org has images bundled right now [19:25:57] <_joe_> MaxSem: yes I was about to tell you [19:26:02] <_joe_> it redirs to enwiki [19:26:05] twentyafterfour: having it run as root would take a change to modules/scap/manifests/master.pp and to scap, but may be the "right" fix as long as we can pin down the rsync sudo grant well enough to avoid something horrible being possible (eg overwriting /etc/shadow) [19:28:47] <_joe_> MaxSem: I have the fix [19:29:08] <_joe_> MaxSem: all those images are under /portals/assets right? [19:30:20] _joe_, yes [19:31:19] <_joe_> MaxSem: ok take a look at my change [19:31:45] (03PS1) 10Giuseppe Lavagetto: portals/wikipedia: fix images for the static portal [puppet] - 10https://gerrit.wikimedia.org/r/252760 [19:31:51] <_joe_> MaxSem: ^^ [19:31:55] bd808: that's why I said containers - though a chroot might be good enough - to isolate the root rsync process [19:31:56] <_joe_> please check for sanity [19:32:45] bd808: what if the sudo rule only allowed calling a script with no arguments, and the script would then execute the rsync in a chroot with a safe set of arguments? [19:33:04] err, my apache-fu is... at 101 level:) but yeah, whatver that works [19:33:48] (03CR) 10Giuseppe Lavagetto: [C: 032] portals/wikipedia: fix images for the static portal [puppet] - 10https://gerrit.wikimedia.org/r/252760 (owner: 10Giuseppe Lavagetto) [19:34:04] <_joe_> MaxSem: as usual, 20 mins for this to take full effect [19:34:09] <_joe_> and I have to go to dinner [19:34:43] have a nice dinner _joe_ [19:34:50] <_joe_> MaxSem: can you point your browser to the test server and confirm this fixed the issue? [19:35:44] _joe_, it does, I already checked [19:36:20] <_joe_> MaxSem: cool [19:37:49] <_joe_> MaxSem: forcing puppet to run, but then I'm out [19:41:17] twentyafterfour: for this particular case, that might work. The only thing that scap would need to pass to the script would be the origin rysnc host to fetch from. The script could be provisioned with Puppet and thus locked down to root +2 oversight. [19:41:17] <_joe_> MaxSem: the page is btw now cached for 1 hour [19:41:32] <_joe_> MaxSem: so to purge it, I guess you should ask bblack maybe [19:41:40] <_joe_> I really am too tired by now [19:41:57] thank you so much _joe_ [19:41:58] bd808: yeah I think that's the right solution [19:42:44] MaxSem: They're being served by the normal WMF appservers? [19:43:00] MaxSem: if so, use purgeList.php [19:43:11] <_joe_> MaxSem: if you see further problems while I'm away, ask any opsen around [19:43:13] thcipriani: you all done I assume? [19:43:23] twentyafterfour: yup [19:43:27] <_joe_> Reedy: he should wait that the pages are fixed on the cluster [19:43:40] True [19:43:45] <_joe_> MaxSem: you can verify that by running apache-fast-test from tin [19:43:48] But afterwards, he should be able to purge them himself [19:46:12] <_joe_> ah I messed it up apparently, groan [19:46:54] <_joe_> MaxSem: :/ sorry [19:47:39] <_joe_> too late in my evening :( [19:47:42] (03PS1) 10Giuseppe Lavagetto: wwwportals: remove bogus rewritecond [puppet] - 10https://gerrit.wikimedia.org/r/252767 [19:48:05] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/252767 (owner: 10Giuseppe Lavagetto) [19:50:27] <_joe_> MaxSem: ok now it's going to be fixed, and I need to go [19:50:38] <_joe_> people have been waiting for me for logn enough [20:16:43] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:18:33] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [20:21:59] 6operations: uwsgi takes a long time to restart (Debian Jessie in labs) - https://phabricator.wikimedia.org/T118495#1801807 (10Halfak) 3NEW a:3yuvipanda [20:22:37] 6operations: uwsgi takes a long time to restart - https://phabricator.wikimedia.org/T118495#1801815 (10yuvipanda) [20:22:58] 6operations: uwsgi takes a long time to restart - https://phabricator.wikimedia.org/T118495#1801807 (10yuvipanda) Ive noticed the same thing on other servers too - I think graphite, invisible-unicorn etc all take a loooong time to restart. [20:23:14] 6operations: uwsgi takes a long time to restart (Debian Jessie in labs) - https://phabricator.wikimedia.org/T118495#1801825 (10Halfak) [20:26:49] ok I got distracted by things in another channel. I guess it's cool to deploy the train now? [20:30:03] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:31:52] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [20:45:29] <_joe_> [20:48:43] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:43] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: puppet fail [20:54:22] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [20:54:33] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:56:59] 6operations, 7Database: defragment db1015, db1035 and db1027 - https://phabricator.wikimedia.org/T110504#1801937 (10jcrespo) db1035 defragmented and in production, 2 to go :-P [20:58:13] twentyafterfour: How's the train going? [20:58:21] I've got a hotfix for IE8 I'd like to unbreak asap [20:58:42] Krinkle: I found a bug in wmf.6 I'm trying to patch it up before it goes live to wikipedia and floods the logs [20:58:49] Okay [20:58:55] Krinkle: go ahead and unbreak it now if you'd like? [20:59:11] although IE8 cannot be unbroken really [20:59:16] OK. I'l add it in wmf.6 [20:59:26] and let wikipedia be broken a little while longer [20:59:34] until you roll it out there [20:59:49] ok yeah it won't be long I'm almost done with this patch [21:03:32] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [21:04:43] PROBLEM - YARN NodeManager Node-State on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:52] PROBLEM - Check size of conntrack table on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:53] PROBLEM - puppet last run on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:05:03] PROBLEM - Disk space on Hadoop worker on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:05:03] PROBLEM - Hadoop NodeManager on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:05:22] PROBLEM - RAID on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:05:42] PROBLEM - SSH on analytics1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:05:52] PROBLEM - DPKG on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:05:53] PROBLEM - configured eth on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:06:13] PROBLEM - Disk space on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:06:23] PROBLEM - dhclient process on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:06:30] can I get a quick review of my hotfix for (https://phabricator.wikimedia.org/T117770) here: https://gerrit.wikimedia.org/r/#/c/252776/ [21:06:32] PROBLEM - Hadoop DataNode on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:06:32] PROBLEM - salt-minion processes on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:07:05] ottomata: is that you?^ [21:07:33] twentyafterfour: No commit in master? [21:07:46] Krinkle: I'll merge to master [21:08:00] trying to get out of the habit of cherry picking [21:08:48] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#1802040 (10RobH) a:3Ejegg [21:08:49] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#1802039 (10RobH) @ejegg, We'll need a few things from you before we can implement this change: * This seems to be the first change to your access since we implemented the L3 document. Plea... [21:09:07] twentyafterfour: cherry-picking from master to wmf, not the other way around :P [21:09:31] Krinkle: but if you merge then you do it the other way around. no cherry pick [21:09:48] since master has everything on the branch, then merging the branch back into master is clean [21:10:17] merging back into master is not something we do though. [21:10:24] !log reboot analytics1040 it locked up [21:10:29] ottomata: ^ [21:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:10:41] Krinkle: that is going to have to change [21:10:49] I'm just trying to lead by example [21:11:01] cherry picking must die [21:11:07] twentyafterfour: Why? Why would we want master to be full of merge commits from unrelated temporal branches [21:11:38] Krinkle: it's already full of merge commits, every commit from gerrit is a merge commit [21:11:39] master has the change first, and then you cherry-pick to a deployment branch (or fast-forward), or if the fix is in deployment only, make a regular commit there. [21:11:45] Sure, but that's different. [21:11:46] (03PS1) 10RobH: adding ejegg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/252778 (https://phabricator.wikimedia.org/T118320) [21:11:58] how is it different? [21:12:05] merge commits afin I'm fine with. [21:12:16] Merge commits associated with branches nobody else uses is confusing and not needed. [21:12:24] Also, it's much harder to tests. [21:12:44] You won't know whether something works well until you know it works in master. You don't "work" in a wmf branch. [21:12:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [21:12:55] unless it's something that regressed within the branch. [21:13:03] RECOVERY - SSH on analytics1040 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [21:13:10] I think that's a principle, not something we'd want to stray away from. It's a good principle and keeps integiry. [21:13:13] RECOVERY - DPKG on analytics1040 is OK: All packages OK [21:13:20] analytics1040 is up [21:13:23] RECOVERY - configured eth on analytics1040 is OK: OK - interfaces up [21:13:26] Krinkle: it could be a feature branch then merge it into master and the production branch [21:13:42] RECOVERY - Disk space on analytics1040 is OK: DISK OK [21:13:43] RECOVERY - dhclient process on analytics1040 is OK: PROCS OK: 0 processes with command name dhclient [21:13:47] We weren't talking about feature branches. [21:13:48] so it rebooted, was it on porpouse? [21:13:53] RECOVERY - salt-minion processes on analytics1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:13:53] RECOVERY - Hadoop DataNode on analytics1040 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [21:14:02] RECOVERY - YARN NodeManager Node-State on analytics1040 is OK: OK: YARN NodeManager analytics1040.eqiad.wmnet:8041 Node-State: RUNNING [21:14:12] We're talking about single-commit improvements, drafted in a temporal deployment branch, merged there first, and then "forward ported" into master. [21:14:12] RECOVERY - Check size of conntrack table on analytics1040 is OK: OK: nf_conntrack is 0 % full [21:14:12] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [21:14:18] Krinkle: do you want me to make it a different branch? it's a hotfix [21:14:20] That is a fragile workflow I'd like to avoid. [21:14:23] RECOVERY - Disk space on Hadoop worker on analytics1040 is OK: DISK OK [21:14:23] RECOVERY - Hadoop NodeManager on analytics1040 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [21:14:42] RECOVERY - RAID on analytics1040 is OK: OK: optimal, 13 logical, 14 physical [21:14:43] PROBLEM - spamassassin on fermium is CRITICAL: PROCS CRITICAL: 0 processes with args spamd [21:14:52] this particular bug exists in master and the feature branch [21:15:26] I don't mind where it showed up first in Gerrit, that's okay either way. I just meant the general principle, because it sounded like you were talking about an example you're trying to indicate, an overall principle. I wasn't sure what you wanted to indicate by this example though. I interpreted it as wanting to start making commits in wmf branches and [21:15:26] applying to master later from there. [21:15:37] wmf branches are not feature branches. [21:15:43] They are behind master, not ahead. [21:16:18] well the principle I'm advocating is avoiding cherry-pick [21:16:21] I think it OOMed? [21:16:23] if you'd have a hotfix branch, then that hotfix branch would be based on masster not the wmf branch, where it is merged after that doesn't matter. [21:16:41] twentyafterfour: OK. what would you suggest instead? [21:16:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [21:17:07] Jenkins is really slow.. [21:17:10] what I just did is the clean way to avoid a cherry pick in this case [21:17:10] robh: can you restart spamd on fermium? [21:17:22] at least I thought so [21:17:23] no [21:17:27] JohnFLewis: nope [21:17:32] sorry [21:17:38] yes i can i misread as 'did you restart' [21:17:39] heh [21:17:40] Krinkle: lets take that discussion to #wikimedia-releng I guess, this is operational channel [21:17:49] JohnFLewis: will do now [21:18:24] RECOVERY - spamassassin on fermium is OK: PROCS OK: 3 processes with args spamd [21:19:16] JohnFLewis: seems better now =] [21:21:09] !log krinkle@tin Synchronized php-1.27.0-wmf.6/resources/lib/jquery.i18n/src/jquery.i18n.language.js: T118242 - Unbreak IE8 (duration: 00m 30s) [21:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:23:20] robh: thanks :) [21:24:31] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#1802072 (10RobH) [21:26:38] 6operations: analytics1040 kernel panicked/OOMed and rebooted - https://phabricator.wikimedia.org/T118501#1802076 (10jcrespo) 3NEW [21:26:53] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#1802083 (10RobH) p:5Triage>3Normal [21:27:08] 10Ops-Access-Requests, 6operations: Give access to stat1002 to mobrovac - https://phabricator.wikimedia.org/T118399#1802084 (10RobH) 5Open>3stalled [21:33:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [21:35:13] PROBLEM - puppet last run on restbase2003 is CRITICAL: CRITICAL: puppet fail [21:43:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [21:56:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [22:00:36] 6operations: analytics1040 kernel panicked/OOMed and rebooted - https://phabricator.wikimedia.org/T118501#1802172 (10Ottomata) We've also seen a couple of other analytics nodes have their root filesystems go into read only mode after OOM killer killed some stuff. Hm, sounds like we need to investigate Hadoop JV... [22:01:33] RECOVERY - puppet last run on restbase2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:01:57] 6operations, 6Analytics-Backlog, 10Analytics-Cluster: Audit Hadoop worker memory usage. - https://phabricator.wikimedia.org/T118501#1802174 (10Ottomata) a:3Ottomata [22:03:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 8 below the confidence bounds [22:07:39] twentyafterfour: I assume that for now that commit must be cherry-picked to master, right? [22:07:42] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 4 below the confidence bounds [22:08:41] Krinkle: I was planning to attempt merging it to master and cherry pick if I can't figure it out [22:08:49] k [22:08:56] at least I'll learn something in the process, I hope [22:09:41] opsen: can we kill that graphite anomaly detection alert? or at least make it less sensitive? it's been false-alarming for weeks [22:09:46] months even [22:10:18] I'd do it myself but I doubt I have permission to edit that (and I'm not sure where to look) [22:11:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [22:16:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 5 below the confidence bounds [22:17:12] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/252584/ [22:25:09] (03CR) 1020after4: graphite: Clarify description of graphite_threshold for reqstats.5xx (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252584 (owner: 10Krinkle) [22:28:02] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 4 below the confidence bounds [22:37:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [22:42:21] 6operations, 10Beta-Cluster-Infrastructure, 10VisualEditor, 7Varnish: [Regression pre-wmf.7] Images for musical scores, formulæ, heiroglyphics, thumbnails are returning 429s in the Beta Cluster when using VE (and other times?) - https://phabricator.wikimedia.org/T118486#1802367 (10hashar) The error 429 is... [22:42:48] !log twentyafterfour@tin Synchronized php-1.27.0-wmf.6/extensions/TranslationNotifications/TranslationNotificationJob.php: hotfix T116960 (duration: 00m 30s) [22:42:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [22:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:44:18] ok I'm syncing wmf.6 to wikipedia [22:44:27] 6operations, 10Beta-Cluster-Infrastructure, 10VisualEditor, 7Varnish: [Regression pre-wmf.7] Images for musical scores, formulæ, heiroglyphics, thumbnails are returning 429s in the Beta Cluster when using VE (and other times?) - https://phabricator.wikimedia.org/T118486#1802380 (10hashar) And `//etc/varnis... [22:44:31] jouncebot: refresh [22:44:34] I refreshed my knowledge about deployments. [22:45:21] (03PS1) 1020after4: wikipedia wikis to 1.27.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252858 [22:46:11] (03CR) 1020after4: [C: 032] wikipedia wikis to 1.27.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252858 (owner: 1020after4) [22:46:32] (03Merged) 10jenkins-bot: wikipedia wikis to 1.27.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252858 (owner: 1020after4) [22:46:54] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: wikipedia wikis to 1.27.0-wmf.6 [22:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:50:22] (03PS1) 10Yuvipanda: toollabs: Don't explicitly declare redis collector [puppet] - 10https://gerrit.wikimedia.org/r/252859 [22:57:42] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [23:00:05] bd808: Respected human, time to deploy Monolog config UBN! fix (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151112T2300). Please do the needful. [23:01:05] twentyafterfour: should I wait a bit? [23:01:59] bd808: I'm done [23:02:01] go for it :) [23:02:49] (03PS3) 10BryanDavis: Monolog: wrap channel handlers in a WhatFailureGroupHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252359 (https://phabricator.wikimedia.org/T118057) [23:02:57] (03CR) 10BryanDavis: [C: 032] Monolog: wrap channel handlers in a WhatFailureGroupHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252359 (https://phabricator.wikimedia.org/T118057) (owner: 10BryanDavis) [23:03:22] (03Merged) 10jenkins-bot: Monolog: wrap channel handlers in a WhatFailureGroupHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252359 (https://phabricator.wikimedia.org/T118057) (owner: 10BryanDavis) [23:04:02] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [23:05:37] !log bd808@tin Synchronized wmf-config/logging.php: Monolog: wrap channel handlers in a WhatFailureGroupHandler (b08aaf4) (duration: 00m 29s) [23:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:52] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:05:55] (03PS1) 10GWicke: Add /api/ listing to www.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/252863 (https://phabricator.wikimedia.org/T118519) [23:06:58] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: Varnish rate limiting has broken beta - https://phabricator.wikimedia.org/T118362#1802450 (10Reedy) 5Resolved>3Open http://en.m.wikipedia.beta.wmflabs.org/ is broken [23:06:58] (03PS1) 10Rush: openstack::common bit of refactor [puppet] - 10https://gerrit.wikimedia.org/r/252864 [23:07:04] Logs still getting into Logstash. Looks good to me [23:07:16] 6operations, 10Wikimedia-DNS: Move wiki to www.wikimedia.org.uk to avoid confusion with Wikimedia Ukraine - https://phabricator.wikimedia.org/T22182#1802453 (10Pcoombe) 5Open>3Resolved a:3Pcoombe WMUK's site moved to https://wikimedia.org.uk/ some time ago, so I'm closing this as Resolved. https://uk.wik... [23:08:32] (03CR) 10Milimetric: [C: 031] "thanks Gabriel!" [puppet] - 10https://gerrit.wikimedia.org/r/252863 (https://phabricator.wikimedia.org/T118519) (owner: 10GWicke) [23:10:00] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: Varnish rate limiting has broken beta - https://phabricator.wikimedia.org/T118362#1802464 (10Jdforrester-WMF) [23:10:02] 6operations, 10Beta-Cluster-Infrastructure, 10VisualEditor, 7Varnish: [Regression pre-wmf.7] Images for musical scores, formulæ, heiroglyphics, thumbnails are returning 429s in the Beta Cluster when using VE (and other times?) - https://phabricator.wikimedia.org/T118486#1802463 (10Jdforrester-WMF) [23:11:55] (03PS2) 10Thcipriani: Move scap-specific items out of mediawiki class [puppet] - 10https://gerrit.wikimedia.org/r/252362 (https://phabricator.wikimedia.org/T116606) [23:14:42] 6operations, 10Beta-Cluster-Infrastructure, 10VisualEditor, 7Varnish: [Regression pre-wmf.7] Images for musical scores, formulæ, heiroglyphics, thumbnails are returning 429s in the Beta Cluster when using VE (and other times?) - https://phabricator.wikimedia.org/T118486#1802485 (10Krenair) [23:14:45] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: Varnish rate limiting has broken beta - https://phabricator.wikimedia.org/T118362#1802483 (10Krenair) 5Open>3Resolved ``` I took a look through deployment-cache-text04:/root/.bash_history and found the comma... [23:19:23] 6operations, 10Beta-Cluster-Infrastructure, 10VisualEditor, 7Varnish: [Regression pre-wmf.7] Images for musical scores, formulæ, heiroglyphics, thumbnails are returning 429s in the Beta Cluster when using VE (and other times?) - https://phabricator.wikimedia.org/T118486#1802493 (10Jdforrester-WMF) @Krenair... [23:19:25] (03PS1) 10Ori.livneh: Add new key for self (ori) [puppet] - 10https://gerrit.wikimedia.org/r/252866 [23:19:28] (03CR) 10Alex Monk: "This will just be another instance of T118410..." [puppet] - 10https://gerrit.wikimedia.org/r/252863 (https://phabricator.wikimedia.org/T118519) (owner: 10GWicke) [23:22:15] 6operations, 10Beta-Cluster-Infrastructure, 10VisualEditor, 7Varnish: [Regression pre-wmf.7] Images for musical scores, formulæ, heiroglyphics, thumbnails are returning 429s in the Beta Cluster when using VE (and other times?) - https://phabricator.wikimedia.org/T118486#1802504 (10Krenair) 5Open>3Resolv... [23:23:53] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Puppet has 1 failures [23:25:07] (03CR) 10GWicke: "@Krenair, wikimedia.org does have RESTBase: https://wikimedia.org/api/rest_v1/?doc" [puppet] - 10https://gerrit.wikimedia.org/r/252863 (https://phabricator.wikimedia.org/T118519) (owner: 10GWicke) [23:25:51] (03CR) 10GWicke: "However, you are right that it might not work for *www.*wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/252863 (https://phabricator.wikimedia.org/T118519) (owner: 10GWicke) [23:28:37] (03PS2) 10Rush: openstack::common bit of refactor [puppet] - 10https://gerrit.wikimedia.org/r/252864 [23:29:49] (03CR) 10Chad: [C: 031] "seems legit" [puppet] - 10https://gerrit.wikimedia.org/r/252866 (owner: 10Ori.livneh) [23:30:46] (03PS1) 10EBernhardson: Turn off language detection user test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252867 (https://phabricator.wikimedia.org/T118197) [23:34:56] (03CR) 10Rush: [C: 032] openstack::common bit of refactor [puppet] - 10https://gerrit.wikimedia.org/r/252864 (owner: 10Rush) [23:37:54] RoanKattouw: ostriches: hi! I'm about to add a CentralNotice update (submodule bump) to this evening's SWAT, sound good? [23:37:56] :) [23:38:10] AndyRussG: Sure [23:38:15] Does it contain i18n changes? [23:38:29] RoanKattouw: fantastic thanks... Yes, I'm afraid it does 8p [23:38:35] OK [23:38:38] We'll deal [23:38:42] 6operations, 10Beta-Cluster-Infrastructure, 10VisualEditor, 7Varnish, 7Verified: [Regression pre-wmf.7] Images for musical scores, formulæ, heiroglyphics, thumbnails are returning 429s in the Beta Cluster when using VE (and other times?) - https://phabricator.wikimedia.org/T118486#1802581 (10Ryasmeen) [23:38:43] It's good to know that going in though [23:38:48] RoanKattouw: however they're only for stuff that's ever seen on meta [23:39:01] Dunno if that might provide some shortcuts [23:39:10] Unfortunately not [23:39:17] ...mmm oh well [23:39:30] RoanKattouw: cool thanks!! [23:41:36] RoanKattouw: this is the CentralNotice wmf deploy merge... https://gerrit.wikimedia.org/r/#/c/252861/ IIRC now it doesn't get automatically pushed to the core submodules? [23:42:18] Probably not [23:42:30] but let me check [23:46:36] I don't think it will update automatically because of the different branch names [23:48:05] Yeah it usually doesn't [23:48:10] But I remember a case where it did [23:49:40] Yeah me too! [23:49:52] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:50:44] (03CR) 10Mark Bergsma: [C: 031] Add new key for self (ori) [puppet] - 10https://gerrit.wikimedia.org/r/252866 (owner: 10Ori.livneh) [23:53:58] 6operations, 10Beta-Cluster-Infrastructure, 10VisualEditor, 7Varnish, 7Verified: [Regression pre-wmf.7] Images for musical scores, formulæ, heiroglyphics, thumbnails are returning 429s in the Beta Cluster when using VE (and other times?) - https://phabricator.wikimedia.org/T118486#1802607 (10greg) Thanks... [23:54:30] RoanKattouw: heh now I remember... Also the instructions for how to create the core patch for a submodule bump are gone, since for most extensions it's automatic! [23:55:13] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [23:57:56] RoanKattouw: preparing the core patch, seemed OK but I just got the version wrong... [23:58:08] Thanks [23:58:26] I'm still updating to the latest version of the wmf branch after fighting the office wifi