[00:00:04] RoanKattouw ostriches rmoen Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151105T0000). Please do the needful. [00:00:34] I guess we're doing https://gerrit.wikimedia.org/r/250992 [00:00:52] andrewbogott: oh nevermind me, yes I found the ferm [00:01:06] James_F [00:02:21] Krenair: When CI works, yes. [00:02:26] andrewbogott: hmm, not sure how to fix that [00:02:31] andrewbogott: I guess we should make it an array? [00:03:14] it is working, the tests just take a while [00:03:15] (03CR) 10Awight: [C: 04-1] "Thanks for helping with this! One small change, we don't want the empty category= param cos the code that uses this variable actually add" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T116609) (owner: 10Nemo bis) [00:03:22] It’s already defined in hiera as labs_ldap_dns_host_secondary [00:03:28] aaah [00:03:30] ok [00:03:30] so it’s just a second ferm line I think [00:04:03] ok let me make a patch andrewbogott [00:04:07] well, wait, I’m wrong — that’s probably a different IP from the one that’s making the nova query [00:04:18] so probably need to add a labs_designate_secondary_hostname [00:04:21] or something like that [00:05:07] (03CR) 10BryanDavis: "Posted for SWAT on 2015-11-06T00:00Z." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249313 (https://phabricator.wikimedia.org/T116550) (owner: 10BryanDavis) [00:05:34] (03PS1) 10Yuvipanda: labs: Open up nova API access to other DNS host too [puppet] - 10https://gerrit.wikimedia.org/r/251167 [00:05:40] andrewbogott: ugh, yeah, just noticed that too [00:05:49] Krenair: https://gerrit.wikimedia.org/r/251168 [00:09:13] (03PS2) 10Yuvipanda: labs: Open up nova API access to other DNS host too [puppet] - 10https://gerrit.wikimedia.org/r/251167 [00:09:23] andrewbogott: ^? [00:10:08] andrewbogott: should I define it for codfw too? [00:10:23] (03CR) 10Andrew Bogott: [C: 04-1] labs: Open up nova API access to other DNS host too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251167 (owner: 10Yuvipanda) [00:10:32] I don’t think it’s useful to do for codfw right now [00:11:30] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, and 2 others: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1784279 (10awight) I think we're prepared to make this change now. The sample rate is parsed out of the filenames, so that... [00:12:44] (03PS3) 10Yuvipanda: labs: Open up nova API access to other DNS host too [puppet] - 10https://gerrit.wikimedia.org/r/251167 [00:12:57] andrewbogott: % [00:13:00] ^ [00:13:25] (03CR) 10Andrew Bogott: [C: 031] labs: Open up nova API access to other DNS host too [puppet] - 10https://gerrit.wikimedia.org/r/251167 (owner: 10Yuvipanda) [00:13:37] (03PS4) 10Yuvipanda: labs: Open up nova API access to other DNS host too [puppet] - 10https://gerrit.wikimedia.org/r/251167 [00:13:48] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Open up nova API access to other DNS host too [puppet] - 10https://gerrit.wikimedia.org/r/251167 (owner: 10Yuvipanda) [00:17:11] (03PS1) 10Yuvipanda: dnsrecursor: Fix permissions for config YAML file [puppet] - 10https://gerrit.wikimedia.org/r/251171 [00:17:35] (03PS2) 10Yuvipanda: dnsrecursor: Fix permissions for config YAML file [puppet] - 10https://gerrit.wikimedia.org/r/251171 [00:17:37] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [00:19:06] RECOVERY - Recursive DNS on 208.80.155.118 is OK: DNS OK: 0.144 seconds response time. www.wikipedia.org returns 208.80.154.224 [00:19:11] andrewbogott: chasemp ^ dns fixed [00:19:17] (03CR) 10Yuvipanda: [C: 032] dnsrecursor: Fix permissions for config YAML file [puppet] - 10https://gerrit.wikimedia.org/r/251171 (owner: 10Yuvipanda) [00:20:02] ebernhardson: dcausse any luck with nobelium? :) [00:23:17] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: puppet fail [00:24:37] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [00:25:57] um [00:26:36] wtf is up on tin? [00:26:38] twentyafterfour, hi [00:27:06] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:28:27] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [00:32:05] YuviPanda: its bqck to importing [00:32:34] (03PS3) 10Andrew Bogott: Update keystone policy.json to allow the 'observer' role to observe. [puppet] - 10https://gerrit.wikimedia.org/r/251151 (https://phabricator.wikimedia.org/T104588) [00:33:46] (03CR) 10Andrew Bogott: [C: 032] Update keystone policy.json to allow the 'observer' role to observe. [puppet] - 10https://gerrit.wikimedia.org/r/251151 (https://phabricator.wikimedia.org/T104588) (owner: 10Andrew Bogott) [00:34:26] (03PS1) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 [00:34:35] andrewbogott: ^^ [00:34:41] ebernhardson: cool! any vague ETAs? [00:35:20] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 (owner: 10Yuvipanda) [00:36:48] YuviPanda: will moving it to uswsgi change the port? Or is that determined by the code and not the wsgi framework? [00:37:29] andrewbogott: so when we initially built this uwsgi couldn't actually serve http directly [00:37:31] now it can [00:37:34] so I just got rid of nginx [00:38:02] I also couldn't find the service definition file for the service earlier [00:38:13] Sure, I’m just wondering about http-socket => '0.0.0.0:80', [00:38:21] andrewbogott: ah yup, that's the one. [00:38:27] andrewbogott: before it was using a unix socket [00:38:30] that nginx listened on [00:38:51] so the service is moving to 80? [00:39:02] it was always on 80 no? [00:39:04] oh wait [00:39:06] maybe not [00:39:13] was on 5668! [00:39:14] good catch [00:39:16] let me fix that [00:39:32] It should stay on 5668 if that’s easy [00:39:44] yeah [00:39:44] (03PS2) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 [00:39:46] it is [00:39:48] I just fixed it [00:39:50] I just didn't notice it [00:39:53] now to fix the crazy pep errors..... [00:40:40] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 (owner: 10Yuvipanda) [00:41:04] YuviPanda: hard to say, looks like 37M content documents to go, then 92M docs in the general indices. [00:41:42] but one doc does not equal another doc, wiktionary docs for example are typically very small [00:42:59] getting an idea on insert speed is also odd because of that... but if we guess something like 1k doc/sec, which is probably at the higher end, 34 hours [00:43:39] (03PS3) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 [00:43:48] andrewbogott: right [00:43:51] err [00:43:53] ebernhardson: right [00:43:58] ebernhardson: another week? [00:44:03] YuviPanda: hopefully less [00:44:09] nice! [00:44:17] is this with the no-nested-documents fix? [00:44:19] * ebernhardson should figure out why elasticsearch.*.elasticsearch.indices.indexing.* are all 0 [00:44:26] YuviPanda: it still has nested documents, its just lazy loading them [00:44:33] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 (owner: 10Yuvipanda) [00:44:41] * YuviPanda nods head [00:44:45] it doesn't appear to need them at indexing time, but if we query all 1800 indices they will be loaded into memory and break things [00:44:53] heh [00:45:15] it might depend on actually querying them via geosearch, not sure yet. never used this lazy load option before :) [00:45:20] did you guys figure out if es in codfw is going to be active-active with eqiad? [00:45:45] YuviPanda: the intention was to serve search queries from codfw app/api servers from thet cluster, if thats what you mean? [00:45:54] aaah ok [00:45:56] yeah, that is. [00:46:12] also the ability to shift traffic over for major upgrades (like es 2.0 which was just released) [00:46:25] when they shift the major number like that the protocol between nodes changes, have to do the whole cluster [00:47:04] ah [00:47:12] so you can turn it all to one, do upgrade, turn it back, repeat [00:47:16] yea [00:48:23] Krenair, what's the status on SWAT? Can I add one super-late item? [00:48:39] Waiting for twentyafterfour to appear [00:48:44] There's something wrong [00:49:32] Okay [00:49:46] See tin:/srv/mediawiki-staging/weird-rebase [00:49:55] contains private info [00:50:55] Okay, I'll just put it in for tomorrow. [01:00:05] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151105T0100). Please do the needful. [01:00:21] heh [01:14:18] PROBLEM - puppet last run on mw1111 is CRITICAL: CRITICAL: Puppet has 1 failures [01:34:07] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:35:57] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [01:41:06] RECOVERY - puppet last run on mw1111 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [01:52:00] 7Puppet, 6Labs, 6Phabricator: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1784550 (10Negative24) @mmodell Not at a computer but isn't that just the tracking tag var? [01:55:34] AaronSchulz, hi [02:04:04] !log Someone has left tin:/srv/mediawiki-staging/php-1.27.0-wmf.5 in a mess, see `git log origin/wmf/1.27.0-wmf.5..HEAD --oneline`. Note there is one commit waiting to be merged on tin (https://gerrit.wikimedia.org/r/#/c/251168/) that hasn't been yet because of this. [02:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:04:19] I'm going to sleep. [02:05:29] Hope nobody runs scap [02:05:41] (03PS1) 10Ori.livneh: Revert "Remove /etc/wikimedia-image-scaler" [puppet] - 10https://gerrit.wikimedia.org/r/251188 [02:05:44] or syncs anything else really [02:06:26] Krenair: good night. who was the last person to sync stuff? [02:06:31] aaron [02:07:06] k [02:07:21] (03PS2) 10Ori.livneh: Revert "Remove /etc/wikimedia-image-scaler" [puppet] - 10https://gerrit.wikimedia.org/r/251188 [02:07:28] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Remove /etc/wikimedia-image-scaler" [puppet] - 10https://gerrit.wikimedia.org/r/251188 (owner: 10Ori.livneh) [02:10:22] 7Puppet, 6Labs, 6Phabricator: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1784582 (10mmodell) @negative24: Production isn't deployed via puppet anymore. I just need to set up labs instances to clone the deployment repo instead of the individual tags. [02:10:59] Krenair: I'm here [02:11:33] Krenair: I'll take care of it [02:13:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [02:13:50] 7Puppet, 6Labs, 6Phabricator: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1784583 (10Negative24) Ah, ok. (I'm a little bit curious of how the deployments are deployed; are they just pulled via git or something else?) [02:15:21] (03PS1) 10Ori.livneh: Only set $wgDisableOutputCompression to 'true' on the scalers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251189 [02:15:25] twentyafterfour: thanks [02:15:41] i'm going to sync a config change, but won't touch the branches [02:16:00] (03CR) 10Ori.livneh: [C: 032] Only set $wgDisableOutputCompression to 'true' on the scalers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251189 (owner: 10Ori.livneh) [02:16:22] (03Merged) 10jenkins-bot: Only set $wgDisableOutputCompression to 'true' on the scalers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251189 (owner: 10Ori.livneh) [02:17:34] !log ori@tin Synchronized wmf-config/CommonSettings.php: I3c397e892e: Only set $wgDisableOutputCompression to 'true' on the scalers (duration: 00m 18s) [02:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:53] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1784606 (10GWicke) @faidon: Until very recently (last days), there wasn't actually any REST proxy with schema validation in the EventLogging repository. @ottomata now has [a patc... [02:28:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [02:35:58] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [02:40:46] !log l10nupdate@tin Synchronized php-1.27.0-wmf.4/cache/l10n: l10nupdate for 1.27.0-wmf.4 (duration: 10m 31s) [02:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:47:21] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.4) at 2015-11-05 02:47:20+00:00 [02:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:53:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [02:58:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [03:00:56] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [03:10:06] (03PS4) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 [03:11:48] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 (owner: 10Yuvipanda) [03:15:13] !log l10nupdate@tin Synchronized php-1.27.0-wmf.5/cache/l10n: l10nupdate for 1.27.0-wmf.5 (duration: 10m 17s) [03:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:18:01] ACKNOWLEDGEMENT - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run for unit replicate-maps was over 1 day, 1:00:00 ago Coren Looking into it. [03:21:45] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.5) at 2015-11-05 03:21:45+00:00 [03:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:23:01] ACKNOWLEDGEMENT - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run for unit replicate-maps was over 1 day, 1:00:00 ago Coren Previous run pre-empted by manual backup next backup starts at 04:00:00 UTC and will clear the alarm. - The acknowledgement expires at: 2015-11-06 04:30:00 UTC. [03:31:36] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [03:31:45] twentyafterfour: looks like it could use a hard reset back to the origin branch and cherry pick of the security change back? [03:39:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [03:48:28] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [03:54:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [03:59:56] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [04:00:26] RECOVERY - Last backup of the maps filesystem on labstore1001 is OK: OK - Last run for unit replicate-maps was successful [04:24:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [04:27:28] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [100000000.0] [04:28:47] 6operations, 6Commons, 10Wikimedia-Media-storage, 5MW-1.27-release-notes, and 2 others: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1784708 (10aaron) {F2916164} Large 5.5Mb list of ~40K orphaned files in the "public" zone for all of Commons. Files in... [04:42:38] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:09:07] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 19.05% of data above the critical threshold [100000000.0] [05:26:57] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [05:52:36] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:21:06] PROBLEM - RAID on es2010 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [06:30:07] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: puppet fail [06:30:47] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:48] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:07] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:07] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:28] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:17] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:37] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:46:19] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1784783 (10Joe) In my experience handling out 3 million events/day to a piwik installation means sounding th... [06:53:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [06:56:17] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:56:48] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:58:36] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:13] 6operations: Import Wikimania 2015 Videos - https://phabricator.wikimedia.org/T106565#1784793 (10Matanya) [07:01:14] 6operations, 5Patch-For-Review: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1784790 (10Matanya) 5Open>3declined a:3Matanya The video team hired by Wikimedia Mexico had encoded and uploaded the videos directly to commons. This task i... [07:03:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [07:08:46] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [07:14:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [07:19:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [07:23:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds [07:25:57] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:26:56] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [07:27:17] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [07:27:37] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:57] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:58] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:33:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 7 below the confidence bounds [07:38:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds [07:46:08] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [07:57:28] (03PS3) 10KartikMistry: WIP: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) [07:58:42] (03PS4) 10KartikMistry: WIP: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) [08:05:13] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Nov 5 08:05:13 UTC 2015 (duration 5m 12s) [08:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:14:59] (03PS1) 10Muehlenhoff: Assign Salt grain through the role, not by host [puppet] - 10https://gerrit.wikimedia.org/r/251202 [08:20:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign Salt grain through the role, not by host [puppet] - 10https://gerrit.wikimedia.org/r/251202 (owner: 10Muehlenhoff) [08:29:12] (03CR) 10DCausse: [C: 031] Make three of the newer ES nodes master eligable [puppet] - 10https://gerrit.wikimedia.org/r/251024 (https://phabricator.wikimedia.org/T112556) (owner: 10EBernhardson) [08:33:48] 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1784891 (10jcrespo) [08:34:49] ACKNOWLEDGEMENT - RAID on es2010 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo Reported for replacement: https://phabricator.wikimedia.org/T117848 [08:43:57] 7Puppet, 6Labs, 6Phabricator: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1784904 (10mmodell) negative24: #scap3 [09:00:45] (03PS1) 10Jcrespo: Add ip resolution for new codfw db servers; Poolings and depoolings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251205 [09:03:35] (03CR) 10Jcrespo: "Al ip resolution of codfw servers have to be added to eqiad too (eqiad would fail to find a master if we failover to codfw), but no server" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251205 (owner: 10Jcrespo) [09:21:00] (03PS2) 10Jcrespo: Add ip resolution for new codfw db servers; Poolings and depoolings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251205 [09:21:24] (03PS2) 10Giuseppe Lavagetto: terbium: move mediawiki monitoring [puppet] - 10https://gerrit.wikimedia.org/r/250931 (https://phabricator.wikimedia.org/T116728) [09:22:47] (03CR) 10Jcrespo: [C: 032] Add ip resolution for new codfw db servers; Poolings and depoolings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251205 (owner: 10Jcrespo) [09:25:18] !log jynus@tin Synchronized wmf-config/db-codfw.php: Pool db2048, db2049, db2050,db2055, db2057, db2063. Depool db2034, db2035, db2051 (duration: 00m 17s) [09:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:25:54] Thanks I have very good notes here, because if not, I could not be able to keep track [09:33:16] !log stopping mysql and cloning db2034 -> db2062, db2035 -> db2063, db2051 -> db2058 [09:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:35:02] (03CR) 10Nemo bis: "Awight, do you have other ideas on how to prevent the redirection to the canonical URL?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T116609) (owner: 10Nemo bis) [09:37:38] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1784983 (10mark) This is clearly a system for analytics. Will it be implemented, maintained and supported by... [09:38:30] (03CR) 10Giuseppe Lavagetto: [C: 032] terbium: move mediawiki monitoring [puppet] - 10https://gerrit.wikimedia.org/r/250931 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [09:56:22] !log run removenode on cerium.eqiad.wmnet -- decomission was missed before reimaging [09:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:07] 6operations, 6Commons, 10Wikimedia-Media-storage, 5MW-1.27-release-notes, and 2 others: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1785015 (10Ankry) >>! In T111838#1784708, @aaron wrote: > {F2916164} > > Large 5.5Mb list of ~40K orphaned files in the... [09:58:18] (03PS7) 10Alexandros Kosiaris: openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [10:03:04] (03CR) 10Nemo bis: "Or in other words, is this guaranteed to add at least one URL parameter (which will prevent URL redirect)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T116609) (owner: 10Nemo bis) [10:04:41] (03PS2) 10Nemo bis: Use full URL in $wgNoticeHideUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T116609) [10:05:40] 6operations, 6Commons, 10Wikimedia-Media-storage, 5MW-1.27-release-notes, and 2 others: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1785028 (10jcrespo) Those 3 files and the ones on the description have a space character, could it be related to: T10767... [10:15:52] 6operations, 6Commons, 10Wikimedia-Media-storage, 5MW-1.27-release-notes, and 2 others: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1785047 (10Ankry) Just for records: I have received information that on 8 October 2015 another file disappeared while b... [10:22:00] (03Abandoned) 10Giuseppe Lavagetto: terbium: remove role mediawiki::maintenance [puppet] - 10https://gerrit.wikimedia.org/r/250930 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [10:30:55] (03CR) 10Alexandros Kosiaris: [C: 031] openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [10:32:05] (03PS8) 10Muehlenhoff: openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) [10:32:13] (03CR) 10Muehlenhoff: [C: 032 V: 032] openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [10:34:36] PROBLEM - Host db2034 is DOWN: PING CRITICAL - Packet loss = 100% [10:35:05] mmm [10:36:04] that is probably network saturation [10:37:24] (03PS2) 10Muehlenhoff: Add schema definitions for openldap/labs to be passed in $extra_schemas [puppet] - 10https://gerrit.wikimedia.org/r/250011 (https://phabricator.wikimedia.org/T101299) [10:38:23] no, it is definitely down [10:38:59] (03PS3) 10Muehlenhoff: Add schema definitions for openldap/labs to be passed in $extra_schemas [puppet] - 10https://gerrit.wikimedia.org/r/250011 (https://phabricator.wikimedia.org/T101299) [10:39:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add schema definitions for openldap/labs to be passed in $extra_schemas [puppet] - 10https://gerrit.wikimedia.org/r/250011 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [10:54:20] 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1785142 (10jcrespo) 3NEW [10:56:02] (03CR) 10Alexandros Kosiaris: "I was under the impression that instead of relying on backports we import packages into the backports suite of our own repo. That used to " [puppet] - 10https://gerrit.wikimedia.org/r/251042 (owner: 10Ori.livneh) [11:00:13] 6operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750#1785162 (10mark) >>! In T116750#1774292, @MoritzMuehlenhoff wrote: > Hardware budget needed: 24 * 50 dollars if all members of the "ops" group receive a Yubikey Neo -> 1200 dollars. (Plus possible shipping costs... [11:03:26] 6operations, 5Patch-For-Review, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1785166 (10fgiunchedi) thanks Daniel! I'll track the swift expansion here [11:09:51] (03CR) 10Filippo Giunchedi: restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [11:10:33] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: puppet should safely manage cassandra start/stop - https://phabricator.wikimedia.org/T103134#1785178 (10fgiunchedi) ok, looks like there's agreement on going with the `systemctl mask` idea, the same can be applied to restbase (easier once converted to sys... [11:15:22] 1.5 hours to clone a full db server. Maybe I can improve that? [11:20:02] moar parallelism [11:22:33] I do not think that sending a tar.gz in paralel will win much :-) [11:22:51] but I already get a x4 or x5 improvement in bytes sent [11:26:39] godog: hello :) I got python-os-client-config prepared for jessie-wikimedia/backports at https://phabricator.wikimedia.org/T104967#1773653 in case you missed the mail notification [11:27:26] jynus: oh well, 3 days to get my home server recovered from a single drive crash ;p [11:27:50] hashar: yeah I saw that, I should be able to get to it today or tomorrow [11:27:50] not ssd, I suppose [11:27:56] no [11:28:18] disks are sloowwww [11:28:19] I do not know what it is an hd on my desktop/laptop anymore [11:28:28] the whole process feels a lot like the big Labs NFS outage [11:28:32] I was a costly investment [11:28:45] <_joe_> yes you were! [11:28:48] but my quality of life improved xinfinity [11:28:49] i don't want to buy 8 TB of SSDs for home use :P [11:28:58] <_joe_> mark: 8 TB?? [11:29:02] why do you need 9 TG?! [11:29:05] i do have SSD in my laptop, sure [11:29:39] <_joe_> jynus: for his collection of movie backups I guess :P [11:29:50] i mostly watch netflix these days actually [11:30:04] <_joe_> apart from jokes, I have 2 TB and it holds all my backups [11:30:18] <_joe_> oh I actually have 4 now, scratch that [11:30:21] it's 8 TB before raid1 ;) [11:30:24] so 4 TB usable [11:30:33] <_joe_> Time machine needed a bigger disk [11:30:34] there we go :P [11:30:38] yes, I have 3 TB for backups, but I do not need redundancy there [11:31:00] yeah well, in my power outage, one drive died completely, another one was already a bit flaky [11:31:07] so going from 3 drives to 1.5 in a raid5 setup wasn't great :P [11:31:27] <_joe_> ewww [11:31:51] godog: and if you have any motivation, I could use a backport of python-shade which is blocked by that python-os-client-config (all of that to bump Nodepool) [11:33:02] oh [11:33:20] jynus: random question from my coworking place: do we use SSD on our MariaDB servers? [11:33:35] we were wondering if one could get a transparent mix of SSD / HD [11:33:56] with more access data on the SSD and rest on the HD [11:34:00] 6operations, 6Revscoring, 6Services, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1785218 (10akosiaris) >>! In T117560#1778582, @yuvipanda wrote: > From talking to @akosiaris during the offsite, we can run both the web and the celery stuff (and redis t... [11:34:10] hashar depends on many factors [11:34:26] buffer pool hit rate [11:34:31] size of working set [11:34:39] size of total db [11:34:52] <_joe_> hashar: you'll never get a straight answer to such a question from a dba, since it truly depends on many factors. there is no silver bullet. [11:35:10] I can imagine [11:35:17] well, the question was: do we use SSD on our MariaDB servers? [11:35:22] but still wondering whether we have SSD [11:35:23] yeah [11:35:25] the answer is yes [11:35:27] :-) [11:35:37] good enough for the DB / disk io newbie I am [11:35:45] hashar: if you want a better answer, you need to optimise the query :-) [11:35:52] ha ha [11:35:54] <_joe_> hashar: and about a mix of ssds/hds of course you can [11:36:12] facebook worked on a diskcache implementation [11:36:18] I do not know the state of that [11:36:37] you can do a poor man's substitution [11:36:46] puting certain tables on a different disk [11:37:03] a while ago I played with linux hybrid caching, bcache and lvm native, not impressed with the latter so far but bcache seemed to work ok [11:37:09] or you know, do something at disk level, but that can have mixed results [11:37:37] (this https://phabricator.wikimedia.org/T88992) [11:37:42] in most cases, investing on memory is more productive for the buck, but more expensive [11:39:00] godog, problem is that with mysql things get more complex, there are hot and cold areas even within files [11:39:26] been asking that since some Apple Mac Mini have SSD/HD system which are seen as a single disk. The OS takes care of offloading least recently / big files to the HD [11:40:07] <_joe_> hashar: that is the diskcache jynus was referring to, I was ofc referring to putting hot tables on SSDs directly :) [11:40:16] jynus: heh I'm not sure how it handles that, it might be blockwise [11:40:33] https://www.facebook.com/notes/mysql-at-facebook/releasing-flashcache/388112370932 [11:40:47] so do we purely use SSD or do we spread / shard tables between SSD and HD? [11:41:54] hashar, that is a money answer [11:42:48] if you have the money, going to SSDs is always going to be better [11:43:28] except some issues with the doublewrite area that some people experimented having wear issues [11:43:43] but it requires tuning [11:43:56] I should ask again the MariaDB folks here and come back with useful tech / doc instead of mumbling :/ [11:44:01] by default, mysql is tuned for HDs, so it does many sequential scans [11:44:39] for example, the transaction log is pure sequential writes [11:45:07] also, for example, we use RAID cache, which changes a lot things [11:46:04] (03PS1) 10Giuseppe Lavagetto: noc: remove unused ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/251221 [11:46:06] (03PS1) 10Giuseppe Lavagetto: noc: remove mod_cgi [puppet] - 10https://gerrit.wikimedia.org/r/251222 [11:46:08] (03PS1) 10Giuseppe Lavagetto: noc: remove mod_userdir inclusion [puppet] - 10https://gerrit.wikimedia.org/r/251223 [11:46:10] (03PS1) 10Giuseppe Lavagetto: mediawiki: extract php config from web::modules, use in noc [puppet] - 10https://gerrit.wikimedia.org/r/251224 [11:46:12] (03PS1) 10Giuseppe Lavagetto: noc: make noc virtualhost compatible with apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/251225 [11:46:14] (03PS1) 10Giuseppe Lavagetto: noc: add role to mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/251226 [11:46:16] (03PS1) 10Giuseppe Lavagetto: noc: point misc-varnish to mw1152 for both noc and dbtree [puppet] - 10https://gerrit.wikimedia.org/r/251227 [11:48:04] and to be fair, moritz's answer is the right one 90% of the time [11:49:19] on my career as a consultant, only once I returned a report and said: your queries are perfect, we can only do things in hardware/rearchitecture [11:50:46] <_joe_> wow I am pretty amazed that actually happened [11:51:11] <_joe_> in my experience, you have to constantly analyze and optimize your queries as your dataset evolves/grows [11:51:30] <_joe_> so if someone can manage to keep a pristine record, it's pretty impressive [11:51:38] yes, basically they had been working before with an oracle employee [11:51:59] so there it is the mystery [11:52:52] it is the same as with programming- performance optimization never finishes, you only do the quick wins first [11:54:20] we are actually there on the external storage [11:54:39] (03PS2) 10Giuseppe Lavagetto: noc: remove unused ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/251221 [12:04:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [12:05:41] the 12h bump [12:06:30] is that due to some cache expiring? [12:06:50] (03PS1) 10Filippo Giunchedi: fully deprovision tungsten [puppet] - 10https://gerrit.wikimedia.org/r/251228 (https://phabricator.wikimedia.org/T97274) [12:06:57] as far as I know it is request-based [12:07:26] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] fully deprovision tungsten [puppet] - 10https://gerrit.wikimedia.org/r/251228 (https://phabricator.wikimedia.org/T97274) (owner: 10Filippo Giunchedi) [12:07:34] (03PS1) 10Muehlenhoff: Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) [12:08:22] (03CR) 10jenkins-bot: [V: 04-1] Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [12:10:15] (03PS2) 10Muehlenhoff: Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) [12:11:06] (03CR) 10jenkins-bot: [V: 04-1] Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [12:18:20] (03PS3) 10Giuseppe Lavagetto: noc: remove unused ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/251221 [12:18:51] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] noc: remove unused ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/251221 (owner: 10Giuseppe Lavagetto) [12:26:48] (03PS2) 10Giuseppe Lavagetto: noc: remove mod_cgi [puppet] - 10https://gerrit.wikimedia.org/r/251222 [12:27:37] (03PS3) 10Muehlenhoff: Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) [12:28:36] (03CR) 10jenkins-bot: [V: 04-1] Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [12:30:47] (03CR) 10Giuseppe Lavagetto: [C: 032] noc: remove mod_cgi [puppet] - 10https://gerrit.wikimedia.org/r/251222 (owner: 10Giuseppe Lavagetto) [12:33:18] <_joe_> !log manually disabling mod_cgi on terbium [12:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:38:12] (03PS2) 10Giuseppe Lavagetto: noc: remove mod_userdir inclusion [puppet] - 10https://gerrit.wikimedia.org/r/251223 [12:38:32] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "noop according to the puppet compiler" [puppet] - 10https://gerrit.wikimedia.org/r/251223 (owner: 10Giuseppe Lavagetto) [12:40:54] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: puppet fail [12:42:13] (03PS2) 10Giuseppe Lavagetto: mediawiki: extract php config from web::modules, use in noc [puppet] - 10https://gerrit.wikimedia.org/r/251224 [12:43:27] 6operations, 6Revscoring, 6Services, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1785348 (10mobrovac) >>! In T117560#1783344, @GWicke wrote: > @halfak, it's a general concern, but something computationally intense and research-driven like ORES is espe... [12:47:57] (03CR) 10BBlack: [C: 04-1] "The more think about this, the more I'm concerned about the DHE>1024 compatibility issue. Probably any client old/crappy enough that DHE+" [puppet] - 10https://gerrit.wikimedia.org/r/251153 (owner: 10BBlack) [12:49:39] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: extract php config from web::modules, use in noc [puppet] - 10https://gerrit.wikimedia.org/r/251224 (owner: 10Giuseppe Lavagetto) [12:50:39] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1785355 (10hashar) @robh scandium has been installed with Trusty. Would need to reimage it to Jessie instead (sorry). Some firewall rules have been... [12:51:09] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1785357 (10hashar) [12:54:51] (03PS2) 10Giuseppe Lavagetto: noc: make noc virtualhost compatible with apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/251225 [12:57:28] (03PS4) 10Muehlenhoff: Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) [12:57:52] (03CR) 10Giuseppe Lavagetto: [C: 032] noc: make noc virtualhost compatible with apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/251225 (owner: 10Giuseppe Lavagetto) [12:57:59] (03CR) 10Mobrovac: restbase: move to systemd unit file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [13:02:12] (03PS2) 10Giuseppe Lavagetto: noc: add role to mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/251226 [13:05:24] (03CR) 10Giuseppe Lavagetto: [C: 032] noc: add role to mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/251226 (owner: 10Giuseppe Lavagetto) [13:06:34] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [13:12:15] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures [13:12:48] (03PS1) 10Giuseppe Lavagetto: noc: puppetize dbtree directories [puppet] - 10https://gerrit.wikimedia.org/r/251233 [13:13:16] _joe_, did you give mw1152 the same sort of network and mysql access as terbium? [13:13:36] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] noc: puppetize dbtree directories [puppet] - 10https://gerrit.wikimedia.org/r/251233 (owner: 10Giuseppe Lavagetto) [13:14:03] <_joe_> Krenair: what do you mean? [13:15:26] <_joe_> is there some specific special access terbium has you are aware of? [13:17:16] manifests/role/mariadb.pp: srange => '@resolve((tin.eqiad.wmnet mira.codfw.wmnet terbium.eqiad.wmnet))', [13:17:17] manifests/role/ganglia.pp: '10.64.32.13', # terbium [13:17:40] maybe this: modules/ganglia/templates/deprecated/gmetad.conf.erb:trusted_hosts 208.80.152.165 208.80.154.149 208.80.154.14 10.64.32.13 #bastions, neon, terbium [13:18:13] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:18:39] <_joe_> Krenair: the ganglia thing doesn't make sense anymore [13:18:46] ok [13:19:18] <_joe_> and for mariadb, I am aware of that and it's an upcoming patch [13:20:24] <_joe_> ganglia I wasn't, tbh [13:21:38] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1785377 (10Krenair) >>! In T117394#1774385, @Krenair wrote: > IIRC, labswiki jobs are supposed to be running locally on silver only... Actually, we... [13:22:07] <_joe_> Krenair: it's an upcoming patch as I'm not sure that is needed anymore as well [13:26:14] (03CR) 10Jcrespo: [C: 032] Repool db2051; Depool db2042, 38, 39, 40 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251234 (owner: 10Jcrespo) [13:28:54] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: puppet fail [13:34:27] 6operations, 10Wikimedia-Planet, 10procurement: ssl certificate renewal: *.planet.wikimedia.org - https://phabricator.wikimedia.org/T117866#1785390 (10RobH) 3NEW a:3mark [13:34:35] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: puppet fail [13:35:44] ^^^ check_puppetrun alerts noted, just a puppetmaster reboot [13:39:25] RECOVERY - check_puppetrun on payments2001 is OK: OK: Puppet is currently enabled, last run 94 seconds ago with 0 failures [13:40:17] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [13:40:25] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Puppet has 1 failures [13:40:54] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1785403 (10jcrespo) These were the first occurrences: ``` { "_index": "logstash-2015.10.31", "_type": "mediawiki", "_id": "AVC8f0N1lAIL90ZzMe... [13:43:56] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2051; Depool db2042, 38, 39, 40 for cloning (duration: 00m 18s) [13:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:40] !log shuting down mysql and cloning db2042 -> db2062, db2038 -> db2059, db2039 -> db2060, db2040 -> db2061 [13:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:05:56] RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:07:02] (03CR) 10Faidon Liambotis: "This was never really the case. The "backports" section of our repository exists for backports that don't exist in backports in Debian and" [puppet] - 10https://gerrit.wikimedia.org/r/251042 (owner: 10Ori.livneh) [14:08:16] (03CR) 10Faidon Liambotis: [C: 04-1] "Also, -1 again because 3/4 of the previous comments went unanswered, guessing that ori missed them since they were on the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/251042 (owner: 10Ori.livneh) [14:12:26] !log reinstalling db2056.codfw.wmnet [14:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:22:26] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1785440 (10RobH) [14:22:43] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1616223 (10RobH) [14:25:06] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1785456 (10RobH) [14:43:16] PROBLEM - Disk space on praseodymium is CRITICAL: DISK CRITICAL - free space: /var 13294 MB (3% inode=99%) [14:44:12] (03PS1) 10Faidon Liambotis: Add loopback for cr2-esams [dns] - 10https://gerrit.wikimedia.org/r/251242 [14:44:14] (03PS1) 10Faidon Liambotis: Allocate subnets/VLANs for cr2-esams neighbor links [dns] - 10https://gerrit.wikimedia.org/r/251243 [14:48:51] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1785475 (10Ottomata) > Until very recently (last days), there wasn't actually an EventBus-like REST proxy with schema validation in the EventLogging repository. Not quite true, t... [14:49:48] (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/237380 (https://phabricator.wikimedia.org/T111377) (owner: 10Hashar) [14:50:48] (03CR) 10Faidon Liambotis: [C: 032] swift: don't wait for data disks on boot [puppet] - 10https://gerrit.wikimedia.org/r/250968 (https://phabricator.wikimedia.org/T107416) (owner: 10Filippo Giunchedi) [14:53:16] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [14:55:20] (03CR) 10Faidon Liambotis: [C: 032] Add loopback for cr2-esams [dns] - 10https://gerrit.wikimedia.org/r/251242 (owner: 10Faidon Liambotis) [14:56:17] (03PS5) 10Giuseppe Lavagetto: gdash: deprecate reqerror dashboard, minor correction [puppet] - 10https://gerrit.wikimedia.org/r/250395 [14:56:30] ACKNOWLEDGEMENT - Disk space on praseodymium is CRITICAL: DISK CRITICAL - free space: /var 7402 MB (2% inode=99%): Filippo Giunchedi looking, effect of nodetool removenode [14:57:30] (03PS2) 10Faidon Liambotis: Allocate subnets/VLANs for cr2-esams neighbor links [dns] - 10https://gerrit.wikimedia.org/r/251243 [14:57:58] (03PS2) 10BBlack: ssl_ciphersuite: add DHE+3DES option only for "mid" [puppet] - 10https://gerrit.wikimedia.org/r/251153 [14:58:53] (03CR) 10Faidon Liambotis: [C: 032] Allocate subnets/VLANs for cr2-esams neighbor links [dns] - 10https://gerrit.wikimedia.org/r/251243 (owner: 10Faidon Liambotis) [14:59:37] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, and 2 others: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1785481 (10Jgreen) > Maybe we can doctor the last old files and the first new files by hand, so that they splice nearly perf... [15:00:03] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 [15:00:17] PROBLEM - Check size of conntrack table on chromium is CRITICAL: CRITICAL: nf_conntrack is 92 % full [15:00:46] RECOVERY - Disk space on praseodymium is OK: DISK OK [15:00:50] (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar) [15:01:08] (03CR) 10BBlack: [C: 04-1] "This also needs appropriate backend definition stuff in the "directors" (which references the real "mw1152.eqiad.wmnet" uses a label like " [puppet] - 10https://gerrit.wikimedia.org/r/251227 (owner: 10Giuseppe Lavagetto) [15:02:17] RECOVERY - Check size of conntrack table on chromium is OK: OK: nf_conntrack is 1 % full [15:03:02] I was going to say: I didn't see much going on on chromium [15:05:03] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: don't wait for data disks on boot [puppet] - 10https://gerrit.wikimedia.org/r/250968 (https://phabricator.wikimedia.org/T107416) (owner: 10Filippo Giunchedi) [15:06:43] (03PS2) 10Filippo Giunchedi: swift: don't wait for data disks on boot [puppet] - 10https://gerrit.wikimedia.org/r/250968 (https://phabricator.wikimedia.org/T107416) [15:06:53] (03CR) 10Filippo Giunchedi: [V: 032] swift: don't wait for data disks on boot [puppet] - 10https://gerrit.wikimedia.org/r/250968 (https://phabricator.wikimedia.org/T107416) (owner: 10Filippo Giunchedi) [15:07:08] (03PS2) 10Giuseppe Lavagetto: noc: point misc-varnish to mw1152 for both noc and dbtree [puppet] - 10https://gerrit.wikimedia.org/r/251227 [15:08:57] (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar) [15:12:47] 7Puppet, 6operations, 5Patch-For-Review: merge swift_new and swift puppet modules/classes - https://phabricator.wikimedia.org/T107416#1785509 (10fgiunchedi) 5Open>3Resolved all done, `swift` and `swift_new` have been merged by @faidon and `nobootwait` added [15:13:21] (03CR) 10Faidon Liambotis: [C: 04-1] exim: Add and use $::other_site to provide LDAP fallback (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249868 (https://phabricator.wikimedia.org/T82662) (owner: 10Alexandros Kosiaris) [15:13:34] (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar) [15:14:36] (03CR) 10Faidon Liambotis: [C: 031] "Sounds fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/250395 (owner: 10Giuseppe Lavagetto) [15:15:11] (03PS6) 10Giuseppe Lavagetto: gdash: deprecate reqerror dashboard, minor correction [puppet] - 10https://gerrit.wikimedia.org/r/250395 [15:16:15] (03CR) 10Giuseppe Lavagetto: [C: 032] gdash: deprecate reqerror dashboard, minor correction [puppet] - 10https://gerrit.wikimedia.org/r/250395 (owner: 10Giuseppe Lavagetto) [15:16:28] (03CR) 10Faidon Liambotis: [C: 04-2] "Honestly, I don't really like a) hardcoding "standard" to the role classes (it doesn't really belong there), b) hardcoding eth0 into the r" [puppet] - 10https://gerrit.wikimedia.org/r/250616 (owner: 10Dzahn) [15:17:46] (03CR) 10Faidon Liambotis: [C: 031] "LGTM" [software/otrs] - 10https://gerrit.wikimedia.org/r/248916 (owner: 10Alexandros Kosiaris) [15:17:51] (03CR) 10Faidon Liambotis: [C: 031] "LGTM" [software/otrs] - 10https://gerrit.wikimedia.org/r/248915 (owner: 10Alexandros Kosiaris) [15:18:08] Krenair: https://gerrit.wikimedia.org/r/#/c/245139/ ? [15:18:17] (03PS1) 10BBlack: config-geo: more of the middle US -> codfw [dns] - 10https://gerrit.wikimedia.org/r/251247 [15:18:43] ? [15:18:56] I responded there a while ago [15:19:15] not sure if you saw that [15:19:37] I saw it, haven't worked on it yet [15:19:43] ok [15:19:44] (03CR) 10Alexandros Kosiaris: exim: Add and use $::other_site to provide LDAP fallback (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249868 (https://phabricator.wikimedia.org/T82662) (owner: 10Alexandros Kosiaris) [15:19:48] (03PS3) 10Giuseppe Lavagetto: noc: point misc-varnish to mw1152 for both noc and dbtree [puppet] - 10https://gerrit.wikimedia.org/r/251227 [15:20:15] (03CR) 10BBlack: [C: 031] noc: point misc-varnish to mw1152 for both noc and dbtree [puppet] - 10https://gerrit.wikimedia.org/r/251227 (owner: 10Giuseppe Lavagetto) [15:20:46] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:05] (03CR) 10Dzahn: "you say as a reason to not do this that "We generally have both standard and the IPv6 stuff in site.pp for all hosts." but the point is th" [puppet] - 10https://gerrit.wikimedia.org/r/250616 (owner: 10Dzahn) [15:22:35] (03PS2) 10BBlack: config-geo: more of the middle US -> codfw [dns] - 10https://gerrit.wikimedia.org/r/251247 (https://phabricator.wikimedia.org/T114659) [15:23:26] mutante: yeah well, I disagree with the point :) [15:23:32] (03CR) 10Giuseppe Lavagetto: [C: 032] noc: point misc-varnish to mw1152 for both noc and dbtree [puppet] - 10https://gerrit.wikimedia.org/r/251227 (owner: 10Giuseppe Lavagetto) [15:23:45] and I don't think the solution is to move this repetition of the ipv6 stanza all over random roles [15:23:56] that stanza is nothing specific to the roles itself [15:24:09] <_joe_> for ipv6 I agree, I am unsure about standard [15:24:16] neither is standard, which btw is defined on site.pp [15:24:23] <_joe_> repeating it in every node seems... wrong [15:24:30] so what you're actually doing makes the modules unusable from anywhere else [15:24:37] the roles, sorry [15:24:53] <_joe_> well, we could move standard to a proper location :) [15:25:01] why? [15:25:09] it's the least of our problems really [15:25:50] <_joe_> no I was saying if that is the reason not to include it [15:26:01] that's one of the reasons for sure [15:26:09] <_joe_> I'm actually totally neutral on the topic [15:26:23] <_joe_> I see pros and cons with both approaches [15:26:50] (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 [15:27:01] so how do these roles work on labs right now? [15:27:09] that include standard? [15:27:20] <_joe_> labs has site.pp in its puppet tree [15:27:34] right [15:27:36] ew.. [15:27:40] <_joe_> actually, site.pp is the entry point in labs as well ;) [15:27:53] <_joe_> it's puppetlabs! [15:28:17] <_joe_> paravoid: if we properly used environments, maybe that could be an issue [15:28:20] <_joe_> but we don't [15:28:30] I doubt there is a way to "properly use environments" [15:28:37] (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar) [15:28:56] but I'd be willing to be convinced otherwise ;) [15:30:02] <_joe_> well for example it's possible to test patches to production classes/modules without actually needing to merge them. We could use it to get rid of 99% of self-hosted puppetmasters in labs [15:34:41] (03PS3) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 [15:34:59] (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar) [15:41:22] 6operations, 10CirrusSearch, 6Discovery, 5Patch-For-Review: Only use newer (elastic10{16..31}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#1785591 (10chasemp) Is there a disadvantage to having 4 eligible masters? I know we have a minimum viability setting righ... [15:41:58] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, and 2 others: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1785592 (10Pcoombe) @awight Sounds like it would be safest to just take campaigns down if it's only for a short window. Plea... [15:43:22] (03CR) 10Rush: [C: 031] "sounds good" [puppet] - 10https://gerrit.wikimedia.org/r/251019 (owner: 10Muehlenhoff) [15:45:38] (03PS1) 10Giuseppe Lavagetto: mw1152: allow connecting to labswiki database [puppet] - 10https://gerrit.wikimedia.org/r/251250 [15:46:33] <_joe_> jynus: ^^ [15:46:53] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar) [15:47:43] (03CR) 10Jcrespo: [C: 04-1] "we do not need to add terbium." [puppet] - 10https://gerrit.wikimedia.org/r/251250 (owner: 10Giuseppe Lavagetto) [15:48:05] (03PS2) 10Giuseppe Lavagetto: mw1152: allow connecting to labswiki database [puppet] - 10https://gerrit.wikimedia.org/r/251250 [15:48:07] <_joe_> jynus: heh I already noticed [15:48:08] <_joe_> :) [15:48:18] 10.x [15:48:24] has access [15:48:28] <_joe_> oh [15:48:41] <_joe_> so it's just a matter of firewall I guess [15:48:52] <_joe_> and that grants file is... useless? [15:49:24] no, that is needed [15:49:55] <_joe_> so what's the issue? just the ip wrong in the grant? I already corrected it [15:50:45] RECOVERY - Host db2034 is UP: PING WARNING - Packet loss = 64%, RTA = 34.79 ms [15:51:27] <_joe_> jynus: so... is the patch now correct? [15:53:46] I do not know what you are referring to [15:53:52] but by parts [15:54:58] mw1152.eqiad.wmnet doesn't need to be added to the firewall [15:55:08] (03PS1) 10Dzahn: ldap: split client.pp into one file per class [puppet] - 10https://gerrit.wikimedia.org/r/251251 [15:55:10] <_joe_> it doesn't? why? [15:55:21] because it is already included on the list of hosts that can access mysqls [15:55:21] 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1785605 (10chasemp) a:5RobH>3Papaul for https://phabricator.wikimedia.org/T117097#1783632 thanks papaul [15:55:54] <_joe_> jynus: including silver? [15:56:02] <_joe_> I would've expected otherwise [15:56:08] <_joe_> from puppet at least [15:56:09] let me recheck, but I wouls say yes [15:56:54] nom you are right [15:57:06] PROBLEM - Host db2034 is DOWN: PING CRITICAL - Packet loss = 100% [15:57:06] no, you are right, it is separated [15:59:58] <_joe_> ok, so I'll go on with merging that patch [16:00:05] anomie ostriches thcipriani marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151105T1600). Please do the needful. [16:00:05] Luke081515: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:08] (03CR) 10Jcrespo: [C: 031] mw1152: allow connecting to labswiki database [puppet] - 10https://gerrit.wikimedia.org/r/251250 (owner: 10Giuseppe Lavagetto) [16:00:25] 6operations, 10Traffic, 5Patch-For-Review: Re-balance North American traffic in DNS for codfw - https://phabricator.wikimedia.org/T114659#1785610 (10BBlack) This is what the US States look like, assuming patch 251247 above is applied: {F2918924} [16:02:05] 6operations, 10Traffic, 5Patch-For-Review: Re-balance North American traffic in DNS for codfw - https://phabricator.wikimedia.org/T114659#1785615 (10chasemp) awesome [16:02:16] PROBLEM - swift codfw-prod object availability on graphite1001 is CRITICAL: CRITICAL: 37.50% of data under the critical threshold [90.0] [16:02:41] (03CR) 10Giuseppe Lavagetto: [C: 032] mw1152: allow connecting to labswiki database [puppet] - 10https://gerrit.wikimedia.org/r/251250 (owner: 10Giuseppe Lavagetto) [16:03:32] jouncebot: why didn't you ping me! [16:03:46] ACKNOWLEDGEMENT - swift codfw-prod object availability on graphite1001 is CRITICAL: CRITICAL: 37.50% of data under the critical threshold [90.0] Filippo Giunchedi codfw swift expansion in progress [16:04:12] 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1785625 (10Cmjohnson) This server is out of warranty. Is there a plan to replace these in the near term? I can send spare disks from eqiad to codfw if needed. [16:04:20] whoops, SWAT time. Luke081515|away jzerebecki matt_flaschen ready? [16:04:41] Yep [16:04:49] (03PS1) 10Dzahn: osm: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/251254 [16:04:52] y [16:05:06] 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1785627 (10Cmjohnson) a:3Papaul Papaul, Could you please troubleshoot this before you leave. Thanks [16:05:08] <_joe_> jynus: {{done}}, I just need the grant to be in effect now :) [16:05:37] 1 sec [16:05:55] cmjohnson1: working al ready on it [16:06:52] curl silver.wikimedia.org:3306 works, grant should too [16:07:59] <_joe_> ok thanks [16:08:26] http://cdn.debian.net/debian <-- looks nice and modern [16:09:35] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [5000000.0] [16:09:54] jzerebecki: I think jouncebot got confused because there are

wrappers in the DOM for that SWAT section. The parser it uses is pretty sensitive to the DOM output. [16:10:39] * bd808 will look at it [16:11:36] hashar: use http://httpredir.debian.org/ thought it sometimes has errors but recovers on retry [16:12:41] (03PS1) 10BBlack: config-geo: ulsfo fallback to codfw before eqiad [dns] - 10https://gerrit.wikimedia.org/r/251255 [16:13:26] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [16:13:29] (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar) [16:13:38] 6operations, 5Continuous-Integration-Scaling, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1785699 (10fgiunchedi) [16:13:40] 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1785696 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi ok I've uploaded `python-os-client-config` ``` root@carbon:~# reprepro -C backports... [16:13:49] Krenair: what is the "weird-rebase" file here? [16:14:04] 6operations, 5Continuous-Integration-Scaling, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1491090 (10fgiunchedi) all dependencies should be available now internally, please try to backport [16:15:32] James_F: it doesn't seem like your evening swat thing was deployed, is that right? [16:15:39] 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1785703 (10Cmjohnson) To purchase new disks replacements from newegg, the disks are Approx $244.00 each. I have 8 decommissioned ES hosts in eqiad that have those disks. I can send a dozen or so disks to... [16:16:32] (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar) [16:20:33] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1785720 (10faidon) More importantly, I don't understand why this is something Andrew has to do (and "soon") and not the services team "or else". Why is it a given that the Servic... [16:22:16] hmm we seem to be 22 commits ahead of wmf/1.27.0-wmf.5 and only one is marked security...these don't seem to be deployed. [16:23:17] that seems to be related to the weird-rebase file [16:23:32] thcipriani: Krenair and AaronSchulz were talking about that last night [16:24:06] AaronSchulz thought it needed to be reset to upstream and the one sec patch reapplied [16:24:09] yeah, I was just reading back scroll [16:24:26] RECOVERY - Host db2034 is UP: PING OK - Packet loss = 0%, RTA = 34.41 ms [16:24:28] that _does_ seem like the right thing to do here. [16:24:42] kk, doing that. [16:26:42] (03PS2) 10BBlack: config-geo: ulsfo fallback to codfw before eqiad [dns] - 10https://gerrit.wikimedia.org/r/251255 (https://phabricator.wikimedia.org/T114659) [16:27:59] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1785745 (10Nuria) > As mentioned, we might want to use a single node process exposing parsoid, restbase & eventbus for small (third party) installs, but might as well use the ne... [16:28:36] PROBLEM - Host db2034 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:22] I am not going to ack^ db2034 for now- I think papaul is working on it [16:32:10] jynus: yes will let you know [16:34:38] (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar) [16:34:39] thank you very much! [16:35:26] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1785750 (10BBlack) It's been over a week since the email, which ended up going out a bit later after the releases than expect... [16:36:01] akosiaris: thank you again for the puppet package_builder class and the jessie-wikimedia WIKIMEDIA=yes stuff :} [16:36:15] akosiaris: it works! https://integration.wikimedia.org/ci/job/debian-glue/24/ [16:36:43] (03PS3) 10BBlack: HTTPS redirects: remove InstantCommons exception [puppet] - 10https://gerrit.wikimedia.org/r/224557 (https://phabricator.wikimedia.org/T102566) [16:36:58] jynus: just a quick update when i got here today db2034 was completely power off [16:36:59] thcipriani: note that where i checked the rest of the fleet is at 96d099dab949f5d430c01a7d6bc2d9722f622ed2 [16:37:22] so this will deploy some new commits [16:38:54] papaul, yep, I expected a full crash [16:39:18] hmm, yeah, there are 7 new commits counting James_F merged last night and the matt_flaschen one merged this morning and not counting security commits. [16:39:57] I made a couple, but mediawiki-config only [16:40:04] yup [16:41:36] sigh. OK. I'm going to apply the remainder of the security commits. I think I'm going to let twentyafterfour verify my thinking on the repo and run a full scap as part of the train since this window is almost over and I haven't untangled this ball of wax yet. [16:42:13] thcipriani: Yup, Krenair found production in an inconsistent state and didn't deploy, I think. [16:42:21] repo is mostly cleaned up, but I'm confused how it got in this state and I could use a little more time to sort everything out. [16:42:36] James_F: kk, thanks for confirming. [16:43:23] sorry jzerebecki matt_flaschen and Luke081515 I'm going to scrub this SWAT until I get it sorted. [16:43:58] ok [16:44:19] thcipriani, it's okay. Let me know if I can help. PM is fine if you want. [16:47:39] thcipriani: sigh indeed [16:54:50] (03PS1) 10Ori.livneh: Relax and extend the Content-Type regexp that controls gzipping [puppet] - 10https://gerrit.wikimedia.org/r/251268 [16:54:57] bblack: ^ [16:56:18] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1785779 (10RobH) a:5hashar>3RobH [16:58:27] 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1785789 (10Papaul) labtestmetal2001 ge-5/0/8 NIC1 ge-5/0/30 NIC2 labtestvirt2001 ge-5/0/17 NIC1 ge-5/0/ 31... [17:00:05] akosiaris moritzm: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151105T1700). [17:03:58] (03CR) 10BryanDavis: [C: 04-1] "This is a WMF production cluster concentric change that will break beta cluster and other Labs projects that use these roles." [puppet] - 10https://gerrit.wikimedia.org/r/251019 (owner: 10Muehlenhoff) [17:11:04] !log nodetool decommission on praseodymium [17:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:35] (03PS1) 10Muehlenhoff: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) [17:16:50] (03CR) 10jenkins-bot: [V: 04-1] Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [17:17:06] PROBLEM - cassandra CQL 10.64.16.149:9042 on praseodymium is CRITICAL: Connection refused [17:17:16] expected ^ [17:17:24] (03PS1) 10Filippo Giunchedi: cassandra: add praseodymium instances to seeds [puppet] - 10https://gerrit.wikimedia.org/r/251274 [17:17:26] (03PS1) 10Filippo Giunchedi: cassandra: add praseodymium-a instance [puppet] - 10https://gerrit.wikimedia.org/r/251275 [17:19:20] ok I'm going to run scap to sync thcipriani's morning swat [17:19:45] jzerebecki: matt_flaschen Luke081515 ^ fyi [17:19:47] twentyafterfour: would you also include my SWAT patch from this morning? [17:19:54] it was not yet merged [17:19:55] (03CR) 10BBlack: [C: 04-1] "Doesn't this miss the previously-matching image/vnd.microsoft.icon?" [puppet] - 10https://gerrit.wikimedia.org/r/251268 (owner: 10Ori.livneh) [17:19:57] yes [17:20:04] I only merged matt_flaschen 's patch so far for SWAT. [17:20:08] oh [17:20:38] jzerebecki: what's your patch? [17:20:49] also, James_F 's patch from evening SWAT is there too. Both required submodule bumps and neither of those submodules have been bumped on tin yet. [17:21:03] twentyafterfour: https://gerrit.wikimedia.org/r/#q,251237,n,z [17:21:07] neither VisualEditor or Flow [17:21:16] (03PS2) 10Muehlenhoff: openldap: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) [17:23:19] (03CR) 10BBlack: [C: 032] HTTPS redirects: remove InstantCommons exception [puppet] - 10https://gerrit.wikimedia.org/r/224557 (https://phabricator.wikimedia.org/T102566) (owner: 10BBlack) [17:23:40] twentyafterfour: that will also newly deploy commits by AaronSchulz, bd808 [17:24:26] twentyafterfour, I have a second commit on the schedule too: https://gerrit.wikimedia.org/r/#/c/251246 [17:24:32] Which is not merged yet. [17:24:34] 6operations, 10Traffic, 7HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1785899 (10BBlack) [17:24:39] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1785897 (10BBlack) 5Open>3Resolved a:3BBlack [17:25:25] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [17:25:56] jzerebecki: my commit was synced yesterday -- https://tools.wmflabs.org/sal/log/AVDTvYdp1oXzWjit6ReL [17:26:05] (03PS1) 10Muehlenhoff: Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251281 [17:26:07] (03PS1) 10Muehlenhoff: Fix traceback for verbose view of deployment result [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251282 [17:26:24] matt_flaschen: looking [17:26:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251281 (owner: 10Muehlenhoff) [17:27:03] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix traceback for verbose view of deployment result [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251282 (owner: 10Muehlenhoff) [17:27:46] RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 2.42 ms [17:27:53] bd808: mw1001 wmf.5 is at 96d099dab949f5d430c01a7d6bc2d9722f622ed2 which does not contain your commmit [17:29:07] jzerebecki: how can you tell? we don't sync the .git data [17:30:20] bd808: ugh. then disregard what I said. [17:30:35] The only way you can check for things that have been changed on the live cluster is by looking at the files [17:30:44] and mw1111 has my patch applied [17:31:14] (I'm not saying this is good, but it is how things work right now) [17:31:14] (03PS1) 10Chad: Remove superprotect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251286 [17:31:21] yea I somehow suppressed that we sync individual files [17:31:57] 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1785919 (10Papaul) Checked the server, the server was completely off. Power of the server, the iLo configuration were stay in place. I couldn't ssh@localIP but i can... [17:33:14] James_F: Can I get you to revoke +superprotect from https://meta.wikimedia.org/wiki/Special:GlobalGroupPermissions/staff per https://gerrit.wikimedia.org/r/#/c/251286/ and https://www.mediawiki.org/wiki/WMF_Product_Development_Process/2015-11-05? [17:33:37] on a completely tangental note, there are 14 deploy branches on tin which seems quite excessive [17:33:52] (or someone else with +sysadmin on meta) [17:35:01] Jamesofur: ^ [17:35:21] 6operations, 5Patch-For-Review: Switch to Linux 3.19 by default on jessie hosts - https://phabricator.wikimedia.org/T100773#1785923 (10MoritzMuehlenhoff) This is down to these hosts now: conf100[1-3] rhenium planet1001 [17:36:26] ostriches: asking stewards might be easier and quicker really [17:36:35] #wikimedia-stewards? [17:36:41] Yes [17:37:03] bd808: that is true [17:38:26] RECOVERY - Disk space on labvirt1002 is OK: DISK OK [17:39:15] James_F, Jamesofur: nvm, asking stewards instead. [17:39:34] 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1785933 (10jcrespo) The issue looks like a network/board problem, right? [17:39:50] I assume you want it revoked from staff too :) [17:40:07] ok merging https://gerrit.wikimedia.org/r/#/c/251246 [17:40:10] (They will likely ask me publicly or privately anyway) [17:40:18] Jamesofur: that's the only group it's on I think [17:40:48] It was only granted to staff unless someone sneaked it into Sysadmin later :) [17:41:07] It's only on staff. [17:41:12] I already checked the groups a few days ago [17:43:52] (03CR) 10Chad: [C: 032] Remove superprotect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251286 (owner: 10Chad) [17:44:12] (03Merged) 10jenkins-bot: Remove superprotect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251286 (owner: 10Chad) [17:44:17] weee [17:44:32] Let's do the super protect controversy again [17:44:48] * JohnFLewis reverts saying community consensus wasn't gathered [17:44:49] !log Deploying schema change on officewiki - flow (s3) [17:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:45:18] JohnFLewis: Yeah, let's do this again real soon :) [17:45:21] Next month? [17:45:29] JohnFLewis: where? [17:45:34] Noted on my calendar :) [17:45:43] oh, hah, sorry, misunderstood, stupid multitasking :) [17:46:11] greg-g: where is still valid! I see no RFC with consensus for reverting ;) [17:46:29] No !log from sync? [17:46:58] !log 17:45:13 Synchronized wmf-config/: Remove +superprotect, I579c11a2 (duration: 00m 18s) [17:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:14] Thanks, jynus. [17:47:43] (03PS2) 10Filippo Giunchedi: cassandra: add praseodymium instances to seeds [puppet] - 10https://gerrit.wikimedia.org/r/251274 [17:47:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add praseodymium instances to seeds [puppet] - 10https://gerrit.wikimedia.org/r/251274 (owner: 10Filippo Giunchedi) [17:50:36] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Puppet has 1 failures [17:52:32] matt_flaschen, if you have 5 minutes, please help me create some traffic on officewiki related to flow [17:52:49] if not, we may not notice problems, etc [17:52:59] jynus, sure, like new posts, or just a lot of simultaneous GET requests? [17:53:18] nothing too formal, just create, edit a new page [17:54:02] this is such a trivial change, that either it is a too obvious error or it works [17:54:19] unlike the ES storage change, that will be more complex [17:54:55] jynus, seems fine: https://office.wikimedia.org/wiki/User_talk:Mattflaschen_(WMF)/Flow_Sandbox [17:55:26] (03PS2) 10Dzahn: ldap: split client.pp into one file per class [puppet] - 10https://gerrit.wikimedia.org/r/251251 [17:55:40] Forgot to add links before, that works too though. [17:57:55] RECOVERY - Host db2034 is UP: PING WARNING - Packet loss = 58%, RTA = 34.54 ms [17:58:55] Packet loss = 58%, nice [17:59:08] that is a 58% more than 0 :-) [18:00:01] matt_flaschen, I see no errors on the logs, so let's go with the real thing [18:00:50] jynus, +1 [18:01:08] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/1200/" [puppet] - 10https://gerrit.wikimedia.org/r/251251 (owner: 10Dzahn) [18:02:50] (03PS2) 10Filippo Giunchedi: cassandra: add praseodymium-a instance [puppet] - 10https://gerrit.wikimedia.org/r/251275 [18:02:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add praseodymium-a instance [puppet] - 10https://gerrit.wikimedia.org/r/251275 (owner: 10Filippo Giunchedi) [18:03:37] so it will be a 1 second write block of flow [18:03:47] hopefuly the last time we have to do so [18:04:00] (thanks to the PK) [18:04:07] Great. :) [18:04:20] (03PS2) 10Dzahn: osm: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/251254 [18:04:21] thcipriani: so whhich patches were merged for swat? [18:04:36] (03PS3) 10Dzahn: osm: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/251254 [18:04:46] twentyafterfour, he said he only did one of mine (I have two total). [18:04:54] I don't think he merged anyone else's. [18:04:58] (03CR) 10Dzahn: [C: 032] "fixes the last "WARNING: unquoted file mode" across the repo" [puppet] - 10https://gerrit.wikimedia.org/r/251254 (owner: 10Dzahn) [18:05:16] twentyafterfour: yup: just the one of matt_flaschen 's and the one from James_F from evening SWAT [18:05:19] !log schema change on x1 - flowdb [18:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:05:38] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [18:05:46] twentyafterfour: and it looks like you merged the other of matt_flaschen 's patches for swat [18:05:49] (actually I am wrong, it is an 8 second process, but it is still online because it is a column addition) [18:05:58] PROBLEM - Restbase root url on praseodymium is CRITICAL: Connection refused [18:06:33] twentyafterfour: the VisualEditor submodule changed but not updated was the evening SWAT patch. [18:06:37] PROBLEM - cassandra CQL 10.64.16.149:9042 on praseodymium is CRITICAL: Connection refused [18:06:56] thcipriani: thanks [18:07:04] np [18:07:10] thank you! [18:07:37] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Puppet has 1 failures [18:07:52] traffic seems normal, lag has went back to 0 and no errors on the log [18:08:17] PROBLEM - service on praseodymium is CRITICAL: CRITICAL - Expecting active but unit cassandra is inactive [18:09:21] (03CR) 10Yuvipanda: [C: 031] ldap: split client.pp into one file per class [puppet] - 10https://gerrit.wikimedia.org/r/251251 (owner: 10Dzahn) [18:09:37] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [18:11:18] ok looks like everything from swat is merged and ready to go [18:11:34] anyone else have a patch to deploy before I sync this thing? [18:13:58] (03PS1) 10coren: Make host check_disk alerts optionally critical [puppet] - 10https://gerrit.wikimedia.org/r/251292 [18:14:07] chasemp: ^^ [18:15:12] (03CR) 10Rush: Make host check_disk alerts optionally critical (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251292 (owner: 10coren) [18:16:00] (03CR) 10jenkins-bot: [V: 04-1] Make host check_disk alerts optionally critical [puppet] - 10https://gerrit.wikimedia.org/r/251292 (owner: 10coren) [18:16:14] (03CR) 10coren: Make host check_disk alerts optionally critical (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251292 (owner: 10coren) [18:16:54] (03PS2) 10coren: Make host check_disk alerts optionally critical [puppet] - 10https://gerrit.wikimedia.org/r/251292 [18:19:31] (03PS1) 10Dzahn: puppet-lint: re-enable 'unquoted file modes' check [puppet] - 10https://gerrit.wikimedia.org/r/251295 [18:21:09] (03CR) 10Rush: [C: 031] "as dicussed in -labs where a virt box w/ crit disk caused a partial outage" [puppet] - 10https://gerrit.wikimedia.org/r/251292 (owner: 10coren) [18:21:11] (03PS3) 10Rush: Make host check_disk alerts optionally critical [puppet] - 10https://gerrit.wikimedia.org/r/251292 (owner: 10coren) [18:21:13] (03PS2) 10Dzahn: puppet-lint: re-enable 'unquoted file modes' check [puppet] - 10https://gerrit.wikimedia.org/r/251295 [18:21:16] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 216, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - Core: cr2-codfw:xe-5/0/1 Zayo (SO 580358) {#2909} [10Gbps DWDM]BR [18:21:17] (03PS1) 10coren: Labs: make disk space alerts for compute nodes paging [puppet] - 10https://gerrit.wikimedia.org/r/251297 [18:24:39] (03CR) 10Dzahn: [C: 031] "+1 for deploying it only to en.wp for now since that is the only thing that was requested and has the time constraint from external reques" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [18:25:22] (03CR) 10coren: [C: 032] Make host check_disk alerts optionally critical [puppet] - 10https://gerrit.wikimedia.org/r/251292 (owner: 10coren) [18:25:53] (03PS2) 10Ori.livneh: Relax and extend the Content-Type regexp that controls gzipping [puppet] - 10https://gerrit.wikimedia.org/r/251268 [18:26:08] (03PS2) 10coren: Labs: make disk space alerts for compute nodes paging [puppet] - 10https://gerrit.wikimedia.org/r/251297 [18:26:16] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 118, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/1: down - Core: cr2-eqiad:xe-5/2/3 Zayo (SO 580358) {#11519} [10Gbps DWDM]BR [18:26:21] chasemp: ^^ throws the switch [18:26:38] (03CR) 10Ori.livneh: "bblack, looks like it, yeah. Amended to match 'icon' (which covers x-icon and image/vnd.microsoft.icon). There is no non-compressible mime" [puppet] - 10https://gerrit.wikimedia.org/r/251268 (owner: 10Ori.livneh) [18:26:50] bblack: ^ [18:26:54] (03CR) 10Dzahn: "i liked this from a production point of view and also want to do similar changes to clean up site.pp, but if we are breaking beta cluster " [puppet] - 10https://gerrit.wikimedia.org/r/251019 (owner: 10Muehlenhoff) [18:28:36] (03PS2) 10Krinkle: Fix varnishmedia comment [puppet] - 10https://gerrit.wikimedia.org/r/243838 (owner: 10Gilles) [18:28:56] (03CR) 10Dzahn: "this is good. it's just about the timing. needs announcement on mailing lists. if it would just add the new backend but not switch it yet," [puppet] - 10https://gerrit.wikimedia.org/r/251115 (https://phabricator.wikimedia.org/T116992) (owner: 10John F. Lewis) [18:29:17] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [18:29:19] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix varnishmedia comment [puppet] - 10https://gerrit.wikimedia.org/r/243838 (owner: 10Gilles) [18:29:36] PROBLEM - Restbase root url on aqs1001 is CRITICAL: Connection refused [18:30:40] (03CR) 10Dzahn: "yea.. hmm. an opinion from _joe_ would be great here" [puppet] - 10https://gerrit.wikimedia.org/r/247324 (owner: 10Chad) [18:30:54] (03CR) 10Rush: [C: 031] "at some point we need to review teh thresholds here but I think based on todays events getting this rolling as is seems practical" [puppet] - 10https://gerrit.wikimedia.org/r/251297 (owner: 10coren) [18:31:09] paravoid: have you seen https://phabricator.wikimedia.org/T107507#1534816 ? [18:31:32] uhm, I guess not [18:31:40] paravoid: i'm all for enabling backports unconditionally, but the "consensus" (?) was to disable it, which is why i took the middle road [18:31:52] enabling it seems perfectly fine [18:32:03] yeah my point was that your middle road isn't very different though [18:32:06] than the default [18:32:12] not compared to what we're doing [18:32:16] the default appears to have changed [18:32:24] backports used to be enabled [18:32:26] RECOVERY - service on praseodymium is OK: OK - cassandra-a is active [18:32:31] (03CR) 10Nemo bis: Remove superprotect (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251286 (owner: 10Chad) [18:32:51] so that default changed on upstream d-i between the jessie release candidates [18:33:11] this is all so deja vu, I remember saying this in a task somewhere [18:33:28] 6operations, 5Patch-For-Review: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1786221 (10Dzahn) @matanya thanks for the update. ok!. we are going to reclaim tungsten [18:33:47] 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1786224 (10jcrespo) This has been resolved to me, unless, @papaul, you want to add anything strange that you found and may be the cause of the issue. I will keep an e... [18:35:16] this was https://bugs.debian.org/764982 btw [18:35:22] 6operations, 6Performance-Team: Allocate tungsten for InfluxDB - https://phabricator.wikimedia.org/T117888#1786231 (10ori) 3NEW a:3Dzahn [18:35:23] (03PS3) 10Dzahn: puppet-lint: re-enable 'unquoted file modes' check [puppet] - 10https://gerrit.wikimedia.org/r/251295 (https://phabricator.wikimedia.org/T93645) [18:36:05] paravoid: i don't really care which alternative to the current status quo we pick, since they're all better, from my perspective [18:36:14] ah yes, I said that on this bug above :P [18:36:45] i just don't have the investment necessary to make sure this is adequately discussed, etc. so if you want to pull a "i'm faidon and i approve this message" thing and just pick some approach, i'd actually welcome that :) [18:36:48] hashar-away: yay! [18:37:06] * YuviPanda +1's ori [18:37:08] 6operations, 5Patch-For-Review: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1786251 (10Dzahn) [18:37:10] 6operations, 5Patch-For-Review: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1786252 (10Dzahn) [18:37:12] 6operations, 6Performance-Team: Allocate tungsten for InfluxDB - https://phabricator.wikimedia.org/T117888#1786250 (10Dzahn) [18:37:21] 6operations, 5Patch-For-Review: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1763907 (10Dzahn) [18:37:22] 6operations, 6Performance-Team: Allocate tungsten for InfluxDB - https://phabricator.wikimedia.org/T117888#1786231 (10Dzahn) [18:37:35] mutante: How about https://gerrit.wikimedia.org/r/#/c/224829/? :) [18:37:45] ori: independently I 've been looking into influxdb as well [18:37:55] great timing tbh [18:38:05] akosiaris: what have your impressions been so far? [18:38:31] so, I 've been only testing the basics with a collectd [18:38:41] I must say I like it way better than graphite [18:39:05] PROBLEM - cassandra-a CQL 10.64.16.188:9042 on praseodymium is CRITICAL: Connection refused [18:39:10] for starters the fact that they tagging values instead of creating an hierarchy [18:39:27] but I 've seen that like a year ago [18:39:35] now at least it is stabler [18:39:54] I am hopeful for that thing [18:39:56] I like their activity level too the core group seems pretty cool [18:40:15] the fact that it was designed from the outset to be horizontally scalable seems like the most attractive property -- with graphite you can scale it but it's a choose-your-own-adventure story, requiring that we cobble together different software components. every time we hit a resource ceiling it's a new crisis. [18:40:19] I am wondering a bit how their sharding works though. still reading/testing on that front [18:40:22] ostriches: sorry, not right now. i actually have the day off :p [18:40:30] it was the usual "just this one thing" [18:40:34] ori: yup [18:40:38] totally agree [18:41:50] ostriches: it has a +1 from filippo and it looks sane to me, so i don't mind merging it. do you have a way to verify it in prod? [18:42:16] Yeah, if we run puppet on tin & mira I can pull the new version of scap and test immediately. [18:42:25] (03CR) 10Alexandros Kosiaris: [C: 032] scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [18:42:30] ostriches: I 'll merge [18:42:31] heh [18:42:33] ty! [18:42:34] thanks! [18:42:41] (03PS11) 10Alexandros Kosiaris: scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [18:43:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [18:44:57] jouncebot: next [18:44:58] In 0 hour(s) and 15 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151105T1900) [18:45:13] cool :) i like seeing that merged [18:45:26] be back tomorrow. cya [18:45:36] twentyafterfour: What all you gotta deploy today? [18:45:56] has the repo state been sorted out? [18:46:04] (03PS3) 10coren: Labs: make disk space alerts for compute nodes paging [puppet] - 10https://gerrit.wikimedia.org/r/251297 [18:46:23] ostriches: morning swat stuff didn't get sync'd yet [18:46:28] ori: yes [18:46:43] twentyafterfour: cool, thanks. i'll poke AaronSchulz to explain what happened. [18:46:57] ori: thcipriani and I got it straightened out. And yes please doo [18:47:06] (03CR) 10coren: [C: 032] Labs: make disk space alerts for compute nodes paging [puppet] - 10https://gerrit.wikimedia.org/r/251297 (owner: 10coren) [18:47:18] inquiring minds would like to know wtf [18:47:20] ostriches: wanna test ? run puppet on both tin and mira [18:47:33] I 've ran* [18:47:41] sigh... sorry 21:00 over here [18:47:48] not my best time of the day [18:48:06] Well, I don't wanna screw with twentyafterfour's train deploy. [18:48:09] 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1786298 (10faidon) I don't remember this IRC discussion. Who was attending it? A little more context please? :) In any case, I disagree with that consensus. I think enabling backports fleet-wid... [18:48:12] ori: ^ [18:48:20] thanks [18:48:22] akosiaris: So I might wait a min :) [18:48:34] ostriches: ok [18:49:58] 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1786306 (10coren) @faidon: That was mostly you and Moritz. Lemme see if I find quotables in my local logs. :-) [18:50:25] I was? [18:50:38] 6operations, 5Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#1786307 (10Dzahn) ^ see my change above. we have fixed all "unquoted file mode" warnings example: mode => 0644) across the repo. so we can re-enable that specific check [18:50:47] perhaps it was about ubuntu ? [18:50:48] ostriches: if it's not gonna take long I can wait (train deploy window isn't for another 10 minutes anyway) [18:50:48] it would surprise me but it's entirely plausible [18:50:52] paravoid: Yep, but that was some weeks ago. I'm looking at my local logs now to figure out when and see quote it. :-) [18:51:08] tbh I am not still feeling fully comfortable with -backports enabled [18:51:16] got a long history of not doing it in production [18:51:27] it was enabled by default [18:51:43] akosiaris: it just did not need an explicit action to enable before [18:51:43] and I am wary of enabling it in only half the fleet (jessie vs ubuntu) [18:52:17] that was in jessie pre-release [18:52:26] !log scap: deploying master@b44c268 [18:52:31] but it always needed to be enabled explicitly [18:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:52:41] back in wheezy and squeeze as well [18:53:33] Ew. We discuss "backport" a lot on IRC. A naive grep gives me a few hundred log files. [18:53:47] Oh, wait, I updated the task the same day pretty much - that should narrow the window. [18:53:49] I am not against being convinced we should be enable it, for the record. [18:53:58] * jzerebecki will be offline soon [18:54:48] Coren: could you review/merge https://gerrit.wikimedia.org/r/#/c/250378/ so we can close out https://phabricator.wikimedia.org/T115711 ? [18:55:09] ori: Sure, give me a minute and I'll look at it. [18:55:10] I found it. [18:55:39] paravoid: date/channel so I can follow along? [18:56:20] hi all [18:56:36] (03CR) 10coren: [C: 031] "With the caveat that this will prevent creation of the views, but will not remove extant ones from the replicas (that needs an interventio" [software] - 10https://gerrit.wikimedia.org/r/250378 (https://phabricator.wikimedia.org/T115711) (owner: 10Jcrespo) [18:56:39] if anything goes wrong with the wikidata deployment, please ping me [18:56:51] DanielK_WMDE: I'm around [18:56:57] or is there anything especially dangerous today? [18:57:04] (03CR) 10coren: [C: 032] Delete user_daily_contribs from the views in labs [software] - 10https://gerrit.wikimedia.org/r/250378 (https://phabricator.wikimedia.org/T115711) (owner: 10Jcrespo) [18:57:07] dewiki is ro? [18:57:14] hoo|busy: just the backport still in progress [18:57:30] ori: I'm doing a test run now to give it V+2, then I'll merge. [18:57:35] thanks [18:57:46] jzerebecki: Why is that? [18:57:53] (03PS1) 10Paladox: Fix replication in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/251304 [18:57:53] hoo|busy: ah, good to know :) i thought you were offline. wanted to make sure *someone* is around [18:58:01] happy if i can go offline in an hour [18:58:11] Katie told to be online, so here I am [18:58:15] + me [18:58:29] !log demon@tin Synchronized README: no-op, testing new scap code (duration: 00m 19s) [18:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:58:46] (03CR) 10Paladox: "I am not sure if this fixes the problem but may." [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox) [18:58:53] twentyafterfour: You may see mira complain for a bit about sync-master. Everything else should work and continue as normal. [18:58:59] * ostriches goes after trebuchet with a knife [18:58:59] ostriches: master-master sync in there yet? [18:59:20] The code's deployed to tin, mira didn't want to update. [18:59:21] 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1786335 (10MoritzMuehlenhoff) Using packages from backport selectively is fine with me, we already do it e.g. with openjdk-8 which we need for the cassandra cluster. It's a valid part of the Deb... [18:59:23] (03CR) 10Jcrespo: "tables and views already deleted." [software] - 10https://gerrit.wikimedia.org/r/250378 (https://phabricator.wikimedia.org/T115711) (owner: 10Jcrespo) [18:59:59] * bd808 runs to look at the logs [19:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151105T1900). [19:01:58] (03CR) 10Paladox: Fix replication in phabricator (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox) [19:02:25] bd808: https://phabricator.wikimedia.org/P2281 [19:02:37] Some are probably stale/depooled/etc. [19:02:52] ostriches: "['/srv/deployment/scap/scap/bin/sync-master', 'tin. [19:02:53] eqiad.wmnet'] on mira.codfw.wmnet returned [127]: bash: /srv/deployment/scap/sca [19:02:54] p/bin/sync-master: No such file or directory" [19:03:02] missing the new script [19:03:05] Yeah, I know, I said trebuchet was stupid. [19:03:10] mira didn't want to update. [19:03:36] twentyafterfour: so did you end up doing the hard reset + repick or use some other way? Looks like it was rebased against master instead of wmf5, so a few newer commits showed up but where not deployed. [19:04:09] AaronSchulz: thcipriani did the rebasing, I think everything got started over [19:04:59] AaronSchulz: I did a rebase to add in the commits that were made as part of SWAT, then I reset to the head of the .5 branch. [19:05:24] then I repicked the security patches on top. [19:06:42] and I'm gonna sync it all right now. [19:06:48] when I got there we were 22 commits ahead of origin/wmf/1.27.0-wmf.5 [19:07:07] I'm going to sync to group1 and let it bake for a while before syncing wmf.5 to group2 [19:07:09] (and 2 commits behind since those had been merged for SWAT) [19:08:41] (03CR) 10Paladox: Fix replication in phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox) [19:10:14] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1786348 (10Milimetric) @Joe, @mark, there was more context to this issue in other tickets, but I'm happy to... [19:11:30] (03CR) 10Chad: [C: 04-1] "This will not fix the problem and makes an unrelated and incorrect change to the proxy config. Please abandon." [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox) [19:13:26] (03PS1) 1020after4: w/static/ symlinks for wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251308 [19:14:10] (03CR) 1020after4: [C: 032] w/static/ symlinks for wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251308 (owner: 1020after4) [19:14:48] (03Merged) 10jenkins-bot: w/static/ symlinks for wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251308 (owner: 1020after4) [19:15:44] !log twentyafterfour@tin Started scap: Sync everything just to be sure [19:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:16:44] (03CR) 10coren: [V: 032] Delete user_daily_contribs from the views in labs [software] - 10https://gerrit.wikimedia.org/r/250378 (https://phabricator.wikimedia.org/T115711) (owner: 10Jcrespo) [19:20:13] twentyafterfour: Again, mira will probably complain about missing sync-master, it should just continue and be ok tho [19:20:13] Still trying to sort that [19:20:25] Ah, it finally caught up [19:20:26] Yay [19:20:26] :) [19:20:48] #eventualconsistency [19:21:03] #eventconsi [19:21:11] stency [19:21:14] (03PS2) 10Paladox: Fix replication in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/251304 [19:21:16] :) [19:22:18] So now we get to play "fix broken permissions" in mw-staging in prod like we did in beta :) [19:22:24] (03PS3) 10Paladox: Fix replication in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/251304 [19:22:29] Although the root dir should be ok now [19:22:33] With puppetz. [19:25:11] (03CR) 10Chad: [C: 04-1] Fix replication in phabricator (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox) [19:33:32] (03CR) 10Faidon Liambotis: [C: 04-1] "Well, right now it would mean that misses (and therefore, all logged in traffic), would go ulsfo->codfw->eqiad->appservers if you look at " [dns] - 10https://gerrit.wikimedia.org/r/251255 (https://phabricator.wikimedia.org/T114659) (owner: 10BBlack) [19:37:45] !log twentyafterfour@tin Finished scap: Sync everything just to be sure (duration: 22m 01s) [19:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:59] twentyafterfour: And? [19:38:00] :) [19:44:16] RECOVERY - Restbase root url on praseodymium is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.015 second response time [19:45:45] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [19:47:38] (03PS1) 10BryanDavis: logstash: Exclude runJobs info events from logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251317 (https://phabricator.wikimedia.org/T113571) [19:47:49] ostriches: scap executed flawlessly [19:48:18] but, wtf? 153 Notice: Undefined property: stdClass::$newContent in /srv/mediawiki/php-1.27.0-wmf.4/includes/page/WikiPage.php on line 2058 [19:48:41] (03CR) 10Paladox: Fix replication in phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox) [19:49:12] twentyafterfour: Filed ages ago. [19:49:29] weird. it just showed up in logs suddenly [19:50:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [19:50:21] twentyafterfour: sync-master was good? no permission complaints? [19:51:04] 6operations, 6Commons, 10Wikimedia-Media-storage, 5MW-1.27-release-notes, and 2 others: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1786485 (10matmarex) >>! In T111838#1785015, @Ankry wrote: > I have checked few random files from the list and they all... [19:51:34] ostriches: actually... I hadn't even noticed the scrollback [19:51:39] 19:29:41 ['/srv/deployment/scap/scap/bin/sync-master', 'tin.eqiad.wmnet'] on mira.codfw.wmnet returned [70]: 19:21:32 Copying to mira.codfw.wmnet from tin.eqiad.wmnet [19:51:41] 19:21:32 Started rsync master [19:51:43] rsync: failed to set times on "/srv/mediawiki-staging/live-1.5": Operation not permitted (1) [19:51:51] followed by a bunch more failed to set times on .... [19:52:26] twentyafterfour: Pastebin :) [19:54:07] (03CR) 10Chad: Fix replication in phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox) [19:54:16] ostriches: https://phabricator.wikimedia.org/P2282 [19:55:18] the 'successful' output actually scrolled the errors up so far that I didn't even notice them :-/ obviously I'm not paying good enough attention [19:55:36] Nbd, the rest of it worked out fine. [19:55:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [19:56:25] 6operations, 7Database, 5Patch-For-Review: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1786507 (10MaxSem) 5Open>3Resolved I don't see this table on betalabs. [19:57:32] bd808: cc https://phabricator.wikimedia.org/P2282 :\ [19:58:04] ostriches: looking. I imaging that means the initial clone there is not owned by mwdeploy [19:58:30] the mtime stuff requires ownership rather than just group access [19:58:35] (03Abandoned) 10Paladox: Fix replication in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox) [19:58:35] Yeah [19:59:35] ostriches: most files there are owned by either root or Krenair [19:59:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [19:59:58] so we need some chown help from a root to make things sane [20:00:16] or to nuke it all and start over [20:02:49] * YuviPanda waves very lately at dbrant [20:03:54] ostriches: YuviPanda has time to wave so he has time to chown :) [20:03:59] <_joe_> bd808: what do you need specifically? [20:04:27] _joe_: the files in /srv/mediawiki-staging on mira need to be owned by the mwdeploy user [20:04:44] * YuviPanda can do if _joe_ isn't on it [20:04:48] bd808: Also, we should make checkoutMediaWiki have you do that as mwdeploy as well [20:04:50] <_joe_> why the mdeploy user? group write permission is not enough? [20:04:58] Not to set mtimes. [20:05:01] YuviPanda: hey! i got it figured out, thanks [20:05:13] <_joe_> oh you manually set mtimes? [20:05:19] ostriches: yeah that will need to be fixed too [20:05:24] _joe_: rsync does [20:05:31] <_joe_> you don't just touch the file, you run setattr, via rsync [20:05:36] dbrant: haha ok [20:05:37] <_joe_> ok [20:05:45] <_joe_> yep you need that then [20:05:48] <_joe_> so, on mira? [20:06:02] _joe_: yeah. the errors we saw are at https://phabricator.wikimedia.org/P2282 [20:07:12] <_joe_> !log chown mwdeploy:wikidev recursively on mira for /srv/mediawiki-staging [20:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:07:30] <_joe_> {{done}} [20:07:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [20:09:53] ostriches: do you have time to scap again and check that out? [20:10:03] Yeah [20:10:21] !log demon@tin Started scap: no changes, testing permissions on mira co-master [20:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:11:26] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [20:14:36] (03PS1) 10Chad: checkoutMediaWiki: sudo as mwdeploy for most things [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251327 [20:15:16] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [20:15:25] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:15:53] Heh, that might be a problem. [20:16:10] 20:11:42 Started sync-masters [20:16:11] sync-masters: 100% (ok: 1; fail: 0; left: 0) [20:16:11] 20:14:58 Finished sync-masters (duration: 03m 16s) [20:16:15] Yay! [20:16:18] w00t [20:17:37] (03CR) 10Jhobs: [C: 031] Deploy QuickSurveys to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [20:17:50] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1786577 (10demon) [20:17:50] !log demon@tin Finished scap: no changes, testing permissions on mira co-master (duration: 07m 29s) [20:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:19:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [20:24:01] (03CR) 10Eevans: restbase: move to systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [20:28:54] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [20:30:13] twentyafterfour: is group2 still going to wmf.5 today? [20:31:00] (03CR) 10BryanDavis: [C: 031] checkoutMediaWiki: sudo as mwdeploy for most things [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251327 (owner: 10Chad) [20:36:37] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1786622 (10demon) Anything left on this? [20:46:24] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 9 below the confidence bounds [20:47:15] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786669 (10RobH) a:3Joe So the current summary, as I understand it is we need 2 identical machines (master/slave) in EQIAD to add to the rdb cluster. These two servers will be name... [20:47:28] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786676 (10RobH) [20:47:42] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1035333 (10RobH) [20:51:59] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786681 (10RobH) @aaron Can the redis system config be updated to use /srv rather than /a? My understanding is we've shifted nearly all other services to use /srv. [20:52:01] bd808: yes I just wanted to give it some time to be sure wmf.5 wasn't horribly broken [20:52:14] cool beans [21:00:53] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786698 (10mark) I understand that this is a bit urgent, so let's use one of our old spares, even if they're out of warranty. We can replace when we're out of the woods. [21:02:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [21:12:38] ok I guess it's baked long enough. I'm gonna deploy wmf.5 to group2 [21:13:50] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786755 (10RobH) a:5Joe>3RobH Update from IRC: @Mark stated he would like this to be hardware under warranty, and thus new, unless its an emergency. @Joe stated he would like to... [21:15:15] (03PS1) 1020after4: all wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251410 [21:15:33] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786761 (10RobH) 5Open>3stalled We'll allocate the two old boxes for now, and order new boxes. I'll put this task to stalled. I'll create a blocking task for the installation of... [21:26:09] 6operations: install/setup/deploy wmf3153 (rdb1007) & wmf3154 (rdb1008) - https://phabricator.wikimedia.org/T117916#1786794 (10RobH) 3NEW a:3RobH [21:26:41] (03CR) 1020after4: [C: 032] all wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251410 (owner: 1020after4) [21:27:02] (03Merged) 10jenkins-bot: all wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251410 (owner: 1020after4) [21:27:05] bd808: wmf.5 coming right up [21:27:33] or not [21:27:51] 21:27:10 sync-wikiversions failed: 'SyncWikiversions' object has no attribute '_get_target_list' [21:32:20] (patch coming up) [21:35:50] 6operations, 6Release-Engineering-Team, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Recover missing values from user_properties tables - https://phabricator.wikimedia.org/T114899#1786833 (10Ciencia_Al_Poder) 5Resolved>3declined [21:36:36] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786836 (10aaron) >>! In T89400#1786681, @RobH wrote: > @aaron Can the redis system config be updated to use /srv rather than /a? My understanding is we've shifted nearly all other s... [21:42:21] (03PS1) 10RobH: setting wmf3153 (rdb1007) & wmf3154 (rdb1008) mgmt dns entries [dns] - 10https://gerrit.wikimedia.org/r/251414 [21:42:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [21:46:24] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [21:51:00] (03PS3) 10BBlack: Relax and extend the Content-Type regexp that controls gzipping [puppet] - 10https://gerrit.wikimedia.org/r/251268 (owner: 10Ori.livneh) [21:51:15] (03CR) 10BBlack: [C: 032 V: 032] Relax and extend the Content-Type regexp that controls gzipping [puppet] - 10https://gerrit.wikimedia.org/r/251268 (owner: 10Ori.livneh) [21:51:32] thanks bblack [21:52:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [21:53:21] (03CR) 10BBlack: "This doesn't affect cache-tiering or routing, just frontend edge where traffic initially lands... ?" [dns] - 10https://gerrit.wikimedia.org/r/251255 (https://phabricator.wikimedia.org/T114659) (owner: 10BBlack) [21:54:26] np! [21:54:30] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: sync 1.27.0-wmf.5 to group2 [21:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:57:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [21:58:18] (03CR) 10RobH: [C: 032] setting wmf3153 (rdb1007) & wmf3154 (rdb1008) mgmt dns entries [dns] - 10https://gerrit.wikimedia.org/r/251414 (owner: 10RobH) [21:59:44] 6operations: install/setup/deploy wmf3153 (rdb1007) & wmf3154 (rdb1008) - https://phabricator.wikimedia.org/T117916#1786917 (10RobH) [22:05:30] (03PS1) 10RobH: setting rdb1007/rdb1008 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/251421 [22:06:17] (03CR) 10RobH: [C: 032] setting rdb1007/rdb1008 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/251421 (owner: 10RobH) [22:07:35] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [22:12:23] req errors are fine [22:12:29] alert is too sensitive [22:13:38] yeah I see that alert for graphite1001 frequently ... [22:13:56] it's definitely too sensitive [22:14:27] it didn't use to be too sensitive. it's probably a testament to good work from releng that it has become too sensitive. we used to have real spikes of errors more frequently. [22:16:12] (03PS1) 10RobH: setting install params for rdb1007-1008 [puppet] - 10https://gerrit.wikimedia.org/r/251426 [22:16:37] 6operations: install/setup/deploy wmf3153 (rdb1007) & wmf3154 (rdb1008) - https://phabricator.wikimedia.org/T117916#1786994 (10RobH) [22:18:27] yeah I think that alert just looks for arbitrary pattern anomalies [22:18:52] as in, no absolute thresholds. So yeah, if things are generally good, very minor disturbances are going to become alerts. [22:19:22] (03CR) 10RobH: [C: 032] setting install params for rdb1007-1008 [puppet] - 10https://gerrit.wikimedia.org/r/251426 (owner: 10RobH) [22:20:55] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [22:27:36] (03PS1) 10Dduvall: Install libjpeg-dev for diagrams in documentation [puppet] - 10https://gerrit.wikimedia.org/r/251428 [22:28:56] (03CR) 10Ori.livneh: [C: 032] Install libjpeg-dev for diagrams in documentation [puppet] - 10https://gerrit.wikimedia.org/r/251428 (owner: 10Dduvall) [22:32:36] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [22:35:04] !log ori@tin Synchronized php-1.27.0-wmf.5/extensions/WikimediaEvents: Ic99ac31f740956: Log backend response time on edit requests (duration: 00m 35s) [22:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:39:14] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [5000000.0] [22:46:26] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [22:48:45] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [23:00:41] (03PS1) 10BryanDavis: monolog: Ensure that context data added by WebProcessor is utf-8 safe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251431 [23:00:48] ori: ^ [23:08:18] (03PS1) 10Gilles: Add libcurl-dev to Python Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/251432 (https://phabricator.wikimedia.org/T111005) [23:13:17] (03CR) 10Ori.livneh: [C: 032] Add libcurl-dev to Python Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/251432 (https://phabricator.wikimedia.org/T111005) (owner: 10Gilles) [23:13:56] bd808: looks ok to me, not sure how to test [23:14:18] if you tested it then let's do it [23:14:23] I pasted it into my mw-vagrant config and it didn't blow up [23:14:46] and seemed to do the wanted thing for urls with nasty chars in them [23:15:08] (03CR) 10Ori.livneh: [C: 032] monolog: Ensure that context data added by WebProcessor is utf-8 safe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251431 (owner: 10BryanDavis) [23:15:29] (03Merged) 10jenkins-bot: monolog: Ensure that context data added by WebProcessor is utf-8 safe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251431 (owner: 10BryanDavis) [23:16:34] !log ori@tin Synchronized wmf-config/logging.php: Ieb8c602a: monolog: Ensure that context data added by WebProcessor is utf-8 safe (duration: 00m 36s) [23:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:39] ori: https://fr.wikisource.org/wiki/R%C3%A9solution_179_du_conseil_de_s%C3%A9curit%C3%A9_des_nations_unies isn't adding to exception.log anymore :) [23:30:31] !log Logging volume into ELK cluster down dramatically; investigating [23:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:27] !log Decreased replica count of logstash-2015.10.13 and logstash-2015.10.14 to free disk space on cluster [23:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:44:15] (03PS5) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 [23:44:17] (03PS1) 10Yuvipanda: Add .pep8 exception for line length [puppet] - 10https://gerrit.wikimedia.org/r/251435 [23:45:37] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 (owner: 10Yuvipanda) [23:45:53] fuck you too jenkins [23:46:21] legoktm: any idea why jenkins doesn't respect either tox.ini nor .pep8 in base of the project? [23:47:02] 6operations, 10CirrusSearch, 6Discovery, 5Patch-For-Review: Only use newer (elastic10{16..31}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#1787259 (10EBernhardson) With four nodes we will need to increase `discovery.zen.minimum_master_nodes` to 3 to ensure ther... [23:47:12] (03PS6) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 [23:47:24] YuviPanda: it uses the root one [23:47:40] do I have a wrong .pep8? [23:47:44] also does it use tox.ini or pep8? [23:47:46] .pep8 [23:48:02] there's a 'flake8' line in the root tox.ini [23:48:05] should it be a 'pep8' [23:48:07] ? [23:48:55] flake8 is pyflakes + pep8 [23:48:58] it looks like its using pep8 and not flake8 [23:48:58] ugh [23:49:01] yeah [23:49:11] this looks like a misconfiguration somewhere... not sure where [23:49:24] see adding the .pep8 into the folder with the py file gets it to shut up [23:49:52] there are also 25 individual .pep8s scattered around the repo [23:50:04] (03PS7) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 [23:51:03] (03CR) 10Yuvipanda: [C: 032] dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 (owner: 10Yuvipanda) [23:54:14] (03PS1) 10Yuvipanda: dynamicproxy: Install proper flask package [puppet] - 10https://gerrit.wikimedia.org/r/251436 [23:54:45] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Install proper flask package [puppet] - 10https://gerrit.wikimedia.org/r/251436 (owner: 10Yuvipanda) [23:54:47] jouncebot: refresh [23:54:51] I refreshed my knowledge about deployments.