[00:43:09] (03CR) 10Smalyshev: [C: 031] wdqs: LVS and conftool configuration for new wdqs-internal service [puppet] - 10https://gerrit.wikimedia.org/r/424599 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [00:57:17] (03PS1) 10MaxSem: Add logging channel for preference stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425210 (https://phabricator.wikimedia.org/T190425) [01:22:38] (03CR) 10Samwilson: [C: 031] "Is it enough to just have the comments saying that these must be loaded in order then? It seems nice to properly check that the other is n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423809 (https://phabricator.wikimedia.org/T190353) (owner: 10Dereckson) [01:46:43] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4119091 (10awight) I ran a test on deployment-ores01.deployment-prep.eqiad... [01:47:55] (03CR) 10Dzahn: [C: 031] "lgtm, but we should verify the puppet run isn't broken for some reason due to adding the new "before => Class" part. which instances are a" [puppet] - 10https://gerrit.wikimedia.org/r/425202 (https://phabricator.wikimedia.org/T191727) (owner: 10BryanDavis) [01:48:00] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4119092 (10awight) [01:50:01] (03CR) 10Dzahn: [C: 031] "just give me the name of an instance to test it on after merge and i'll handle it" [puppet] - 10https://gerrit.wikimedia.org/r/425202 (https://phabricator.wikimedia.org/T191727) (owner: 10BryanDavis) [01:55:45] 10Operations, 10Availability, 10Patch-For-Review: create codfw-equivalent of bromine, make webserver_misc_static active/active in misc varnish - https://phabricator.wikimedia.org/T188163#4119095 (10Dzahn) [01:55:47] 10Operations: upgrade bromine to stretch / reinstall - https://phabricator.wikimedia.org/T189910#4119093 (10Dzahn) 05Open>03Resolved this is done in parent task. bromine is on stretch now and has more space as well [01:58:23] 10Operations, 10monitoring: Netbox: add Icinga check for PosgreSQL - https://phabricator.wikimedia.org/T185504#4119096 (10Dzahn) https://exchange.nagios.org/directory/Plugins/Databases/PostgresQL [02:01:40] 10Operations, 10monitoring: Netbox: add Icinga check for PosgreSQL - https://phabricator.wikimedia.org/T185504#4119098 (10Dzahn) seems like the best one. hast the most votes: https://exchange.nagios.org/directory/Plugins/Databases/PostgresQL/check_postgres/details latest: https://github.com/bucardo/check_p... [02:34:07] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.28) (duration: 05m 39s) [02:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:50] (03CR) 10Krinkle: "If we do want a run-time exception, I would recommend to implement such logic in the respective run-times instead of here (e.g. in Echo an" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423809 (https://phabricator.wikimedia.org/T190353) (owner: 10Dereckson) [02:48:38] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4119135 (10mmodell) >>! In T181071#4118685, @awight wrote: > I can't tell... [02:52:43] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4119136 (10mmodell) >>! In T181071#4119091, @awight wrote: > There's no .... [04:06:15] (03CR) 10BryanDavis: "> just give me the name of an instance to test it on after merge and" [puppet] - 10https://gerrit.wikimedia.org/r/425202 (https://phabricator.wikimedia.org/T191727) (owner: 10BryanDavis) [04:08:02] (03PS1) 10MusikAnimal: Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697) [04:20:28] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey, 10User-notice: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4119202 (10Joe) >>! In T189295#4117343, @Krinkle wrote: >>>! In T189295#4116309, @gerritbot wrote: >> Change 425027 had a related patc... [04:25:39] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey, 10User-notice: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4119217 (10Joe) Also a note on beta not running php7: when we migrated to HHVM it was made very clear to me and to Ori that we could n... [04:28:30] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey, 10User-notice: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4119220 (10Joe) [04:38:14] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4119223 (10awight) The weird part about this is just that we've been deplo... [05:02:21] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey, 10User-notice: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4119225 (10Joe) [05:11:09] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425213 [05:11:18] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425213 (owner: 10Marostegui) [05:11:32] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425213 (owner: 10Marostegui) [05:12:13] (03PS1) 10Marostegui: db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425214 [05:14:14] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425214 (owner: 10Marostegui) [05:15:41] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425214 (owner: 10Marostegui) [05:17:19] !log Deploy alter table on s1 primary master (db1052) - T185128 T153182 [05:17:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 after alter table (duration: 01m 11s) [05:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:27] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [05:17:27] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [05:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:19] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425214 (owner: 10Marostegui) [05:20:01] 10Operations, 10DBA, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4119261 (10Marostegui) [05:24:22] (03PS1) 10Elukey: profile::kafka::mirror::alerts: escape some unusual chars in the query [puppet] - 10https://gerrit.wikimedia.org/r/425215 [05:24:58] (03CR) 10Elukey: [C: 032] profile::kafka::mirror::alerts: escape some unusual chars in the query [puppet] - 10https://gerrit.wikimedia.org/r/425215 (owner: 10Elukey) [05:43:05] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425216 (https://phabricator.wikimedia.org/T191275) [05:45:06] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425216 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [05:46:32] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425216 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [05:47:10] 10Operations, 10User-Joe: build new version of mcrouter package - https://phabricator.wikimedia.org/T190979#4119289 (10Joe) p:05Triage>03High a:03Joe [05:48:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db2069 from config - T191275 (duration: 00m 59s) [05:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:11] T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275 [05:48:34] 10Operations, 10User-Joe: build new version of mcrouter package - https://phabricator.wikimedia.org/T190979#4089852 (10Joe) Status update: - I've built 0.37.0 for stretch, will upload the package today if all goes well - I'm working on a jessie backport, which will be hopefully done shortly [05:49:06] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425216 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [05:49:12] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db2069 from config - T191275 (duration: 00m 58s) [05:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:19] (03PS3) 10Marostegui: mediawiki: increase speed of deleteAutoPatrolLogs in wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/425098 (owner: 10Ladsgroup) [05:51:28] (03CR) 10Marostegui: [C: 032] mediawiki: increase speed of deleteAutoPatrolLogs in wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/425098 (owner: 10Ladsgroup) [05:54:16] 10Operations, 10DBA, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4119309 (10Marostegui) [06:10:08] (03PS1) 10Ema: prometheus::class_config: sort targets [puppet] - 10https://gerrit.wikimedia.org/r/425218 [06:16:06] (03CR) 10Ema: "pcc looks good https://puppet-compiler.wmflabs.org/compiler02/10876/" [puppet] - 10https://gerrit.wikimedia.org/r/425218 (owner: 10Ema) [06:16:13] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Add UDP monitor for pybal - https://phabricator.wikimedia.org/T178151#4119325 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [06:16:28] <_joe_> ema: wait [06:16:39] <_joe_> ema: ruby has deterministic hash ordering [06:16:55] <_joe_> so I don't see how sorting would be necessary [06:17:04] _joe_: interesting [06:17:08] <_joe_> maybe the problem is the order in which we get the data from puppetdb [06:17:23] <_joe_> that might be changing from one request to the next, and that's bad [06:17:43] <_joe_> let's check that first, maybe [06:17:46] sure! [06:18:11] <_joe_> you can anyways merge your change for now [06:19:24] note that we call keys.sort in various places in the puppet repo, I think under the assumption that keys would return elements in non-deterministic order? [06:20:05] <_joe_> ema: that was the case in ruby 1.8 (and 1.9 maybe, I don't recall), not anymore today [06:20:30] good to know! [06:21:50] no rush anyways (and I need to go afk to run an errand, fun times) [06:21:52] bbiab [06:31:01] 10Operations, 10DBA, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4119330 (10Marostegui) [06:46:49] 10Operations, 10Ops-Access-Requests, 10Analytics: Access to stat100x and notebook1003.eqiad.wmnet for Jonas Kress - https://phabricator.wikimedia.org/T191308#4119350 (10Jonas) So what is the status here? [06:55:00] (03CR) 10Elukey: Modify eventlogging purging script to read from YAML whitelist (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/420685 (https://phabricator.wikimedia.org/T189692) (owner: 10Mforns) [07:03:01] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119368 (10Marostegui) This is the list of slaves per section we'd need to depool before starting this maintenance: s1: db1089 main db1105 rc s2: db1060 vslow db1090 main db1105... [07:22:55] (03PS2) 10Muehlenhoff: Remove mw1259/mw1260 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/425019 (https://phabricator.wikimedia.org/T187466) [07:23:40] 10Operations, 10Ops-Access-Requests, 10Analytics: Access to stat100x and notebook1003.eqiad.wmnet for Jonas Kress - https://phabricator.wikimedia.org/T191308#4100865 (10elukey) So Jonas (user: jk) is already in analytics-privatedata-users, and as far as I can see access is already granted for notebook1003, s... [07:23:48] (03CR) 10Muehlenhoff: [C: 032] Remove mw1259/mw1260 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/425019 (https://phabricator.wikimedia.org/T187466) (owner: 10Muehlenhoff) [07:27:19] <_joe_> so, regarding the appservers upgrade to stretch [07:27:45] <_joe_> it kind-of superimposes with the memcached extension patch for HHVM [07:29:02] <_joe_> so my proposal would be: -we prepare hhvm builds for both jessie and stretch, and roll out the upgrade directly [07:29:34] <_joe_> - in parallel, we start reimaging canaries to stretch [07:30:02] (03PS3) 10Gehel: wdqs: new wdqs-internal service [dns] - 10https://gerrit.wikimedia.org/r/424587 (https://phabricator.wikimedia.org/T187766) [07:30:05] <_joe_> once we're done with both things, we need to have a pause period where we do a careful rolling restart of our memcached cluster, over several days [07:30:15] <_joe_> and we can validate the appservers-on-stretch status [07:30:16] yeah, I was thinking of reimaging mw1265 and pool it for some production traffic to see whether the are any issues [07:30:23] <_joe_> moritzm: +1 [07:30:23] the manifests per see should be fine by now [07:30:59] <_joe_> moritzm: I was wondering if we might want to use php 7.2, but that's for next Q, to be honest [07:31:08] I'll prepare a test build of hhvm with the memcached patch, shall I do that one for jessie or stretch? [07:31:19] (03CR) 10Gehel: "@Smalyshev: the generic endpoint is defined in templates/wmnet (L4851). There is some magic on the puppet side to map it to the closest DC" [dns] - 10https://gerrit.wikimedia.org/r/424587 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [07:31:22] <_joe_> moritzm: both? [07:31:31] yeah, let's focus on migrating to stretch, which is then the proper step stone [07:31:37] ok, can also prepare both [07:31:51] <_joe_> yeah the issue is [07:32:01] <_joe_> the memcached thing blocks the upgrade of the snapshots [07:32:16] <_joe_> which is the one with more unknowns (at least to me), coming from trustys [07:32:49] sounds good to me [07:33:00] <_joe_> and it will take 1-2 week to roll restart the memcached without user performance impact [07:33:01] ack, I'll finish upgrading HHVM in codfw, then prepare HHVM builds with the memcached and then look into upgrading mw1265 [07:33:29] (03PS2) 10Gehel: maps: add Java proxy to cleartables_sync cron [puppet] - 10https://gerrit.wikimedia.org/r/424247 (https://phabricator.wikimedia.org/T190193) [07:33:29] <_joe_> moritzm: well once you have the new memcached build, I want to do some testing before we release it in the wild :) [07:33:34] sure :-) [07:33:46] two qs about memcached: [07:34:21] (03PS3) 10Gehel: maps: add Java proxy to cleartables_sync cron [puppet] - 10https://gerrit.wikimedia.org/r/424247 (https://phabricator.wikimedia.org/T190193) [07:34:25] 1) when we release the patch for hhvm, is it going to handle compressed values with/without the related flags transparently ? [07:34:33] (meaning, without any issue) [07:35:03] (03CR) 10Gehel: "@pnorman: thanks for catching the typo! Corrected..." [puppet] - 10https://gerrit.wikimedia.org/r/424247 (https://phabricator.wikimedia.org/T190193) (owner: 10Gehel) [07:35:07] 2) after restarting a memcached shard, is there any possibility that nutcracker on a hhvm-still-not-patched pushes content to it? [07:35:27] we need to upgrade the fleet completely to prevent that [07:35:28] (probably silly qs but just wanted to discuss those) [07:35:32] (for 2) [07:35:47] <_joe_> elukey: 1 - that's one of the things I want to verify [07:35:55] <_joe_> 2 - obviously, what moritz said [07:36:12] super, just wanted to chat with you about those :) [07:37:09] !log upgrading API servers in codfw to ICU57-enabled build of HHVM [07:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:26] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119423 (10jcrespo) I would honestly move x1 replica (or the master directy), probably in a logical way, somewhere else- we don't want to serve the whole service from the same row,... [07:37:40] moritzm: when there is a biul ready, can we get that deployed in beta so I can test immediately there? [07:37:44] *build [07:37:58] (03CR) 10Dzahn: [C: 031] prometheus::class_config: sort targets [puppet] - 10https://gerrit.wikimedia.org/r/425218 (owner: 10Ema) [07:38:12] apergos: yeah, we can do that [07:38:14] of hhvm + memcached patch, I mean [07:38:29] great, I don't expect any problems, but better safe than sorry [07:38:31] (03CR) 10Jcrespo: [C: 031] wiki replicas: depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/425095 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [07:39:34] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review: Decommission mw2017 and mw2099 - https://phabricator.wikimedia.org/T187467#4119427 (10MoritzMuehlenhoff) [07:40:17] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119429 (10Marostegui) >>! In T187962#4119423, @jcrespo wrote: > I would honestly move x1 replica (or the master directy), probably in a logical way, somewhere else- we don't want t... [07:43:08] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119434 (10jcrespo) I would do the second. [07:43:38] 10Operations, 10ops-eqiad, 10DBA: Rack and setup 8 new eqiad DBs - https://phabricator.wikimedia.org/T191792#4119436 (10jcrespo) [07:43:41] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119435 (10jcrespo) [07:43:46] (03PS1) 10Dzahn: icinga: import check_postgres.pl [puppet] - 10https://gerrit.wikimedia.org/r/425227 (https://phabricator.wikimedia.org/T185504) [07:46:26] (03CR) 10Vgutierrez: [C: 04-1] "Awesome testing effort, please fix the duped test method (check inline comment)" (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/423995 (owner: 10Mark Bergsma) [07:51:31] (03CR) 10Vgutierrez: [C: 031] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/421053 (owner: 10Mark Bergsma) [07:52:21] !log Updated operations/dumps/dcat (7ea4e75c..61154ca4) on snapshot1007 [07:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:36] (03PS4) 10Jcrespo: labsdb: Reduce the sleep timeouts of clients to prevent connection hogging [puppet] - 10https://gerrit.wikimedia.org/r/423494 [08:07:05] (03CR) 10Jcrespo: [C: 032] labsdb: Reduce the sleep timeouts of clients to prevent connection hogging [puppet] - 10https://gerrit.wikimedia.org/r/423494 (owner: 10Jcrespo) [08:08:05] (03PS3) 10Jcrespo: mariadb: migrate sanitarium to role/profile and abstract instances [puppet] - 10https://gerrit.wikimedia.org/r/425087 (https://phabricator.wikimedia.org/T190704) [08:08:55] (03CR) 10Hoo man: [C: 04-1] "If we want to do this, please add a changelog mentioning how the configuration evolved over time and what exactly is needed to adopt to th" (031 comment) [dumps/dcat] - 10https://gerrit.wikimedia.org/r/425065 (https://phabricator.wikimedia.org/T163328) (owner: 10Lokal Profil) [08:10:15] (03CR) 10Hoo man: [C: 031] "@Lokal Profil: It's up to you whether you also want to add the nt configuration… from my point of view, this is fine to go." [puppet] - 10https://gerrit.wikimedia.org/r/424291 (https://phabricator.wikimedia.org/T163328) (owner: 10Lokal Profil) [08:10:36] there's been a few 500 errors on cache_text: https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-aggregate-client-status-code?orgId=1&from=1523346969742&to=1523347618339&var-site=codfw&var-site=esams&var-site=eqsin&var-site=ulsfo&var-site=eqiad&var-cache_type=varnish-text&var-status_type=5 [08:11:01] affecting search.wikimedia.org apparently https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X?_g=h@97fe121&_a=h@7e5f62f [08:11:04] "false 500" IMHO [08:11:14] https://search.wikimedia.org/?site=wikipedia&lang=en&search=zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz [08:11:20] zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz [08:11:24] 10Operations, 10Analytics-Kanban: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4119479 (10elukey) A bit of historic context about the why db1108 is not read-only: ``` # History context: there used to be a distinction b... [08:11:26] zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz&limit=31 [08:11:31] (sorry) [08:11:34] that produces a 400 on the backend [08:11:36] but gets transformed into a 500 [08:11:40] hahahah [08:11:56] It took me a bit to figure out what it was :D [08:12:08] I think the letter was 'z' [08:12:23] somebody hit us with a bunch of those (>150), and that's enough to create the spike on the dashboard [08:13:45] the body says "Backend failure: it returned HTTP code 400" but the headers, 500 [08:15:02] <_joe_> vgutierrez: is the 500 returned from mediawiki or ES? [08:17:50] <_joe_> because it's clearly a bug - we should tell the client the request was not valid [08:18:04] _joe_: I'd say mediawiki, I can repro hitting appservers.svc.eqiad.wmnet [08:18:34] it looks like mw is getting a 400 from ES and returns a 500 to the user [08:18:39] <_joe_> ok, so it's a bug in CirrusSearch IMHO [08:18:43] (03CR) 10Jcrespo: "Looks good: https://puppet-compiler.wmflabs.org/compiler02/10877/db1102.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/425087 (https://phabricator.wikimedia.org/T190704) (owner: 10Jcrespo) [08:18:53] (03CR) 10Jcrespo: [C: 032] mariadb: migrate sanitarium to role/profile and abstract instances [puppet] - 10https://gerrit.wikimedia.org/r/425087 (https://phabricator.wikimedia.org/T190704) (owner: 10Jcrespo) [08:19:04] <_joe_> of course fixing that will break all kind of tools that depend on that bug [08:19:25] <_joe_> https://xkcd.com/1172/ [08:21:07] hmmm [08:21:19] it looks like the 500 is triggered here: https://github.com/wikimedia/operations-mediawiki-config/blob/master/docroot/search.wikimedia.org/index.php#L62-L64 [08:21:51] doh [08:21:54] <_joe_> uh that's search.wm.org, right [08:21:59] <_joe_> it's not "mediawiki" [08:22:05] we throw an error on large search queries, also I thought we disabled search.wikimedia.org? [08:22:30] <_joe_> dcausse: sadly, no [08:22:37] <_joe_> there is a long queue of requests there [08:22:44] :/ [08:22:48] <_joe_> vgutierrez: yeah let's not worry about that [08:22:57] ack, sorry about the noise then :) [08:23:01] <_joe_> it's the apple search api [08:23:05] dcausse: thing is, this is surely an error, but a client error. It should just pass the 400 along instead of turning it into a 500 [08:23:21] <_joe_> vgutierrez: it took me to see the code to remember search.wm.org is that thing [08:23:24] <_joe_> :P [08:23:36] sure but I was not even aware that we had custom php code behing search.wikimedia.org [08:24:46] heh, T179266 [08:24:47] T179266: search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266 [08:25:13] <_joe_> dcausse: yeah the whole thing is pretty raw :) [08:28:14] let's reopen T179266 then? [08:30:42] yeah, in particular if I read https://github.com/wikimedia/operations-mediawiki-config/blob/master/docroot/search.wikimedia.org/index.php#L62-L64 correctly all non-200 responses are turned into 500 [08:31:54] (03PS1) 10Elukey: Set m4-master to db1107 rather than dbproxy [dns] - 10https://gerrit.wikimedia.org/r/425231 (https://phabricator.wikimedia.org/T188991) [08:32:41] (03CR) 10Marostegui: [C: 031] "+1 as I rather have errors than a split brain" [dns] - 10https://gerrit.wikimedia.org/r/425231 (https://phabricator.wikimedia.org/T188991) (owner: 10Elukey) [08:33:37] insta-code-rewiew! <3 [08:36:49] (03CR) 10Jcrespo: [C: 031] "Could we then absorbe the proxy and not purchase a replacement?" [dns] - 10https://gerrit.wikimedia.org/r/425231 (https://phabricator.wikimedia.org/T188991) (owner: 10Elukey) [08:37:43] (03CR) 10Marostegui: [C: 031] "> Could we then absorbe the proxy and not purchase a replacement?" [dns] - 10https://gerrit.wikimedia.org/r/425231 (https://phabricator.wikimedia.org/T188991) (owner: 10Elukey) [08:37:53] ema: reopened [08:38:59] (03CR) 10Elukey: "> > Could we then absorbe the proxy and not purchase a replacement?" [dns] - 10https://gerrit.wikimedia.org/r/425231 (https://phabricator.wikimedia.org/T188991) (owner: 10Elukey) [08:39:26] (03CR) 10Marostegui: [C: 031] "> > > Could we then absorbe the proxy and not purchase a replacement?" [dns] - 10https://gerrit.wikimedia.org/r/425231 (https://phabricator.wikimedia.org/T188991) (owner: 10Elukey) [08:39:29] dcausse: thanks [08:41:25] (03PS2) 10Gehel: tilerator: Add sudo rule for tileratorui [puppet] - 10https://gerrit.wikimedia.org/r/425208 (owner: 10Catrope) [08:41:59] _joe_: so, back to https://gerrit.wikimedia.org/r/#/c/425218/. What's the best way to debug interactions with puppetdb and figure out whether results are returned in a predictable order or not? [08:43:43] (03CR) 10Marostegui: "I am still waiting for labsdb1011 to catch up" [puppet] - 10https://gerrit.wikimedia.org/r/425095 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [08:44:32] (03CR) 10Elukey: [C: 032] Set m4-master to db1107 rather than dbproxy [dns] - 10https://gerrit.wikimedia.org/r/425231 (https://phabricator.wikimedia.org/T188991) (owner: 10Elukey) [08:45:09] (03PS3) 10Gehel: tilerator: Add sudo rule for tileratorui [puppet] - 10https://gerrit.wikimedia.org/r/425208 (owner: 10Catrope) [08:45:21] <_joe_> ema: give me ~ 30 minutes and I can take a deeper look [08:45:43] <_joe_> but if you want to take a look yourself, you should probably try to get @cluster_sites sorted consistently [08:45:54] <_joe_> instead of sorting the final output there [08:46:01] (03CR) 10Gehel: [C: 032] tilerator: Add sudo rule for tileratorui [puppet] - 10https://gerrit.wikimedia.org/r/425208 (owner: 10Catrope) [08:53:11] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Remove video scaler instances from deployment-prep - https://phabricator.wikimedia.org/T187063#3963166 (10EddieGP) According to openstack browser, both instances are gone. Is this resolved? [08:53:54] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Remove video scaler instances from deployment-prep - https://phabricator.wikimedia.org/T187063#4119525 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Ack, I removed those last week, closing the task. [08:54:40] moritzm: Thanks for cleaning those up :) [08:54:53] (03PS3) 10Vgutierrez: lvs: Get rid of interface names on site.pp [puppet] - 10https://gerrit.wikimedia.org/r/425040 (https://phabricator.wikimedia.org/T177961) [08:55:07] yw :-) [08:56:05] (03PS2) 10Gehel: Allow for different storage_id in kartotherian and tilerator [puppet] - 10https://gerrit.wikimedia.org/r/425092 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [08:56:36] (03CR) 10Gehel: "This is functionally a noop, puppet compiler is happy: https://puppet-compiler.wmflabs.org/compiler03/10880/" [puppet] - 10https://gerrit.wikimedia.org/r/425092 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [08:56:43] (03CR) 10Gehel: [C: 032] Allow for different storage_id in kartotherian and tilerator [puppet] - 10https://gerrit.wikimedia.org/r/425092 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [08:58:02] 10Operations, 10Analytics-Kanban, 10Patch-For-Review: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4119539 (10elukey) Created https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration#Mysql_inserti... [08:59:42] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey, 10User-notice: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4119543 (10Joe) [09:01:44] (03CR) 10Jcrespo: [C: 031] "Note proxies are not going away immediately, rather we will not take into account it for the refresh (T191595)" [dns] - 10https://gerrit.wikimedia.org/r/425231 (https://phabricator.wikimedia.org/T188991) (owner: 10Elukey) [09:04:38] (03CR) 10Vgutierrez: "PS3 provides even more pcc happiness: http://puppet-compiler.wmflabs.org/10881/" [puppet] - 10https://gerrit.wikimedia.org/r/425040 (https://phabricator.wikimedia.org/T177961) (owner: 10Vgutierrez) [09:07:27] 10Operations, 10hardware-requests: eqiad/codfw: (4)+(4) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#4119566 (10Joe) [09:08:12] 10Operations, 10hardware-requests: eqiad/codfw: (4)+(4) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3995477 (10Joe) [09:09:48] (03PS3) 10Giuseppe Lavagetto: Upgrade to 0.37.0 [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/423748 [09:10:40] 10Operations, 10Ops-Access-Requests: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4119571 (10ArielGlenn) [09:11:16] apergos: no idea which ops are awake now, but saw you active MediaWiki-Page-deletion: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119555 (BethNaught) p:Triage>Unbreak! [09:11:40] all of europe is here, so that's most of us [09:12:07] 10Operations, 10MediaWiki-Page-deletion: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119573 (10Peachey88) [09:16:58] 10Operations, 10MediaWiki-Page-deletion: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119544 (10Marostegui) There is an alter table on the archive table on the enwiki master going on. It is almost done though. We didn't see issues on other masters when altering this ta... [09:17:58] marostegui and/or jynus: [Wsx7PgpAAEQAAEII7OoAAACE] 2018-04-10 08:53:14: Fatal exception of type "Wikimedia\Rdbms\DBQueryTimeoutError" for en wp deletions, any ideas? [09:18:04] ah nm there you are [09:18:40] logstash tells me nothing by the way, not even those errors, weir [09:18:40] d [09:19:37] _joe_: yeah, so at least when it comes to querying puppetdb, ordering is not ensured [09:20:07] <_joe_> ema: there should be a way to get sorted results, though? [09:20:19] I've tried this, which is what I've seen on the wire: [09:20:20] curl -v -G localhost:8080/pdb/query/v4/resources --data-urlencode 'query=["and",["=","type","Class"],["=","title","Role::Cache::Text"],["=","exported",false]]' [09:20:37] 10Operations, 10MediaWiki-Page-deletion: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119587 (10Marostegui) Alter table finished [09:20:39] (03PS4) 10Giuseppe Lavagetto: Upgrade to 0.37.0 [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/423748 [09:20:52] if you run that a few times, you'll get the list of hosts in a different order [09:22:10] order-by looks promising though :) [09:22:39] 10Operations, 10MediaWiki-Page-deletion: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119588 (10Marostegui) @BethNaught can you check if it works again? [09:22:54] 10Operations, 10MediaWiki-Page-deletion: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119544 (10JEumerus) Works for me at this time. [09:23:22] (03CR) 10Giuseppe Lavagetto: [C: 032] Upgrade to 0.37.0 [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/423748 (owner: 10Giuseppe Lavagetto) [09:23:24] 10Operations, 10MediaWiki-Page-deletion: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119591 (10Marostegui) Thanks - it should not give any errors again [09:25:57] (03PS5) 10Giuseppe Lavagetto: Upgrade to 0.37.0 [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/423748 (https://phabricator.wikimedia.org/T190979) [09:27:26] 10Operations, 10MediaWiki-Page-deletion: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119604 (10jcrespo) p:05Unbreak!>03Normal Normal as the incident should be solved, we now have to research what actually happened. Errors started at 8:40 UTC, lasting until 9:19 U... [09:28:34] 10Operations, 10MediaWiki-Page-deletion, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119606 (10Aklapper) [09:29:49] 10Operations, 10MediaWiki-Page-deletion, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119609 (10Marostegui) This is what was being executed: `SET SESSION innodb_lock_wait_timeout=1; SET SESSION lock_wait_timeout=30; ALTER TABLE archive MODIFY C... [09:29:51] !log upgrading app servers in codfw to ICU57-enabled build of HHVM [09:29:54] (03PS1) 10Giuseppe Lavagetto: Jessie port for 0.37.0 [debs/mcrouter] (jessie) - 10https://gerrit.wikimedia.org/r/425232 [09:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:37] (03PS2) 10Elukey: role::configcluster_stretch: add IPv6 static addresses [puppet] - 10https://gerrit.wikimedia.org/r/422911 (https://phabricator.wikimedia.org/T166081) [09:33:51] _joe_: almost... https://ask.puppet.com/question/29991/getting-error-unsupported-query-parameter-order-by/ [09:34:07] (03CR) 10jerkins-bot: [V: 04-1] role::configcluster_stretch: add IPv6 static addresses [puppet] - 10https://gerrit.wikimedia.org/r/422911 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [09:34:19] <_joe_> ema: FML [09:35:04] heh [09:35:24] curl -G localhost:8080/pdb/query/v4/resources --data-urlencode 'query=["and",["=","type","Class"],["=","title","Role::Cache::Text"],["=","exported",false]]' --data-urlencode 'order-by=[{"field": "certname"}]' [09:36:38] oh wait a moment [09:37:14] --data-urlencode 'query=["and",["=","type","Class"],["=","title","Role::Cache::Text"],["=","exported",false]]&order-by=[{"field": "certname"}]' [09:37:23] ^ this seems to work [09:37:37] <_joe_> ema: I'm sure we use puppetdbquery [09:37:44] <_joe_> function query_resources or something [09:37:52] <_joe_> and you can set if you want ordering or not [09:38:07] <_joe_> we avoided it I think because queries are more expensive [09:38:55] (03PS1) 10Gehel: maps: tileshell has moved to a new location [puppet] - 10https://gerrit.wikimedia.org/r/425233 (https://phabricator.wikimedia.org/T191807) [09:39:04] <_joe_> but in theory a db returns you the data in the natural order of the db [09:39:21] <_joe_> the problem here is, that order changes because of the way puppetdb 4.x works, I guess [09:39:24] 10Operations, 10MediaWiki-Page-deletion, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119620 (10jcrespo) What I saw was INSERTs into alter being blocked due to metadata locking, but that would not make sense except at the start of the command, o... [09:41:28] 10Operations, 10MediaWiki-Page-deletion, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119624 (10Marostegui) >>! In T191875#4119620, @jcrespo wrote: > What I saw was INSERTs into archive being blocked due to metadata locking, but that would not m... [09:42:48] _joe_: from puppetdbquery's README: "Sometimes puppetdb doesn't return items in the same order every run" [09:43:04] and then they show some fascinating hiera 5 syntax [09:43:22] <_joe_> ema: oh dear [09:44:26] query_resources does not seem to allow specifying that you want sanity in your life [09:44:29] Accepts two arguments or three argument, a query used to discover nodes, and a resource query [09:44:33] , and an optional a boolean to whether or not to group the result per host. [09:47:47] <_joe_> ok so [09:48:09] <_joe_> the right thing to do is to sort [09:48:15] <_joe_> $cluster_sites [09:48:19] <_joe_> let me check the code [09:49:56] <_joe_> so yeah, we want to sort results in get_clusters(), probably [09:50:06] (03PS1) 10ArielGlenn: remove dumps web server from dataset1001 and ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/425234 (https://phabricator.wikimedia.org/T182540) [09:50:25] <_joe_> ema: let me fix that [09:51:19] _joe_: hey, that should be fixed already! [09:51:39] function_query_resources([false, 'Class["Profile::Cumin::Target"]', false, 'certname asc']) [09:51:55] <_joe_> ? [09:52:01] <_joe_> yeah [09:52:09] <_joe_> I'm reading that code myself [09:52:33] <_joe_> ema: so let me circle back to your problem: where did you see those unexpected changes? [09:53:15] _joe_: for instance on bast3002, bast5001. Run puppet a couple of times, you'll see unexpected changes at every run [09:53:33] <_joe_> ema: I think the signature of query_resources changed [09:55:18] <_joe_> so what we do there is to get the parameters of Class["Profile::Cumin::Target"] on all nodes [09:55:33] <_joe_> the last parameter is just ignored I guess [09:56:28] <_joe_> ok I'm baffled, ruby magic going on here [09:57:10] also worth mentioning: ./modules/prometheus/templates/class_config.erb uses scope.function_ordered_yaml([all]) [09:57:29] ema, _joe_ in case you can create a subtask of T191388 [09:57:29] T191388: Puppet: tracking catalogs that changes at every run - https://phabricator.wikimedia.org/T191388 [09:57:52] <_joe_> volans: we had almost zero before all the recent transitions FWIW [09:58:28] nope according to what puppet consider 'changed', Notify is considered changed [09:58:32] and we had that since forever [09:58:40] see the current subtask opened [09:58:56] <_joe_> volans: that's by design, I'm commenting on that ticket [10:00:02] 10Operations, 10Puppet: Puppet: tlsproxy localssl default_server make a Notify at each run - https://phabricator.wikimedia.org/T191393#4119709 (10Joe) There is no way in puppet 4.x to do it better as far as @ema and I determined when we looked into it. So that notify is there for a good reason. It's a hack, it... [10:00:24] <_joe_> volans: when I mean "changing catalogs" I mean "files on disk changed" [10:00:47] 10Operations, 10MediaWiki-Page-deletion, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119715 (10jcrespo) This were the queries ongoing at that time: {P6973} [10:05:02] correction: you won't see unexpected changes at *every* run on the bastions mentioned above, but "often" [10:06:11] in line with the "Sometimes" from puppetdbquery's README [10:07:09] 10Operations, 10Ops-Access-Requests, 10Analytics: Access to stat100x and notebook1003.eqiad.wmnet for Jonas Kress - https://phabricator.wikimedia.org/T191308#4119726 (10Jonas) 05Open>03Resolved a:03Jonas Thanks for looking into it! [10:08:43] if I see it correctly we call function_ordered_yaml() [10:09:26] volans: yeah, see my comment above. It doesn't seem to do what you'd think? [10:10:06] * volans wonders if it's recursive [10:10:20] volans: keys get sorted by ordered_yaml, not values [10:18:37] (03PS1) 10Elukey: role::configcluster_stretch: add zookeeper profiles for conf1004 [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924) [10:20:44] (03CR) 10Sbisson: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/425233 (https://phabricator.wikimedia.org/T191807) (owner: 10Gehel) [10:22:25] (03PS1) 10Elukey: zookeeper: swap conf1001 with conf1004 [puppet] - 10https://gerrit.wikimedia.org/r/425239 (https://phabricator.wikimedia.org/T182924) [10:23:40] !log upgrading job runners in codfw to ICU57-enabled build of HHVM [10:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:25] (03PS1) 10Giuseppe Lavagetto: get_cluster: re-introduce sorting [puppet] - 10https://gerrit.wikimedia.org/r/425240 [10:25:36] <_joe_> ema: ^^ [10:25:53] <_joe_> this fixes the root issue [10:26:06] <_joe_> well, it should :P [10:26:20] looking [10:31:49] 10Operations, 10DBA, 10MediaWiki-Page-deletion, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119788 (10jcrespo) [10:34:21] _joe_: it's fun to look at yaml files diffs! https://puppet-compiler.wmflabs.org/compiler03/10884/ [10:34:41] <_joe_> yeah :/ [10:34:47] those +- and -- are confusing, but it all looks good to me [10:35:36] (03CR) 10Ema: [C: 031] get_cluster: re-introduce sorting [puppet] - 10https://gerrit.wikimedia.org/r/425240 (owner: 10Giuseppe Lavagetto) [10:39:34] (03Abandoned) 10Ema: prometheus::class_config: sort targets [puppet] - 10https://gerrit.wikimedia.org/r/425218 (owner: 10Ema) [10:42:19] (03CR) 10Giuseppe Lavagetto: [C: 032] get_cluster: re-introduce sorting [puppet] - 10https://gerrit.wikimedia.org/r/425240 (owner: 10Giuseppe Lavagetto) [10:45:06] (03CR) 10Ema: [C: 031] "One nit, LGTM otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425040 (https://phabricator.wikimedia.org/T177961) (owner: 10Vgutierrez) [10:45:49] 10Operations, 10DBA, 10MediaWiki-Page-deletion, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119820 (10Marostegui) These are the versions of the previous altered masters s1: 10.0.28 (the one that caused this) s2: 10.0.29 s3: 10.0.23 s4: 10.0.... [10:47:17] !log T188266 reimage labtestservices2002.wikimedia.org [10:47:19] (03CR) 10Giuseppe Lavagetto: [C: 032] Jessie port for 0.37.0 [debs/mcrouter] (jessie) - 10https://gerrit.wikimedia.org/r/425232 (owner: 10Giuseppe Lavagetto) [10:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:23] T188266: labtestn to Mitaka on Jessie - https://phabricator.wikimedia.org/T188266 [10:47:46] <_joe_> ema: confirmed subsequent puppet runs seem stable :) [10:47:56] _joe_: great, thanks! [10:48:08] <_joe_> wow, puppet is *slow* in eqsin [10:48:20] 10Operations, 10DBA, 10MediaWiki-Page-deletion, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4119830 (10jcrespo) CC @Anomie this is not directly related- maintenance was the direct cause, but I believe the new comment model may be creating wors... [10:49:43] I wish the perf team would look into that old speed of light slowness issue [10:50:02] <_joe_> ema: yeah, that's quite annoying, ain't it? [10:51:16] !log mobrovac@tin Started deploy [restbase/deploy@29df9db]: Use the MCS-provided content-type in the definition response - T191809 [10:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:23] T191809: Definition endpoint doesn't include Spec version in content-type - https://phabricator.wikimedia.org/T191809 [11:03:42] (03PS1) 10ArielGlenn: turn off public dumps mirror rsync access to dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/425246 (https://phabricator.wikimedia.org/T182540) [11:05:32] (03PS1) 10Fdans: [wip] Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [11:05:56] (03CR) 10jerkins-bot: [V: 04-1] [wip] Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [11:07:30] !log upgrading mwdebug servers in codfw to ICU57-enabled build of HHVM [11:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:48] (03PS3) 10Mark Bergsma: Create FSM test cases according to the RFC 4271 definition [debs/pybal] - 10https://gerrit.wikimedia.org/r/423995 [11:08:50] (03PS2) 10Fdans: [wip] Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [11:09:14] (03CR) 10jerkins-bot: [V: 04-1] [wip] Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [11:11:38] (03CR) 10Mark Bergsma: "Good catch, thanks!" (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/423995 (owner: 10Mark Bergsma) [11:15:07] (03CR) 10ArielGlenn: [C: 032] writeuptopageid.1: Fix typo [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/425028 (owner: 10Hoo man) [11:15:35] !log mobrovac@tin Finished deploy [restbase/deploy@29df9db]: Use the MCS-provided content-type in the definition response - T191809 (duration: 24m 19s) [11:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:42] T191809: Definition endpoint doesn't include Spec version in content-type - https://phabricator.wikimedia.org/T191809 [11:19:45] (03PS7) 10Jcrespo: Install parallel gzip (pigz) and parallel xz (pxz) on all servers [puppet] - 10https://gerrit.wikimedia.org/r/419709 [11:20:12] (03CR) 10Jcrespo: [C: 031] "Maybe we can try this later." [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [11:21:54] (03PS2) 10Ema: check_http_varnish: bump check_interval [puppet] - 10https://gerrit.wikimedia.org/r/411033 [11:22:52] (03CR) 10Ema: [C: 032] check_http_varnish: bump check_interval [puppet] - 10https://gerrit.wikimedia.org/r/411033 (owner: 10Ema) [11:24:21] (03PS3) 10Fdans: [wip] Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [11:24:49] (03CR) 10jerkins-bot: [V: 04-1] [wip] Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [11:27:20] (03PS1) 10Deskana: Update wikis with consolidate editing feedback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425248 (https://phabricator.wikimedia.org/T168886) [11:28:37] (03CR) 10ArielGlenn: [V: 032 C: 032] writeuptopageid.1: Fix typo [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/425028 (owner: 10Hoo man) [11:45:23] (03PS4) 10Fdans: [wip] Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [11:47:24] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey, 10User-notice: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4119938 (10MoritzMuehlenhoff) codfw has now also been upgraded to the ICU-enabled HHVM build (and the related Boost libraries) [11:47:55] PROBLEM - HHVM jobrunner on mw2264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [11:48:36] PROBLEM - DPKG on mw2264 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:48:36] PROBLEM - Nginx local proxy to apache on mw2264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.155 second response time [11:50:07] ^ that's me [11:50:59] (03PS1) 10Ema: lvs: use UDP monitor for logstash-gelf and logstash-udp2log [puppet] - 10https://gerrit.wikimedia.org/r/425251 [11:51:36] RECOVERY - DPKG on mw2264 is OK: All packages OK [11:51:45] RECOVERY - Nginx local proxy to apache on mw2264 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.155 second response time [11:51:55] RECOVERY - HHVM jobrunner on mw2264 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.074 second response time [11:54:42] (03PS1) 10Rush: openstack: l3-agent custom rule behavior [puppet] - 10https://gerrit.wikimedia.org/r/425252 (https://phabricator.wikimedia.org/T168580) [11:55:05] PROBLEM - DPKG on mw2273 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:55:15] PROBLEM - Nginx local proxy to apache on mw2273 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.151 second response time [11:55:16] PROBLEM - HHVM processes on mw2273 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [11:55:25] PROBLEM - HHVM rendering on mw2273 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [11:55:36] PROBLEM - Apache HTTP on mw2273 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [11:56:16] (03PS1) 10Ema: lvs: use UDP monitor for logstash-{json,syslog}-udp [puppet] - 10https://gerrit.wikimedia.org/r/425253 [11:56:55] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:07] (03CR) 10Rush: [C: 032] openstack: l3-agent custom rule behavior [puppet] - 10https://gerrit.wikimedia.org/r/425252 (https://phabricator.wikimedia.org/T168580) (owner: 10Rush) [11:58:10] RECOVERY - DPKG on mw2273 is OK: All packages OK [11:58:21] RECOVERY - Nginx local proxy to apache on mw2273 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.707 second response time [11:58:21] RECOVERY - HHVM processes on mw2273 is OK: PROCS OK: 6 processes with command name hhvm [11:58:30] RECOVERY - HHVM rendering on mw2273 is OK: HTTP OK: HTTP/1.1 200 OK - 76349 bytes in 3.971 second response time [11:58:40] RECOVERY - Apache HTTP on mw2273 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.116 second response time [11:58:50] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.77 ms [11:59:27] <_joe_> !log uploading mcrouter 0.37.0 to jessie-wikimedia (T190979) [11:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:34] T190979: build new version of mcrouter package - https://phabricator.wikimedia.org/T190979 [12:01:21] PROBLEM - HHVM rendering on mw2287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:01:31] PROBLEM - HHVM rendering on mw2285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:01:40] PROBLEM - HHVM rendering on mw2284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:01:46] <_joe_> !log uploading mcrouter 0.37.0 to stretch-wikimedia (T190979) [12:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:10] RECOVERY - HHVM rendering on mw2287 is OK: HTTP OK: HTTP/1.1 200 OK - 76347 bytes in 0.296 second response time [12:02:30] RECOVERY - HHVM rendering on mw2285 is OK: HTTP OK: HTTP/1.1 200 OK - 76347 bytes in 0.304 second response time [12:02:30] RECOVERY - HHVM rendering on mw2284 is OK: HTTP OK: HTTP/1.1 200 OK - 76347 bytes in 0.295 second response time [12:02:33] !log upgrading naos and wasat to ICU57-enabled build of HHVM [12:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:00] PROBLEM - puppet last run on mw2287 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [12:05:26] 10Operations: Update ICU version to 55.1 - https://phabricator.wikimedia.org/T143931#4119953 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff The app servers are now using ICU 57. [12:06:30] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey, 10User-notice: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4119957 (10Joe) [12:16:07] yay [12:25:14] 10Operations, 10Ops-Access-Requests: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4120019 (10ArielGlenn) [12:33:00] RECOVERY - puppet last run on mw2287 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:39:09] (03PS3) 10Marostegui: wiki replicas: depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/425095 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [12:41:37] (03PS1) 10Hashar: Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) [12:44:14] (03PS2) 10Gehel: maps: tileshell has moved to a new location [puppet] - 10https://gerrit.wikimedia.org/r/425233 (https://phabricator.wikimedia.org/T191807) [12:45:36] (03CR) 10Gehel: [C: 032] maps: tileshell has moved to a new location [puppet] - 10https://gerrit.wikimedia.org/r/425233 (https://phabricator.wikimedia.org/T191807) (owner: 10Gehel) [12:45:57] (03PS4) 10Vgutierrez: lvs: Get rid of interface names on site.pp [puppet] - 10https://gerrit.wikimedia.org/r/425040 (https://phabricator.wikimedia.org/T177961) [12:46:23] (03Abandoned) 10Elukey: zookeeper: swap conf1001 with conf1004 [puppet] - 10https://gerrit.wikimedia.org/r/425239 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [12:46:37] (03CR) 10Vgutierrez: lvs: Get rid of interface names on site.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425040 (https://phabricator.wikimedia.org/T177961) (owner: 10Vgutierrez) [12:47:20] (03CR) 10Vgutierrez: [C: 031] "LGTM!" [debs/pybal] - 10https://gerrit.wikimedia.org/r/423995 (owner: 10Mark Bergsma) [12:48:47] 10Operations, 10DBA, 10MediaWiki-Page-deletion, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4120050 (10Anomie) >>! In T191875#4119830, @jcrespo wrote: > We can create a specific task for that. Please do. > Could the SELECT ... FOR UPDATE be... [12:50:38] 10Operations, 10Ops-Access-Requests: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4120055 (10Springle) ``` ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCZwhGWhhv+9QdjhhShbLdSZSV349oFxPH73CfvI0jRsQFXsQIlPQaSeKcFqw+kjhUoxvfgCw3YWoExHTT6jxHUxrOswI6ZVPeicHNBQ4k... [12:50:39] (03CR) 10Vgutierrez: [C: 031] "Thanks for taking care of this :D" [puppet] - 10https://gerrit.wikimedia.org/r/425251 (owner: 10Ema) [12:55:01] (03PS2) 10Hashar: Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) [12:57:59] (03PS2) 10Elukey: Swap conf1001 with conf1004 in Zookeeper main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924) [12:58:25] (03PS1) 10Gehel: logstash: add icinga check of logstash TCP ports [puppet] - 10https://gerrit.wikimedia.org/r/425260 [12:59:18] XioNoX: ^ to ensure we don't have the same issue again... [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180410T1300). [13:00:05] Daimona, tgr, and Deskana: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:25] Hi [13:00:39] I can SWAT today [13:01:23] tgr, Deskana: want to deploy your changes, or should I? [13:01:42] zeljkof: I am not a deployer, so I am not capable. :-) [13:02:37] Deskana: Daimona: I will merge your changes and let you know when they are ready for testing at mwdebug1002 [13:02:44] Thanks! [13:02:46] Ok, thanks [13:02:48] tgr: around for swat? [13:04:13] gehel: nice! Also as pybal can now do UDP checks we should create a task to add them for the logstash services [13:05:41] zeljkof: gerrit/424622 isn't testable by me [13:05:42] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425248 (https://phabricator.wikimedia.org/T168886) (owner: 10Deskana) [13:05:51] XioNoX, gehel: https://gerrit.wikimedia.org/r/#/c/425251/ https://gerrit.wikimedia.org/r/#/c/425253/ [13:05:52] But was tested by at least 3 people on master [13:06:39] Daimona: ok, so it should be deployed as soon as it is merged? [13:06:43] ema: amazing! we only have to think about it and it gets fixed ! [13:06:49] Yeah, basically [13:07:06] (03Merged) 10jenkins-bot: Update wikis with consolidate editing feedback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425248 (https://phabricator.wikimedia.org/T168886) (owner: 10Deskana) [13:07:19] gehel: preemptive coding powered by ema [13:08:23] Deskana: your patch is at mwdebug1002, please test and let me know if I can deploy it [13:08:30] (03CR) 10Gehel: [C: 031] "LGTM (and big thanks ema / vgutierrez for that!)" [puppet] - 10https://gerrit.wikimedia.org/r/425251 (owner: 10Ema) [13:10:47] (03CR) 10jenkins-bot: Update wikis with consolidate editing feedback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425248 (https://phabricator.wikimedia.org/T168886) (owner: 10Deskana) [13:12:09] 10Operations, 10Ops-Access-Requests: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4120089 (10ArielGlenn) Key verified over google hangout. [13:12:23] Testing. [13:12:49] gehel, vgutierrez: I'll merge https://gerrit.wikimedia.org/r/#/c/425251/ then. The changes won't be applied till we restart pybal, so let's restart pybal on a secondary first and see how the monitor behaves in prod [13:13:00] ack [13:13:58] well first let's see if pcc is happy [13:14:17] (03PS1) 10Rush: openstack: when applicable setup bridge interface [puppet] - 10https://gerrit.wikimedia.org/r/425262 (https://phabricator.wikimedia.org/T188266) [13:14:44] (03CR) 10jerkins-bot: [V: 04-1] openstack: when applicable setup bridge interface [puppet] - 10https://gerrit.wikimedia.org/r/425262 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:14:49] No, it's not working. The feedback is being posted locally in production. On mwdebug1002 it should be posted on mediawiki.org, but is instead being discarded. [13:15:00] tgr: your patch will not be deployed if you are not around for swat [13:15:11] Deskana: should I revert? [13:15:25] Let me keep testing it for now. [13:15:27] (03PS2) 10Rush: openstack: when applicable setup bridge interface [puppet] - 10https://gerrit.wikimedia.org/r/425262 (https://phabricator.wikimedia.org/T188266) [13:15:30] ok [13:15:35] (03PS2) 10Ema: lvs: use UDP monitor for logstash-gelf and logstash-udp2log [puppet] - 10https://gerrit.wikimedia.org/r/425251 [13:15:55] (03CR) 10jerkins-bot: [V: 04-1] openstack: when applicable setup bridge interface [puppet] - 10https://gerrit.wikimedia.org/r/425262 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:16:01] (03CR) 10Ema: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler03/10890/lvs1003.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/425251 (owner: 10Ema) [13:16:06] (03PS1) 10ArielGlenn: Create group of normal users with snapshot/dumps host access, add springle [puppet] - 10https://gerrit.wikimedia.org/r/425263 (https://phabricator.wikimedia.org/T191478) [13:16:10] (03CR) 10Ema: [C: 032] lvs: use UDP monitor for logstash-gelf and logstash-udp2log [puppet] - 10https://gerrit.wikimedia.org/r/425251 (owner: 10Ema) [13:16:30] (03CR) 10jerkins-bot: [V: 04-1] Create group of normal users with snapshot/dumps host access, add springle [puppet] - 10https://gerrit.wikimedia.org/r/425263 (https://phabricator.wikimedia.org/T191478) (owner: 10ArielGlenn) [13:16:32] (03PS3) 10Rush: openstack: when applicable setup bridge interface [puppet] - 10https://gerrit.wikimedia.org/r/425262 (https://phabricator.wikimedia.org/T188266) [13:16:59] (03CR) 10jerkins-bot: [V: 04-1] openstack: when applicable setup bridge interface [puppet] - 10https://gerrit.wikimedia.org/r/425262 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:17:33] zeljkof: Can we instead try deploying it to production, and see if it works then? It's a relatively safe patch to test it by deploying it, then revert if it doesn't work. [13:17:54] Deskana: sure; should I deploy? [13:18:05] zeljkof: Yes please! I'm ready. [13:18:06] (03PS4) 10Rush: openstack: when applicable setup bridge interface [puppet] - 10https://gerrit.wikimedia.org/r/425262 (https://phabricator.wikimedia.org/T188266) [13:18:24] Deskana: deploying... [13:18:55] XioNoX: if you have a minute to review https://gerrit.wikimedia.org/r/#/c/425260/ ... I'll merge it later today... [13:18:56] Daimona: 424622 is merged, [13:19:03] !log restart pybal on lvs1006 for config changes introduced by https://gerrit.wikimedia.org/r/#/c/425251/ [13:19:05] (03PS2) 10ArielGlenn: Create group of normal users with snapshot/dumps host access, add springle [puppet] - 10https://gerrit.wikimedia.org/r/425263 (https://phabricator.wikimedia.org/T191478) [13:19:05] Yeah, finally [13:19:05] will deploy soon [13:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:10] Thanks [13:19:11] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:425248|Update wikis with consolidate editing feedback (T168886)]] (duration: 01m 00s) [13:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:18] T168886: Change in-editor feedback tool to point to mediawiki.org - https://phabricator.wikimedia.org/T168886 [13:19:21] Deskana: deployed [13:19:32] Testing... [13:19:55] (03Abandoned) 10Ppchelko: Remove special jobrunners for refreshLinks and htmlCacheUpdate. [puppet] - 10https://gerrit.wikimedia.org/r/416481 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [13:20:51] (03CR) 10Rush: [C: 032] openstack: when applicable setup bridge interface [puppet] - 10https://gerrit.wikimedia.org/r/425262 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:23:16] (03CR) 10BBlack: [C: 031] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/425040 (https://phabricator.wikimedia.org/T177961) (owner: 10Vgutierrez) [13:23:24] gehel, vgutierrez: done, logstash-gelf_12201_udp looks good on lvs1006 (see curl http://localhost:9090/pools/logstash-gelf_12201_udp) [13:24:05] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler03/10889/" [puppet] - 10https://gerrit.wikimedia.org/r/422911 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [13:24:07] (03PS3) 10Muehlenhoff: mediawiki::packages::fonts: Consistently use require_package [puppet] - 10https://gerrit.wikimedia.org/r/420670 [13:24:20] zeljkof: It definitely doesn't work. Please revert it. Sorry for the inconvenience. [13:24:20] (03CR) 10Elukey: "> pcc: https://puppet-compiler.wmflabs.org/compiler03/10889/" [puppet] - 10https://gerrit.wikimedia.org/r/422911 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [13:24:31] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler03/10889/" [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [13:24:33] (03PS3) 10ArielGlenn: Create group of normal users with snapshot/dumps host access, add springle [puppet] - 10https://gerrit.wikimedia.org/r/425263 (https://phabricator.wikimedia.org/T191478) [13:24:57] Deskana: no problem, reverting [13:25:02] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4120121 (10ArielGlenn) Not sure if the patchset gives access to the bastions, otherwise I think we're ok. [13:25:20] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4120122 (10ArielGlenn) [13:25:44] (03PS1) 10Zfilipin: Revert "Update wikis with consolidate editing feedback" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425264 [13:26:15] 10Operations, 10DBA, 10MediaWiki-Page-deletion: Reduce locking contention on deletion of pages - https://phabricator.wikimedia.org/T191892#4120123 (10jcrespo) p:05Triage>03Normal [13:26:33] (03PS2) 10Zfilipin: Revert "Update wikis with consolidate editing feedback" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425264 (https://phabricator.wikimedia.org/T168886) [13:26:43] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425264 (https://phabricator.wikimedia.org/T168886) (owner: 10Zfilipin) [13:27:30] zeljkof: Alright, I'm going away for now. Thanks for trying. No harm done. [13:28:02] (03Merged) 10jenkins-bot: Revert "Update wikis with consolidate editing feedback" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425264 (https://phabricator.wikimedia.org/T168886) (owner: 10Zfilipin) [13:28:33] 10Operations, 10DBA, 10MediaWiki-Page-deletion, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4120153 (10jcrespo) I agree with everything you said, my comment was a quick sketch of what I wanted, and what you proposed was what I really wanted, c... [13:28:35] (03PS3) 10Hashar: Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) [13:29:15] (03CR) 10jenkins-bot: Revert "Update wikis with consolidate editing feedback" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425264 (https://phabricator.wikimedia.org/T168886) (owner: 10Zfilipin) [13:29:18] ema: thanks! [13:29:25] Deskana: thanks for deploying with #releng ;) [13:29:40] Daimona: deploying 424622 [13:29:58] Alright, thanks, this one should be really quick to test :-) [13:29:58] !log zfilipin@tin Synchronized php-1.31.0-wmf.28/extensions/AbuseFilter/: SWAT: [[gerrit:424622|Disable search for global filters (T191539)]] (duration: 01m 01s) [13:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:05] T191539: Internal error when searching within global rules - https://phabricator.wikimedia.org/T191539 [13:30:19] I meant the next one [13:30:26] Daimona: deployed (it's the first patch) [13:30:34] Yeah [13:30:45] (03PS4) 10Elukey: Update kafka java.security file with Java 8 u162 changes [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata) [13:31:02] 10Operations, 10DBA, 10MediaWiki-Page-deletion: Reduce locking contention on deletion of pages - https://phabricator.wikimedia.org/T191892#4120159 (10jcrespo) I believe this have been happening for some time now, but this incident only made it more real (happening not only for large deletes, but for small on... [13:32:06] (03PS1) 10Rush: openstack: neutron adjust spacing for bridge resource [puppet] - 10https://gerrit.wikimedia.org/r/425266 (https://phabricator.wikimedia.org/T188266) [13:32:49] gehel: yw! [13:32:58] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:425264|Revert "Update wikis with consolidate editing feedback" (T168886)]] (duration: 00m 59s) [13:32:58] (03CR) 10Rush: [C: 032] openstack: neutron adjust spacing for bridge resource [puppet] - 10https://gerrit.wikimedia.org/r/425266 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:33:01] (03CR) 10Elukey: [C: 032] Update kafka java.security file with Java 8 u162 changes [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata) [13:33:02] Deskana: revert deployed [13:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:05] T168886: Change in-editor feedback tool to point to mediawiki.org - https://phabricator.wikimedia.org/T168886 [13:33:08] (03PS5) 10Elukey: Update kafka java.security file with Java 8 u162 changes [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata) [13:33:56] (03PS5) 10Vgutierrez: varnish: Remove varnishxcache python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421925 (https://phabricator.wikimedia.org/T184942) [13:34:34] (03CR) 10Vgutierrez: [C: 032] varnish: Remove varnishxcache python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421925 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [13:34:38] (03Abandoned) 10Rush: openstack: neutron router l3-agent HA [puppet] - 10https://gerrit.wikimedia.org/r/423032 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:35:39] !log restart kafka on kafka-jumbo1001 for openjdk upgrades [13:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:24] zeljkof: oops, sorry, timezone confusion [13:39:50] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/varnishxcache] [13:40:12] tgr: no problem, there is still time, are you deploying your commit, or should I? [13:40:36] (03PS4) 10Hashar: Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) [13:41:27] !log upgraded HHVM on mediawiki-deployment04/05/06 to a build with a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854) [13:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:33] T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854 [13:41:50] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:41:56] zeljkof: I can do it, are you done with the other patches? [13:42:01] puppet on cp* it's on me [13:42:07] Daimona: the second commit is merged, will deploy it in a few minutes [13:42:15] Yup, finally [13:42:26] tgr: I need a few more minutes, I'll let you know [13:42:28] (03CR) 10Hashar: [C: 032] Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) (owner: 10Hashar) [13:42:33] thx [13:44:39] (03PS1) 10Muehlenhoff: Reimage mw1265 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/425269 (https://phabricator.wikimedia.org/T174431) [13:44:50] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:45:13] Daimona: the patch is at mwdebug1002 [13:45:19] Ok, testing [13:45:41] Yeah, it works as expected [13:45:43] Safe to deploy [13:46:10] Daimona: ok, deploying [13:46:18] TY [13:46:50] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:47:28] !log zfilipin@tin Synchronized php-1.31.0-wmf.28/extensions/AbuseFilter/: SWAT: [[gerrit:424767|Restore subtract method for backward compatibility (T191696)]] (duration: 01m 01s) [13:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:35] (03CR) 10Gehel: [C: 04-1] "This file is already provided by the `check-postgres` package which is already deployed on our postgresql servers (https://github.com/wiki" [puppet] - 10https://gerrit.wikimedia.org/r/425227 (https://phabricator.wikimedia.org/T185504) (owner: 10Dzahn) [13:47:35] T191696: Abuse filter error (Exception caught: Unknown variable compute type subtract) - https://phabricator.wikimedia.org/T191696 [13:47:57] Daimona: deployed, please check and thanks for deploying with #releng ;) [13:48:21] tgr: I'm done, go ahead [13:48:30] Rechecked, works [13:48:47] Thank you, cya [13:50:13] (03CR) 10محمد شعیب: "> Thanks for this change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425062 (owner: 10محمد شعیب) [13:50:27] (03PS2) 10Gergő Tisza: Enable TemplateStyles on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425112 (https://phabricator.wikimedia.org/T188198) [13:51:07] (03CR) 10Gergő Tisza: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425112 (https://phabricator.wikimedia.org/T188198) (owner: 10Gergő Tisza) [13:51:36] !log disable puppet on primary LVS to merge safely gerrit/425040 T177961 [13:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:43] T177961: Upgrade LVS servers to stretch - https://phabricator.wikimedia.org/T177961 [13:52:03] (03PS3) 10محمد شعیب: Fixing names of some Urdu projects from وکی to wiki along with namespace name in Urdu wiktionary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425062 [13:52:14] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PosgreSQL - https://phabricator.wikimedia.org/T185504#4120224 (10Gehel) We already have some puppet code to monitor postgres replication lag (https://github.com/wikimedia/puppet/blob/production/modules/postgresql/manifests/slave/mon... [13:52:25] (03Merged) 10jenkins-bot: Enable TemplateStyles on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425112 (https://phabricator.wikimedia.org/T188198) (owner: 10Gergő Tisza) [13:52:39] (03CR) 10jenkins-bot: Enable TemplateStyles on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425112 (https://phabricator.wikimedia.org/T188198) (owner: 10Gergő Tisza) [13:52:48] (03PS1) 10BBlack: upload: experimental reduction of fb traffic [puppet] - 10https://gerrit.wikimedia.org/r/425270 [13:53:20] (03CR) 10Vgutierrez: [C: 032] lvs: Get rid of interface names on site.pp [puppet] - 10https://gerrit.wikimedia.org/r/425040 (https://phabricator.wikimedia.org/T177961) (owner: 10Vgutierrez) [13:53:29] (03PS5) 10Vgutierrez: lvs: Get rid of interface names on site.pp [puppet] - 10https://gerrit.wikimedia.org/r/425040 (https://phabricator.wikimedia.org/T177961) [13:54:00] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: CRITICAL - scalar(avg_over_time(kafka_server_ReplicaManager_UnderReplicatedPartitions{cluster=kafka_jumbo,instance=kafka-jumbo1005:7800,job=jmx_kafka}[30m])): 11.233333333333333 = 10.0 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=kafka_jumbo&var-kafka_bro [13:54:00] 5 [13:54:40] checking kafka alarm [13:54:49] might be still recovering after my last restart [13:54:52] (03CR) 10BBlack: [C: 032] upload: experimental reduction of fb traffic [puppet] - 10https://gerrit.wikimedia.org/r/425270 (owner: 10BBlack) [13:54:57] (03PS2) 10BBlack: upload: experimental reduction of fb traffic [puppet] - 10https://gerrit.wikimedia.org/r/425270 [13:55:31] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: T188198 Enable TemplateStyles on ruwiki (duration: 01m 00s) [13:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:38] T188198: Enable TemplateStyles on ruwikion 2018-04-10 - https://phabricator.wikimedia.org/T188198 [13:57:49] (03PS1) 10Ppchelko: Disable bulk number 2 jobs in redis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425271 (https://phabricator.wikimedia.org/T190327) [13:58:01] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: CRITICAL - scalar(avg_over_time(kafka_server_ReplicaManager_UnderReplicatedPartitions{cluster=kafka_jumbo,instance=kafka-jumbo1005:7800,job=jmx_kafka}[30m])): 11.233333333333333 = 10.0 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=kafka_jumbo&var-kafka_bro [13:58:01] 5 [14:00:59] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: CRITICAL - scalar(avg_over_time(kafka_server_ReplicaManager_UnderReplicatedPartitions{cluster=kafka_jumbo,instance=kafka-jumbo1005:7800,job=jmx_kafka}[30m])): 11.233333333333333 = 10.0 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=kafka_jumbo&var-kafka_bro [14:01:00] 5 [14:01:30] a bit spammy [14:01:37] duh, ruwiki has an abuse filter to prevent templatestyles so I can't test [14:01:48] I'll call it done [14:01:52] thx zeljkof [14:02:19] the metric recovered but since it looks for the past 30m of data it alarms [14:02:55] (03CR) 10Giuseppe Lavagetto: [C: 031] role::configcluster_stretch: add IPv6 static addresses [puppet] - 10https://gerrit.wikimedia.org/r/422911 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [14:03:26] (03PS1) 10ArielGlenn: Revert "remove snapshot01 from mediawiki scap list on beta for testing" [puppet] - 10https://gerrit.wikimedia.org/r/425272 [14:03:32] (03PS2) 10ArielGlenn: Revert "remove snapshot01 from mediawiki scap list on beta for testing" [puppet] - 10https://gerrit.wikimedia.org/r/425272 [14:03:48] (03PS3) 10Elukey: role::configcluster_stretch: add IPv6 static addresses [puppet] - 10https://gerrit.wikimedia.org/r/422911 (https://phabricator.wikimedia.org/T166081) [14:04:00] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: CRITICAL - scalar(avg_over_time(kafka_server_ReplicaManager_UnderReplicatedPartitions{cluster=kafka_jumbo,instance=kafka-jumbo1005:7800,job=jmx_kafka}[30m])): 11.233333333333333 = 10.0 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=kafka_jumbo&var-kafka_bro [14:04:00] 5 [14:04:16] (03CR) 10jerkins-bot: [V: 04-1] role::configcluster_stretch: add IPv6 static addresses [puppet] - 10https://gerrit.wikimedia.org/r/422911 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [14:04:27] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/424546 (https://phabricator.wikimedia.org/T135991) [14:04:30] (03CR) 10Elukey: [V: 032 C: 032] role::configcluster_stretch: add IPv6 static addresses [puppet] - 10https://gerrit.wikimedia.org/r/422911 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [14:04:32] (03CR) 10ArielGlenn: [C: 032] Revert "remove snapshot01 from mediawiki scap list on beta for testing" [puppet] - 10https://gerrit.wikimedia.org/r/425272 (owner: 10ArielGlenn) [14:05:11] (03PS3) 10ArielGlenn: Revert "remove snapshot01 from mediawiki scap list on beta for testing" [puppet] - 10https://gerrit.wikimedia.org/r/425272 [14:09:01] (03CR) 10Ayounsi: [C: 031] logstash: add icinga check of logstash TCP ports [puppet] - 10https://gerrit.wikimedia.org/r/425260 (owner: 10Gehel) [14:11:49] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4120303 (10Tgr) [14:12:09] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: OK - scalar(avg_over_time(kafka_server_ReplicaManager_UnderReplicatedPartitions{cluster=kafka_jumbo,instance=kafka-jumbo1005:7800,job=jmx_kafka}[30m])) within thresholds https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=kafka_jumbo&var-kafka_brokers=kafka-jumbo1005 [14:12:53] (03PS2) 10Gehel: logstash: add icinga check of logstash TCP ports [puppet] - 10https://gerrit.wikimedia.org/r/425260 [14:13:27] (03CR) 10Gehel: [C: 032] logstash: add icinga check of logstash TCP ports [puppet] - 10https://gerrit.wikimedia.org/r/425260 (owner: 10Gehel) [14:13:57] (03PS6) 10Mforns: Modify eventlogging purging script to read from YAML whitelist [puppet] - 10https://gerrit.wikimedia.org/r/420685 (https://phabricator.wikimedia.org/T189692) [14:14:45] (03CR) 10Mforns: "Answered your comments inline, thanks!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/420685 (https://phabricator.wikimedia.org/T189692) (owner: 10Mforns) [14:17:38] !log installing python-crypto security updates on trusty [14:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:37] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey, 10User-notice: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4120330 (10Joe) [14:21:39] !log re-enable puppet on primary LVS [14:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:02] !log restarted nsca server on einsteinium [14:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:41] (03PS1) 10Cmjohnson: Adding dns entries for ms-be1040-43 [dns] - 10https://gerrit.wikimedia.org/r/425274 (https://phabricator.wikimedia.org/T191896) [14:26:17] PROBLEM - DPKG on labcontrol1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:26:48] PROBLEM - DPKG on labcontrol1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:27:08] RECOVERY - DPKG on labcontrol1003 is OK: All packages OK [14:27:49] RECOVERY - DPKG on labcontrol1004 is OK: All packages OK [14:29:51] (03CR) 10Krinkle: [C: 031] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424618 (https://phabricator.wikimedia.org/T191643) (owner: 10Gilles) [14:30:50] (03CR) 10Gehel: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/425253 (owner: 10Ema) [14:31:07] PROBLEM - puppet last run on labcontrol1003 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 5 minutes ago with 5 failures. Failed resources (up to 3 shown): Package[openssl],Package[apparmor],Package[zsh],Package[openssh-client] [14:32:48] 10Operations, 10Pybal, 10Traffic: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4120349 (10Vgutierrez) p:05Triage>03Normal [14:33:43] (03PS2) 10Muehlenhoff: Remove Varnish config for image scaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/424552 (https://phabricator.wikimedia.org/T188062) [14:33:59] (03CR) 10Muehlenhoff: "That slipped in accidentally, fixed in PS2." [puppet] - 10https://gerrit.wikimedia.org/r/424552 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [14:34:09] 10Operations, 10Patch-For-Review, 10User-Joe: build new version of mcrouter package - https://phabricator.wikimedia.org/T190979#4120366 (10Joe) I built and uploaded two packages for 0.37.0, both in jessie and stretch. I will try to document the build process and automate it as much as possible. [14:34:27] 10Operations, 10Patch-For-Review, 10User-Joe: build new version of mcrouter package - https://phabricator.wikimedia.org/T190979#4120367 (10Joe) 05Open>03Resolved [14:37:38] (03PS2) 10Jcrespo: dbstore: Reenable alerts for dbstore1001 after reset [puppet] - 10https://gerrit.wikimedia.org/r/425086 (https://phabricator.wikimedia.org/T186596) [14:37:53] (03CR) 10RobH: [C: 031] Adding dns entries for ms-be1040-43 [dns] - 10https://gerrit.wikimedia.org/r/425274 (https://phabricator.wikimedia.org/T191896) (owner: 10Cmjohnson) [14:38:33] (03CR) 10Jcrespo: [C: 032] dbstore: Reenable alerts for dbstore1001 after reset [puppet] - 10https://gerrit.wikimedia.org/r/425086 (https://phabricator.wikimedia.org/T186596) (owner: 10Jcrespo) [14:38:41] (03CR) 10Elukey: [C: 031] Reimage mw1265 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/425269 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff) [14:41:32] (03PS1) 10Vgutierrez: install_server: Reimage lvs5003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425278 (https://phabricator.wikimedia.org/T191897) [14:45:20] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4120406 (10RobH) [edit interfaces interface-range vlan-private1-a-codfw] member xe-2/0/0 { ... } + member ge-3/0/27; [edit interfaces ge-3/0/27] + description db2040;... [14:46:20] !log Stop MySQL on db2040 for server move - T191193 [14:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:26] T191193: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193 [14:48:44] (03PS1) 10Papaul: DNS: move db2040 from private1-c-codfw to private1-a-codfw [dns] - 10https://gerrit.wikimedia.org/r/425279 (https://phabricator.wikimedia.org/T191193) [14:48:55] (03CR) 10BBlack: [C: 04-1] "Should disable pybal's BGP for the reinstall too, so we have a chance to check things out before it starts talking to routers" [puppet] - 10https://gerrit.wikimedia.org/r/425278 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [14:52:48] (03PS2) 10Vgutierrez: install_server: Reimage lvs5003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425278 (https://phabricator.wikimedia.org/T191897) [14:53:46] (03CR) 10Vgutierrez: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/425278 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [14:54:40] (03PS1) 10Giuseppe Lavagetto: Update rbenv ruby version to match production [puppet] - 10https://gerrit.wikimedia.org/r/425280 [14:54:48] <_joe_> git /win 19 [14:54:51] <_joe_> argh [14:55:43] 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Remove imagescaler cluster (aka 'rendering') - https://phabricator.wikimedia.org/T188062#4120427 (10Krinkle) [14:56:03] RECOVERY - puppet last run on labcontrol1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:56:24] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4120431 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [14:57:01] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4120437 (10Marostegui) Let's hope this time it finishes correctly! ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 1% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuildi... [14:58:49] (03CR) 10BBlack: [C: 031] install_server: Reimage lvs5003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425278 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [15:06:19] !log Wiki replicas: ran `sudo maintain-views --table page_assessments --database arwiki` on all 3 servers for T191455 [15:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:25] T191455: trwiki_p.page_assessments and trwiki_p.page_assessments_projects missing on replicas - https://phabricator.wikimedia.org/T191455 [15:08:00] !log restarting Icinga on einsteinium, command file not working [15:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:29] (03CR) 10Vgutierrez: "pcc looks happy https://puppet-compiler.wmflabs.org/compiler02/10891/" [puppet] - 10https://gerrit.wikimedia.org/r/425278 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [15:14:29] (03CR) 10Marostegui: [C: 032] DNS: move db2040 from private1-c-codfw to private1-a-codfw [dns] - 10https://gerrit.wikimedia.org/r/425279 (https://phabricator.wikimedia.org/T191193) (owner: 10Papaul) [15:16:57] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Change db2040 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425285 (https://phabricator.wikimedia.org/T191193) [15:18:39] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Change db2040 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425285 (https://phabricator.wikimedia.org/T191193) (owner: 10Marostegui) [15:19:00] PROBLEM - Host db2040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:19:16] ^ that is expected [15:20:06] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Change db2040 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425285 (https://phabricator.wikimedia.org/T191193) (owner: 10Marostegui) [15:20:20] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Change db2040 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425285 (https://phabricator.wikimedia.org/T191193) (owner: 10Marostegui) [15:21:42] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db2040 IP as it is being moved to another rack - T191193 (duration: 00m 59s) [15:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:48] T191193: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193 [15:22:31] (03PS1) 10Andrew Bogott: labtestwiki: disable new account creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425286 [15:22:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change db2040 IP as it is being moved to another rack - T191193 (duration: 00m 59s) [15:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:21] 10Operations, 10DBA, 10MediaWiki-Page-deletion, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4120530 (10Marostegui) We have started an Incident Report for this: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180410-Deleting_a_page... [15:23:30] (03CR) 10jerkins-bot: [V: 04-1] labtestwiki: disable new account creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425286 (owner: 10Andrew Bogott) [15:26:12] !log Reimage lvs5003 as stretch [15:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:10] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs5003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425278 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [15:27:16] (03PS3) 10Vgutierrez: install_server: Reimage lvs5003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425278 (https://phabricator.wikimedia.org/T191897) [15:29:04] (03PS4) 10Marostegui: wiki replicas: depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/425095 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [15:29:30] RECOVERY - Host db2040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.60 ms [15:29:54] (03CR) 10Marostegui: [C: 032] wiki replicas: depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/425095 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [15:30:57] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4120557 (10Papaul) Move db2040 from C6 to A3 in racktables Please advice what is the next server [15:31:16] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4120559 (10Papaul) [15:32:04] !log Reload haproxy on dbproxy1010 to depool labsdb1010 [15:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:25] (03CR) 10Rush: [C: 031] "Thank you Andrew, this makes me feel better." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425286 (owner: 10Andrew Bogott) [15:38:13] 10Operations, 10DBA, 10MediaWiki-Page-deletion, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4120575 (10jcrespo) 05Open>03Resolved a:03Marostegui I am going to close this ticket as the initial report, "Deletion not working", was resolved... [15:42:45] !log disable puppet on analytics1003 and stop camus crons in preperation for spark 2 upgrade [15:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:44] (03PS1) 10Ottomata: Use spark2 for Refine job [puppet] - 10https://gerrit.wikimedia.org/r/425289 (https://phabricator.wikimedia.org/T159962) [15:45:36] (03CR) 10Krinkle: [C: 031] Set $wgPropagateErrors to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza) [15:45:54] marostegui, jynus: Somewhat crazy thought regarding T191875, is there any chance the columns weren't "ar_text mediumblob NOT NULL" and "ar_flags tinyblob NOT NULL" on that enwiki master before the ALTER? [15:45:55] T191875: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875 [15:46:40] anomie: You mean that maybe the alter wasn't online? [15:46:44] (03CR) 10Madhuvishy: [C: 031] remove dumps web server from dataset1001 and ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/425234 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [15:47:05] (03CR) 10Madhuvishy: [C: 031] turn off public dumps mirror rsync access to dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/425246 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [15:47:11] anomie: we've have seen lag on the slaves and connections piling up if the alter wasn't fully online [15:47:17] marostegui: Yeah. If one of the columns had a different type or something. [15:47:32] (03PS1) 10Rduran: [WIP] Add integration tests to test agains MariaDB [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/425291 [15:47:47] anomie: We would have seen that as lag and as connections piling up [15:48:42] (03PS1) 10Elukey: Add AAAA and PTR records for conf100[456] [dns] - 10https://gerrit.wikimedia.org/r/425292 (https://phabricator.wikimedia.org/T166081) [15:48:55] (03CR) 10jerkins-bot: [V: 04-1] Add AAAA and PTR records for conf100[456] [dns] - 10https://gerrit.wikimedia.org/r/425292 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [15:48:59] moritzm: masters do not create lag [15:49:03] sorry, wrong person [15:49:10] marostegui: masters do not create lag [15:49:26] jynus: I mean on the slaves, if the table was locked on the master [15:49:34] I can check [15:49:37] on the backups [15:49:41] it is trivial now [15:50:19] I would prefer a simple explanation like that than the complex I can say now [15:50:57] But if the table was locked, we would have seen connections piling up and slaves with lag [15:51:09] (03PS2) 10Elukey: Add AAAA and PTR records for conf100[456] [dns] - 10https://gerrit.wikimedia.org/r/425292 (https://phabricator.wikimedia.org/T166081) [15:51:50] marostegui: not on archive, which is only used for delete and undelete [15:51:59] I am not saying it is that [15:52:10] I am saying, let me check because it is easy to discard [15:52:17] of course [15:52:33] what was the supposed state before/after? [15:52:41] and anomie's suggestion? [15:52:58] zcat dump.s1.2018-04-04--03-40-39/enwiki.archive-schema.sql.gz [15:53:14] gives the state of enwiki, at least on one replica on the 4 april [15:53:20] let me see [15:53:35] but by that, it may have changed [15:53:42] (03CR) 10Ema: [C: 031] varnish: varnishxcache post-removal cleanup [puppet] - 10https://gerrit.wikimedia.org/r/424611 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [15:53:50] jynus: Before should have had "ar_text mediumblob NOT NULL" and "ar_flags tinyblob NOT NULL". After should change NOT NULL to NULL. [15:54:11] `ar_text` mediumblob NOT NULL, [15:54:19] `ar_flags` tinyblob NOT NULL, [15:54:40] so I guess change hadn't happen there yet [15:55:13] I wouldn't discard "strange issue on mariadb that only happens with legacy internal types" [15:55:19] XDD [15:55:39] but that is not really actionable, other than failing over the master, which we are planning to do anyway [15:56:30] yep, it has been like that for some time [15:56:46] could enwiki had a special structure? maybe? [15:56:50] enwiki-master [15:57:36] but those fields specifically, it is unlikely [15:57:44] yeah, that is pretty unlikely I would say [15:58:10] there is https://phabricator.wikimedia.org/T104756 [15:58:17] but that is indexes, which we know is a thing [15:58:38] but not columns as that would have likely created issues with replication (specially ROW before) [15:59:27] (03PS2) 10Vgutierrez: varnish: varnishxcache post-removal cleanup [puppet] - 10https://gerrit.wikimedia.org/r/424611 (https://phabricator.wikimedia.org/T184942) [16:00:04] godog, moritzm, and _joe_: My dear minions, it's time we take the moon! Just kidding. Time for Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180410T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:39] anomie: the thing is it fits a metadata locking in result, but not in how it happens [16:00:52] if it was metadata locking, the alter wouldn't run at all [16:01:21] so I think it is an interaaction with the UPDATE, which is also considered a DDL [16:01:29] *FOR UPDATE [16:02:31] (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/425293 [16:03:38] The FOR UPDATE for deletion was on revision and related tables, but not archive. [16:04:04] (03CR) 10Marostegui: [C: 032] Revert "wiki replicas: depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/425293 (owner: 10Bstorm) [16:05:09] !log Reload haproxy on dbproxy1010 to repool labsdb1010 [16:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:25] (03CR) 10Vgutierrez: [C: 032] varnish: varnishxcache post-removal cleanup [puppet] - 10https://gerrit.wikimedia.org/r/424611 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [16:05:38] PROBLEM - Long running screen/tmux on furud is CRITICAL: CRIT: Long running SCREEN process. (user: otto PID: 22854, 1731443s 1728000s). [16:05:54] (03PS3) 10Vgutierrez: varnish: varnishxcache post-removal cleanup [puppet] - 10https://gerrit.wikimedia.org/r/424611 (https://phabricator.wikimedia.org/T184942) [16:07:39] !log labsdb1010 now has the latest views available, including the comment table [16:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:08] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4120667 (10Papaul) switch port information when ready to move db2045. db2045 was on asw-c6-codfw ge-6/0/14 and now will be on asw-b3-codfw ge-3/0/ 20 new ip address will be :... [16:09:41] ^ quit my furud screens [16:10:28] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Change db2045 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425298 (https://phabricator.wikimedia.org/T191193) [16:11:06] !log Stop MySQL on db2045 (s8 codfw master) to move it to another rack, this will break replication on codfw - T191193 [16:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:13] T191193: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193 [16:16:37] (03PS1) 10Bstorm: wiki replicas: depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/425301 (https://phabricator.wikimedia.org/T181650) [16:17:17] 10Operations, 10Ops-Access-Requests: Access to the deployment hosts for Imarlier - https://phabricator.wikimedia.org/T191704#4120678 (10demon) Approved by Releng. [16:19:07] 10Operations, 10Ops-Access-Requests: Access to the deployment hosts for Imarlier - https://phabricator.wikimedia.org/T191704#4120686 (10RobH) I'll note that @demon is currently @greg's delegate while Greg is on vacation! (So it counts!) [16:20:19] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10892/" [puppet] - 10https://gerrit.wikimedia.org/r/425301 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [16:21:01] !log Reload haproxy on dbproxy1010 to depool labsdb1011 [16:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:08] PROBLEM - Host cp2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:22:14] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Change db2045 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425298 (https://phabricator.wikimedia.org/T191193) (owner: 10Marostegui) [16:22:16] (03PS1) 10Papaul: DNS: move db2045 from private1-c-codfw to private1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/425303 (https://phabricator.wikimedia.org/T191193) [16:22:20] (03PS1) 10RobH: adding imarlier to deployment [puppet] - 10https://gerrit.wikimedia.org/r/425304 (https://phabricator.wikimedia.org/T191704) [16:22:51] (03PS2) 10RobH: adding imarlier to deployment [puppet] - 10https://gerrit.wikimedia.org/r/425304 (https://phabricator.wikimedia.org/T191704) [16:23:24] (03CR) 10Marostegui: [C: 032] DNS: move db2045 from private1-c-codfw to private1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/425303 (https://phabricator.wikimedia.org/T191193) (owner: 10Papaul) [16:23:39] (03CR) 10RobH: [C: 032] adding imarlier to deployment [puppet] - 10https://gerrit.wikimedia.org/r/425304 (https://phabricator.wikimedia.org/T191704) (owner: 10RobH) [16:23:41] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Change db2045 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425298 (https://phabricator.wikimedia.org/T191193) (owner: 10Marostegui) [16:23:55] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Change db2045 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425298 (https://phabricator.wikimedia.org/T191193) (owner: 10Marostegui) [16:24:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change db2045 IP as it is being moved to another rack - T191193 (duration: 00m 59s) [16:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:06] T191193: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193 [16:26:08] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db2045 IP as it is being moved to another rack - T191193 (duration: 00m 59s) [16:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:18] RECOVERY - Host cp2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.62 ms [16:27:23] 10Operations, 10Ops-Access-Requests: Access to the deployment hosts for Imarlier - https://phabricator.wikimedia.org/T191704#4120709 (10RobH) [16:28:28] (03CR) 10Nuria: [wip] Puppetize cron job archiving old MaxMind databases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [16:28:32] 10Operations, 10Ops-Access-Requests: Access to the deployment hosts for Imarlier - https://phabricator.wikimedia.org/T191704#4114326 (10RobH) 05Open>03Resolved I've gone ahead and prepared/merged the patchset giving @Imarlier deployment access. @Imarlier: All the usual precautions apply. You now have the... [16:32:09] (03PS2) 10Andrew Bogott: labtestwiki: disable new account creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425286 [16:33:44] (03PS2) 10Cmjohnson: Adding dns entries for ms-be1040-43 [dns] - 10https://gerrit.wikimedia.org/r/425274 (https://phabricator.wikimedia.org/T191896) [16:34:07] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for ms-be1040-43 [dns] - 10https://gerrit.wikimedia.org/r/425274 (https://phabricator.wikimedia.org/T191896) (owner: 10Cmjohnson) [16:36:28] PROBLEM - Host db2045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:36:52] ^ expected [16:37:19] RECOVERY - Host cp2022 is UP: PING WARNING - Packet loss = 44%, RTA = 36.25 ms [16:37:28] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 66 ESP OK [16:37:28] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 66 ESP OK [16:37:28] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK [16:37:29] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 66 ESP OK [16:37:29] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 66 ESP OK [16:37:29] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 66 ESP OK [16:37:29] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK [16:37:29] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 66 ESP OK [16:37:30] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 66 ESP OK [16:37:30] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK [16:37:31] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 66 ESP OK [16:37:31] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 66 ESP OK [16:37:38] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [16:37:38] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [16:37:38] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [16:37:38] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 66 ESP OK [16:37:38] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 66 ESP OK [16:37:38] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 66 ESP OK [16:37:39] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK [16:37:39] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 66 ESP OK [16:37:59] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [16:37:59] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 66 ESP OK [16:37:59] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK [16:38:08] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [16:38:08] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 66 ESP OK [16:38:08] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK [16:38:09] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 66 ESP OK [16:38:09] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 66 ESP OK [16:38:09] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK [16:38:09] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 66 ESP OK [16:38:09] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 66 ESP OK [16:38:18] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK [16:38:18] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 66 ESP OK [16:38:18] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 66 ESP OK [16:38:18] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [16:39:28] PROBLEM - HTTPS Unified RSA on cp2022 is CRITICAL: SSL CRITICAL - OCSP staple validity for en.wikipedia.org has -376631 seconds left [16:39:28] PROBLEM - HTTPS Unified ECDSA on cp2022 is CRITICAL: SSL CRITICAL - OCSP staple validity for en.wikipedia.org has -376631 seconds left [16:39:39] PROBLEM - Freshness of zerofetch successful run file on cp2022 is CRITICAL: CRITICAL: File /var/netmapper/.update-success is more than 86400 secs old! [16:39:58] PROBLEM - Freshness of OCSP Stapling files on cp2022 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-rsa-unified.ocsp is more than 259500 secs old! [16:41:08] (03PS4) 10Gehel: maps: add Java proxy to cleartables_sync cron [puppet] - 10https://gerrit.wikimedia.org/r/424247 (https://phabricator.wikimedia.org/T190193) [16:41:21] (03CR) 10Andrew Bogott: [C: 032] labtestwiki: disable new account creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425286 (owner: 10Andrew Bogott) [16:42:09] (03CR) 10Gehel: [C: 032] maps: add Java proxy to cleartables_sync cron [puppet] - 10https://gerrit.wikimedia.org/r/424247 (https://phabricator.wikimedia.org/T190193) (owner: 10Gehel) [16:42:39] (03Merged) 10jenkins-bot: labtestwiki: disable new account creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425286 (owner: 10Andrew Bogott) [16:42:54] (03CR) 10jenkins-bot: labtestwiki: disable new account creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425286 (owner: 10Andrew Bogott) [16:45:20] !log andrew@tin Synchronized wmf-config/CommonSettings.php: disable new accounts on labtestwikitech (duration: 01m 00s) [16:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:28] PROBLEM - Host cp2022 is DOWN: PING CRITICAL - Packet loss = 100% [16:49:18] RECOVERY - Host cp2022 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [16:51:48] PROBLEM - HTTPS Unified ECDSA on cp2022 is CRITICAL: SSL CRITICAL - OCSP staple validity for en.wikipedia.org has -377372 seconds left [16:51:48] PROBLEM - HTTPS Unified RSA on cp2022 is CRITICAL: SSL CRITICAL - OCSP staple validity for en.wikipedia.org has -377372 seconds left [16:51:48] PROBLEM - Freshness of zerofetch successful run file on cp2022 is CRITICAL: CRITICAL: File /var/netmapper/.update-success is more than 86400 secs old! [16:52:08] PROBLEM - Freshness of OCSP Stapling files on cp2022 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-rsa-unified.ocsp is more than 259500 secs old! [16:54:16] Question about graphite: is there an appropriate endpoint for pickle metrics? graphite-in.eqiad.wmnet:2004 doesn't appear to be listening, or is filtered. [16:54:28] PROBLEM - Host cp2022 is DOWN: PING CRITICAL - Packet loss = 61%, RTA = 10480.46 ms [16:54:38] RECOVERY - Host cp2022 is UP: PING OK - Packet loss = 0%, RTA = 36.57 ms [16:56:49] PROBLEM - Freshness of zerofetch successful run file on cp2022 is CRITICAL: CRITICAL: File /var/netmapper/.update-success is more than 86400 secs old! [16:56:49] (03PS1) 10Vgutierrez: Revert "install_server: Reimage lvs5003 as stretch" [puppet] - 10https://gerrit.wikimedia.org/r/425312 [16:57:09] PROBLEM - Freshness of OCSP Stapling files on cp2022 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-rsa-unified.ocsp is more than 259500 secs old! [16:57:12] !log starting branch cut of 1.31.0-wmf.29 [16:57:14] 10Operations, 10ops-codfw, 10Traffic: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4120807 (10RobH) Ok, So I just took this over from Papaul. He replaced the bad memory on the A side earlier today, but after just clearing the log and rebooting, we have more memory errors: Record:... [16:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:38] PROBLEM - Check systemd state on cp2022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:58:02] (03CR) 10Vgutierrez: [C: 032] Revert "install_server: Reimage lvs5003 as stretch" [puppet] - 10https://gerrit.wikimedia.org/r/425312 (owner: 10Vgutierrez) [16:59:38] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp2022_v4 [16:59:45] everything ok with db2045, can I help? [16:59:48] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp2022_v4 [16:59:58] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp2022_v4 [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: Time to snap out of that daydream and deploy Services – Graphoid / Parsoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180410T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:07] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp2022_v4 [17:00:18] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp2022_v4 [17:00:18] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp2022_v4 [17:00:27] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp2022_v4 [17:00:27] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp2022_v4 [17:00:37] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp2022_v4 [17:00:44] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4120819 (10awight) >>! In T181071#4119135, @mmodell wrote: >>>! In T181071... [17:01:32] halfak: fyi I’m troubleshooting the blocker from yesterday by making some test deployments to ores1001. [17:02:58] RECOVERY - Freshness of zerofetch successful run file on cp2022 is OK: OK [17:04:16] (03CR) 10Herron: [C: 031] "Looks good to me. After merging I think we should squelch icinga and do some stress testing to make sure the master behaves as expected u" [puppet] - 10https://gerrit.wikimedia.org/r/421860 (https://phabricator.wikimedia.org/T184561) (owner: 10Filippo Giunchedi) [17:05:17] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp2022_v4 [17:05:22] !log awight@tin Started deploy [ores/deploy@1e18fa6]: Test deploy virtualenv on ores1001, with logging [17:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:07] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. [17:07:27] moritzm: anything to do? ^^^ [17:07:50] !log awight@tin Finished deploy [ores/deploy@1e18fa6]: Test deploy virtualenv on ores1001, with logging (duration: 02m 28s) [17:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:42] (03CR) 10Joal: [C: 031] "LGTM !" [puppet] - 10https://gerrit.wikimedia.org/r/425289 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [17:11:47] RECOVERY - Host db2045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms [17:12:00] 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4120845 (10Cmjohnson) [17:14:42] (03PS2) 10Ottomata: Use spark2 for Refine job [puppet] - 10https://gerrit.wikimedia.org/r/425289 (https://phabricator.wikimedia.org/T159962) [17:15:06] 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4120855 (10Cmjohnson) @ayounsi These are racked in 10G racks and I would like to utilize the new switches....Can you assign ports please 1040 A7 u29/30 ( probably something in the xe-7/0/... [17:15:22] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4120858 (10awight) @mmodell: We're running the fetch check with "bash -x"... [17:16:03] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4120861 (10Marostegui) [17:16:47] RECOVERY - Check systemd state on cp2022 is OK: OK - running: The system is fully operational [17:17:38] !log awight@tin Started deploy [ores/deploy@d35a1e6]: Test deploy virtualenv on ores1001, with logging and forced failure [17:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:21] !log awight@tin Finished deploy [ores/deploy@d35a1e6]: Test deploy virtualenv on ores1001, with logging and forced failure (duration: 02m 44s) [17:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:49] 10Operations, 10ops-eqsin: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4120892 (10Vgutierrez) [17:22:49] !log shutting down cp2022 for main board replacement [17:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:34] (03Abandoned) 10Herron: puppet: change codfw puppet masters to use eqiad puppetdb server [puppet] - 10https://gerrit.wikimedia.org/r/395028 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [17:24:37] PROBLEM - Host cp2022 is DOWN: PING CRITICAL - Packet loss = 100% [17:24:47] (03Abandoned) 10Herron: puppet: depool (via firewall) codfw puppetmaster for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/385976 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [17:26:18] 10Operations, 10DNS, 10Mail, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4120910 (10herron) 05Open>03stalled [17:26:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4120912 (10Papaul) moved db2045 from C6 to B3 in racktables Please update task with next server we need to move next week. thanks [17:28:52] PROBLEM - Host cp2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:29:43] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:29:52] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:29:52] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:29:52] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:29:52] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:29:52] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:29:53] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:29:53] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:29:53] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:29:53] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:02] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:30:02] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:30:02] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:30:02] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:30:02] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:03] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:03] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:03] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:03] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:05] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:05] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:05] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:12] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:12] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:12] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:30:12] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:30:12] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:12] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:22] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:22] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:22] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:22] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:22] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:30:42] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:30:42] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:31:33] (03PS2) 10ArielGlenn: remove dumps web server from dataset1001 and ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/425234 (https://phabricator.wikimedia.org/T182540) [17:32:38] (03CR) 10ArielGlenn: [C: 032] remove dumps web server from dataset1001 and ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/425234 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [17:35:38] (03CR) 10Smalyshev: [C: 031] wdqs: new wdqs-internal service [dns] - 10https://gerrit.wikimedia.org/r/424587 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [17:35:51] (03PS2) 10ArielGlenn: turn off public dumps mirror rsync access to dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/425246 (https://phabricator.wikimedia.org/T182540) [17:36:44] (03CR) 10ArielGlenn: [C: 032] turn off public dumps mirror rsync access to dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/425246 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [17:37:02] PROBLEM - HTTP on ms1001 is CRITICAL: connect to address 208.80.154.16 and port 80: Connection refused [17:38:02] PROBLEM - HTTP on dataset1001 is CRITICAL: connect to address 208.80.154.11 and port 80: Connection refused [17:39:29] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4120939 (10awight) There's a wealth of surprising results on tin, /srv/dep... [17:40:15] (03PS3) 10Ottomata: Use spark2 for Refine job and banner-streaming job [puppet] - 10https://gerrit.wikimedia.org/r/425289 (https://phabricator.wikimedia.org/T159962) [17:40:52] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:43:07] !log add static route to neutron poc instance range for codfw 172.16.128.0/21 [17:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:22] Hi ops-team - We're about to dpeloy analytics hadoop patches [17:45:50] !log joal@tin Started deploy [analytics/refinery@b8ea97f]: Analytics weekly deploy - Move to spark 2 [17:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:28] (03PS3) 10Ottomata: Install the Spark 2 yarn shuffle service jar over Spark 1's [puppet] - 10https://gerrit.wikimedia.org/r/424593 (https://phabricator.wikimedia.org/T159962) [17:47:32] RECOVERY - Host cp2022 is UP: PING OK - Packet loss = 0%, RTA = 36.30 ms [17:47:39] !log joal@tin (no justification provided) [17:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:10] !log joal@tin (no justification provided) [17:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:55] Please excuse me for the spam - Wrong command (deploy-log, not log) [17:49:46] !log joal@tin Finished deploy [analytics/refinery@b8ea97f]: Analytics weekly deploy - Move to spark 2 (duration: 03m 55s) [17:49:51] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4120980 (10Marostegui) 05Open>03Resolved This is all good now! Thanks ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldri... [17:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:42] PROBLEM - Host cp2022 is DOWN: PING CRITICAL - Packet loss = 100% [17:53:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4121004 (10Cmjohnson) [17:53:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: decom spare server caesium - https://phabricator.wikimedia.org/T191358#4121002 (10Cmjohnson) 05Open>03Resolved disk removed from rack, added to spreadsheet, removed from racktables [17:55:08] 10Operations, 10Wikimedia-Apache-configuration, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4063273 (10Dzahn) I investigated a bit on the part ".. on mwdebug1001 and mwdebug1002, .. behaves differently on... [17:56:22] RECOVERY - Host cp2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.08 ms [17:56:35] !log otto@tin Started deploy [analytics/refinery@b8ea97f]: refinery 0.0.60 - take 2^ [17:56:37] !log otto@tin Started deploy [analytics/refinery@b8ea97f]: refinery 0.0.60 - take 2 [17:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:28] !log otto@tin Finished deploy [analytics/refinery@b8ea97f]: refinery 0.0.60 - take 2 (duration: 01m 50s) [17:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:28] !log otto@tin Started deploy [analytics/refinery@b8ea97f]: refinery 0.0.60 - take 3 [17:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180410T1800) [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4121049 (10Cmjohnson) [18:00:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: decom spare server iodine - https://phabricator.wikimedia.org/T191359#4121047 (10Cmjohnson) 05Open>03Resolved removed from rack, spreadsheet and racktables updated [18:02:33] (03CR) 10Ottomata: [C: 032] Install the Spark 2 yarn shuffle service jar over Spark 1's [puppet] - 10https://gerrit.wikimedia.org/r/424593 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [18:02:57] 10Operations, 10Wikimedia-Apache-configuration, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4121053 (10Dzahn) The version of apache2.conf that canaries and mwdebug has matches the puppet repo template: me... [18:03:24] Krinkle: the reason why you get different results on mw1299 is because it's a jobrunner and not a regular appserver [18:03:27] https://phabricator.wikimedia.org/T190111#4121012 [18:03:37] (03PS1) 10Cmjohnson: Reoving mgmt dns for osmium [dns] - 10https://gerrit.wikimedia.org/r/425326 (https://phabricator.wikimedia.org/T175093) [18:03:38] that means it is getting a different apache2.conf in the end [18:04:12] (03PS2) 10Cmjohnson: Reoving mgmt dns for osmium [dns] - 10https://gerrit.wikimedia.org/r/425326 (https://phabricator.wikimedia.org/T175093) [18:04:23] !log otto@tin Finished deploy [analytics/refinery@b8ea97f]: refinery 0.0.60 - take 3 (duration: 04m 54s) [18:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:30] (03CR) 10Cmjohnson: [C: 032] Reoving mgmt dns for osmium [dns] - 10https://gerrit.wikimedia.org/r/425326 (https://phabricator.wikimedia.org/T175093) (owner: 10Cmjohnson) [18:04:46] Krinkle: if you do the same with mw1267 or mw1261 .. you should get results like on mwdebug [18:06:42] jouncebot: next [18:06:43] In 0 hour(s) and 53 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180410T1900) [18:07:12] PROBLEM - etcd request latencies on argon is CRITICAL: CRITICAL - scalar( sum(rate(etcd_request_latencies_summary_sum{ job=k8s-api,instance=10.64.32.133:6443}[5m]))/ sum(rate(etcd_request_latencies_summary_count{ job=k8s-api,instance=10.64.32.133:6443}[5m]))): 85722.05804111244 = 50000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:07:22] PROBLEM - etcd request latencies on neon is CRITICAL: CRITICAL - scalar( sum(rate(etcd_request_latencies_summary_sum{ job=k8s-api,instance=10.64.0.40:6443}[5m]))/ sum(rate(etcd_request_latencies_summary_count{ job=k8s-api,instance=10.64.0.40:6443}[5m]))): 50140.899598393575 = 50000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:07:34] !log Stopping coal on graphite1001 to manually repopulate for T191239 [18:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:41] T191239: coal metrics changed after deploying new code - https://phabricator.wikimedia.org/T191239 [18:08:12] RECOVERY - etcd request latencies on argon is OK: OK - scalar( sum(rate(etcd_request_latencies_summary_sum{ job=k8s-api,instance=10.64.32.133:6443}[5m]))/ sum(rate(etcd_request_latencies_summary_count{ job=k8s-api,instance=10.64.32.133:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:08:23] RECOVERY - etcd request latencies on neon is OK: OK - scalar( sum(rate(etcd_request_latencies_summary_sum{ job=k8s-api,instance=10.64.0.40:6443}[5m]))/ sum(rate(etcd_request_latencies_summary_count{ job=k8s-api,instance=10.64.0.40:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:09:41] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#4121068 (10Cmjohnson) [18:10:03] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4121073 (10Cmjohnson) [18:10:06] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3582551 (10Cmjohnson) 05Open>03Resolved [18:10:15] !log thcipriani@tin Started scap: testwiki to php-1.31.0-wmf.29 and rebuild l10n cache [18:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:32] PROBLEM - puppet last run on analytics1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:33] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:33] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:33] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:33] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:42] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [18:11:42] RECOVERY - Host cp2022 is UP: PING OK - Packet loss = 16%, RTA = 36.07 ms [18:11:43] PROBLEM - puppet last run on analytics1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:43] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:43] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:43] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK [18:11:43] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 66 ESP OK [18:11:44] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 66 ESP OK [18:11:44] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 66 ESP OK [18:11:45] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 66 ESP OK [18:11:45] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 66 ESP OK [18:11:46] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK [18:11:52] PROBLEM - puppet last run on analytics1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:52] PROBLEM - puppet last run on analytics1077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:52] PROBLEM - puppet last run on analytics1074 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:52] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 66 ESP OK [18:11:52] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 66 ESP OK [18:11:53] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 66 ESP OK [18:11:53] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [18:12:02] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [18:12:02] PROBLEM - puppet last run on analytics1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:12:02] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:12:02] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:12:03] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 66 ESP OK [18:12:03] PROBLEM - puppet last run on analytics1075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:12:03] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:12:04] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 66 ESP OK [18:12:04] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 66 ESP OK [18:12:12] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK [18:12:12] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK [18:12:12] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 66 ESP OK [18:12:12] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK [18:12:12] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [18:12:13] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK [18:12:13] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [18:12:13] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 66 ESP OK [18:12:22] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [18:12:22] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [18:12:22] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 66 ESP OK [18:12:22] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 66 ESP OK [18:12:22] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK [18:12:22] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 66 ESP OK [18:12:23] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 66 ESP OK [18:12:23] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK [18:12:23] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 66 ESP OK [18:12:32] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK [18:12:32] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK [18:12:32] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK [18:12:32] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 66 ESP OK [18:12:33] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 66 ESP OK [18:12:33] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 66 ESP OK [18:12:33] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK [18:12:42] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 66 ESP OK [18:12:42] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 66 ESP OK [18:12:42] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 66 ESP OK [18:12:42] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 66 ESP OK [18:13:12] PROBLEM - HTTPS Unified RSA on cp2022 is CRITICAL: SSL CRITICAL - OCSP staple validity for en.wikipedia.org has -382256 seconds left [18:13:53] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 66 ESP OK [18:13:53] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 66 ESP OK [18:14:03] PROBLEM - Freshness of OCSP Stapling files on cp2022 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-rsa-unified.ocsp is more than 259500 secs old! [18:14:12] PROBLEM - HTTPS Unified ECDSA on cp2022 is CRITICAL: SSL CRITICAL - OCSP staple validity for en.wikipedia.org has -382316 seconds left [18:14:24] (03PS1) 10Cmjohnson: Removing mgmt dns nobelium [dns] - 10https://gerrit.wikimedia.org/r/425330 (https://phabricator.wikimedia.org/T191363) [18:15:30] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4121085 (10Dzahn) per the last ops meeting and joe's comments: - reinstall it one more time. back to stretch instead of jessie [18:15:50] 10Operations, 10ops-codfw, 10Traffic: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4121087 (10Papaul) @BBlack we replaced the main board on cp2022 and the new NIC MAC address is:44:A8:42:2D:1E:80 I asked Dell tech to leave the memory for the other 3 servers cp2008, cp2011 and cp20... [18:16:03] (03PS2) 10Cmjohnson: Removing mgmt dns nobelium [dns] - 10https://gerrit.wikimedia.org/r/425330 (https://phabricator.wikimedia.org/T191363) [18:16:33] RECOVERY - puppet last run on analytics1063 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:33] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:33] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:33] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:33] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:42] RECOVERY - puppet last run on analytics1049 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:42] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:42] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:52] RECOVERY - puppet last run on analytics1067 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:52] RECOVERY - puppet last run on analytics1077 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:52] RECOVERY - puppet last run on analytics1074 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:57] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns nobelium [dns] - 10https://gerrit.wikimedia.org/r/425330 (https://phabricator.wikimedia.org/T191363) (owner: 10Cmjohnson) [18:17:02] RECOVERY - puppet last run on analytics1068 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:17:02] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:17:02] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:17:02] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:17:03] RECOVERY - puppet last run on analytics1075 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:17:21] (03PS1) 10Dzahn: deploy1001: reinstall with stretch instead of jessie [puppet] - 10https://gerrit.wikimedia.org/r/425331 (https://phabricator.wikimedia.org/T175288) [18:18:54] 10Operations, 10ops-codfw, 10Traffic: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4121099 (10Papaul) Note: there is no need to re image the server because the MAC address is the same 44:A8:42:2D:1E:80; [18:19:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4121105 (10Cmjohnson) [18:19:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom spare server nobelium/wmf4543 - https://phabricator.wikimedia.org/T191363#4121103 (10Cmjohnson) 05Open>03Resolved dns removed...removed from rack, everything updated [18:21:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: decom spare server caesium - https://phabricator.wikimedia.org/T182805#4121113 (10Cmjohnson) [18:21:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: decom spare server caesium - https://phabricator.wikimedia.org/T182805#3834960 (10Cmjohnson) 05Open>03Resolved [18:23:56] (03PS1) 10Cmjohnson: Removing mgmt dns for wdqs1001/2 [dns] - 10https://gerrit.wikimedia.org/r/425333 (https://phabricator.wikimedia.org/T175595) [18:25:34] (03PS2) 10Cmjohnson: Removing mgmt dns for wdqs1001/2 [dns] - 10https://gerrit.wikimedia.org/r/425333 (https://phabricator.wikimedia.org/T175595) [18:27:32] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for wdqs1001/2 [dns] - 10https://gerrit.wikimedia.org/r/425333 (https://phabricator.wikimedia.org/T175595) (owner: 10Cmjohnson) [18:31:16] RECOVERY - Freshness of OCSP Stapling files on cp2022 is OK: OK [18:31:17] RECOVERY - HTTPS Unified ECDSA on cp2022 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345570 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 225 days) [18:31:17] RECOVERY - HTTPS Unified RSA on cp2022 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345570 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2018-11-22 07:59:59 +0000 (expires in 225 days) [18:33:31] 10Operations, 10ops-eqiad, 10DBA: Rack and setup 8 new eqiad DBs - https://phabricator.wikimedia.org/T191792#4121164 (10Marostegui) [18:36:54] (03CR) 10Ottomata: [C: 032] Use spark2 for Refine job and banner-streaming job [puppet] - 10https://gerrit.wikimedia.org/r/425289 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [18:37:00] (03PS4) 10Ottomata: Use spark2 for Refine job and banner-streaming job [puppet] - 10https://gerrit.wikimedia.org/r/425289 (https://phabricator.wikimedia.org/T159962) [18:38:13] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4121210 (10BBlack) [18:38:20] 10Operations, 10ops-codfw, 10Traffic: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4121208 (10BBlack) 05Open>03Resolved all green in icinga now and repooled, closing! [18:42:06] (03PS1) 10Herron: puppet: disable color output in puppet log /var/log/puppet.log [puppet] - 10https://gerrit.wikimedia.org/r/425335 [18:42:07] Krinkle: since you're awake do you have context on: https://gerrit.wikimedia.org/r/#/c/425026/ ? The reason I'm asking is that train today is going to take 45 minutes to regenerate l10n files since we're stuck on hhvm (this takes like 10 minutes on php5). [18:47:19] (03CR) 10Dzahn: "Gehel, good point! in that case i will abandon this. using the package is the better option." [puppet] - 10https://gerrit.wikimedia.org/r/425227 (https://phabricator.wikimedia.org/T185504) (owner: 10Dzahn) [18:47:21] (03Abandoned) 10Dzahn: icinga: import check_postgres.pl [puppet] - 10https://gerrit.wikimedia.org/r/425227 (https://phabricator.wikimedia.org/T185504) (owner: 10Dzahn) [18:47:29] 10Operations, 10ops-eqsin, 10Traffic: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4121254 (10Vgutierrez) [18:47:33] thcipriani: What I know is what we agreed months ago that we're migrating to php7 or hhvm, and that php7 isn't ready yet, and that it seems all use of PHP5 had been removed (aside from mwscript), and that a recent deployment for an ICU upgrade made it infeasible to keep maintaining PHP5 compat, so the switch was flipped and is not reversible at this point. [18:47:45] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PosgreSQL - https://phabricator.wikimedia.org/T185504#4121257 (10Dzahn) Thanks Gehel for pointing that out. Using the existing package is the better option. Abandoned. [18:47:50] Doing so now or bypassing it would likely cause widespread and unpredictable corruption. [18:48:02] In a way that is not going to be detected by logtash or anything else we have. [18:48:15] (03CR) 10Dzahn: [C: 032] deploy1001: reinstall with stretch instead of jessie [puppet] - 10https://gerrit.wikimedia.org/r/425331 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [18:48:38] (I was going to say something, but it's roughly what Krinkle said, as I understand from a meeting yesterday - do not undo 425026 or risk data peril) [18:49:36] well shoot. OK, what's the timeline for how long we're in this state? I.e. no php7, stuck on hhvm? [18:49:45] (for mwscript) [18:49:50] thcipriani: I would recommend reaching out to Chad whom might know of a way to cut down that time. Last I checked the main reason hhvm would take so long is due to use (or non-use) of its compilation-cache , which has a CLI switch. [18:50:15] I forget which way around the problem was. [18:50:29] But I doubt its hhvm itself generally being slower, which would make no sense :) [18:50:45] some stats cache thing is the root of the issue IIRC [18:51:27] so who can I talk to about timeline? [18:52:55] that is, if we're on a php7 next week, there's probably not a lot of point in pinpointing the problem and working around it in scap, if we're stuck here for a while...we'll have to do sometime otherwise all full syncs take like 40 minutes longer [18:53:10] i am reinstalling deploy1001 again with stretch right now (after we first went back to jessie) so that we can get rid of tin [18:53:54] ah cool, that's good :) [18:54:20] thcipriani: afaict it's "a week or 2" .. but that's really afaict and please ask j.oe as well [18:54:44] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4121285 (10Sharvaniharan) I am unable to ssh into releases1001 or stat1006.eqiad.wmnet after the changes.. @MarcoAurelio @MoritzMuehlenhoff [18:55:06] mutante: ok, I'll message him, thanks! [18:58:42] From what I know, there's currently 4 groups of php7 issues: 1) profiling (doesn't affect cli), 2) memcached (affects everything, being worked on), 3) some code paths relating to dumps and possibly anything else, 4) random libs and extensions still failing their Jenkins job on php7 for unknown reasons. [18:59:18] All tracked at https://phabricator.wikimedia.org/tag/php_7.0_support/ and https://phabricator.wikimedia.org/T172165 [18:59:36] thcipriani: one of the main issue that was found is compatibility of data saved in memcached between hhvm and php7 [18:59:38] issue #2-4 are imho blocking for scap to use it via a CLI script. [19:00:04] thcipriani: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180410T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:02:32] 10Operations, 10ops-eqsin, 10Traffic: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4121313 (10Volans) Reporting it here too for the future, to fix it's sufficient to replace the `--diff` of the above command with `--commit` and then re-run the `--diff` to ensure that this t... [19:02:38] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4113459 (10Dzahn) @Sharvaniharan I checked on releases1001 but i don't see any failed attempts to login with your user. It looks like it's already failing before that at the con... [19:03:10] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4121316 (10Dzahn) 05Resolved>03Open [19:03:22] Krinkle: ok, so wait, you're saying that there are other issues beyond just swapping servers that will prevent scap from moving to php7 so we should work to determine the hhvm fix inside scap? [19:03:57] also, if there are now dire consequences to using php5 as PHP in mwscript should that script stop you from doing that? [19:04:39] [[ "$PHP" == "php5" ]] && { echo "I'm afraid I can't do that, Dave"; exit 1 } [19:04:48] thcipriani: I don't know about all of that, but I do know that I wouldn't even know where to begin confirming whether or not localisation rebuild is and will work correctly on php7 presently. It would be more unpredictable than the nightly l10nupdate. [19:04:50] There was a change to mwscript to try and detect if it's on php5 or php7, which was then reverted again. [19:06:08] Maybe if it ran in a firewalled container with no network access (assuming localisation rebuild works under those conditions). [19:06:17] ... but it probably doesn't. [19:06:30] which is exactly why it would likely cause cascading failures. [19:06:39] starting with corrupting random memcached values. [19:07:04] +1 for making /usr/bin/php5 an alias to fail across the fleet. [19:07:10] * /bin/false [19:07:14] 10Operations, 10ops-eqsin, 10Traffic: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4121328 (10Vgutierrez) p:05Triage>03Normal [19:07:33] and probably php7 as well, on mw-related nodes. [19:07:42] FWIW I straced out what scap does yesterday most of the writing it does is just to files on disk. It talkes to etcd to get config values, but that's really about it. It doesn't write to very much. [19:07:47] https://gist.github.com/thcipriani/3ec50480b0d5b1c1a3fcb7535c5aade1 [19:08:31] thcipriani: This is MediaWiki we're talking about. I could write a 400 page book about what it does before even starts doing what you just described. [19:08:55] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4121333 (10Sharvaniharan) ``` ## Use bastion-eqiad.wmflabs.org as proxy to labs Host bastlabs HostName bastion-eqiad.wmflabs.org User sharan IdentityFile ~/.ssh/id_rsa Host *.eqi... [19:10:03] heh, this is true, I started trying to dig down through some layers to see what each mwscript is doing, and then gave up and straced it and all its subprocesses because of the 400 page book thing :) [19:10:25] I agree that in terms of meaningful/important stuff it does, you nailed it. [19:11:21] * hashar preorders the book [19:11:24] But it won't get to that without establishing some database connections, probably instantiating a few Title and User objects, reading and writing to various memcached keys, possibly even a master write query in there somewhere because it detected something and decided to work on it, perhaps trigger a few deferreed updates and jobs while at it, all innocent. [19:11:54] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4121339 (10mmodell) >>! In T181071#4120939, @awight wrote: > It would be g... [19:12:29] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4121341 (10mmodell) [19:13:07] hashar: I'll have to disappoint you :) The 400 pages mostly represent its complexity, not my wisdom. I'd have go insane and be imprisoned before I'd write it. [19:13:53] In fact, I probably couldn't write it. But quite possibly, if we all work together, we could maybe actually write it. WE'd never finish it, but in theory if it were to be complete, it'd be 400 pages. [19:16:24] ok, so from this discussion I guess the current plan of action will be to figure out how to run scap sync using mwscript to hhvm without it taking a huge amount of additional time (possibly achievable via command line flags) and in the interim no l10nupdates during SWAT (for realz this time). [19:16:44] !log thcipriani@tin Finished scap: testwiki to php-1.31.0-wmf.29 and rebuild l10n cache (duration: 66m 28s) [19:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:34] thcipriani: Aye, yeah. Feel free to CC me on an actionable, I can help move this along. [19:21:21] thcipriani: Also, the switch from CDB to PHP would likely help. Depending on how long it'll take for others to finalise php7, you might have a few weeks to work on that, wouldn't conflict. [19:21:46] I think chad was testing that on beta some weeks ago. not sure if there's a blocker or not. [19:31:40] I had issues with it [19:31:45] I gave up [19:31:47] hours/day [19:38:40] (03PS1) 10Herron: puppet-merge: continue despite errors during remote/ssh stage [puppet] - 10https://gerrit.wikimedia.org/r/425339 [19:40:39] (03PS1) 10Thcipriani: Group0 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425341 [19:43:31] (03CR) 10Thcipriani: [C: 032] Group0 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425341 (owner: 10Thcipriani) [19:45:02] (03Merged) 10jenkins-bot: Group0 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425341 (owner: 10Thcipriani) [19:48:21] !log sbisson@tin Started deploy [tilerator/deploy@3326c14]: Deploying tilerator pre-i18n to maps-test* [19:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:48] !log sbisson@tin Finished deploy [tilerator/deploy@3326c14]: Deploying tilerator pre-i18n to maps-test* (duration: 00m 27s) [19:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:15] (03CR) 10jenkins-bot: Group0 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425341 (owner: 10Thcipriani) [19:55:24] (03PS1) 10Thcipriani: Revert "Group0 to 1.31.0-wmf.29" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425343 [19:56:20] !log sbisson@tin Started deploy [tilerator/deploy@3326c14]: Deploying tilerator pre-i18n everywhere [19:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:08] !log sbisson@tin Finished deploy [tilerator/deploy@3326c14]: Deploying tilerator pre-i18n everywhere (duration: 00m 48s) [19:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:19] !log sbisson@tin Started deploy [kartotherian/deploy@6e4d666]: Deploying kartotherian pre-i18n everywhere [19:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:55] (03CR) 10Thcipriani: [C: 032] Revert "Group0 to 1.31.0-wmf.29" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425343 (owner: 10Thcipriani) [20:02:32] (03Merged) 10jenkins-bot: Revert "Group0 to 1.31.0-wmf.29" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425343 (owner: 10Thcipriani) [20:02:53] !log sbisson@tin Finished deploy [kartotherian/deploy@6e4d666]: Deploying kartotherian pre-i18n everywhere (duration: 04m 34s) [20:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:39] RECOVERY - Long running screen/tmux on furud is OK: OK: No SCREEN or tmux processes detected. [20:06:43] !log thcipriani@tin rebuilt and synchronized wikiversions files: testwiki back to 1.31.0-wmf.28 [20:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:16] 10Operations, 10DBA, 10MediaWiki-Page-deletion, 10Patch-For-Review, 10Wikimedia-Incident: Reduce locking contention on deletion of pages - https://phabricator.wikimedia.org/T191892#4121476 (10Peachey88) [20:13:08] (03CR) 10jenkins-bot: Revert "Group0 to 1.31.0-wmf.29" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425343 (owner: 10Thcipriani) [20:13:29] !log increasing sample change-prop sample rate to 20% (from 10) in dev environment -- T186751 [20:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:35] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [20:16:00] PROBLEM - Keyholder SSH agent on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:00] PROBLEM - MD RAID on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:00] PROBLEM - Confd template for /etc/dsh/group/jobrunner on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:00] PROBLEM - nutcracker port on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:09] PROBLEM - confd service on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:10] PROBLEM - dhclient process on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:10] PROBLEM - Check whether ferm is active by checking the default input chain on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:19] PROBLEM - Confd template for /etc/dsh/group/mediawiki-installation on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:19] PROBLEM - nutcracker process on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:19] PROBLEM - configured eth on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:20] PROBLEM - Check systemd state on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:20] PROBLEM - Confd template for /etc/dsh/group/maps on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:30] PROBLEM - Confd template for /etc/dsh/group/zotero-translation-server on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:30] PROBLEM - Check size of conntrack table on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:30] PROBLEM - Confd template for /etc/dsh/group/zotero-translators on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:40] PROBLEM - Confd template for /etc/dsh/group/parsoid on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:40] PROBLEM - Disk space on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:40] PROBLEM - DPKG on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:49] PROBLEM - Confd template for /etc/dsh/group/cassandra on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:49] PROBLEM - Unmerged changes on repository mediawiki_config on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:16:50] PROBLEM - Confd template for /etc/dsh/group/ores on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:18:09] PROBLEM - puppet last run on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:24:29] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1001 is CRITICAL: Return code of 255 is out of bounds [20:26:53] awwww.. i reinstalled deploy1001 just now [20:26:55] checking [20:26:59] it broken.. :/ [20:27:13] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMC] (Node ffff8a7bdf5ae230), AE_NOT_EXIST [20:27:16] bah [20:28:34] actually, after that it booted nevertheless and is sitting at login.. just has to be added to puppet [20:28:49] (03PS5) 10Ottomata: Use spark2 for Refine job and banner-streaming job [puppet] - 10https://gerrit.wikimedia.org/r/425289 (https://phabricator.wikimedia.org/T159962) [20:29:24] (03CR) 10jerkins-bot: [V: 04-1] Use spark2 for Refine job and banner-streaming job [puppet] - 10https://gerrit.wikimedia.org/r/425289 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [20:30:47] !log deploy1001 - reinstalled with jessie - re-adding to puppet (T175288) [20:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:54] T175288: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288 [20:30:57] !log deploy1001 - reinstalled with stretch - re-adding to puppet (T175288) [20:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:32] (03PS6) 10Ottomata: Use spark2 for Refine job and banner-streaming job [puppet] - 10https://gerrit.wikimedia.org/r/425289 (https://phabricator.wikimedia.org/T159962) [20:34:35] (03CR) 10Ottomata: [C: 032] Use spark2 for Refine job and banner-streaming job [puppet] - 10https://gerrit.wikimedia.org/r/425289 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [20:37:11] (03PS1) 10Ottomata: Blacklist jobqueue topics for main -> jumbo mirrormaker (again) [puppet] - 10https://gerrit.wikimedia.org/r/425410 (https://phabricator.wikimedia.org/T189464) [20:37:17] !log sbisson@tin Started deploy [kartotherian/deploy@bdf70ed]: Deploying kartotherian pre-i18n everywhere (downgrade snapshot) [20:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:09] PROBLEM - Check the NTP synchronisation status of timesyncd on deploy1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:40:19] PROBLEM - IPMI Sensor Status on deploy1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:40:46] (03CR) 10Ottomata: [C: 032] Blacklist jobqueue topics for main -> jumbo mirrormaker (again) [puppet] - 10https://gerrit.wikimedia.org/r/425410 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [20:41:02] !log sbisson@tin Finished deploy [kartotherian/deploy@bdf70ed]: Deploying kartotherian pre-i18n everywhere (downgrade snapshot) (duration: 03m 45s) [20:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:45] PROBLEM - configured eth on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:07:25] PROBLEM - Confd template for /etc/dsh/group/ores on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:07:25] PROBLEM - dhclient process on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:07:35] PROBLEM - Keyholder SSH agent on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:07:35] PROBLEM - Confd template for /etc/dsh/group/jobrunner on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:07:35] PROBLEM - nutcracker port on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:07:35] PROBLEM - MD RAID on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:07:45] PROBLEM - Confd template for /etc/dsh/group/maps on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:07:46] PROBLEM - Confd template for /etc/dsh/group/mediawiki-installation on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:07:46] PROBLEM - nutcracker process on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:08:04] PROBLEM - Check size of conntrack table on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:08:04] PROBLEM - confd service on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:08:04] PROBLEM - Confd template for /etc/dsh/group/zotero-translators on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:08:05] PROBLEM - Confd template for /etc/dsh/group/zotero-translation-server on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:08:05] PROBLEM - puppet last run on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:08:14] PROBLEM - Confd template for /etc/dsh/group/parsoid on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:08:14] PROBLEM - Confd template for /etc/dsh/group/cassandra on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:08:14] PROBLEM - DPKG on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:08:14] PROBLEM - Disk space on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:12:17] !log sbisson@tin Started deploy [kartotherian/deploy@8f3a903]: Rollback kartotherian to v0.0.35 [21:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:05] (03PS6) 10Volans: First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) [21:13:07] (03PS4) 10Volans: Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) [21:13:09] (03PS6) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) [21:13:11] (03PS1) 10Volans: Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) [21:13:21] (03CR) 10jerkins-bot: [V: 04-1] First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [21:13:23] (03CR) 10jerkins-bot: [V: 04-1] Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [21:13:25] (03CR) 10jerkins-bot: [V: 04-1] Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [21:13:27] (03CR) 10jerkins-bot: [V: 04-1] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [21:15:14] PROBLEM - Unmerged changes on repository mediawiki_config on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:16:04] PROBLEM - Check systemd state on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:16:25] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:18:44] !log sbisson@tin Finished deploy [kartotherian/deploy@8f3a903]: Rollback kartotherian to v0.0.35 (duration: 06m 27s) [21:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:34] PROBLEM - Confd template for /etc/dsh/group/ores on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:20:34] PROBLEM - dhclient process on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:20:44] PROBLEM - Keyholder SSH agent on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:20:44] PROBLEM - MD RAID on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:20:44] PROBLEM - Confd template for /etc/dsh/group/jobrunner on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:20:44] PROBLEM - nutcracker port on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:20:45] PROBLEM - Check whether ferm is active by checking the default input chain on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:20:54] PROBLEM - Confd template for /etc/dsh/group/maps on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:20:54] PROBLEM - configured eth on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:20:54] PROBLEM - nutcracker process on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:20:54] PROBLEM - Confd template for /etc/dsh/group/mediawiki-installation on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:21:04] PROBLEM - confd service on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:21:04] PROBLEM - Check systemd state on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:21:04] PROBLEM - Check size of conntrack table on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:21:04] PROBLEM - Confd template for /etc/dsh/group/zotero-translation-server on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:21:05] PROBLEM - Confd template for /etc/dsh/group/zotero-translators on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:21:14] PROBLEM - Confd template for /etc/dsh/group/parsoid on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:21:14] PROBLEM - Confd template for /etc/dsh/group/cassandra on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:21:14] PROBLEM - Disk space on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:21:14] PROBLEM - DPKG on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:21:16] ^ yes yes... it will change soon [21:21:19] still on that [21:21:34] and i kind of want to see them recover [21:24:03] (03PS1) 10Rush: openstack: apt proxy allowance for instances [puppet] - 10https://gerrit.wikimedia.org/r/425418 (https://phabricator.wikimedia.org/T188266) [21:25:59] (03CR) 10Rush: [C: 032] openstack: apt proxy allowance for instances [puppet] - 10https://gerrit.wikimedia.org/r/425418 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:26:19] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4121650 (10thcipriani) [21:26:46] (03PS2) 10Volans: Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) [21:26:58] (03CR) 10jerkins-bot: [V: 04-1] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [21:28:14] PROBLEM - Unmerged changes on repository mediawiki_config on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:35:21] (03PS1) 10Andrew Bogott: bootstrap firstboot: use an apt proxy on labtest [puppet] - 10https://gerrit.wikimedia.org/r/425420 [21:35:23] (03PS1) 10Andrew Bogott: Bootstrapvz: remove firstboot script, enable cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/425421 [21:35:36] (03CR) 10Andrew Bogott: [C: 04-2] "Not yet!" [puppet] - 10https://gerrit.wikimedia.org/r/425421 (owner: 10Andrew Bogott) [21:36:29] (03PS2) 10Andrew Bogott: bootstrap firstboot: use an apt proxy on labtest [puppet] - 10https://gerrit.wikimedia.org/r/425420 [21:36:31] (03PS2) 10Andrew Bogott: Bootstrapvz: remove firstboot script, enable cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/425421 [21:38:34] (03PS3) 10Andrew Bogott: bootstrap firstboot: use an apt proxy on labtest [puppet] - 10https://gerrit.wikimedia.org/r/425420 [21:38:36] (03PS3) 10Andrew Bogott: Bootstrapvz: remove firstboot script, enable cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/425421 [21:41:50] (03PS4) 10Andrew Bogott: bootstrap firstboot: use an apt proxy on labtest [puppet] - 10https://gerrit.wikimedia.org/r/425420 [21:41:52] (03PS4) 10Andrew Bogott: Bootstrapvz: remove firstboot script, enable cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/425421 [21:43:46] (03CR) 10Andrew Bogott: [C: 032] bootstrap firstboot: use an apt proxy on labtest [puppet] - 10https://gerrit.wikimedia.org/r/425420 (owner: 10Andrew Bogott) [21:45:44] (03PS7) 10Volans: First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) [21:45:46] (03PS5) 10Volans: Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) [21:45:48] (03PS7) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) [21:45:50] (03PS3) 10Volans: Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) [21:45:58] (03CR) 10jerkins-bot: [V: 04-1] First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [21:46:00] (03CR) 10jerkins-bot: [V: 04-1] Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [21:46:02] (03CR) 10jerkins-bot: [V: 04-1] Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [21:46:06] (03CR) 10jerkins-bot: [V: 04-1] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [21:47:06] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4121741 (10DFoy) @BBlack - not sure why OperaMini proxy IPs are no longer being exported. Can this information be re-established? My only... [22:23:24] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:23:44] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:23:44] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:24:04] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:24:14] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:24:14] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:25:04] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:25:44] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [22:25:44] RECOVERY - Disk space on stat1005 is OK: DISK OK [22:26:05] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [22:26:14] RECOVERY - DPKG on stat1005 is OK: All packages OK [22:26:14] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [22:26:24] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [22:30:04] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:35:41] (03PS1) 10Awight: Include packages needed by virtualenv [puppet] - 10https://gerrit.wikimedia.org/r/425442 (https://phabricator.wikimedia.org/T181071) [22:37:48] (03PS2) 10Dzahn: toolforge: add mr (Marathi) language pack and locale [puppet] - 10https://gerrit.wikimedia.org/r/425202 (https://phabricator.wikimedia.org/T191727) (owner: 10BryanDavis) [22:39:08] (03CR) 10Dzahn: [C: 031] "ok! partially it was about how to convert that -* wildcard to actual numbers though" [puppet] - 10https://gerrit.wikimedia.org/r/425202 (https://phabricator.wikimedia.org/T191727) (owner: 10BryanDavis) [22:39:59] (03Abandoned) 10Awight: Include packages needed by virtualenv [puppet] - 10https://gerrit.wikimedia.org/r/425442 (https://phabricator.wikimedia.org/T181071) (owner: 10Awight) [22:43:33] Hi again ops-team - We're gonna deploy a fix for some issues after todays deploy on the analytics cluster [22:43:34] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4121912 (10thcipriani) Profiling info for rebuilding a single file via the command: `mwscript rebuildLocalisationCache.php --wiki=enwiki --outdir=/tmp... [22:45:03] !log joal@tin Started deploy [analytics/refinery@33448cd]: Deploying fixes after todays deploy errors [22:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:54] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active, AS2914/IPv4: Active [22:48:54] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 116, down: 7, shutdown: 0 [22:49:49] !log joal@tin Finished deploy [analytics/refinery@33448cd]: Deploying fixes after todays deploy errors (duration: 04m 46s) [22:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:20] hmm https://en.wikipedia.org seems down for me [22:50:28] and phabricator [22:50:34] though gerrit works [22:50:44] PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53: Servers dns5001.wikimedia.org are marked down but pooled: dns_rec_53_udp: Servers dns5001.wikimedia.org are marked down but pooled: dns_rec6_53_udp: Servers dns5002.wikimedia.org are marked down but pooled [22:50:44] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53: Servers dns5001.wikimedia.org are marked down but pooled: dns_rec_53_udp: Servers dns5002.wikimedia.org are marked down but pooled: dns_rec6_53_udp: Servers dns5002.wikimedia.org are marked down but pooled [22:50:52] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4121932 (10mmodell) What could it be wait4ing for? [23:01:05] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 313 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [23:01:25] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 5 probes of 302 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:01:35] looks like I'm the only one in swat [23:01:38] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4121945 (10Sharvaniharan) @Dzahn please let me know if being on a hangout or remote access would help you.. I am open to it. [23:02:02] MaxSem: Prod issues right now, possibly. [23:02:05] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms [23:02:05] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 3.31 ms [23:02:05] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [23:02:05] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 73.29 ms [23:02:05] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 90.57 ms [23:02:05] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 231.54 ms [23:02:05] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 234.67 ms [23:02:13] are we OK to deploy, or these alerts mean problems? [23:02:15] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater] [23:02:15] Hmm. Gerrit's back, at least. [23:02:53] everything is back to normal [23:03:15] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53_udp: Servers dns5002.wikimedia.org are marked down but pooled: dns_rec6_53_udp: Servers dns5002.wikimedia.org are marked down but pooled [23:03:49] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4121947 (10awight) Back to the drawing board. Including the packaged Pyth... [23:03:50] !log Seemingly from 22:53 - 23:03 global traffic dropped by 30-60%, presumably due to issues in eqiad where 10 Gbits dropped to 3 Gbits sharper than ever before. [23:04:06] I pushed a router change that wasn't appreciated, but rolled it back [23:04:15] PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53_udp: Servers dns5002.wikimedia.org are marked down but pooled [23:04:15] Krinkle: stashbot (~stashbot@wikimedia/bot/stashbot) has quit [23:04:18] Edit rate looks back up now? [23:04:28] Yeah, things are slowly recovering. [23:04:36] !log Seemingly from 22:53 - 23:03 global traffic dropped by 30-60%, presumably due to issues in eqiad where 10 Gbits dropped to 3 Gbits sharper than ever before. [23:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:03] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4121949 (10mmodell) @awight: why not build the virtualenv on a developer m... [23:05:04] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 8 probes of 302 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [23:06:15] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 11 probes of 323 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [23:07:20] (03CR) 10Aaron Schulz: Add mcrouter module and mcrouter_wancache profile and enable on beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [23:07:46] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4121952 (10awight) @mmodell: That would be wonderful, but virtualenvs are... [23:08:40] (03PS39) 10Aaron Schulz: Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 [23:09:24] RECOVERY - PyBal backends health check on lvs5002 is OK: PYBAL OK - All pools are healthy [23:09:37] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4121953 (10mmodell) So why even use virtualenv if they are based on site p... [23:12:24] PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53: Servers dns5002.wikimedia.org are marked down but pooled [23:12:24] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy [23:12:32] (03CR) 10EddieGP: "Alright, then this shouldn't be a problem any longer. Also, the stretch instance in beta was dropped in the meantime anyways." [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [23:14:07] 10Operations, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Krinkle) [23:14:10] 10Operations, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121980 (10Krinkle) [23:14:12] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4121981 (10Sharvaniharan) @Dzahn my new key for production is not an rsa key, it is a ed25519 key. While we are doing this exercise, could you please update it to the new rsa ke... [23:14:24] RECOVERY - PyBal backends health check on lvs5002 is OK: PYBAL OK - All pools are healthy [23:14:26] XioNoX: I've uploaded what I got from Grafana at https://phabricator.wikimedia.org/T191940 - just bare minimum though. [23:16:35] Krinkle: thx, commenting [23:17:14] 10Operations, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Paladox) I guess this is why en.wikipedia.org and phabricator.wikimedia.org would not load for me? (though gerrit.wikimedia.org loaded for me) [23:17:49] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4121994 (10awight) That's right, we do use the --system-site-packages flag... [23:18:05] 10Operations, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Ghouston) They still don't load for me. I think this is about April 10, not April 11. [23:18:05] 08Warning Alert for device cr1-eqsin.wikimedia.org - Processor usage over 85% [23:18:42] 10Operations, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121999 (10Ghouston) Well, phabricator is fine. [23:19:55] 10Operations, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4122000 (10Krinkle) [23:20:25] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53: Servers dns5002.wikimedia.org are marked down but pooled: dns_rec_53: Servers dns5001.wikimedia.org are marked down but pooled [23:21:19] XioNoX: btw, aside from eqiad which recovered, it seems 10min earlier, eqsin dropped as wlel, which is still low. Is that intentional, e.g. being rerouted? [23:21:25] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy [23:21:40] 10Operations, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4122003 (10Ghouston) Just started working again. [23:22:05] so, can I deploy? [23:22:55] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [23:24:54] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 12 probes of 302 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [23:27:15] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:27:38] Krinkle: yeah something is borked with eqsin [23:28:28] https://grafana.wikimedia.org/dashboard/db/navigation-timing?refresh=5m&panelId=12&fullscreen&orgId=1&from=now-3h&to=now&var-source=navtiming2_oversample&var-metric=responseStart [23:28:33] 10Operations, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4122016 (10Ghouston) And now dead again. Affects www.wikipedia.org, commons, wikidata, wiktionary. [23:29:00] This suggests asia was down for ~ 40min even [23:29:03] Although it seems back up now [23:29:38] (03PS1) 10Ayounsi: Depolling eqsin due to router issue [dns] - 10https://gerrit.wikimedia.org/r/425445 [23:29:56] Meh, belay that, it's dropping again. [23:31:00] (03CR) 10Ayounsi: [C: 032] Depolling eqsin due to router issue [dns] - 10https://gerrit.wikimedia.org/r/425445 (owner: 10Ayounsi) [23:31:05] https://gerrit.wikimedia.org/r/#/c/425445/ I'm depolling eqsin until I can figure out what's going on [23:31:55] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 108 probes of 302 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [23:32:51] !log depolled eqsin due to router issue [23:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:24] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4122020 (10Dzahn) Hi @Sharvaniharan , I see in your config you have > User sharan Your user name in production is "sharvaniharan" though. Please try changing that config... [23:33:53] Social media has been getting pinged about access to Wikipedia being down in places like Australia. [23:34:31] Scrolling up - looks like we are aware of an outage in Asia. Just wanted to confirm if the problem has been resolved. [23:36:55] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 13 probes of 302 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [23:37:28] varnent: yes, the singapore pop has been depolled [23:38:25] XioNoX: awesome - thank you [23:41:05] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 1 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [23:43:06] 10Operations, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4122038 (10Krinkle) [23:43:29] MaxSem: forgot to reply, yes, it's fine [23:43:38] wee - thanks [23:43:50] (03PS2) 10MaxSem: Add logging channel for preference stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425210 (https://phabricator.wikimedia.org/T190425) [23:44:04] (03CR) 10MaxSem: [C: 032] Add logging channel for preference stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425210 (https://phabricator.wikimedia.org/T190425) (owner: 10MaxSem) [23:44:07] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4122040 (10Sharvaniharan) I am still getting the same error after changing username to sharvaniharan sharvaniharan@bast4001.wikimedia.org: Permission denied (publickey,keyboard-... [23:44:21] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Krinkle) [23:44:35] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4122042 (10Sharvaniharan) Do you want to do a quick hangout? [23:45:36] (03Merged) 10jenkins-bot: Add logging channel for preference stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425210 (https://phabricator.wikimedia.org/T190425) (owner: 10MaxSem) [23:49:01] (03CR) 10jenkins-bot: Add logging channel for preference stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425210 (https://phabricator.wikimedia.org/T190425) (owner: 10MaxSem) [23:50:37] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10ayounsi) This was caused by a change made for T191667, more specifically enabling nonstop-routing on cr1/2-eqiad. I applied the change to cr1-... [23:55:07] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4122066 (10thcipriani) >>! In T191921#4121932, @mmodell wrote: > What could it be wait4ing for? Probably a red herring in this instance. There are 7 `...