[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181218T0000). [00:00:04] tgr: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:02:29] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/479892 (owner: 10BBlack) [00:06:49] (03CR) 10Gergő Tisza: [C: 03+2] Disable MediaViewer thumbnail URL guessing for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479997 (https://phabricator.wikimedia.org/T212099) (owner: 10Gergő Tisza) [00:09:16] (03Merged) 10jenkins-bot: Disable MediaViewer thumbnail URL guessing for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479997 (https://phabricator.wikimedia.org/T212099) (owner: 10Gergő Tisza) [00:12:09] (03CR) 10Paladox: add httpd for doc.wm.org, using php-fpm, php7.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480270 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [00:13:22] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:479997|Disable MediaViewer thumbnail URL guessing for private wikis (T212099)]] (duration: 00m 45s) [00:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:25] T212099: MediaViewer shows images in the thumbnail's size on private wikis - https://phabricator.wikimedia.org/T212099 [00:14:05] (03PS4) 10Dzahn: add httpd for doc.wm.org, using php-fpm, php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/480270 (https://phabricator.wikimedia.org/T211974) [00:15:00] (03CR) 10jerkins-bot: [V: 04-1] add httpd for doc.wm.org, using php-fpm, php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/480270 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [00:17:34] (03PS5) 10Dzahn: add httpd for doc.wm.org, using php-fpm, php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/480270 (https://phabricator.wikimedia.org/T211974) [00:18:22] (03CR) 10jenkins-bot: Disable MediaViewer thumbnail URL guessing for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479997 (https://phabricator.wikimedia.org/T212099) (owner: 10Gergő Tisza) [00:18:41] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/480270 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [00:21:40] shdubsh: not sure whether it is operations related, but something (idk what, maybe job queue?) responsible for displaying categorization on recent changes as well as wikidata edits on other wmf projects is not running for two days or so... is that known issue? [00:22:49] (03CR) 10Tim Starling: [C: 03+2] Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling) [00:23:56] (03Merged) 10jenkins-bot: Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling) [00:26:44] (03CR) 10Dzahn: [C: 03+2] add httpd for doc.wm.org, using php-fpm, php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/480270 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [00:31:12] (03CR) 10jenkins-bot: Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling) [00:31:20] (03PS1) 10Dzahn: doc: ensure apt-get update is executed before package install [puppet] - 10https://gerrit.wikimedia.org/r/480277 [00:32:13] (03CR) 10Dzahn: [C: 03+2] "fix invalid relationship, code like existing for phabricator" [puppet] - 10https://gerrit.wikimedia.org/r/480277 (owner: 10Dzahn) [00:40:50] (03PS1) 10Dzahn: doc: try to break dependency cycle for php-fpm7.2 package [puppet] - 10https://gerrit.wikimedia.org/r/480278 [00:43:59] (03CR) 10Dzahn: [C: 03+2] doc: try to break dependency cycle for php-fpm7.2 package [puppet] - 10https://gerrit.wikimedia.org/r/480278 (owner: 10Dzahn) [00:52:39] (03PS1) 10Tim Starling: Revert "Refactor profiler.php and X-Wikimedia-Debug parsing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480279 [00:52:56] (03CR) 10Tim Starling: [C: 03+2] Revert "Refactor profiler.php and X-Wikimedia-Debug parsing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480279 (owner: 10Tim Starling) [00:54:02] (03Merged) 10jenkins-bot: Revert "Refactor profiler.php and X-Wikimedia-Debug parsing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480279 (owner: 10Tim Starling) [00:55:48] (03PS1) 10Dzahn: doc: rename apache_worker to httpd_worker, wrong file [puppet] - 10https://gerrit.wikimedia.org/r/480280 [00:55:52] (03CR) 10jenkins-bot: Revert "Refactor profiler.php and X-Wikimedia-Debug parsing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480279 (owner: 10Tim Starling) [00:56:17] (03CR) 10Dzahn: [C: 03+2] doc: rename apache_worker to httpd_worker, wrong file [puppet] - 10https://gerrit.wikimedia.org/r/480280 (owner: 10Dzahn) [01:08:19] (03PS1) 10Dzahn: doc: fix for dependencies between APT repo and httpd [puppet] - 10https://gerrit.wikimedia.org/r/480281 [01:09:11] (03CR) 10Dzahn: [C: 03+2] doc: fix for dependencies between APT repo and httpd [puppet] - 10https://gerrit.wikimedia.org/r/480281 (owner: 10Dzahn) [01:15:32] 10Operations: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (10greg) (This seems like a high level/meta task, right? And since there is already scap-specific tasks as subtasks, removing #scap from this task.) [01:26:12] Danny_B: a phab task would probably be a good start [01:26:56] also if its an urgent problem state that clearly, otherwise it won't get fixed before January as tomorrow is the last normal deploy of the year [01:31:55] (03PS1) 10Dzahn: doc: don't make APT repo dependent on package [puppet] - 10https://gerrit.wikimedia.org/r/480283 [01:32:57] (03CR) 10Dzahn: [C: 03+2] doc: don't make APT repo dependent on package [puppet] - 10https://gerrit.wikimedia.org/r/480283 (owner: 10Dzahn) [01:37:36] (03PS1) 10Awight: [DNM] Rename JADE to Jade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480284 (https://phabricator.wikimedia.org/T212182) [01:38:02] 10Operations, 10serviceops: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (10Dzahn) [01:42:42] (03PS1) 10CRusnov: Update old hardware report to exclude certain device types. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/480286 (https://phabricator.wikimedia.org/T205899) [01:59:50] (03PS1) 10Dzahn: doc: dependency issue using require_package, fix path worker.conf [puppet] - 10https://gerrit.wikimedia.org/r/480288 [02:00:49] (03CR) 10jerkins-bot: [V: 04-1] doc: dependency issue using require_package, fix path worker.conf [puppet] - 10https://gerrit.wikimedia.org/r/480288 (owner: 10Dzahn) [02:02:11] (03PS2) 10Dzahn: doc: dependency issue using require_package, fix path worker.conf [puppet] - 10https://gerrit.wikimedia.org/r/480288 [02:03:05] (03CR) 10jerkins-bot: [V: 04-1] doc: dependency issue using require_package, fix path worker.conf [puppet] - 10https://gerrit.wikimedia.org/r/480288 (owner: 10Dzahn) [02:06:56] (03PS3) 10Dzahn: doc: dependency issue using require_package, fix path worker.conf [puppet] - 10https://gerrit.wikimedia.org/r/480288 [02:07:52] (03CR) 10jerkins-bot: [V: 04-1] doc: dependency issue using require_package, fix path worker.conf [puppet] - 10https://gerrit.wikimedia.org/r/480288 (owner: 10Dzahn) [02:12:08] (03PS4) 10Dzahn: doc: dependency issue using require_package, fix path worker.conf [puppet] - 10https://gerrit.wikimedia.org/r/480288 [02:14:30] (03CR) 10Dzahn: [C: 03+2] doc: dependency issue using require_package, fix path worker.conf [puppet] - 10https://gerrit.wikimedia.org/r/480288 (owner: 10Dzahn) [02:17:55] (03PS1) 10Dzahn: doc: fix Invalid relationship, rm notify for apt_update_php [puppet] - 10https://gerrit.wikimedia.org/r/480289 [02:23:22] (03CR) 10Dzahn: [C: 03+2] doc: fix Invalid relationship, rm notify for apt_update_php [puppet] - 10https://gerrit.wikimedia.org/r/480289 (owner: 10Dzahn) [02:25:07] (03PS1) 10Bstorm: sonofgridengine: restore ssh knownhosts hack for tools [puppet] - 10https://gerrit.wikimedia.org/r/480291 (https://phabricator.wikimedia.org/T212153) [02:26:16] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: restore ssh knownhosts hack for tools [puppet] - 10https://gerrit.wikimedia.org/r/480291 (https://phabricator.wikimedia.org/T212153) (owner: 10Bstorm) [02:26:26] (03PS2) 10Bstorm: sonofgridengine: restore ssh knownhosts hack for tools [puppet] - 10https://gerrit.wikimedia.org/r/480291 (https://phabricator.wikimedia.org/T212153) [02:27:35] (03PS1) 10Dzahn: doc: remove single quotes around variable for package name [puppet] - 10https://gerrit.wikimedia.org/r/480292 [02:31:10] (03PS2) 10Dzahn: doc: remove single quotes around variable for package name [puppet] - 10https://gerrit.wikimedia.org/r/480292 [02:31:56] (03CR) 10Dzahn: [C: 03+2] doc: remove single quotes around variable for package name [puppet] - 10https://gerrit.wikimedia.org/r/480292 (owner: 10Dzahn) [02:39:17] so after lots of frustration I found out that X-Wikimedia-Debug does not work on wikitech [02:40:21] is there a way to do live testing there? [02:41:41] the code to be tested is not too destructive, I'm just wondering if I can cause some unexpected issues by changing PHP code on labwiki001, and what's the best way to get the code there [02:42:00] (would scap pull work?) [02:48:25] tgr: I would imagine that scap pull works [02:48:30] since that's normally how we deploy code there [02:48:39] I guess wikitech is hardcoded to never use the mwdebug servers [02:51:50] so IIRC [02:51:57] wikitech runs on a single machine [02:52:05] behind cache-misc rather than cache-web (?) [02:52:19] X-Wikimedia-Debug doesn't quite make sense for it [02:53:03] you could try sending a request to cache-web for wikitech with X-Wikimedia-Debug but I'm not sure what it would do. [02:53:54] hm no wikitech is in text-lb*, interesting [02:55:37] dunno how my head changed text-lb to cache-web. I should sleep [02:56:21] * legoktm hugs Krenair [03:18:58] thanks, I found another wiki where I can reproduce the issue [03:19:20] it would be nice to get workarounds for wikitech (if they exist) documented though [03:34:31] 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T212185 (10ops-monitoring-bot) [03:37:09] (03PS1) 10Andrew Bogott: Horizon: remove 'multimedia' project [puppet] - 10https://gerrit.wikimedia.org/r/480407 [03:37:52] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: remove 'multimedia' project [puppet] - 10https://gerrit.wikimedia.org/r/480407 (owner: 10Andrew Bogott) [04:53:04] (03PS1) 10Tim Starling: Un-revert "Refactor profiler.php and X-Wikimedia-Debug parsing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480419 [04:54:29] (03PS5) 10Tim Starling: Class wrapper for ProductionServices.php etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477956 [04:54:58] (03PS5) 10Tim Starling: Put profiler hostnames in ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477957 [04:55:15] (03PS6) 10Tim Starling: Excimer and Tideways support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478137 [05:59:22] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2038" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480420 [06:00:53] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2038" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480420 (owner: 10Marostegui) [06:02:00] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2038" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480420 (owner: 10Marostegui) [06:03:01] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2038 after mysql and kernel upgrade (duration: 00m 47s) [06:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:26] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480421 (https://phabricator.wikimedia.org/T86338) [06:04:53] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480421 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:05:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480421 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:07:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1078 T86338 T202167 (duration: 00m 45s) [06:07:07] (03CR) 10Revi: [C: 03+1] "LGTM! (Question on the task but not blocking this patch.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480106 (https://phabricator.wikimedia.org/T211991) (owner: 10Kosta Harlan) [06:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:10] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:07:11] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:07:11] !log Deploy schema change on db1078 T86338 T202167 [06:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:45] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2038" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480420 (owner: 10Marostegui) [06:07:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480421 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:09:37] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T212185 (10Marostegui) p:05Triage→03High a:03Cmjohnson @Cmjohnson I have triagged this as high priority as this is m3 primary master (phabricator master) [06:11:27] (03CR) 10Marostegui: [C: 03+1] "I don't really have much context about it, but I guess if we are only changing a logo..." [software/tendril] - 10https://gerrit.wikimedia.org/r/479741 (owner: 10Ladsgroup) [06:18:27] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480106 (https://phabricator.wikimedia.org/T211991) (owner: 10Kosta Harlan) [07:27:34] (03PS1) 10Elukey: profile::kafkatee::webrequest::ops: absent quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/480422 (https://phabricator.wikimedia.org/T211883) [07:27:36] (03PS1) 10Elukey: Set role(spare::system) for oxygen [puppet] - 10https://gerrit.wikimedia.org/r/480423 (https://phabricator.wikimedia.org/T211883) [07:27:38] (03PS1) 10Elukey: profile::kafkatee::webrequest::ops: remove unused code [puppet] - 10https://gerrit.wikimedia.org/r/480424 (https://phabricator.wikimedia.org/T211883) [07:27:52] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480425 [07:28:45] (03CR) 10Elukey: [C: 03+2] profile::kafkatee::webrequest::ops: absent quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/480422 (https://phabricator.wikimedia.org/T211883) (owner: 10Elukey) [07:28:59] (03CR) 10Elukey: [C: 03+2] Set role(spare::system) for oxygen [puppet] - 10https://gerrit.wikimedia.org/r/480423 (https://phabricator.wikimedia.org/T211883) (owner: 10Elukey) [07:31:45] (03CR) 10Elukey: [C: 03+2] profile::kafkatee::webrequest::ops: remove unused code [puppet] - 10https://gerrit.wikimedia.org/r/480424 (https://phabricator.wikimedia.org/T211883) (owner: 10Elukey) [07:37:08] 10Operations, 10Patch-For-Review, 10User-Elukey: Move oxygen to weblog1001 - https://phabricator.wikimedia.org/T211883 (10elukey) 05Open→03Resolved a:03elukey Move completed, we are ready to decom oxygen :) [07:47:34] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480425 (owner: 10Marostegui) [07:47:58] 10Operations, 10Traffic, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10faidon) There were a few notices on the 15th and 16th of December. Did these arrive to maint-announce@? [07:48:42] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480425 (owner: 10Marostegui) [07:49:10] !log Deploy schema change on db1075 (s3 master) T86338 T202167 [07:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:14] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [07:49:15] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [07:49:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 T86338 T202167 (duration: 00m 45s) [07:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:36] (03CR) 10Elukey: [C: 03+1] "It will surely be important with a more strict Auth scheme in place :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/480245 (owner: 10Ottomata) [07:57:42] !log restart cassandra-{a,b} on aqs1004 for openjdk upgrades [07:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:46] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480425 (owner: 10Marostegui) [07:57:47] moritzm: --^ [08:02:04] ack, thanks, let me know if there's any issues, otherwise I'll deploy the rest [08:09:35] 10Operations, 10hardware-requests: Procure logstash hardware in eqiad - https://phabricator.wikimedia.org/T210498 (10faidon) a:03RobH Let's move this forward please! [08:11:32] moritzm: everything looks good [08:11:46] you can deploy to the other nodes :) [08:11:57] (03CR) 10Hashar: [C: 03+1] "I have no idea what that "cluster" is for. It used to be for grouping servers in Ganglia but that is long gone :)" [puppet] - 10https://gerrit.wikimedia.org/r/479845 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [08:16:20] elukey: done [08:23:10] !log Enable GTID on es2 - T211973 [08:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:13] T211973: Check GTID, consistency options, notifications across the fleet and db-eqiad.php weights - https://phabricator.wikimedia.org/T211973 [08:24:12] (03PS2) 10Filippo Giunchedi: role: add rsyslog::udp_localhost_compat to mediawiki hosts [puppet] - 10https://gerrit.wikimedia.org/r/480057 (https://phabricator.wikimedia.org/T211124) [08:24:22] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/13982/" [puppet] - 10https://gerrit.wikimedia.org/r/480057 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [08:26:05] !log Enable GTID on es3 - T211973 [08:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:36] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Performance-Team, 10serviceops: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) [08:34:33] (03PS1) 10Elukey: Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [08:35:31] (03PS1) 10Filippo Giunchedi: role: include kafka_shipper on mediawiki hosts [puppet] - 10https://gerrit.wikimedia.org/r/480434 (https://phabricator.wikimedia.org/T211124) [08:35:50] (03PS2) 10Elukey: Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [08:38:09] (03CR) 10Filippo Giunchedi: [C: 03+2] role: include kafka_shipper on mediawiki hosts [puppet] - 10https://gerrit.wikimedia.org/r/480434 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [08:38:15] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/13983/" [puppet] - 10https://gerrit.wikimedia.org/r/480434 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [08:38:49] 10Operations, 10Operations-Software-Development, 10Goal: Expand Netbox usage - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205868 (10faidon) [08:38:52] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10faidon) 05Resolved→03Open This task is great, and the table at the top is a very useful summary! The Q2 goal part of it has been completed indeed, so I can see the argument for the task being resolved. However, we hav... [08:43:22] (03PS3) 10Elukey: Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [08:44:00] !log Enable GTID on s7 codfw master (db2040) - T211973 [08:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:05] T211973: Check GTID, consistency options, notifications across the fleet and db-eqiad.php weights - https://phabricator.wikimedia.org/T211973 [08:46:32] 1/win 24 [08:51:13] (03PS4) 10Elukey: Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [08:57:02] (03PS5) 10Elukey: Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [08:58:05] (03CR) 10Muehlenhoff: "Looks good, two comments." (032 comments) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [08:59:58] (03PS4) 10Michael Große: Perform even more PHP constraint checks before falling back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478630 (https://phabricator.wikimedia.org/T209504) [09:00:39] (03CR) 10Vgutierrez: [C: 03+1] Don't depool pooledDownServers in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447769 (https://phabricator.wikimedia.org/T184715) (owner: 10Mark Bergsma) [09:01:05] (03CR) 10Elukey: Add remaining kerberos wrapped commands (032 comments) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [09:01:23] (03PS2) 10Muehlenhoff: Remove Diamond from further roles [puppet] - 10https://gerrit.wikimedia.org/r/480032 (https://phabricator.wikimedia.org/T183454) [09:09:01] (03CR) 10Muehlenhoff: [C: 03+2] Remove Diamond from further roles [puppet] - 10https://gerrit.wikimedia.org/r/480032 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [09:28:21] 10Operations, 10Services, 10Wikidata, 10Wikidata-Termbox-Hike, 10Service-deployment-requests: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) [09:33:21] (03PS6) 10Elukey: Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [09:36:20] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13988/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [09:38:03] !log swift eqiad-prod: initial weights for ms-be10[44-50].eqiad.wmnet - T209618 [09:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:06] T209618: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 [09:46:10] (03PS1) 10Filippo Giunchedi: hieradata: add ms-be10[44-50].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/480446 (https://phabricator.wikimedia.org/T209618) [09:46:43] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add ms-be10[44-50].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/480446 (https://phabricator.wikimedia.org/T209618) (owner: 10Filippo Giunchedi) [09:47:49] (03CR) 10Muehlenhoff: [C: 03+1] Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [10:09:24] (03CR) 10Ema: [C: 03+1] "One nit." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/480072 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [10:12:48] (03PS1) 10Ema: profile::spicerack: install python3-requests [puppet] - 10https://gerrit.wikimedia.org/r/480452 (https://phabricator.wikimedia.org/T205867) [10:13:45] (03PS2) 10Ema: profile::spicerack: install python3-requests [puppet] - 10https://gerrit.wikimedia.org/r/480452 (https://phabricator.wikimedia.org/T205867) [10:17:11] (03PS5) 10Ema: sre.hosts: add varnish upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) [10:17:42] (03CR) 10Ema: sre.hosts: add varnish upgrade cookbook (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema) [10:21:44] (03CR) 10Ema: remote: add more functionalities (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/480064 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [10:25:34] (03CR) 10Muehlenhoff: sre.hosts: add varnish upgrade cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema) [10:26:12] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/480452 (https://phabricator.wikimedia.org/T205867) (owner: 10Ema) [10:27:13] (03CR) 10Addshore: [C: 04-1] Configure WikibaseQualityConstraints on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479681 (https://phabricator.wikimedia.org/T209957) (owner: 10Lucas Werkmeister (WMDE)) [10:33:20] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 3 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10mobrovac) [10:33:24] (03CR) 10Addshore: [C: 04-1] Configure WikibaseQualityConstraints on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479681 (https://phabricator.wikimedia.org/T209957) (owner: 10Lucas Werkmeister (WMDE)) [10:35:02] (03PS3) 10Addshore: Configure WikibaseQualityConstraints on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479681 (https://phabricator.wikimedia.org/T209957) (owner: 10Lucas Werkmeister (WMDE)) [10:35:09] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10mobrovac) [10:38:22] 10Operations, 10Wikibase-Containers, 10Wikidata, 10wikidata-tech-focus, 10Release Pipeline (Blubber): Create a wmf production ready nginx image - https://phabricator.wikimedia.org/T209292 (10Addshore) I wonder if there is a way to get some rough timeline for this from #operations or #release-engineering-... [10:40:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/480077 (https://phabricator.wikimedia.org/T209886) (owner: 10Effie Mouzeli) [10:49:10] (03PS3) 10Volans: sre.hosts: add upgrade and reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480072 (https://phabricator.wikimedia.org/T205886) [10:49:18] (03CR) 10Volans: sre.hosts: add upgrade and reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/480072 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [11:02:17] (03CR) 10Filippo Giunchedi: hiera: add alerting_host cluster definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479843 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [11:06:58] (03CR) 10Filippo Giunchedi: profile: enable statsd_exporter and add matching rules to logstash::collector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [11:11:44] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) Looking at the attached diagrams, it seems that the flow of a request is as follows: - page gets requested to MediaWiki - MW sends a req... [11:15:10] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Addshore) [11:15:26] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) Also: it is stated in https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service that "In case of no configured server-side rendering... [11:23:40] 10Operations, 10Performance-Team, 10monitoring, 10Graphite: Graphite generates a lot of 502 in Grafana - https://phabricator.wikimedia.org/T211747 (10fgiunchedi) >>! In T211747#4828329, @CDanis wrote: > We could try changing the Grafana datasource back to 'direct' (now called 'Browser' in 5.x) -- it was fl... [11:37:15] (03CR) 10Ema: [C: 03+2] profile::spicerack: install python3-requests [puppet] - 10https://gerrit.wikimedia.org/r/480452 (https://phabricator.wikimedia.org/T205867) (owner: 10Ema) [11:41:42] !log installing remaining libgd2 security updates [11:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:21] (03PS3) 10Elukey: Enable Kerberos support for hdfs-balancer systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/479696 (owner: 10Muehlenhoff) [11:42:43] (03PS3) 10Elukey: Remove unused cron and logrotate config, replaced by systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/479697 (owner: 10Muehlenhoff) [11:45:31] (03CR) 10Elukey: [C: 03+2] Enable Kerberos support for hdfs-balancer systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/479696 (owner: 10Muehlenhoff) [11:45:33] (03CR) 10Elukey: [C: 03+2] Remove unused cron and logrotate config, replaced by systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/479697 (owner: 10Muehlenhoff) [11:47:18] (03PS1) 10Elukey: profile::hadoop::balancer: Remove unused logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/480465 [11:48:10] (03CR) 10Elukey: [C: 03+2] profile::hadoop::balancer: Remove unused logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/480465 (owner: 10Elukey) [11:50:14] (03PS1) 10Elukey: Revert "profile::hadoop::balancer: Remove unused logrotate config" [puppet] - 10https://gerrit.wikimedia.org/r/480466 [11:50:18] (03CR) 10Elukey: [V: 03+2 C: 03+2] Revert "profile::hadoop::balancer: Remove unused logrotate config" [puppet] - 10https://gerrit.wikimedia.org/r/480466 (owner: 10Elukey) [11:50:33] (03PS2) 10Muehlenhoff: Remove obsolete rsync::repo [puppet] - 10https://gerrit.wikimedia.org/r/470611 [11:56:50] jouncebot: next [11:56:50] In 0 hour(s) and 3 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181218T1200) [11:57:48] (03CR) 10Ladsgroup: "This is the WMF guideline on using the logo: https://foundation.wikimedia.org/wiki/Visual_identity_guidelines" [software/tendril] - 10https://gerrit.wikimedia.org/r/479741 (owner: 10Ladsgroup) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181218T1200). [12:00:04] Michael_WMDE and godog: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:24] 👋 [12:00:41] 👋 [12:00:50] (03PS1) 10Elukey: Delete unused Analytics role (already moved to profile) [puppet] - 10https://gerrit.wikimedia.org/r/480469 [12:01:41] (03CR) 10Elukey: [C: 03+2] Delete unused Analytics role (already moved to profile) [puppet] - 10https://gerrit.wikimedia.org/r/480469 (owner: 10Elukey) [12:02:36] I can swap today, that is if Michael_WMDE and godog are not deployers :) [12:03:02] please do, I am not a deployer [12:03:22] technically I'm not, but I could, though I'd rather deploy under swat [12:04:02] Michael_WMDE: please stand by, I'll let you know when your commit is at mwdebug1002, ready for testing [12:04:17] * Michael_WMDE is standing by [12:04:22] godog: sorry, did not understand you :) if you can deploy, you can do it, after I deploy Michael_WMDE's patch [12:06:38] zeljkof: sounds good to me [12:07:45] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478630 (https://phabricator.wikimedia.org/T209504) (owner: 10Michael Große) [12:09:02] (03Merged) 10jenkins-bot: Perform even more PHP constraint checks before falling back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478630 (https://phabricator.wikimedia.org/T209504) (owner: 10Michael Große) [12:09:17] (03CR) 10jenkins-bot: Perform even more PHP constraint checks before falling back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478630 (https://phabricator.wikimedia.org/T209504) (owner: 10Michael Große) [12:10:08] Michael_WMDE: the patch is at mwdebug1002, please test and let me know if I can deploy [12:14:53] seems to still work for the things that work on the live instance (which is all I can test for this patch, the rest I will have to see in the statistics) [12:15:05] Michael_WMDE: ok to deploy? [12:15:28] zeljkof: ok to deploy :) [12:15:41] zeljkof: Hey, when you're done, please let me know. I'll be deploying a security patch for T207814 [12:15:54] jouncebot: next [12:15:54] In 0 hour(s) and 44 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181218T1300) [12:16:11] Amir1: will do [12:16:16] Michael_WMDE: ok, deploying [12:16:17] Thanks [12:16:38] !log installing fuse updates from stretch point release [12:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:13] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:478630|Perform even more PHP constraint checks before falling back (T209504)]] (duration: 00m 46s) [12:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:16] T209504: Perform more constraint type checks in PHP before falling back to SPARQL - https://phabricator.wikimedia.org/T209504 [12:17:38] Michael_WMDE: it's deployed, please test and thanks for deploying with #releng :) [12:18:05] zeljkof thank you [12:18:09] godog, Amir1: I'm done, feel free to continue, please self organize - Amir1 is your patch urgent? [12:18:38] godog, Amir1: please let me know when you finish, I need to start with train (cutting the branch...) [12:19:40] ok, mine is not urgent though I'd like another set of eyes on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/480052 [12:21:23] zeljkof Amir1 perhaps? [12:22:13] godog: uh, I've took a look but I really don't know much about logging :/ [12:22:37] zeljkof: ack, thanks anyways [12:22:39] I'll deploy [12:22:55] please do, looks like Amir1 is not around at the moment [12:23:45] (03PS2) 10Filippo Giunchedi: Use logging pipeline for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480052 (https://phabricator.wikimedia.org/T211124) [12:24:25] (03CR) 10Filippo Giunchedi: [C: 03+2] Use logging pipeline for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480052 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [12:30:18] ok I need a small change to the patch, doing [12:33:04] (03PS1) 10Filippo Giunchedi: Use array for wmgLogstashServers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480473 (https://phabricator.wikimedia.org/T211124) [12:34:05] (03CR) 10jerkins-bot: [V: 04-1] Use array for wmgLogstashServers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480473 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [12:34:37] (03PS2) 10BBlack: Eliminate {{zonename}} templating in favor of @Z [dns] - 10https://gerrit.wikimedia.org/r/479889 [12:34:39] (03PS2) 10BBlack: Switch to %include for geolang templating [dns] - 10https://gerrit.wikimedia.org/r/479890 [12:34:41] (03PS2) 10BBlack: Remove trailing serial comments [dns] - 10https://gerrit.wikimedia.org/r/479891 [12:34:43] (03PS6) 10BBlack: New zone generator gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/479892 [12:34:45] (03PS2) 10BBlack: Minor improvements to check-gdnsd [dns] - 10https://gerrit.wikimedia.org/r/480260 [12:35:11] Amir1: zeljkof I also have one I would like to add to swat, please ping me when your all done! [12:35:20] (03CR) 10jenkins-bot: Use logging pipeline for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480052 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [12:35:33] will do, almost done here [12:36:16] (03PS2) 10Filippo Giunchedi: Use array and IP address for wmgLogstashServers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480473 (https://phabricator.wikimedia.org/T211124) [12:38:12] zeljkof: sorry I was pulled away [12:38:17] (03CR) 10Filippo Giunchedi: [C: 03+2] Use array and IP address for wmgLogstashServers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480473 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [12:38:21] addshore: do you want to go first? [12:38:23] I can wait [12:38:29] Amir1: no, jenkins will take a while to merge mine [12:38:32] godog: is deploying something [12:38:40] yes I'll be done shortly [12:38:42] Amir1, addshore ^ [12:38:47] ack [12:38:54] okay [12:39:21] (03Merged) 10jenkins-bot: Use array and IP address for wmgLogstashServers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480473 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [12:41:50] !log filippo@deploy1001 Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 45s) [12:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:46] Amir1 addshore good to go on my end [12:43:16] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10mobrovac) As a further optimisation of both the architecture as well as parsing and load times, MW/Wikibase could populate a (hidden?) tag in... [12:43:41] okay, I quickly deploy my patch [12:48:27] (03CR) 10jenkins-bot: Use array and IP address for wmgLogstashServers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480473 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [12:52:51] !log deployed a patch on wmf.8 for T207814 [12:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:16] I'm done [12:53:19] addshore: ^ [12:53:21] thanks [12:53:25] still waiting for CI for mine :) [12:55:32] zeljkof: has the branch cut? I want to put the security patch on gerrit. It's in central auth extension so it's fine [12:55:54] Amir1: no, after swat [12:57:24] * addshore keeps waiting for CI [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181218T1300) [13:00:15] addshore: I'll start the branch cut, swat window is up [13:00:27] zeljkof: okay, this backport is still merging on master too [13:00:28] is your patch urgent, or can it wait for the next swat window? [13:00:33] zeljkof: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/480467/ [13:00:55] not urgent urgent, but it would be good to get in both the branch and what is currently deployed [13:01:19] the merge on master should only have a couple of mins left [13:01:21] addshore: ok, we can coordinate during the day, I'll start with the train stuff [13:05:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] Introduce blubberoid LVS IP endpoints [dns] - 10https://gerrit.wikimedia.org/r/480123 (https://phabricator.wikimedia.org/T205919) (owner: 10Alexandros Kosiaris) [13:05:13] (03PS2) 10Alexandros Kosiaris: Introduce blubberoid LVS IP endpoints [dns] - 10https://gerrit.wikimedia.org/r/480123 (https://phabricator.wikimedia.org/T205919) [13:05:16] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Introduce blubberoid LVS IP endpoints [dns] - 10https://gerrit.wikimedia.org/r/480123 (https://phabricator.wikimedia.org/T205919) (owner: 10Alexandros Kosiaris) [13:06:54] zeljkof: iut is merged on master [13:07:02] im guessing the branch cut didnt finish yet, so it should be in the branch :) [13:07:06] addshore: cutting the branch right now [13:07:09] cool! [13:07:11] just started [13:07:15] (03PS1) 10BBlack: [WIP] authdns-local-update: use check-gdnsd/gen-zones [puppet] - 10https://gerrit.wikimedia.org/r/480477 [13:07:24] zeljkof: mind if I backport the other one to .8 now? or wait until after the train? [13:07:30] at extensions/Cite [13:08:16] addshore: after the train, please [13:08:20] ack! [13:08:35] if there are no problems, train will be done in an hour or so [13:08:39] so, FYI it is already merged on .8 but not deployed [13:08:41] if there are problems... [13:09:10] addshore: well, the script needs 15 minutes to run, I guess you can deploy now then [13:09:21] ack, will do [13:09:24] I'm not deploying anything for a while [13:12:12] !log addshore@deploy1001 Synchronized php-1.33.0-wmf.8/extensions/Wikibase/lib/includes/Formatters/ControlledFallbackEntityIdFormatter.php: T201930 ControlledFallbackEntityIdFormatter, track unique value formats (duration: 00m 46s) [13:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:15] zeljkof: all done [13:12:15] T201930: Track entity page performance - https://phabricator.wikimedia.org/T201930 [13:12:25] addshore: great! [13:13:04] (03CR) 10BBlack: [C: 04-1] "I uploaded this mostly for context/WIP on the preceding changes in ops/dns. It still has major issues, chiefly that the metadata updates " [puppet] - 10https://gerrit.wikimedia.org/r/480477 (owner: 10BBlack) [13:15:59] (03PS1) 10KartikMistry: WIP: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 [13:16:54] (03CR) 10jerkins-bot: [V: 04-1] WIP: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [13:26:33] (03PS1) 10Alexandros Kosiaris: blubberoid: Bump CPU limit to 1800m [deployment-charts] - 10https://gerrit.wikimedia.org/r/480484 [13:28:22] (03PS4) 10Volans: remote: add more functionalities [software/spicerack] - 10https://gerrit.wikimedia.org/r/480064 (https://phabricator.wikimedia.org/T205884) [13:28:24] (03PS1) 10Volans: interactive: check TTY in ask_confirmation() [software/spicerack] - 10https://gerrit.wikimedia.org/r/480485 (https://phabricator.wikimedia.org/T205884) [13:28:30] (03PS1) 10Alexandros Kosiaris: Introduce blubberoid LVS service [puppet] - 10https://gerrit.wikimedia.org/r/480486 (https://phabricator.wikimedia.org/T205919) [13:29:08] (03CR) 10Volans: remote: add more functionalities (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/480064 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [13:31:55] (03CR) 10jerkins-bot: [V: 04-1] remote: add more functionalities [software/spicerack] - 10https://gerrit.wikimedia.org/r/480064 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [13:31:57] (03CR) 10jerkins-bot: [V: 04-1] interactive: check TTY in ask_confirmation() [software/spicerack] - 10https://gerrit.wikimedia.org/r/480485 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [13:33:25] (03PS2) 10Alexandros Kosiaris: blubberoid: Bump CPU limit to 1800m [deployment-charts] - 10https://gerrit.wikimedia.org/r/480484 [13:35:31] (03PS1) 10Zfilipin: Group0 to 1.33.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480487 [13:42:16] (03PS4) 10Ladsgroup: Configure WikibaseQualityConstraints on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479681 (https://phabricator.wikimedia.org/T209957) (owner: 10Lucas Werkmeister (WMDE)) [13:42:28] (03CR) 10Ladsgroup: [C: 03+2] "labs only patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479681 (https://phabricator.wikimedia.org/T209957) (owner: 10Lucas Werkmeister (WMDE)) [13:43:31] (03Merged) 10jenkins-bot: Configure WikibaseQualityConstraints on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479681 (https://phabricator.wikimedia.org/T209957) (owner: 10Lucas Werkmeister (WMDE)) [13:44:44] ^ rebased on deploy1001 [13:46:00] 10Operations, 10Wikibase-Containers, 10Wikidata, 10wikidata-tech-focus, 10Release Pipeline (Blubber): Create a wmf production ready nginx image - https://phabricator.wikimedia.org/T209292 (10hashar) I don't think anyone from releng is looking into it. The base images are build with `docker-pkg` (just li... [13:47:05] (03PS1) 10Hashar: Add .gitreview file [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/480492 [13:48:14] !log zfilipin@deploy1001 Pruned MediaWiki: 1.33.0-wmf.4 (duration: 11m 13s) [13:48:30] 10Operations, 10Wikibase-Containers, 10Wikidata, 10serviceops, and 2 others: Create a wmf production ready nginx image - https://phabricator.wikimedia.org/T209292 (10hashar) #serviceops should be able to help / review. [13:50:14] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [13:51:13] !log zfilipin@deploy1001 Started scap: testwiki to php-1.33.0-wmf.9 and rebuild l10n cache [13:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:03] (03CR) 10jenkins-bot: Configure WikibaseQualityConstraints on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479681 (https://phabricator.wikimedia.org/T209957) (owner: 10Lucas Werkmeister (WMDE)) [13:57:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] Introduce blubberoid LVS service [puppet] - 10https://gerrit.wikimedia.org/r/480486 (https://phabricator.wikimedia.org/T205919) (owner: 10Alexandros Kosiaris) [14:00:04] zeljkof: #bothumor I � Unicode. All rise for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181218T1400). [14:00:50] thanks jouncebot [14:03:07] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10Patch-For-Review: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 (10fgiunchedi) MW logging through the logging pipeline is deployed to group0 (testwiki / mediawikiwiki), results: https://logstash.wikimedia.... [14:04:07] (03PS1) 10Andrew Bogott: Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/480496 (https://phabricator.wikimedia.org/T204745) [14:04:57] (03PS2) 10Andrew Bogott: Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/480496 (https://phabricator.wikimedia.org/T204745) [14:05:38] (03CR) 10Vgutierrez: [C: 03+1] interactive: check TTY in ask_confirmation() [software/spicerack] - 10https://gerrit.wikimedia.org/r/480485 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:05:53] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/480496 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [14:15:31] !log restart pybal on lvs1016, lvs2006 for blubberoid LVS deployment. T205919 [14:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:35] T205919: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 [14:18:17] (03PS1) 10Alexandros Kosiaris: lvs: Set blubberoid as critical in icinga [puppet] - 10https://gerrit.wikimedia.org/r/480497 (https://phabricator.wikimedia.org/T205919) [14:19:32] akosiaris: I'm surprised, shouldn't the currently active 'PyBal connections to etcd' criticals alert here on irc? [14:19:48] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add .gitreview file [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/480492 (owner: 10Hashar) [14:19:59] (03CR) 10Vgutierrez: [C: 03+1] Wait for onConfigUpdate initialization in setServers using inlineCallbacks [debs/pybal] - 10https://gerrit.wikimedia.org/r/477793 (owner: 10Mark Bergsma) [14:20:49] ema: maybe I was too successful? [14:21:00] I did disable puppet for a while on icinga1001 btw [14:21:18] akosiaris: mmmh, but I do see the criticals in the web ui [14:21:22] I do see during the manual run the + check_command nrpe_check!check_confd_srv_config-master_pybal_eqiad_blubberoid!10 [14:21:30] so it was added a few mins ago [14:22:31] 10Operations, 10Performance-Team, 10monitoring, 10Graphite: Graphite generates a lot of 502 in Grafana - https://phabricator.wikimedia.org/T211747 (10CDanis) Agreed it's worth a try. Graphite datasource updated. [14:23:04] ema: can't say I disagree ... it should have indeed alerted [14:23:22] yeah the alert for lvs1006 is there since 23 minutes [14:23:36] perhaps icinga-wm committed seppuku? [14:23:47] lvs1006 PyBal connections to etcd CRITICAL 2018-12-18 14:02:27 irc notify-service-by-irc CRITICAL: 46 connections established with conf1004.eqiad.wmnet:4001 (min=47) [14:23:54] it does believe it notified [14:25:15] UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 174: ordinal not in range(128) [14:25:21] ema: your guess is correct [14:25:24] !log zfilipin@deploy1001 Finished scap: testwiki to php-1.33.0-wmf.9 and rebuild l10n cache (duration: 34m 11s) [14:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:45] right, the last time icinga-wm said anything here was clearly too long ago, yesterday 19:36 CET [14:26:01] !log restart ircecho, seems to have croaked with Dec 17 19:39:52 icinga1001 ircecho[861]: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 174: ordinal not in range(128) [14:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:03] volans: ^ [14:26:15] you beloved irc bot is calling for ya :P [14:26:25] we need an icinga check for this! :) [14:26:34] thanks akosiaris [14:26:40] thanks for noticing it [14:27:13] ema: https://giant.gfycat.com/TemptingLiveHerculesbeetle.webm [14:28:23] ema: you should have a look at that python library for irc [14:28:42] somehow I feel you're being sarcastic [14:28:54] good for you, if you actually fell for it [14:28:58] ... [14:31:43] akosiaris: ema I’ve added support for python3 here https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/463794/ (which should fix that I think?) [14:32:50] RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 37 connections established with conf2001.codfw.wmnet:2379 (min=37) [14:32:52] 10Operations, 10vm-requests: eqiad: 1-2 VM requests for docker-registry-beta.wikimedia.org - https://phabricator.wikimedia.org/T212212 (10fselles) [14:33:02] 10Operations, 10serviceops, 10vm-requests: eqiad: 1-2 VM requests for docker-registry-beta.wikimedia.org - https://phabricator.wikimedia.org/T212212 (10fselles) [14:33:06] RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 47 connections established with conf1004.eqiad.wmnet:4001 (min=47) [14:33:38] welcome back icinga-wm [14:33:43] paladox: nice! [14:34:11] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) [14:34:26] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) [14:34:41] paladox: thanks! [14:35:06] Your welcome :) [14:35:58] (03CR) 10Zfilipin: [C: 03+2] Group0 to 1.33.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480487 (owner: 10Zfilipin) [14:36:57] (03Merged) 10jenkins-bot: Group0 to 1.33.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480487 (owner: 10Zfilipin) [14:37:32] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) Picked back up by the campsite to write an ADR for... [14:40:02] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.33.0-wmf.9 [14:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:22] 10Operations, 10serviceops, 10vm-requests: eqiad: 1-2 VM requests for docker-registry-beta.wikimedia.org - https://phabricator.wikimedia.org/T212212 (10akosiaris) Add codfw in the mix as well, no reason to cap this to eqiad. Everything else LGTM [14:43:56] 10Operations, 10Traffic, 10media-storage: Update Subject Alternative Name field in TLS certificates for swift - https://phabricator.wikimedia.org/T212215 (10ema) p:05Triage→03Normal [14:45:35] (03CR) 10jenkins-bot: Group0 to 1.33.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480487 (owner: 10Zfilipin) [14:52:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] lvs: Set blubberoid as critical in icinga [puppet] - 10https://gerrit.wikimedia.org/r/480497 (https://phabricator.wikimedia.org/T205919) (owner: 10Alexandros Kosiaris) [14:52:37] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) @mobrovac Please note that the term box is shown based on user preferences (languages spoken), the initially served DOM however needs... [14:55:12] (03PS1) 10Alexandros Kosiaris: blubberoid: Add discovery RRs [dns] - 10https://gerrit.wikimedia.org/r/480505 (https://phabricator.wikimedia.org/T205919) [14:55:20] !log restart pybal on lvs1006, lvs2003 for blubberoid LVS deployment. T205919 [14:55:25] (03CR) 10jerkins-bot: [V: 04-1] blubberoid: Add discovery RRs [dns] - 10https://gerrit.wikimedia.org/r/480505 (https://phabricator.wikimedia.org/T205919) (owner: 10Alexandros Kosiaris) [14:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:28] T205919: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 [14:59:41] (03PS2) 10Alexandros Kosiaris: blubberoid: Add discovery RRs [dns] - 10https://gerrit.wikimedia.org/r/480505 (https://phabricator.wikimedia.org/T205919) [15:00:16] (03CR) 10Elukey: "Only relevant change is:" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [15:00:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] blubberoid: Add discovery RRs [dns] - 10https://gerrit.wikimedia.org/r/480505 (https://phabricator.wikimedia.org/T205919) (owner: 10Alexandros Kosiaris) [15:06:16] 10Operations: wmf-auto-restart fails on certain legacy services - https://phabricator.wikimedia.org/T212219 (10ema) [15:06:24] 10Operations: wmf-auto-restart fails on certain legacy services - https://phabricator.wikimedia.org/T212219 (10ema) p:05Triage→03Normal [15:09:44] akosiaris: oh nooo :( at least it should have logged the exception this time around [15:11:16] (03CR) 10Ottomata: "OH YA" [puppet] - 10https://gerrit.wikimedia.org/r/480245 (owner: 10Ottomata) [15:13:14] (03CR) 10Elukey: [C: 04-1] "No it doesn't look good, now I got what the code meant to do. Fixing.." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [15:14:24] (03PS1) 10Alexandros Kosiaris: blubberoid: Add the discovery stanzas [puppet] - 10https://gerrit.wikimedia.org/r/480507 (https://phabricator.wikimedia.org/T205919) [15:15:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] blubberoid: Add the discovery stanzas [puppet] - 10https://gerrit.wikimedia.org/r/480507 (https://phabricator.wikimedia.org/T205919) (owner: 10Alexandros Kosiaris) [15:15:49] (03PS7) 10Elukey: Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [15:15:59] (03PS3) 10Ottomata: Remove specific settings from profile::hadoop::common and put into hiera [puppet] - 10https://gerrit.wikimedia.org/r/480245 [15:16:01] !log mwscript extensions/WikibaseQualityConstraints/maintenance/ImportConstraintEntities.php --wiki=testwikidatawiki --config-format=wgConf | tee WikibaseQualityConstraints-config.php [15:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:34] (03CR) 10Elukey: "Better now :) https://puppet-compiler.wmflabs.org/compiler1002/13992/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [15:19:15] akosiaris: found it :( except Exception: pass [15:19:45] volans: lol [15:20:43] mmmh no wait a sec [15:23:02] akosiaris: do you know where are the definitions for the recommendation_api checks? [15:23:57] PROBLEM - Host ms-be2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:24:43] (03CR) 10Ema: ircecho: Convert script to python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox) [15:25:04] (03PS1) 10Volans: ircecho: skip message if unable to decode it [puppet] - 10https://gerrit.wikimedia.org/r/480509 [15:25:09] akosiaris, ema, godog: hotfix (part1) ^^^ [15:27:24] (03PS1) 10Michael Große: Add proxy info to toolkit analyzer cron job [puppet] - 10https://gerrit.wikimedia.org/r/480510 (https://phabricator.wikimedia.org/T209399) [15:27:36] (03PS1) 10Andrew Bogott: nfs: added two more VMs to the maps share [puppet] - 10https://gerrit.wikimedia.org/r/480511 [15:28:26] (03CR) 10Andrew Bogott: [C: 03+2] nfs: added two more VMs to the maps share [puppet] - 10https://gerrit.wikimedia.org/r/480511 (owner: 10Andrew Bogott) [15:29:07] RECOVERY - Host ms-be2047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.95 ms [15:30:25] !log empty ganeti2005, ganeti2006 for T210447 [15:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:28] T210447: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 [15:33:45] mobrovac: are EM DASH here intended? [15:33:46] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/recommendation-api/+/refs/heads/master/spec.yaml [15:34:33] (03PS3) 10Ayounsi: Assign public /29 for cloud-instance-transport1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/479337 (https://phabricator.wikimedia.org/T207663) [15:35:54] (03PS1) 10Andrew Bogott: Horizon: enable more projects in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/480513 (https://phabricator.wikimedia.org/T204745) [15:37:23] (03PS1) 10Ayounsi: Depool codfw for row A recabling [dns] - 10https://gerrit.wikimedia.org/r/480514 (https://phabricator.wikimedia.org/T210447) [15:39:46] (03CR) 10Effie Mouzeli: [C: 03+2] aptrepo: add component/thumbor to stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/480077 (https://phabricator.wikimedia.org/T209886) (owner: 10Effie Mouzeli) [15:39:57] (03PS2) 10Effie Mouzeli: aptrepo: add component/thumbor to stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/480077 (https://phabricator.wikimedia.org/T209886) [15:41:17] (03CR) 10Michael Große: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/480510 (https://phabricator.wikimedia.org/T209399) (owner: 10Michael Große) [15:43:32] (03CR) 10Ayounsi: [C: 03+2] Depool codfw for row A recabling [dns] - 10https://gerrit.wikimedia.org/r/480514 (https://phabricator.wikimedia.org/T210447) (owner: 10Ayounsi) [15:43:50] (03PS1) 10Andrew Bogott: Horizon: move some utility projects [puppet] - 10https://gerrit.wikimedia.org/r/480517 [15:44:12] !log depool codfw for T210447 [15:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:15] T210447: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 [15:44:21] (03PS1) 10Ayounsi: Redirect eqsin/ulsfo caches to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/480518 (https://phabricator.wikimedia.org/T210447) [15:45:05] (03PS2) 10Andrew Bogott: Horizon: enable more projects in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/480513 (https://phabricator.wikimedia.org/T204745) [15:45:46] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: enable more projects in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/480513 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [15:45:48] (03CR) 10Ayounsi: [C: 03+2] Redirect eqsin/ulsfo caches to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/480518 (https://phabricator.wikimedia.org/T210447) (owner: 10Ayounsi) [15:45:58] (03PS2) 10Ayounsi: Redirect eqsin/ulsfo caches to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/480518 (https://phabricator.wikimedia.org/T210447) [15:46:05] (03PS1) 10Bstorm: sonofgridengine: removing commented manual db init [puppet] - 10https://gerrit.wikimedia.org/r/480519 (https://phabricator.wikimedia.org/T212153) [15:46:07] (03PS2) 10Andrew Bogott: Horizon: move some utility projects [puppet] - 10https://gerrit.wikimedia.org/r/480517 [15:47:07] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: move some utility projects [puppet] - 10https://gerrit.wikimedia.org/r/480517 (owner: 10Andrew Bogott) [15:47:10] !log redirect eqsin/ulsfo caches to eqiad for T210447 [15:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:14] (03PS3) 10Andrew Bogott: Horizon: move some utility projects [puppet] - 10https://gerrit.wikimedia.org/r/480517 [15:47:32] (03PS1) 10Muehlenhoff: Clarify expected format of service name in wmf-auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/480520 (https://phabricator.wikimedia.org/T212219) [15:48:18] (03PS2) 10Bstorm: sonofgridengine: removing commented manual db init [puppet] - 10https://gerrit.wikimedia.org/r/480519 (https://phabricator.wikimedia.org/T212153) [15:49:36] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: removing commented manual db init [puppet] - 10https://gerrit.wikimedia.org/r/480519 (https://phabricator.wikimedia.org/T212153) (owner: 10Bstorm) [15:53:23] !log redirect ns1 to authdns1001 for T210447 [15:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:26] T210447: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 [15:53:36] cscott: I <3 telnet.wikimedia.org but it's in need of attention; can you please comment on T204694? [15:53:36] T204694: cloudvps: telnet project trusty deprecation - https://phabricator.wikimedia.org/T204694 [15:56:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] ircecho: Convert script to python3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox) [15:56:45] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10ayounsi) [15:59:10] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/480520 (https://phabricator.wikimedia.org/T212219) (owner: 10Muehlenhoff) [16:00:07] (03CR) 10Paladox: ircecho: Convert script to python3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox) [16:00:23] (03PS1) 10BBlack: mock_etc: warn about making puppet changes first [dns] - 10https://gerrit.wikimedia.org/r/480529 [16:01:46] 10Operations, 10Operations-Software-Development: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10akosiaris) I like black too but from but from https://black.readthedocs.io/en/stable/installation_and_usage.html it tied to having python 3.6 installed. ` Black can be instal... [16:02:06] (03CR) 10BBlack: [C: 03+2] mock_etc: warn about making puppet changes first [dns] - 10https://gerrit.wikimedia.org/r/480529 (owner: 10BBlack) [16:02:23] (03PS2) 10BBlack: mock_etc: warn about making puppet changes first [dns] - 10https://gerrit.wikimedia.org/r/480529 [16:02:31] (03PS4) 10Kosta Harlan: Production configuration for GrowthExperiments Help Panel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480106 (https://phabricator.wikimedia.org/T211991) [16:02:44] (03CR) 10BBlack: [V: 03+2 C: 03+2] mock_etc: warn about making puppet changes first [dns] - 10https://gerrit.wikimedia.org/r/480529 (owner: 10BBlack) [16:03:19] (03PS1) 10Bstorm: sonofgridengine: change the killmode of sge_shadowd to process [puppet] - 10https://gerrit.wikimedia.org/r/480530 (https://phabricator.wikimedia.org/T211258) [16:04:38] volans: em dash? [16:05:02] mobrovac: I've sent https://gerrit.wikimedia.org/r/c/mediawiki/services/recommendation-api/+/480524 ;) [16:05:34] ah! [16:05:36] hehehe [16:05:38] neat find volans [16:06:37] how do people keep on typing it ? [16:06:42] something on Mac OS X ? [16:07:21] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10ayounsi) [16:07:24] copy paste? windows? dunno [16:07:29] !log starting codfw row A recabling - T210447 [16:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:34] T210447: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 [16:07:43] mobrovac: didn't find it, ircecho did crashing on it :D [16:07:59] :P [16:11:13] (03PS2) 10Ottomata: Add proxy info to toolkit analyzer cron job [puppet] - 10https://gerrit.wikimedia.org/r/480510 (https://phabricator.wikimedia.org/T209399) (owner: 10Michael Große) [16:11:22] (03CR) 10Kosta Harlan: Production configuration for GrowthExperiments Help Panel (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480106 (https://phabricator.wikimedia.org/T211991) (owner: 10Kosta Harlan) [16:11:27] (03CR) 10Ottomata: [C: 03+2] "Nice! Looks just right :)" [puppet] - 10https://gerrit.wikimedia.org/r/480510 (https://phabricator.wikimedia.org/T209399) (owner: 10Michael Große) [16:12:03] (03PS7) 10BBlack: cache_text: Vary for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/478680 (https://phabricator.wikimedia.org/T206339) [16:12:17] Phab seems down [16:12:19] I got 503: Request from 87.138.110.76 via cp3042 cp3042, Varnish XID 170596440 Error: 503, Backend fetch failed at Tue, 18 Dec 2018 16:11:45 GMT [16:12:26] So does prod. [16:12:43] I get [16:12:45] XioNoX: ^^^ [16:12:45] If you report this error to the Wikimedia System Administrators, please include the details below. [16:12:45] Request from 188.29.164.143 via cp3041 cp3041, Varnish XID 513507772 [16:12:46] Error: 503, Backend fetch failed at Tue, 18 Dec 2018 16:12:04 GMT [16:12:51] can be related? [16:12:52] Anyone doing anything funky? [16:12:53] commons and en wiki too [16:13:06] Request from 82.5.41.73 via cp3041 cp3041, Varnish XID 471859384 [16:13:08] Error: 503, Backend fetch failed at Tue, 18 Dec 2018 16:12:01 GMT [16:13:10] Shouldn't be affected by codfw stuff? [16:13:29] similar issue for me [16:13:30] Phabs back [16:13:30] all works fine for me [16:13:36] yeh that's esams and it's routed to eqiad [16:13:44] Aha, back for me. [16:13:48] same :) [16:14:01] esams hiccup? [16:14:13] phab doesn't live in esams, though [16:14:28] Does routing go through there? [16:14:42] maybe EU<->US hiccup on the broader internet, would affect both esams->eqiad for the cache 503, and also EU users -> phab(in eqiad) [16:14:48] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [16:14:53] PROBLEM - Host cp1075 is DOWN: PING CRITICAL - Packet loss = 100% [16:15:00] phabricator uses the cache [16:15:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:15:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:15:33] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [16:15:37] oh right, I don't know what I was thinking, re caches+phab [16:15:43] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:15:53] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [16:15:53] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [16:16:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:16:26] XioNoX: I expect the codfw/ulsfo alerts, but we should pause here and figure out why the rest [16:17:00] actually even ulsfo is unexpected I guess [16:17:03] bblack: ulsfo shouldn't alert [16:17:16] we didn't depool it, only redirect to eqiad [16:17:19] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:17:21] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [16:17:25] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [16:17:28] this seems pretty broad though, both text and upload clusters more or less implies it's not an applayer problem [16:17:29] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [16:17:35] looks like about ~30s of 500s [16:17:37] https://logstash.wikimedia.org/goto/4e21dd20a4657995e8bab929b3179e98 [16:17:51] PROBLEM - cache_text: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20fullscreenorgId=1var-instance=webrequestvar-host=All [16:18:14] cp1075 crashed too, which may be causing some confusion and a temporary spike for text [16:18:17] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [16:18:19] but not upload [16:18:30] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [16:18:39] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [16:18:57] 10Operations, 10monitoring: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10MoritzMuehlenhoff) [16:19:15] could it be related to the depool and or redirect of ulsfo/eqsin to eqiad? [16:19:21] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:19:31] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [16:19:32] possibly! [16:19:42] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) - https://phabricator.wikimedia.org/T205862 (10MoritzMuehlenhoff) [16:19:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:19:48] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) 05Open→03Resolved This is done (and the task is getting too big as well), I created T212231 for some followup work. [16:19:57] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:19:59] PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp2016_v4 [16:20:07] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20fullscreenorgId=1var-instance=webrequestvar-host=All [16:20:11] get ready for the ipsec alert shower [16:20:11] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 71 not-conn: cp2005_v4 [16:20:17] PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp2012_v4 [16:20:25] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [16:20:41] PROBLEM - puppet last run on ms-be1036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/DigiCert_SHA2_High_Assurance_Server_CA.crt] [16:20:55] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml],File[/usr/local/bin/puppet-enabled] [16:21:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:21:16] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&from=now-1h&to=now [16:21:27] ^ there was also a HUGE spike of PURGE requests in there, what's driving that? [16:21:43] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [16:21:53] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:21:59] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [16:21:59] PROBLEM - IPsec on cp4028 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:01] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:03] PROBLEM - IPsec on cp5010 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:03] PROBLEM - IPsec on cp5008 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:04] PROBLEM - IPsec on cp5007 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:13] PROBLEM - IPsec on cp5009 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:13] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [16:22:15] PROBLEM - IPsec on cp5012 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:15] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [16:22:17] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [16:22:25] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [16:22:37] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [16:22:37] PROBLEM - IPsec on cp4030 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:37] PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:37] PROBLEM - IPsec on cp4032 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:39] PROBLEM - IPsec on cp4031 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:39] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:39] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:43] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [16:22:45] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:45] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [16:22:45] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:53] PROBLEM - IPsec on cp4027 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:22:55] !log powercycle cp1075 from console (crashed, apparently) [16:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:57] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [16:22:57] PROBLEM - IPsec on cp5011 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1075_v4, cp1075_v6 [16:24:10] bblack: we also had a temporary spike of traffic going over the eqsin-ulsfo tunnel (instead of eqsin-codfw transport), maybe a transient issue within telia? [16:24:13] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/13994/" [puppet] - 10https://gerrit.wikimedia.org/r/480252 (https://phabricator.wikimedia.org/T211859) (owner: 10Herron) [16:24:45] XioNoX: well we are working on codfw... [16:25:02] right now I have no idea what's going, far too many alerts [16:25:07] RECOVERY - cache_text: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20fullscreenorgId=1var-instance=webrequestvar-host=All [16:25:12] bblack: https://librenms.wikimedia.org/graphs/lazy_w=648/to=1545150300/device=92/type=device_bits/from=1545128700/legend=no/ [16:25:44] but I think the PURGE spike is probably a big sign. Other random network-layer things wouldn't cause that I don't think, unless they're looping multicasts. [16:25:57] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [16:26:09] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20fullscreenorgId=1var-instance=webrequestvar-host=All [16:26:13] RECOVERY - Host cp1075 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [16:26:15] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 52 ESP OK [16:26:17] RECOVERY - IPsec on cp4030 is OK: Strongswan OK - 36 ESP OK [16:26:17] RECOVERY - IPsec on cp4029 is OK: Strongswan OK - 36 ESP OK [16:26:17] RECOVERY - IPsec on cp4032 is OK: Strongswan OK - 36 ESP OK [16:26:17] RECOVERY - IPsec on cp4031 is OK: Strongswan OK - 36 ESP OK [16:26:17] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 36 ESP OK [16:26:19] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 36 ESP OK [16:26:21] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 52 ESP OK [16:26:23] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 36 ESP OK [16:26:23] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 52 ESP OK [16:26:23] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 36 ESP OK [16:26:31] RECOVERY - IPsec on cp4027 is OK: Strongswan OK - 36 ESP OK [16:26:33] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 52 ESP OK [16:26:33] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK [16:26:35] RECOVERY - IPsec on cp5011 is OK: Strongswan OK - 36 ESP OK [16:26:45] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 36 ESP OK [16:26:49] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 52 ESP OK [16:26:51] RECOVERY - IPsec on cp4028 is OK: Strongswan OK - 36 ESP OK [16:26:53] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 36 ESP OK [16:26:55] RECOVERY - IPsec on cp5007 is OK: Strongswan OK - 36 ESP OK [16:26:55] RECOVERY - IPsec on cp5008 is OK: Strongswan OK - 36 ESP OK [16:26:55] RECOVERY - IPsec on cp5010 is OK: Strongswan OK - 36 ESP OK [16:27:05] RECOVERY - IPsec on cp5009 is OK: Strongswan OK - 36 ESP OK [16:27:05] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 52 ESP OK [16:27:07] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 52 ESP OK [16:27:09] RECOVERY - IPsec on cp5012 is OK: Strongswan OK - 36 ESP OK [16:27:09] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 52 ESP OK [16:27:17] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 52 ESP OK [16:27:21] so, let's dive into the purge thing first... [16:27:41] it ramps off around 16:12, to over 10x its normal rate [16:27:59] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?panelId=6&fullscreen&orgId=1&from=1545148922203&to=1545150415240&var-site=All&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5 [16:28:45] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 503 (expecting: 200) [16:29:02] I think the first question here is probably: is this really a 10x jump in the rate of PURGE coming out of various apps/services, or did we cause some kind of multicast storm at the network layer, which is ongoing? [16:29:49] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Performance-Team, 10serviceops: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) a:03EvanProdromou [16:29:57] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [16:30:08] I think we've seen purge storms from the app, although not quite this bad, triggered by the editing of massively-referenced templates? [16:30:38] bblack: whatever caused that spike of multicast (visible on cr1/2-codfw graphs, it's not happening anymore [16:30:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [16:31:04] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Performance-Team, 10serviceops: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10daniel) Calls to getMainObjectStash: https://codesearch.wmflabs.org/search/?q=getMainObjectStash&i=nope... [16:32:01] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad | (14 + 6) hadoop hardware refresh and expansion - https://phabricator.wikimedia.org/T199673 (10RobH) 05Open→03Resolved a:03RobH This has been filled via #procurement task T204177, resolving. [16:32:29] 10Operations, 10decommission, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10RobH) a:03RobH [16:32:44] https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&from=1545146941180&to=1545150690488 [16:33:18] ^ jobq stats mostly look decent-ish, but it does have a corresponding dropout around 16:12-ish for "dequeue rate", hard to say if cause or effect [16:33:56] (could be the effect of some general network issues, or could be a pointer than the cause was a huge job being dequeued and running and generating a ton of purges?) [16:34:41] XioNoX: the multicast is still coming at us, afaics: [16:34:44] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?panelId=6&fullscreen&orgId=1&from=now-1h&to=now&var-site=All&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5 [16:34:55] bblack: it's possible that the recabling caused a loop in the VC fabric, I don't see it on asw-a graphs, but I do see ton of multicast going in/out of many switches ports [16:35:30] XioNoX: but from the Varnish point of view, this elevated multicast traffic is hitting all traffic hosts at all datacenters [16:35:41] RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 56 ESP OK [16:36:01] so it's not just a codfw thing. although perhaps there's a plausible scenario in which a codfw loop could cause it to loop multicasts that head to everywhere else, too? [16:36:17] it's not growing if it's a loop though, but perhaps a fixed window of those packets are still looping, in that scenario [16:36:31] bblack: is it possible to know the source? from the network graphs on routers ,multicast levels are back to normal after a big spike [16:36:43] (03PS1) 10Hashar: Option to use Docker cache [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/480533 (https://phabricator.wikimedia.org/T210438) [16:36:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:37:00] !log librsvg* 2.40.20-3+wmf1+stretch1 uploaded to components/thumor to stretch-wikimedia - T209886 [16:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:03] T209886: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 [16:37:05] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 72 ESP OK [16:37:08] you can use the site dropdown on my varnish graph link above, they're still all getting them (the multicasts) [16:37:18] which implies they're traversing routers still [16:38:21] RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 56 ESP OK [16:39:17] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:41:05] bblack: is it possible that the graph has any kind of latency? I multicast levels are back to normal on all switches and routers afaik [16:41:45] it's possible, yes [16:41:51] good point! [16:42:02] but we must've queued a fairly massive spike that's playing out then [16:42:20] scb2003 and scb2005 stand out as the top source IPs of multicast, in a small sample [16:42:56] changeprop runs there, so that makes sense [16:43:20] I'm still not sure if cp/jobqueue emitted a massive storm of multicast purge on its own, or we looped some of its normal output temporarily [16:43:27] (and forwarded all that mess globally?) [16:43:36] (03CR) 10Dzahn: "how could we test that?" [puppet] - 10https://gerrit.wikimedia.org/r/470726 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [16:44:28] root@cp1077:~# cat /tmp/vhtcpd.stats [16:44:28] start:1540547188 uptime:4604250 purgers:2 recvd:9629400210 bad:4 filtered:0 [16:44:31] Purger0: input:9629400206 failed:0 q_size:248 q_mem:100690918 q_max_size:22195539 q_max_mem:3744223276 [16:44:34] Purger1: input:9629399957 failed:0 q_size:322 q_mem:25204550 q_max_size:6576033 q_max_mem:984732058 [16:44:58] so cp1077 for example has ~100MB worth of tiny purge packets enqueue in memory and playing out, still [16:45:22] hmm that doesn't seem right with that tiny q_size [16:45:41] 10Operations, 10Traffic, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10RobH) [16:45:43] 10Operations, 10Traffic, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10RobH) 05Stalled→03Resolved So yesterday eqsin sent out the email subject: COMPLETED - Non-Critical Upstream Provider Maintenance-SG Metro Area... [16:46:07] bblack: I still think we should finish the last few steps of the recabling, we're now at a stage where some switches are only connected to 1 spine. From all the steps the only one that could have caused a loop has been unplugged since. [16:46:12] maybe it needs a while to clean up from that (I think it has some efficiency hacks where it may hold onto queue memory allocations after a spike for quite a while) [16:46:37] RECOVERY - puppet last run on ms-be1036 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:46:57] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:47:02] q_max_size says at one point it had ~22M purges enqueued, and q_max_mem puts that at 3.7GB of enqueued purge data [16:47:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:49:56] I rescheduled a couple of icinga checks to see if they finish clearing up [16:50:13] thx [16:50:37] XioNoX: anyways, yes, proceed I guess [16:50:40] 10Operations, 10Performance-Team, 10monitoring, 10Graphite: Graphite generates a lot of 502 in Grafana - https://phabricator.wikimedia.org/T211747 (10CDanis) @Peter please let me know if you can still repro the 502s with the datasource in 'browser' mode. Is it possible that this correlates with the upgrad... [16:50:43] (03PS1) 10Ladsgroup: Add WikibaseQualityConstraints configs in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480535 (https://phabricator.wikimedia.org/T209922) [16:50:45] ok [16:51:01] XioNoX: but assuming all continues smoothly, we should really try to figure out whether this was applayer-induced or network-induced [16:51:18] because if it's network, that's a huge risk, since it apparently stormed our whole global network due to one local problem [16:51:25] yeah, definitelly needs investigating [16:51:53] and why it didn't happen during the other 2 recabling (if recabling related) [16:51:59] I wonder if there's maybe some stuff we can do at the routers to mitigate the spread of that, in the case of misbehaving switches at one site [16:53:18] for the ~16:12 -> ~16:18 spike in eqiad most clusters have network rx spikes https://grafana.wikimedia.org/d/000000608/datacenter-overview?orgId=1&from=1545149150859&to=1545150443547&var-datasource=eqiad%20prometheus%2Fops&var-cluster=All [16:53:52] rx from the host's POV that is [16:54:47] (03PS1) 10Dzahn: cache/trafficserver: switch doc.wikimedia.org to doc1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/480536 (https://phabricator.wikimedia.org/T211974) [16:55:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:55:08] (03PS2) 10Hashar: Attempt to pull images before building [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475843 (https://phabricator.wikimedia.org/T200720) [16:56:00] godog: yeah the multicast may have flooded everywhere, which would explain a lot of the inexplicable fallouts [16:56:03] (03CR) 10Jeena Huneidi: [C: 03+1] blubberoid: Bump CPU limit to 1800m [deployment-charts] - 10https://gerrit.wikimedia.org/r/480484 (owner: 10Alexandros Kosiaris) [16:56:14] it also saturated some links at least very briefly I'm sure [16:56:29] (03CR) 10Hashar: "I have moved the part to support --cache to parent change: https://gerrit.wikimedia.org/r/#/c/operations/docker-images/docker-pkg/+/480533" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475843 (https://phabricator.wikimedia.org/T200720) (owner: 10Hashar) [16:56:50] (03CR) 10Ottomata: Add remaining kerberos wrapped commands (032 comments) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [16:57:25] yeah, and caused routing protocols to flap [16:57:27] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:57:31] DDOS_PROTOCOL_VIOLATION_SET: Protocol resolve:mcast-v4 is violated at fpc 0 for 717 times, started at 2018-12-18 16:11:41 UTC [16:57:57] so the routers caught it and probably helped mitigate it [16:58:06] but their default values are very lax [16:58:14] (03CR) 10Ottomata: Add remaining kerberos wrapped commands (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [16:58:47] (03PS1) 10Herron: add interface::add_ip6_mapped to default production node definition [puppet] - 10https://gerrit.wikimedia.org/r/480537 (https://phabricator.wikimedia.org/T102099) [16:59:39] bblack: indeed, the other aspect of this I was looking at is related to restbase async processing, which iirc happens in codfw (only?) and presumably not depooled/changed as part of frontend dns [16:59:55] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) [17:00:01] (03PS2) 10Herron: logstash::collector: pin curator to components/spicerack on stretch [puppet] - 10https://gerrit.wikimedia.org/r/480252 (https://phabricator.wikimedia.org/T211859) [17:00:04] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181218T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:02:17] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:02:19] (03CR) 10Filippo Giunchedi: [C: 03+1] ircecho: skip message if unable to decode it [puppet] - 10https://gerrit.wikimedia.org/r/480509 (owner: 10Volans) [17:03:22] (03CR) 10Hashar: "We will have to first migrate CI jobs to publish to that new host :)" [puppet] - 10https://gerrit.wikimedia.org/r/480536 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [17:03:25] (03PS1) 10Dzahn: doc: add ensure /srv/org/wikimedia/doc exists and has backup [puppet] - 10https://gerrit.wikimedia.org/r/480539 (https://phabricator.wikimedia.org/T211974) [17:03:56] (03CR) 10Filippo Giunchedi: "We get elasticsearch from upstream's in thirdparty/elastic55 ATM, I'd expect curator to also come from there. This would also ensure that " [puppet] - 10https://gerrit.wikimedia.org/r/480252 (https://phabricator.wikimedia.org/T211859) (owner: 10Herron) [17:03:59] (03CR) 10jerkins-bot: [V: 04-1] doc: add ensure /srv/org/wikimedia/doc exists and has backup [puppet] - 10https://gerrit.wikimedia.org/r/480539 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [17:04:43] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:05:04] (03PS2) 10Dzahn: doc: ensure /srv/org/wikimedia/doc exists and has backup [puppet] - 10https://gerrit.wikimedia.org/r/480539 (https://phabricator.wikimedia.org/T211974) [17:07:33] (03PS3) 10Dzahn: doc: ensure /srv/org/wikimedia/doc exists and has backup [puppet] - 10https://gerrit.wikimedia.org/r/480539 (https://phabricator.wikimedia.org/T211974) [17:08:35] (03CR) 10Dzahn: [C: 03+2] doc: ensure /srv/org/wikimedia/doc exists and has backup [puppet] - 10https://gerrit.wikimedia.org/r/480539 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [17:09:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:14:45] (03PS10) 10Cwhite: ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) [17:15:12] (03CR) 10jerkins-bot: [V: 04-1] ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:17:23] (03PS11) 10Cwhite: ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) [17:20:45] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10thcipriani) [17:20:48] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Services (watching): Move Graphoid to Kubernetes via the deployment pipeline - https://phabricator.wikimedia.org/T203091 (10thcipriani) [17:21:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:23:16] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10RobH) [17:26:00] (03PS2) 10Cwhite: hiera: add alerting_host cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/479843 (https://phabricator.wikimedia.org/T210486) [17:26:34] (03CR) 10Cwhite: hiera: add alerting_host cluster definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479843 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [17:26:36] so the recabling caused something to happen with multicast routing which led to the purge packets being sprayed everywhere and likely duplicated a bunch (but not infinitely)? is that what we think so far? [17:27:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:30:44] (03CR) 10Arturo Borrero Gonzalez: "The control group based kill doesn't kill every proc in the cgroup? Is sg_shadowd able to spawn procs outside of the cgroup?" [puppet] - 10https://gerrit.wikimedia.org/r/480530 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [17:30:51] (03PS2) 10Volans: ircecho: skip message if unable to decode it [puppet] - 10https://gerrit.wikimedia.org/r/480509 [17:31:37] (03CR) 10Volans: [C: 03+2] ircecho: skip message if unable to decode it [puppet] - 10https://gerrit.wikimedia.org/r/480509 (owner: 10Volans) [17:32:07] (03CR) 10Arturo Borrero Gonzalez: "> The control group based kill doesn't kill every proc in the cgroup?" [puppet] - 10https://gerrit.wikimedia.org/r/480530 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [17:32:13] (03PS3) 10Cwhite: hiera: add alerting cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/479843 (https://phabricator.wikimedia.org/T210486) [17:32:27] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:34:27] !log triggered restart of ircecho on icinga1001 while applying https://gerrit.wikimedia.org/r/480509 [17:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:33] that was me ^^6 [17:34:39] (icinga-wm) [17:34:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:34:54] (03PS1) 10Hashar: admin: remove CI sudo rule for "nodepool" [puppet] - 10https://gerrit.wikimedia.org/r/480546 (https://phabricator.wikimedia.org/T209361) [17:35:01] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: add alerting cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/479843 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [17:35:27] (03PS5) 10Cwhite: profile: enable statsd_exporter and add matching rules to logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) [17:35:33] (03CR) 10Filippo Giunchedi: [C: 03+1] ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:35:56] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10mobrovac) >>! In T212189#4831314, @daniel wrote: > @mobrovac Please note that the term box is shown based on user preferences (languages spoke... [17:36:05] (03CR) 10Cwhite: profile: enable statsd_exporter and add matching rules to logstash::collector (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:37:14] 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): eqiad: 1 VM request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10Dzahn) @Hashar Here you go: - doc1001.eqiad.wmnet - stretch - php 7.2, not just 7.0 - php-fpm, not mod_php anymore... [17:38:27] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: enable statsd_exporter and add matching rules to logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:39:55] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Refactor pipeline build step to be more isolated/secure/scalable - https://phabricator.wikimedia.org/T195050 (10thcipriani) 05Open→03Invalid The model of execution that this task refers to is not longer used/valid. [17:42:21] (03PS2) 10BBlack: [WIP] authdns-local-update: use check-gdnsd/gen-zones [puppet] - 10https://gerrit.wikimedia.org/r/480477 [17:44:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:46:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:47:16] (03PS1) 10WMDE-leszek: Added setting to adjust the range of PropertyIDs using new link formatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480550 (https://phabricator.wikimedia.org/T201838) [17:47:18] (03PS1) 10WMDE-leszek: Beta: use the new link formatter to format P1 on wikidata beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480551 (https://phabricator.wikimedia.org/T201838) [17:47:19] PROBLEM - Host heze is DOWN: PING CRITICAL - Packet loss = 100% [17:47:24] (03PS2) 10Bstorm: sonofgridengine: change the killmode of sge_shadowd to process [puppet] - 10https://gerrit.wikimedia.org/r/480530 (https://phabricator.wikimedia.org/T211258) [17:47:29] RECOVERY - Host heze is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [17:47:52] 10Operations, 10Release Pipeline (Blubber): blubber template for nodejs should allow defining configuration files to copy to the container - https://phabricator.wikimedia.org/T211580 (10thcipriani) [17:48:01] it's possible that codfw rack A8 had a blip [17:48:39] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [17:49:06] (03CR) 10Bstorm: "> I just saw your comment in T211258, sorry for the noise" [puppet] - 10https://gerrit.wikimedia.org/r/480530 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [17:49:10] !log bounce rsyslog on lithium, tls listener timeout [17:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:21] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:49:38] (03CR) 10Herron: "> We get elasticsearch from upstream's in thirdparty/elastic55 ATM," [puppet] - 10https://gerrit.wikimedia.org/r/480252 (https://phabricator.wikimedia.org/T211859) (owner: 10Herron) [17:49:40] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: change the killmode of sge_shadowd to process [puppet] - 10https://gerrit.wikimedia.org/r/480530 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [17:50:26] jouncebot: next [17:50:26] In 0 hour(s) and 9 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181218T1800) [17:51:09] !log deploying php7 cache-splitter patch to cache_text - https://gerrit.wikimedia.org/r/c/operations/puppet/+/478680 - T206339 [17:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:13] T206339: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 [17:51:26] (03PS8) 10BBlack: cache_text: Vary for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/478680 (https://phabricator.wikimedia.org/T206339) [17:51:37] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1040 days) [17:52:22] (03CR) 10BBlack: [C: 03+2] cache_text: Vary for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/478680 (https://phabricator.wikimedia.org/T206339) (owner: 10BBlack) [17:52:50] (03PS8) 10Elukey: Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [17:53:00] bstorm_: can merge yours too? [17:53:07] Please! [17:53:10] I was about to :) [17:53:13] done! [17:53:23] thank you [17:53:57] oh, now really done heh. I typed "yes" instead of "multiple" [17:57:51] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:58:00] !log shutdown fpc4 for replacement - T210447 [17:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:03] T210447: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 [17:59:48] (03CR) 10CRusnov: [C: 03+1] "LGTM" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/479358 (https://phabricator.wikimedia.org/T182028) (owner: 10Volans) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181218T1800). [18:01:41] (03PS9) 10Elukey: Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [18:02:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:03:21] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:04:00] wdqs paged [18:04:04] yep [18:04:13] onimisionipe: related to your work? [18:04:27] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.015 second response time [18:04:29] PROBLEM - WDQS HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.002 second response time [18:04:31] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:04:34] was gonna ask that. is that why you needed that merge? [18:04:45] * apergos peeksin [18:05:04] I think the pending wdqs merge is a functional no-op that's just refactoring things, from the sounds of the existing code reviews [18:05:17] bblack: nope.. not merged yet [18:05:22] its a noop anyway [18:05:45] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:05:47] !log stat1004:~# umount /mnt/T211327 [18:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:54] bblack: yes it is [18:06:57] so we've got a flap on wdqs lvs, an alert on wdqs1004 giving 503s, and then also some k8s latency hit in eqiad [18:07:04] (03PS10) 10Elukey: [WIP] Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [18:07:19] I can't see how those first two are related to the last, unless we've triggered a network issue again or just bad luck timing [18:08:13] anyways, proceeding with the no-op patch in the meantime [18:08:14] I don't think they are related, it's just that icinga/icinga-wm likes to realize stuff at the same time. [18:08:24] <_joe_> It can be the check went to wdqs1004 before pybal could depool it [18:08:37] (03CR) 10BBlack: [C: 03+2] wdqs: reduce hiera configs via profile defaults [puppet] - 10https://gerrit.wikimedia.org/r/479376 (https://phabricator.wikimedia.org/T210431) (owner: 10Mathew.onipe) [18:08:43] (03PS4) 10BBlack: wdqs: reduce hiera configs via profile defaults [puppet] - 10https://gerrit.wikimedia.org/r/479376 (https://phabricator.wikimedia.org/T210431) (owner: 10Mathew.onipe) [18:08:50] yeah probably [18:08:56] <_joe_> someone should look at wdqs1004 :) [18:09:06] _joe_: looking [18:10:33] RECOVERY - WDQS HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.032 second response time [18:10:54] so the one thing i knew about how to fix these was "restart wdqs-blazegraph" but i stepped back [18:11:35] holy crap, profile::wdqs has 27 classparams [18:11:46] anyways, merged! [18:11:50] mutante: Oh.. I see you did already [18:12:06] but they all have data types :) [18:12:20] onimisionipe: no, i did not restart it, just "status" [18:12:54] and it had told me it was running since 47s [18:14:04] mutante: that's weird. It must has restarted probably some recovery [18:15:14] bblack: yes! I know. WDQS puppet is complex. enough refactoring has been done already to reduce it. [18:16:11] oh, lool, an error is caused by a user agent that is called [18:16:17] "Toolforge - legacy code" [18:18:27] PROBLEM - IPMI Sensor Status on elastic2026 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 1 = Critical, Power Supplies = Critical] [18:18:31] i meant to say "look" not "lool" and i don't know if that would cause it but see those "MalformedQueryException" ones [18:20:43] onimisionipe: [wdqs1004:~] $ grep -E 'ERROR.*Toolforge' /var/log/syslog [18:21:18] mutante: I see it in wdqs-blazegraph logs [18:21:59] is that something new or normal background noise though [18:22:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:24:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:25:11] SMalyshev: ^ [18:27:29] PROBLEM - puppet last run on elastic2046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:28:27] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:28:54] onimisionipe: MalformedQueryException is normal, that's somebody sending broken sparql [18:29:27] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title}{/revision} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) [18:29:32] if there's a real lot of them (like hundreds) throttling should ban it, but occasional one is fine [18:29:49] SMalyshev: Cool! Thanks! [18:30:28] SMalyshev: it was all from a user agent saying Toolforge - legacy code and timing out [18:30:41] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [18:31:14] one every 3 or 4 seconds or so? [18:32:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:32:05] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10ayounsi) [18:32:23] not anymore right now though [18:32:23] mutante: IIRC that's just a tool that serves user requests - so probably somebody put something weird into the query field there [18:33:09] yea, that's exactly what it looks, the query has "?fakeVariable" in it [18:33:50] and it stopped [18:38:47] (03PS1) 10Ayounsi: Revert "Redirect eqsin/ulsfo caches to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/480565 [18:38:52] (03PS1) 10Ayounsi: Revert "Depool codfw for row A recabling" [dns] - 10https://gerrit.wikimedia.org/r/480566 [18:39:17] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:39:36] (03PS6) 10Dzahn: ci::master: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/453554 [18:39:52] (03CR) 10Ayounsi: [C: 03+2] Revert "Redirect eqsin/ulsfo caches to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/480565 (owner: 10Ayounsi) [18:40:00] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool codfw for row A recabling" [dns] - 10https://gerrit.wikimedia.org/r/480566 (owner: 10Ayounsi) [18:40:02] (03PS2) 10Ayounsi: Revert "Redirect eqsin/ulsfo caches to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/480565 [18:40:07] (03PS2) 10Ayounsi: Revert "Depool codfw for row A recabling" [dns] - 10https://gerrit.wikimedia.org/r/480566 [18:40:59] !log Revert "Redirect eqsin/ulsfo caches to eqiad" - T210447 [18:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:02] T210447: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 [18:41:28] (03CR) 10Cwhite: [C: 03+2] hiera: add alerting cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/479843 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [18:41:35] (03PS4) 10Cwhite: hiera: add alerting cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/479843 (https://phabricator.wikimedia.org/T210486) [18:42:17] !log repool codfw - T210447 [18:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:07] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:45:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:45:49] (03PS1) 10Bstorm: sonofgridengine: change back to simple foreground service for the master [puppet] - 10https://gerrit.wikimedia.org/r/480567 (https://phabricator.wikimedia.org/T211258) [18:46:08] (03PS1) 10Ayounsi: Depool ulsfo for PDU work [dns] - 10https://gerrit.wikimedia.org/r/480568 (https://phabricator.wikimedia.org/T209101) [18:47:14] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: change back to simple foreground service for the master [puppet] - 10https://gerrit.wikimedia.org/r/480567 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [18:48:02] (03CR) 10Ayounsi: [C: 03+2] Depool ulsfo for PDU work [dns] - 10https://gerrit.wikimedia.org/r/480568 (https://phabricator.wikimedia.org/T209101) (owner: 10Ayounsi) [18:48:41] !log depool ulsfo - T209101 [18:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:46] T209101: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 [18:48:50] 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): eqiad: 1 VM request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10Dzahn) - added "doc" as an official cluster prefix https://wikitech.wikimedia.org/w/index.php?title=Infrastructure_n... [18:49:24] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10Papaul) [18:50:20] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10Papaul) a:05Papaul→03ayounsi [18:51:14] !log installing libx11 security updates [18:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:07] RECOVERY - puppet last run on elastic2046 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:54:19] (03PS1) 10Bstorm: sonofgridengine: correct default file configuration [puppet] - 10https://gerrit.wikimedia.org/r/480569 (https://phabricator.wikimedia.org/T211258) [18:54:36] (03PS7) 10Dzahn: ci::master: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/453554 [18:55:11] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: correct default file configuration [puppet] - 10https://gerrit.wikimedia.org/r/480569 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [18:55:47] (03CR) 10Dzahn: [C: 03+2] ci::master: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/453554 (owner: 10Dzahn) [18:55:55] (03PS8) 10Dzahn: ci::master: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/453554 [18:58:30] (03CR) 10Cwhite: [C: 03+2] hiera: add ci cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/479845 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [18:58:37] (03PS2) 10Cwhite: hiera: add ci cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/479845 (https://phabricator.wikimedia.org/T210486) [18:58:59] (03PS4) 10Ottomata: Remove specific settings from profile::hadoop::common and put into hiera [puppet] - 10https://gerrit.wikimedia.org/r/480245 [18:59:42] (03CR) 10Dzahn: [C: 03+2] "disabled puppet on contint1001, applied on contin2001, noop, applied on contint1001, noop" [puppet] - 10https://gerrit.wikimedia.org/r/453554 (owner: 10Dzahn) [19:00:53] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10ayounsi) [19:02:20] (03CR) 10Ottomata: [C: 03+2] Remove specific settings from profile::hadoop::common and put into hiera [puppet] - 10https://gerrit.wikimedia.org/r/480245 (owner: 10Ottomata) [19:02:24] (03PS5) 10Ottomata: Remove specific settings from profile::hadoop::common and put into hiera [puppet] - 10https://gerrit.wikimedia.org/r/480245 [19:02:27] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove specific settings from profile::hadoop::common and put into hiera [puppet] - 10https://gerrit.wikimedia.org/r/480245 (owner: 10Ottomata) [19:03:18] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10ayounsi) This has been completed 30min before schedule despite 2 issues: * A massive spike of multicast most likely due to the recabling flooded the entire infra causing v... [19:03:26] 10Operations, 10ops-codfw, 10netops: upgrade all codfw switch stacks to include additional 10G switch per row - https://phabricator.wikimedia.org/T196489 (10ayounsi) [19:03:30] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10ayounsi) 05Open→03Resolved [19:04:23] (03PS2) 10Dzahn: ci::httpd: add support for stretch/PHP 7.0 [puppet] - 10https://gerrit.wikimedia.org/r/478125 [19:06:57] !log redirect ns1 back to authdns2001 - T210447 [19:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:00] T210447: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 [19:08:19] There is no more deployment windows until the end of the day. I would like to swat a config change. [19:08:25] greg-g: ^ [19:08:41] stephanebisson: what's up? [19:08:54] Can I swat config change? [19:08:55] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/13997/" [puppet] - 10https://gerrit.wikimedia.org/r/478125 (owner: 10Dzahn) [19:09:15] stephanebisson: I suppose, I like knowing what it is before saying yes usually :) [19:09:17] link to gerrit? [19:09:37] greg-g: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/480106/ [19:09:48] (03CR) 10Volans: "Reply inline" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/479358 (https://phabricator.wikimedia.org/T182028) (owner: 10Volans) [19:09:53] greg-g: It enables a feature on testwiki so we can start testing it there. [19:10:29] ah, simple enough, go for it [19:10:35] greg-g: thanks [19:11:21] (03CR) 10CRusnov: [C: 03+1] validator: allow to compare run results (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/479358 (https://phabricator.wikimedia.org/T182028) (owner: 10Volans) [19:11:45] PROBLEM - Host ripe-atlas-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [19:11:59] this is expected ^ [19:12:06] and mr1 is going to alert soon I guess [19:12:24] cf. #wikimedia-dcops [19:12:35] (03PS9) 10Paladox: phabricator: Fix loading of php-extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 [19:12:41] (03CR) 10Sbisson: [C: 03+2] Production configuration for GrowthExperiments Help Panel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480106 (https://phabricator.wikimedia.org/T211991) (owner: 10Kosta Harlan) [19:13:05] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [19:13:07] PROBLEM - Host dns4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:13:07] PROBLEM - Host dns4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:13:21] PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [19:13:56] (03Merged) 10jenkins-bot: Production configuration for GrowthExperiments Help Panel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480106 (https://phabricator.wikimedia.org/T211991) (owner: 10Kosta Harlan) [19:14:33] PROBLEM - Host mr1-ulsfo.oob is DOWN: CRITICAL - Network Unreachable (198.24.47.102) [19:14:49] PROBLEM - Host cp4024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:14:49] PROBLEM - Host cp4022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:15:03] PROBLEM - Host lvs4005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:15:03] PROBLEM - Host lvs4007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:15:03] PROBLEM - Host lvs4006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:15:03] PROBLEM - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [19:15:03] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [19:16:20] (03CR) 10Cwhite: [C: 03+2] profile: enable statsd_exporter and add matching rules to logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [19:16:26] (03PS6) 10Cwhite: profile: enable statsd_exporter and add matching rules to logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) [19:17:01] PROBLEM - Host cp4027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:17:03] PROBLEM - Host bast4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:17:03] PROBLEM - Host cp4029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:17:03] PROBLEM - Host cp4030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:17:03] PROBLEM - Host cp4031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:17:03] PROBLEM - Host cp4032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:17:03] RECOVERY - Host ripe-atlas-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 74.56 ms [19:17:15] PROBLEM - Host cp4026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:17:25] PROBLEM - Host cp4021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:17:25] PROBLEM - Host cp4023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:17:25] PROBLEM - Host cp4025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:17:25] PROBLEM - Host cp4028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:17:36] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Imarlier) [19:19:01] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:480106|Production configuration for GrowthExperiments Help Panel]] (duration: 00m 52s) [19:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:40] greg-g: Done. Thanks again. [19:20:45] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:21:50] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.05 ms [19:21:55] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.96 ms [19:22:09] RECOVERY - Host bast4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.88 ms [19:22:09] RECOVERY - Host cp4029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 77.46 ms [19:22:09] RECOVERY - Host cp4030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.96 ms [19:22:10] RECOVERY - Host cp4031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.89 ms [19:22:10] RECOVERY - Host cp4032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.35 ms [19:22:12] (03Abandoned) 10Aaron Schulz: [WIP] Initial debianization [debs/dynomite] - 10https://gerrit.wikimedia.org/r/421447 (owner: 10Aaron Schulz) [19:22:23] RECOVERY - Host cp4026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.05 ms [19:22:33] RECOVERY - Host cp4021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.84 ms [19:22:33] RECOVERY - Host cp4023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.82 ms [19:22:33] RECOVERY - Host cp4025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.81 ms [19:22:33] RECOVERY - Host cp4027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.81 ms [19:22:33] RECOVERY - Host cp4028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.85 ms [19:22:33] (03Abandoned) 10Aaron Schulz: [WIP] Add dynomite module and dynomite_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/415789 (owner: 10Aaron Schulz) [19:22:49] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:22:57] (03CR) 10jenkins-bot: Production configuration for GrowthExperiments Help Panel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480106 (https://phabricator.wikimedia.org/T211991) (owner: 10Kosta Harlan) [19:23:29] RECOVERY - Host dns4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.92 ms [19:23:29] RECOVERY - Host dns4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.81 ms [19:23:56] 10Operations: Integrate Stretch 9.6 point update - https://phabricator.wikimedia.org/T209260 (10MoritzMuehlenhoff) I've verified that none of the packages removed in 9.6 are present in our environment. These updates have been fully deployed: ` confuse dom4j libgd2 libopenmpt libtirpc libx11 serf spamassassin... [19:24:57] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 78.67 ms [19:25:13] RECOVERY - Host cp4022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.93 ms [19:25:13] RECOVERY - Host cp4024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.86 ms [19:25:27] RECOVERY - Host lvs4006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.83 ms [19:25:27] RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 76.26 ms [19:25:27] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 79.96 ms [19:25:51] (03CR) 10Imarlier: [C: 03+1] Set xenon.period to 5 minutes to improve flamegraph resolution [puppet] - 10https://gerrit.wikimedia.org/r/472032 (owner: 10Aaron Schulz) [19:26:39] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 38, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:27:05] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:27:07] Is anyone able to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/472032 ? No one on the Perf team has merge rights in puppet. [19:28:03] It's just a quick config change to xenon (collects the flame graphs that appear at https://performance.wikimedia.org ) [19:28:49] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 40, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:29:15] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:30:27] !log migration of ulsfo pdus complete [19:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:40] marlier: yeah I can do it. any expected fallout from e.g. doubling the rate of xenon things? [19:30:47] RECOVERY - Host lvs4005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.85 ms [19:30:47] RECOVERY - Host lvs4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.93 ms [19:31:17] (03PS2) 10BBlack: Set xenon.period to 5 minutes to improve flamegraph resolution [puppet] - 10https://gerrit.wikimedia.org/r/472032 (owner: 10Aaron Schulz) [19:32:56] marlier: ah I see that's in commentary already [19:35:04] (03CR) 10BBlack: [C: 03+2] Set xenon.period to 5 minutes to improve flamegraph resolution [puppet] - 10https://gerrit.wikimedia.org/r/472032 (owner: 10Aaron Schulz) [19:37:32] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10ayounsi) We also need to update the cloud-in4 filter in eqiad, cf. T211921 `lang=diff [edit firewall family inet filter cloud-i... [19:38:27] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@963d704]: Parse summaries from lead objects only (T202642) [19:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:30] T202642: Investigate how to fix the performance problems caused by CPU bound work on the MCS services - https://phabricator.wikimedia.org/T202642 [19:43:28] (03PS1) 10Dzahn: doc: add rsyncd config to let contint servers push docs [puppet] - 10https://gerrit.wikimedia.org/r/480573 (https://phabricator.wikimedia.org/T211974) [19:43:53] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@963d704]: Parse summaries from lead objects only (T202642) (duration: 05m 26s) [19:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:56] T202642: Investigate how to fix the performance problems caused by CPU bound work on the MCS services - https://phabricator.wikimedia.org/T202642 [19:44:24] (03CR) 10jerkins-bot: [V: 04-1] doc: add rsyncd config to let contint servers push docs [puppet] - 10https://gerrit.wikimedia.org/r/480573 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [19:45:11] PROBLEM - Nginx local proxy to apache on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:15] RECOVERY - Nginx local proxy to apache on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.048 second response time [19:48:51] PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.003 second response time [19:49:14] 10Operations, 10Beta-Cluster-Infrastructure: Set up a process for regularly updating the Beta Cluster's copies of common.js, gadgets, and similar files - https://phabricator.wikimedia.org/T212244 (10Whatamidoing-WMF) p:05Triage→03Normal [19:49:55] PROBLEM - HHVM jobrunner on mw1336 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [19:50:03] RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [19:50:57] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Services, 10Release-Engineering-Team (Kanban): graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10greg) p:05Triage→03Normal [19:51:07] RECOVERY - HHVM jobrunner on mw1336 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [19:55:43] (03PS2) 10Dzahn: doc: add rsyncd config to let contint servers push docs [puppet] - 10https://gerrit.wikimedia.org/r/480573 (https://phabricator.wikimedia.org/T211974) [19:56:01] PROBLEM - HHVM jobrunner on mw1308 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [19:57:15] RECOVERY - HHVM jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [19:57:51] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:58:53] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.048 second response time [20:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181218T2000) [20:01:25] PROBLEM - HHVM jobrunner on mw1309 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [20:02:39] RECOVERY - HHVM jobrunner on mw1309 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [20:02:47] PROBLEM - HHVM jobrunner on mw1295 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [20:03:59] RECOVERY - HHVM jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [20:10:43] (03PS1) 10Bstorm: sonofgridengine: foreground the shadow master process [puppet] - 10https://gerrit.wikimedia.org/r/480576 (https://phabricator.wikimedia.org/T211258) [20:11:17] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) >>! In T212189#4831959, @mobrovac wrote: >>>! In T212189#4831314, @daniel wrote: >> @mobrovac Please note that the term box is shown base... [20:13:06] (03PS1) 10RobH: adding dns entries for ulsfo pdus [dns] - 10https://gerrit.wikimedia.org/r/480577 (https://phabricator.wikimedia.org/T209101) [20:13:18] 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) [20:13:57] 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) I'm keeping this open to track the additional steps of adding the brackets to the ps1 in each cabinet (only ps2 has them presently) and also then... [20:14:05] (03PS2) 10RobH: adding dns entries for ulsfo pdus [dns] - 10https://gerrit.wikimedia.org/r/480577 (https://phabricator.wikimedia.org/T209101) [20:14:38] (03CR) 10RobH: [C: 03+2] adding dns entries for ulsfo pdus [dns] - 10https://gerrit.wikimedia.org/r/480577 (https://phabricator.wikimedia.org/T209101) (owner: 10RobH) [20:19:42] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10aaron) [20:22:37] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10aaron) We need persistence and replication. The plan is to use the same store as session for th... [20:29:12] (03CR) 10Ottomata: [C: 03+1] Remove obsolete rsync::repo [puppet] - 10https://gerrit.wikimedia.org/r/470611 (owner: 10Muehlenhoff) [20:29:49] (03CR) 10Ottomata: "Oops, didn't mean to remove jenkins." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [20:32:48] (03CR) 10Ottomata: ":)" (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [20:33:24] 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) Both PDU sets are online via mgmt interfaces and can be remotely administered. I'm getting the same error code on serial, will troubleshoot it r... [20:33:29] !log otto@deploy1001 Started deploy [eventlogging/analytics@104adb5]: Send JSON string of event for validation errors in EventError [20:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:34] !log otto@deploy1001 Finished deploy [eventlogging/analytics@104adb5]: Send JSON string of event for validation errors in EventError (duration: 00m 04s) [20:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:42] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: foreground the shadow master process [puppet] - 10https://gerrit.wikimedia.org/r/480576 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [20:37:52] 10Operations, 10Operations-Software-Development: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10faidon) >>! In T211750#4817482, @Volans wrote: > On my side I've done a test on the `cumin` codebase with black. The results are: > - all the ignore comments for pylint or any... [20:38:27] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) >>! In T212129#4832390, @aaron wrote: > We need persistence and replication. The plan is t... [20:47:35] 10Operations, 10Operations-Software-Development: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10faidon) >>! In T211750#4831642, @akosiaris wrote: > I like black too but from but from https://black.readthedocs.io/en/stable/installation_and_usage.html it tied to having pyth... [20:50:58] (03PS3) 10Hashar: doc: add rsyncd config to let contint servers push docs [puppet] - 10https://gerrit.wikimedia.org/r/480573 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [20:51:00] (03PS1) 10Hashar: profile: add spec for profile::doc [puppet] - 10https://gerrit.wikimedia.org/r/480587 [20:53:31] (03CR) 10Hashar: "That helps making sure the profile more or less work before sending patch for review." [puppet] - 10https://gerrit.wikimedia.org/r/480587 (owner: 10Hashar) [20:56:32] (03CR) 10Hashar: "I went ahead and added the 'doc-uploader' local user. I don't think we need a global UID, at least for now." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/480573 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [20:59:00] (03PS1) 10RobH: Revert "Depool ulsfo for PDU work" [dns] - 10https://gerrit.wikimedia.org/r/480588 [20:59:07] (03PS2) 10RobH: Revert "Depool ulsfo for PDU work" [dns] - 10https://gerrit.wikimedia.org/r/480588 [20:59:36] (03CR) 10RobH: [C: 03+2] Revert "Depool ulsfo for PDU work" [dns] - 10https://gerrit.wikimedia.org/r/480588 (owner: 10RobH) [21:03:17] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) [21:07:53] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Eevans) >>! In T212129#4832455, @Joe wrote: > [ ... ] > This needs a thorough discussion ASAP.... [21:10:50] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) Well that discussion was limited to Session storage, and I stand by the idea that service,... [21:11:43] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.98 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [21:12:14] ^ normal for repool, the ulsfo traffic going back home causes a dropoff in codfw [21:13:01] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/480064 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [21:13:31] (03PS1) 10Ladsgroup: Drop valid_tag and tag_summary from labs replicas [puppet] - 10https://gerrit.wikimedia.org/r/480590 (https://phabricator.wikimedia.org/T212254) [21:19:30] (03PS2) 10Cwhite: proton: enable statsd_exporter and add matching rules to profile::proton [puppet] - 10https://gerrit.wikimedia.org/r/480259 (https://phabricator.wikimedia.org/T205870) [21:19:55] (03CR) 10jerkins-bot: [V: 04-1] proton: enable statsd_exporter and add matching rules to profile::proton [puppet] - 10https://gerrit.wikimedia.org/r/480259 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [21:28:10] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/480485 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [21:29:16] (03PS3) 10Cwhite: proton: enable statsd_exporter and add matching rules to profile::proton [puppet] - 10https://gerrit.wikimedia.org/r/480259 (https://phabricator.wikimedia.org/T205870) [21:40:35] (03PS6) 10Cwhite: hiera: remove diamond from labweb role [puppet] - 10https://gerrit.wikimedia.org/r/466908 (https://phabricator.wikimedia.org/T183454) [21:41:47] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 74.06 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [21:48:46] 10Operations, 10monitoring: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10colewhite) p:05Triage→03Normal [21:55:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to go, now that the memcached collector is removed. No remaining dashboards references these metrics." [puppet] - 10https://gerrit.wikimedia.org/r/466908 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [22:02:46] (03CR) 10Samwilson: [C: 03+1] php72: add RSVG [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/480159 (https://phabricator.wikimedia.org/T151656) (owner: 10MaxSem) [22:21:29] (03PS4) 10BBlack: validator: bail out on wrong IP version [dns] - 10https://gerrit.wikimedia.org/r/478957 (https://phabricator.wikimedia.org/T182028) (owner: 10Volans) [22:21:31] (03PS2) 10BBlack: validator: allow to compare run results [dns] - 10https://gerrit.wikimedia.org/r/479358 (https://phabricator.wikimedia.org/T182028) (owner: 10Volans) [22:21:53] (03PS3) 10BBlack: Eliminate {{zonename}} templating in favor of @Z [dns] - 10https://gerrit.wikimedia.org/r/479889 [22:21:55] (03PS3) 10BBlack: Switch to %include for geolang templating [dns] - 10https://gerrit.wikimedia.org/r/479890 [22:21:57] (03PS3) 10BBlack: Remove trailing serial comments [dns] - 10https://gerrit.wikimedia.org/r/479891 [22:21:59] (03PS7) 10BBlack: New zone generator gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/479892 [22:22:01] (03PS3) 10BBlack: Minor improvements to check-gdnsd [dns] - 10https://gerrit.wikimedia.org/r/480260 [22:22:50] (03PS3) 10BBlack: [WIP] authdns-local-update: use check-gdnsd/gen-zones [puppet] - 10https://gerrit.wikimedia.org/r/480477 [22:23:45] (03CR) 10BBlack: [C: 03+2] validator: bail out on wrong IP version [dns] - 10https://gerrit.wikimedia.org/r/478957 (https://phabricator.wikimedia.org/T182028) (owner: 10Volans) [22:23:55] (03CR) 10BBlack: [C: 03+2] validator: allow to compare run results (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/479358 (https://phabricator.wikimedia.org/T182028) (owner: 10Volans) [22:24:32] (03CR) 10BBlack: [C: 03+2] Eliminate {{zonename}} templating in favor of @Z [dns] - 10https://gerrit.wikimedia.org/r/479889 (owner: 10BBlack) [22:25:06] (03CR) 10Cwhite: [C: 03+2] hiera: remove diamond from labweb role [puppet] - 10https://gerrit.wikimedia.org/r/466908 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [22:25:28] (03CR) 10BBlack: [C: 03+2] Switch to %include for geolang templating [dns] - 10https://gerrit.wikimedia.org/r/479890 (owner: 10BBlack) [22:25:33] (03PS1) 10Andrew Bogott: nova: added Cern's nova-quota-sync [puppet] - 10https://gerrit.wikimedia.org/r/480656 (https://phabricator.wikimedia.org/T210215) [22:25:42] (03CR) 10BBlack: [C: 03+2] Remove trailing serial comments [dns] - 10https://gerrit.wikimedia.org/r/479891 (owner: 10BBlack) [22:26:05] bblack: thanks for adapting the validator too for {zonename} ;) [22:28:19] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Papaul) [22:28:33] (03Abandoned) 10Cwhite: memcached, redis: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464366 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [22:28:52] I'm not ready to merge my bigger bits yet, but I wanted to get caught up on all the stuff leading up to it, because cherry-picking this stuff around to test was getting annoying. [22:28:56] :) [22:28:57] (03PS2) 10Andrew Bogott: nova: add Cern's nova-quota-sync [puppet] - 10https://gerrit.wikimedia.org/r/480656 (https://phabricator.wikimedia.org/T210215) [22:29:52] (03PS3) 10Andrew Bogott: nova: add Cern's nova-quota-sync [puppet] - 10https://gerrit.wikimedia.org/r/480656 (https://phabricator.wikimedia.org/T210215) [22:30:18] (03PS1) 10Hashar: nodepool is gone, no need to assign a cluster [puppet] - 10https://gerrit.wikimedia.org/r/480659 (https://phabricator.wikimedia.org/T209361) [22:30:38] :) [22:30:40] (03CR) 10Andrew Bogott: [C: 03+2] nova: add Cern's nova-quota-sync [puppet] - 10https://gerrit.wikimedia.org/r/480656 (https://phabricator.wikimedia.org/T210215) (owner: 10Andrew Bogott) [22:31:44] (03CR) 10Andrew Bogott: [C: 03+2] "Thanks! I missed this one." [puppet] - 10https://gerrit.wikimedia.org/r/480659 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [22:31:53] (03PS2) 10Andrew Bogott: nodepool is gone, no need to assign a cluster [puppet] - 10https://gerrit.wikimedia.org/r/480659 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [22:32:08] relatedly, I donno if you saw, but I did rework https://gerrit.wikimedia.org/r/c/operations/dns/+/479892/7/utils/gen-zones.py for pathlib. pathlib is pretty nice in the abstract, but it has a few rough edges (I'd file bugs, but there's already bugs/discussions/patches ongoing that will eventually hit a future python stdlib) [22:32:51] the key thing is they didn't think hard enough about the use-cases for manipulating/reading symlinks for software whose use of the Path module might not be contained all within Python. [22:33:03] (03PS1) 10Hashar: nodepool: cleanup database related settings [puppet] - 10https://gerrit.wikimedia.org/r/480661 (https://phabricator.wikimedia.org/T212230) [22:33:45] (as in, the filesystem is an interface to work with other software, which I think must be common, and therefore the python software wants to be super explicit about symlinks. But pathlib tries to hard to make them invisible (can be worked around), and doesn't offer readlink (requires still using legacy os.path.readlink) [22:33:50] ) [22:34:18] but other than that, pathlib is pretty awesome over os.path and friends. [22:35:11] no, didn't saw that patch yet, sorry. Yeah pathlib is in general much more usable that the old tooling, but I didn't use it yet in complex cases [22:35:31] *PS yet [22:35:32] (03PS1) 10Hashar: cumin: remove nodepool from misc-releng group [puppet] - 10https://gerrit.wikimedia.org/r/480663 (https://phabricator.wikimedia.org/T209361) [22:36:23] so in the *nix APIs for this, we have stat() and lstat() because such things are tricky. stat() sees through symlinks and reports on the target, whereas lstat() gives you a raw view with the stats of the actual symlink if it's a symlink. [22:36:39] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/480663 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [22:37:44] abstractions often go this way, where the whole abstraction looks more like stat() than lstat() 's view of the world. So long as there is some kind of is_symlink(), you can still work around that in general (notice in gen-zones, we have lots of is_symlink() spread around conditionals, to be careful of cases where is_dir() or is_file() would've said true for symlinks and we don't want that) [22:38:10] volans: guess you can just merge it, I am not sure who else would review the regex change in cumin :) [22:38:34] i'll look at it if you want more reviews :) [22:38:50] hashar: sure, just don't have my yubikey at hand, chaomodus wanna do the honor? :) [22:39:03] ah another reviewers \o/ [22:39:11] but without a readlink(), it's hard to be able to check some things, especially in a security sensitive way (or in this case, in a way that prevents broken stuff from being deployed by enforcing similar rules). [22:39:51] the mergning part? okay [22:39:55] bblack: isn't Path.resolve() equivalent of readlink()? [22:40:05] readlink() is the only way you can really get to the heart of "ok this file is a symlink sitting in directory foo, but where *exactly* does it point at, and is that absolute or relative [22:40:31] ah ok resolve() dreturns an absolute path so does too much for you? [22:40:45] yes, and it sees through multiple layers of link indirection as well [22:40:50] ack [22:41:00] for some use cases, only the raw value of the immediate symlink in front of us will do [22:42:08] (in this case, we actually want to allow symlinks, but constrain that they not point at other symlinks (double-indirection) or non-files, and that they have no path components to them (stay within same directory). [22:42:20] you can do some of that via resolve, but it leaves holes [22:42:43] (03CR) 10CRusnov: [C: 03+1] "Looks fine!" [puppet] - 10https://gerrit.wikimedia.org/r/480663 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [22:42:46] anyways, there's already upstream bugs about it and nothing to do here, just rambling on a topic :) [22:44:37] this rambling is on me, I suggested that :) [22:44:48] (03PS1) 10Cwhite: hiera: add management cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/480664 (https://phabricator.wikimedia.org/T210486) [22:45:13] chaomodus: you can go ahead and +2+submit+puppet-merge that ;) [22:46:45] kk. [22:46:51] (03PS2) 10Cwhite: hiera: add management cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/480664 (https://phabricator.wikimedia.org/T210486) [22:46:53] (03CR) 10CRusnov: [C: 03+2] cumin: remove nodepool from misc-releng group [puppet] - 10https://gerrit.wikimedia.org/r/480663 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [22:47:00] (03PS2) 10CRusnov: cumin: remove nodepool from misc-releng group [puppet] - 10https://gerrit.wikimedia.org/r/480663 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [22:47:08] thanks :) [22:47:14] shoot i should +2 after i rebase no [22:47:18] oh it's sticky [22:48:27] yeah it is a bit messy [22:48:48] operations/puppet.git only allow a change to be submitted if it is a fast forward from the branch [22:49:03] so whenever another change get submitted and thus the commit get merged, the branch has moved ahead [22:49:14] and all other changes become obsolete requiring a rebase [22:49:32] (03PS3) 10Cwhite: hiera: add management cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/480664 (https://phabricator.wikimedia.org/T210486) [22:49:47] :) yah okay done [22:50:56] (03PS6) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794 [22:51:18] (03PS1) 10Cwhite: hiera: add debmonitor cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/480666 (https://phabricator.wikimedia.org/T210486) [22:55:13] (03PS4) 10BBlack: Minor improvements to check-gdnsd [dns] - 10https://gerrit.wikimedia.org/r/480260 [22:55:15] (03PS8) 10BBlack: New zone generator gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/479892 [22:55:40] 10Operations, 10monitoring, 10Patch-For-Review, 10Performance-Team (Radar): Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10colewhite) [22:56:00] 10Operations, 10monitoring, 10Patch-For-Review, 10Performance-Team (Radar): Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10colewhite) [23:02:01] (03CR) 10Volans: [C: 03+1] "LGTM. A couple of nitpicks and questions inline" (034 comments) [dns] - 10https://gerrit.wikimedia.org/r/479892 (owner: 10BBlack) [23:18:59] 10Operations, 10Operations-Software-Development: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10Volans) >>! In T211750#4832453, @faidon wrote: >>>! In T211750#4817482, @Volans wrote: >> On my side I've done a test on the `cumin` codebase with black. The results are: >> -... [23:25:42] (03CR) 10Volans: "Where this cluster definition will appear?" [puppet] - 10https://gerrit.wikimedia.org/r/480666 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [23:27:59] (03PS5) 10BBlack: Minor improvements to check-gdnsd [dns] - 10https://gerrit.wikimedia.org/r/480260 [23:28:01] (03PS9) 10BBlack: New zone generator gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/479892 [23:29:33] (03CR) 10Cwhite: "> Where this cluster definition will appear?" [puppet] - 10https://gerrit.wikimedia.org/r/480666 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [23:29:43] (03CR) 10BBlack: [C: 03+2] Minor improvements to check-gdnsd [dns] - 10https://gerrit.wikimedia.org/r/480260 (owner: 10BBlack) [23:31:35] (03CR) 10BBlack: New zone generator gen-zones.py (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/479892 (owner: 10BBlack) [23:32:17] 10Operations, 10MediaWiki-Containers: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10thcipriani) I made a little command line tool to help me find image: https://gist.github.com/thcipriani/7d7633eb238cd868d5ba24d0f1069463 Then I wrapped that in a bash script... [23:34:35] (03CR) 10Volans: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/480666 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [23:37:26] (03PS1) 10RobH: normalizing pdu names [dns] - 10https://gerrit.wikimedia.org/r/480672 (https://phabricator.wikimedia.org/T209101) [23:38:50] (03CR) 10RobH: [C: 03+2] normalizing pdu names [dns] - 10https://gerrit.wikimedia.org/r/480672 (https://phabricator.wikimedia.org/T209101) (owner: 10RobH) [23:46:29] 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) Oh, the firmware needs to be updated. We've done this without downtime in the past on other servertechs, but since I'll be onsite to install the... [23:48:51] (03PS1) 10Volans: Edit Project Config [dns] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/480675 [23:49:07] no I didn't want to do that... damn gerrit :) [23:49:18] !log remove BGP session to AS50629 from cr2-esams (not in AMS-IX anymore) [23:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:22] (03Abandoned) 10Volans: Edit Project Config [dns] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/480675 (owner: 10Volans) [23:49:44] (03CR) 10Dzahn: [C: 03+2] profile: add spec for profile::doc [puppet] - 10https://gerrit.wikimedia.org/r/480587 (owner: 10Hashar) [23:50:19] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 407, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:52:34] (03CR) 10Dzahn: [C: 03+2] doc: add rsyncd config to let contint servers push docs [puppet] - 10https://gerrit.wikimedia.org/r/480573 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [23:52:42] (03PS4) 10Dzahn: doc: add rsyncd config to let contint servers push docs [puppet] - 10https://gerrit.wikimedia.org/r/480573 (https://phabricator.wikimedia.org/T211974) [23:53:01] 10Operations, 10Wikimedia-Mailing-lists: Request to create mailing list for Wikimedians of Chicago User Group - https://phabricator.wikimedia.org/T212266 (10Airplaneman) [23:56:10] (03PS10) 10BBlack: New zone generator gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/479892 [23:56:12] (03PS1) 10BBlack: utils/*.sh: check explicitly for ops/dns repo root [dns] - 10https://gerrit.wikimedia.org/r/480677 [23:56:19] 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Next), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10aaron) >>! In T211721#4822917, @Joe wrote: > To add to what @tgr found, we have to search... [23:56:39] (03CR) 10BBlack: [C: 03+2] utils/*.sh: check explicitly for ops/dns repo root [dns] - 10https://gerrit.wikimedia.org/r/480677 (owner: 10BBlack) [23:59:17] (03PS2) 10Dzahn: profile: add spec for profile::doc [puppet] - 10https://gerrit.wikimedia.org/r/480587 (owner: 10Hashar)