[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181213T0000). [00:00:04] raynor and dcausse: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:28] o/ [00:00:54] \o im here in place of raynor [00:01:08] (03CR) 10Dzahn: [C: 032] "1002 isn't in production, ok, going ahead to check for issues" [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353) (owner: 10Paladox) [00:01:14] I guess I can swat [00:01:19] o/ marostegui [00:01:22] oh wrong user [00:01:24] * mutante [00:01:27] (03PS6) 10CRusnov: Add management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) [00:03:46] paladox: many changes, no error :) [00:03:51] :) [00:04:32] i think you need to restart php7.2-fpm manually as when php-* are installed it's not restarted (that's intentional) [00:04:43] sudo service php7.2-fpm restart [00:04:47] (03CR) 10DCausse: [C: 032] Beta cluster shows production content on mobile only for non-existent pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479256 (https://phabricator.wikimedia.org/T207508) (owner: 10Jdlrobson) [00:05:33] paladox: i prefer systemctl. systemctl status php7.2-fpm Active: active (running) 1min 59s ago [00:05:40] ah yeh [00:05:50] systemctl restart php7.2-fpm [00:05:57] (03Merged) 10jenkins-bot: Beta cluster shows production content on mobile only for non-existent pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479256 (https://phabricator.wikimedia.org/T207508) (owner: 10Jdlrobson) [00:06:17] paladox: done! systemd[1]: Started The PHP 7.2 FastCGI Process Manager [00:06:40] but it was already running and just got installed [00:06:57] :) [00:07:31] "ready to handle connections" before and after, just a new pid [00:07:51] :) [00:07:53] yep, expected [00:08:01] jdlrobson: I'm going to deploy my config change while waiting for jenkins on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/479265 [00:08:32] sounds good! [00:08:40] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479180 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [00:09:44] (03Merged) 10jenkins-bot: [cirrus] fix temp clusters for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479180 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [00:12:48] we have recently upgraded our puppet stdlib by several versions. it's worth checking the new stuff you can use now. like more / better data types: https://gerrit.wikimedia.org/r/#/q/topic:puppet-stdlib+(status:open+OR+status:merged) [00:13:14] (it was all for the /types/ dir intitially) [00:13:18] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T210381: [cirrus] fix temp clusters for codfw (duration: 00m 52s) [00:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:22] T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381 [00:14:26] (03PS3) 10Volans: validator: bail out on wrong IP version [dns] - 10https://gerrit.wikimedia.org/r/478957 (https://phabricator.wikimedia.org/T182028) [00:14:28] (03PS1) 10Volans: validator: allow to compare run results [dns] - 10https://gerrit.wikimedia.org/r/479358 (https://phabricator.wikimedia.org/T182028) [00:15:50] (03CR) 10jenkins-bot: Beta cluster shows production content on mobile only for non-existent pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479256 (https://phabricator.wikimedia.org/T207508) (owner: 10Jdlrobson) [00:15:52] (03CR) 10jenkins-bot: [cirrus] fix temp clusters for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479180 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [00:18:42] jdlrobson: it's live on mwdebug1002, is it possible for you to test? [00:19:35] dcausse: did the config get deployed? I'm not seeing it on beta cluster.. [00:20:00] the non-config flag is workign on mwdebug1002 yes [00:20:01] so the config got merged so it should be synced automatically [00:20:26] ok ill wait to verify that one then, but the others good for syncing! [00:20:47] ok syncing [00:22:28] thanks dcausse ! [00:22:36] !log dcausse@deploy1001 Synchronized php-1.33.0-wmf.8/extensions/MobileFrontend/extension.json: T210390: Reset default mobilefrontend provider (duration: 00m 53s) [00:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:40] T210390: PHPUnit: Cover methods in content-providers/McsContentProvider.php with unit tests - https://phabricator.wikimedia.org/T210390 [00:23:03] but shouldn't beta cluster config changes be instant? [00:23:19] yes... that's what I would expect... [00:23:32] hmm [00:24:15] 10Operations, 10Product-Analytics, 10Patch-For-Review: Upload shiny-server .deb to our Stretch apt repository - https://phabricator.wikimedia.org/T168967 (10EBjune) @mpopov are you waiting on anything from Guillaume here? He's on leave atm, and I want to make sure nothing's blocking from our end in getting t... [00:25:35] dcausse: are you able to check the value of wgMFContentProviderClass on beta cluster? [00:26:04] jdlrobson: I'm trying to find a mw machine there I can log in to check [00:28:41] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:28:45] jdlrobson: it just appeared now [00:28:54] yay! [00:28:56] hurrah [00:29:11] so it did [00:29:12] thank you! [00:29:17] yw! [00:31:42] (03PS1) 10Dmaza: Enable Block notice stats on top blocking wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479360 (https://phabricator.wikimedia.org/T211234) [00:32:39] !log elastic@codfw created cirrus metastore on psi&omega clusters [00:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:08] (03PS3) 10Jeena Huneidi: Initial Helm chart for Blubberoid. [deployment-charts] - 10https://gerrit.wikimedia.org/r/479026 (https://phabricator.wikimedia.org/T211708) [00:36:09] !log changing two passwords for compromised accounts [00:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:36] !log einsteinium - rm /lib/systemd/system/update-etcd-mw-config-lastindex.service ; systemctl reset-failed [00:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:05] 10Operations, 10Phabricator, 10Patch-For-Review: Switch PHP-FPM on phab1002 - https://phabricator.wikimedia.org/T211353 (10Dzahn) deployed, puppet ran and applied many changes as expected, no puppet errors. looks to be working fine: root@phab1002:~# curl --verbose --header 'Host: phabricator.wikimedia.org... [00:45:38] 10Operations, 10Phabricator, 10Patch-For-Review: Switch PHP-FPM on phab1002 - https://phabricator.wikimedia.org/T211353 (10Dzahn) 05Open>03Resolved [00:49:08] (03PS4) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [00:50:05] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [00:52:48] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.8/includes/http/GuzzleHttpRequest.php: T211806 (duration: 00m 51s) [00:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:51] T211806: Passing in the "body" request option as an array to send a POST request has been deprecated - https://phabricator.wikimedia.org/T211806 [00:59:57] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181213T0100). [01:04:03] (03CR) 10Ayounsi: [C: 031] Add management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [01:07:01] (03PS5) 10Krinkle: Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling) [01:07:39] (03CR) 10Krinkle: "@Tim Agreed. I've deployed the CI patch, hopefully no more php55 now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling) [01:09:26] (03CR) 10CRusnov: [V: 032 C: 032] "Good enough for me :) Merging this." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [01:10:00] (03PS4) 10Krinkle: Class wrapper for ProductionServices.php etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477956 (owner: 10Tim Starling) [01:10:10] (03PS4) 10Krinkle: Put profiler hostnames in ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477957 (owner: 10Tim Starling) [01:10:15] (03PS5) 10Krinkle: Excimer and Tideways support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478137 (owner: 10Tim Starling) [01:20:04] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10crusnov) [01:22:02] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10crusnov) [01:34:07] (03CR) 10Tim Starling: Excimer and Tideways support (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478137 (owner: 10Tim Starling) [01:49:37] (03CR) 10Tim Starling: Put profiler hostnames in ProductionServices.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477957 (owner: 10Tim Starling) [02:36:29] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [02:38:51] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [02:43:27] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:43:35] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:44:35] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [02:45:05] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [02:45:11] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [02:45:49] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [02:45:49] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:46:17] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [02:46:17] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [02:47:09] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [02:47:29] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [02:48:13] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [02:48:13] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [02:48:37] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [02:53:15] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:54:21] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [02:54:59] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:57:19] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [02:58:43] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [02:59:49] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [03:07:29] (03CR) 10Aezell: [C: 031] Enable Block notice stats on top blocking wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479360 (https://phabricator.wikimedia.org/T211234) (owner: 10Dmaza) [03:21:37] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 41010312 [03:24:05] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 270768 [03:34:13] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 918.24 seconds [04:10:32] (03CR) 10Gergő Tisza: [C: 031] Fix minor tech debt around AuthManager audit logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479156 (owner: 10Krinkle) [04:13:25] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 249.04 seconds [04:16:33] (03CR) 10Krinkle: [C: 031] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling) [06:03:56] 10Operations, 10Performance-Team, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Joe) >>! In T211721#4818615, @Eevans wrote: >>>! In T211721#4818580, @Joe wrote: >> I was asking because l... [06:07:40] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479370 [06:09:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479370 (owner: 10Marostegui) [06:10:42] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479370 (owner: 10Marostegui) [06:11:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1098:3316 db1098:3317 after kernel and mysql upgrade (duration: 00m 54s) [06:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:09] (03PS1) 10Marostegui: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479371 (https://phabricator.wikimedia.org/T86338) [06:14:43] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479371 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:15:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479371 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:16:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1086 T86338 T202167 (duration: 00m 51s) [06:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:02] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:17:03] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:17:07] !log Deploy schema change on db1086 T86338 T202167 [06:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:54] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479370 (owner: 10Marostegui) [06:18:56] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479371 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:19:09] 10Operations: install2002 94% disk usage on "/" - https://phabricator.wikimedia.org/T211850 (10Marostegui) [06:22:09] PROBLEM - puppet last run on cp5005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:24:31] PROBLEM - puppet last run on dns5002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:25:37] (03CR) 10Marostegui: "I am still thinking about from which angle we should approach this." [puppet] - 10https://gerrit.wikimedia.org/r/479224 (https://phabricator.wikimedia.org/T210478) (owner: 10Banyek) [06:28:35] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:34:55] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479372 [06:36:10] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479372 (owner: 10Marostegui) [06:37:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479372 (owner: 10Marostegui) [06:38:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1086 T86338 T202167 (duration: 00m 51s) [06:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:28] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:38:28] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:41:20] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/479373 (https://phabricator.wikimedia.org/T86338) [06:41:57] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479374 (https://phabricator.wikimedia.org/T86338) [06:42:34] (03CR) 10Marostegui: [C: 032] dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/479373 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:43:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479374 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:43:49] !log Depool labsdb1010 T86338 [06:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:53] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:44:02] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479372 (owner: 10Marostegui) [06:44:04] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479374 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:44:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479374 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:45:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1079 T86338 T202167 (duration: 00m 53s) [06:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:14] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:46:13] !log Deploy schema change on db1079 with replication, lag will be generated on labsdb:s7 T86338 T202167 [06:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:35] RECOVERY - puppet last run on dns5002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:53:29] RECOVERY - puppet last run on cp5005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:59:39] 10Operations, 10Performance-Team, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Tgr) Per the [[https://github.com/wikimedia/mediawiki/blob/master/includes/objectcache/ObjectCache.php#L37... [06:59:53] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:03:53] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 237, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:05:35] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api [07:06:21] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:10:03] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192 (10elukey) This server is going to be decommed very soon (OOW), I've acked the alarm a long time ago to avoid it spamming us. Good to close in my opinion, +1 [07:11:43] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [07:15:22] (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/479375 [07:15:47] (03PS1) 10Mathew.onipe: wdqs: reduce hiera configs via profile defaults [puppet] - 10https://gerrit.wikimedia.org/r/479376 (https://phabricator.wikimedia.org/T210431) [07:16:31] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479377 [07:20:31] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479377 (owner: 10Marostegui) [07:21:33] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479377 (owner: 10Marostegui) [07:21:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479377 (owner: 10Marostegui) [07:22:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1079 T86338 T202167 (duration: 00m 52s) [07:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:38] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [07:22:38] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [07:24:29] (03CR) 10Mathew.onipe: "PCC proclaims NOOP: https://puppet-compiler.wmflabs.org/compiler1002/13919/" [puppet] - 10https://gerrit.wikimedia.org/r/479376 (https://phabricator.wikimedia.org/T210431) (owner: 10Mathew.onipe) [07:25:02] (03CR) 10Marostegui: [C: 032] Revert "dbproxy1010: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/479375 (owner: 10Marostegui) [07:25:55] !log Repool labsdb1010 T86338 [07:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:29] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/479378 (https://phabricator.wikimedia.org/T86338) [07:28:15] (03CR) 10Marostegui: [C: 032] dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/479378 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [07:28:36] !log Depool labsdb1011 T86338 [07:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:39] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [07:32:34] (03PS1) 10Urbanecm: Add custom minerva logo for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479379 (https://phabricator.wikimedia.org/T210979) [07:42:54] (03PS3) 10Muehlenhoff: Remove Diamond from ORES hosts [puppet] - 10https://gerrit.wikimedia.org/r/479189 (https://phabricator.wikimedia.org/T183454) [07:44:05] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from ORES hosts [puppet] - 10https://gerrit.wikimedia.org/r/479189 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [07:50:54] (03CR) 10Vgutierrez: [C: 032] certcentral: Allow puppet_svc to be undef [puppet] - 10https://gerrit.wikimedia.org/r/479244 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [07:51:03] (03PS3) 10Vgutierrez: certcentral: Allow puppet_svc to be undef [puppet] - 10https://gerrit.wikimedia.org/r/479244 (https://phabricator.wikimedia.org/T207050) [07:52:14] (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/479380 [07:52:20] (03PS2) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/479380 [07:54:31] ACKNOWLEDGEMENT - Device not healthy -SMART- on stat1004 is CRITICAL: cluster=analytics device=sde instance=stat1004:9100 job=node site=eqiad Muehlenhoff T211327 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004var-datasource=eqiad%2520prometheus%252Fops [07:55:26] (03CR) 10Vgutierrez: [C: 032] mx: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/479245 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [07:55:36] (03PS3) 10Vgutierrez: mx: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/479245 (https://phabricator.wikimedia.org/T207050) [08:08:13] (03PS3) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/479380 [08:08:23] !log installing nodejs updates on restbase1007 [08:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:54] (03CR) 10Marostegui: [C: 032] Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/479380 (owner: 10Marostegui) [08:09:52] !log Repool labsdb1011 T86338 [08:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:56] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [08:10:37] (03PS1) 10Vgutierrez: mx: Use certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/479381 (https://phabricator.wikimedia.org/T207050) [08:11:20] !log rolling reboot of scb in eqiad for kernel security update (combined with nodejs update) [08:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:20] !log Drop unused flaggedrevs tables from srwikinews - T209761 [08:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:24] T209761: Drop FlaggedRevs tables in database for srwikinews - https://phabricator.wikimedia.org/T209761 [08:26:24] (03CR) 10Vgutierrez: [C: 032] mx: Use certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/479381 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [08:26:32] (03PS2) 10Vgutierrez: mx: Use certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/479381 (https://phabricator.wikimedia.org/T207050) [08:26:54] !log Use certcentral managed TLS certificates in mx[12]001.wikimedia.org - T207050 [08:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:01] T207050: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 [08:33:09] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: cronspam from elasticsearch-curator on stretch - https://phabricator.wikimedia.org/T211859 (10fgiunchedi) p:05Triage>03Normal [08:40:09] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479393 (https://phabricator.wikimedia.org/T86338) [08:41:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479393 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [08:41:43] 10Operations, 10Traffic: Fix CAA iodef tags - https://phabricator.wikimedia.org/T211860 (10Vgutierrez) [08:42:17] 10Operations, 10Traffic: Fix CAA iodef tags - https://phabricator.wikimedia.org/T211860 (10Vgutierrez) p:05Triage>03Normal a:03Vgutierrez [08:44:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479393 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [08:45:10] !log installing openssl security updates on stretch [08:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3317 T86338 T202167 (duration: 00m 53s) [08:45:41] !log Deploy schema change on db1090:3317 T86338 T202167 [08:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:43] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [08:45:44] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [08:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:26] (03PS1) 10Vgutierrez: Fix CAA iodef tag [dns] - 10https://gerrit.wikimedia.org/r/479394 (https://phabricator.wikimedia.org/T211860) [08:48:53] 10Operations, 10Deployments, 10Keyholder: Make keyholder work with systemd - https://phabricator.wikimedia.org/T144043 (10hashar) [08:49:26] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479393 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [08:50:14] 10Operations, 10media-storage, 10User-fgiunchedi: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10fgiunchedi) [08:50:19] !log stress-test ms-be10[44-50] - T209618 [08:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:23] T209618: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 [08:52:23] (03PS1) 10Mathew.onipe: wdqs: prefix exporter with wdqs_updater_ [puppet] - 10https://gerrit.wikimedia.org/r/479395 (https://phabricator.wikimedia.org/T208215) [08:54:41] PROBLEM - very high load average likely xfs on ms-be1045 is CRITICAL: CRITICAL - load average: 203.19, 104.37, 42.11 [08:54:43] PROBLEM - very high load average likely xfs on ms-be1050 is CRITICAL: CRITICAL - load average: 203.25, 104.44, 42.12 [08:55:01] PROBLEM - very high load average likely xfs on ms-be1047 is CRITICAL: CRITICAL - load average: 202.35, 110.64, 45.52 [08:55:02] that's me ^ silencing [08:55:19] PROBLEM - very high load average likely xfs on ms-be1044 is CRITICAL: CRITICAL - load average: 202.53, 115.45, 48.17 [08:55:31] PROBLEM - very high load average likely xfs on ms-be1046 is CRITICAL: CRITICAL - load average: 201.40, 118.19, 49.89 [08:55:31] PROBLEM - very high load average likely xfs on ms-be1048 is CRITICAL: CRITICAL - load average: 201.03, 117.67, 49.61 [08:55:51] PROBLEM - very high load average likely xfs on ms-be1049 is CRITICAL: CRITICAL - load average: 201.60, 124.37, 53.52 [08:58:36] (03PS1) 10Vgutierrez: mx: Get rid of nginx [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/479396 (https://phabricator.wikimedia.org/T207050) [08:58:38] (03PS1) 10Vgutierrez: mx: Get rid of nginx [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/479397 (https://phabricator.wikimedia.org/T207050) [09:02:23] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10fgiunchedi) [09:02:55] (03PS2) 10Vgutierrez: mx: Get rid of nginx [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/479396 (https://phabricator.wikimedia.org/T207050) [09:02:57] (03PS2) 10Vgutierrez: mx: Get rid of nginx [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/479397 (https://phabricator.wikimedia.org/T207050) [09:02:58] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-fgiunchedi: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10fgiunchedi) 05Open>03Resolved This is completed, modulo ms-be2047 being diagnosed in T209921 [09:03:00] (03PS1) 10Vgutierrez: mx: Get rid of the old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/479398 (https://phabricator.wikimedia.org/T207050) [09:04:19] 10Operations, 10ops-codfw, 10Patch-For-Review, 10Services (watching), 10User-fgiunchedi: rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10fgiunchedi) 05Open>03Resolved Completed! [09:05:54] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2001.codfw.wmnet [09:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:59] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2002.codfw.wmnet [09:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:01] RECOVERY - very high load average likely xfs on ms-be1047 is OK: OK - load average: 3.51, 78.89, 78.11 [09:06:05] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2003.codfw.wmnet [09:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:10] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2004.codfw.wmnet [09:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:16] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2005.codfw.wmnet [09:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:21] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2006.codfw.wmnet [09:06:21] RECOVERY - very high load average likely xfs on ms-be1044 is OK: OK - load average: 2.69, 74.68, 76.75 [09:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:33] RECOVERY - very high load average likely xfs on ms-be1046 is OK: OK - load average: 2.26, 72.21, 75.95 [09:06:33] RECOVERY - very high load average likely xfs on ms-be1048 is OK: OK - load average: 2.24, 72.07, 75.77 [09:06:51] RECOVERY - very high load average likely xfs on ms-be1049 is OK: OK - load average: 1.62, 67.56, 74.50 [09:06:55] RECOVERY - very high load average likely xfs on ms-be1045 is OK: OK - load average: 1.44, 66.29, 74.03 [09:06:57] RECOVERY - very high load average likely xfs on ms-be1050 is OK: OK - load average: 1.40, 65.45, 73.49 [09:09:30] (03PS1) 10MaxSem: Revert "Block WP Zero users from accessing Phabricator uploads" [puppet] - 10https://gerrit.wikimedia.org/r/479399 (https://phabricator.wikimedia.org/T187716) [09:10:47] (03CR) 10jerkins-bot: [V: 04-1] Revert "Block WP Zero users from accessing Phabricator uploads" [puppet] - 10https://gerrit.wikimedia.org/r/479399 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [09:11:12] (03CR) 10Vgutierrez: [C: 032] mx: Get rid of the old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/479398 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [09:13:29] (03PS1) 10GTirloni: prometheus-directory-size: Ignore stderr from du [puppet] - 10https://gerrit.wikimedia.org/r/479400 (https://phabricator.wikimedia.org/T211861) [09:13:42] (03PS1) 10Filippo Giunchedi: Remove restbase200[1-6] from restbase [puppet] - 10https://gerrit.wikimedia.org/r/479401 (https://phabricator.wikimedia.org/T211070) [09:13:44] (03PS1) 10Filippo Giunchedi: Remove restbase200[1-6] cassandra instances [puppet] - 10https://gerrit.wikimedia.org/r/479402 (https://phabricator.wikimedia.org/T211070) [09:13:46] (03PS1) 10Filippo Giunchedi: site: spare::system for restbase200[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/479403 (https://phabricator.wikimedia.org/T211070) [09:15:06] (03CR) 10jerkins-bot: [V: 04-1] site: spare::system for restbase200[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/479403 (https://phabricator.wikimedia.org/T211070) (owner: 10Filippo Giunchedi) [09:18:46] (03PS3) 10Vgutierrez: mx: Get rid of nginx [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/479396 (https://phabricator.wikimedia.org/T207050) [09:18:48] (03PS3) 10Vgutierrez: mx: Get rid of nginx [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/479397 (https://phabricator.wikimedia.org/T207050) [09:18:58] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479404 [09:23:03] (03PS7) 10ArielGlenn: convert dump scripts to python3 [dumps] - 10https://gerrit.wikimedia.org/r/478702 (https://phabricator.wikimedia.org/T210989) [09:28:41] (03PS2) 10MaxSem: Revert "Block WP Zero users from accessing Phabricator uploads" [puppet] - 10https://gerrit.wikimedia.org/r/479399 (https://phabricator.wikimedia.org/T187716) [09:32:42] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479404 (owner: 10Marostegui) [09:33:46] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479404 (owner: 10Marostegui) [09:34:36] (03PS2) 10GTirloni: prometheus-directory-size: Ignore stderr from du [puppet] - 10https://gerrit.wikimedia.org/r/479400 (https://phabricator.wikimedia.org/T211861) [09:34:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090:3317 T86338 T202167 (duration: 00m 51s) [09:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:54] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [09:34:55] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [09:35:38] !log rebooting etcd/kubernetes hosts in codfw to pick up SSBD-enabled qemu [09:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:56] (03CR) 10GTirloni: [C: 032] prometheus-directory-size: Ignore stderr from du [puppet] - 10https://gerrit.wikimedia.org/r/479400 (https://phabricator.wikimedia.org/T211861) (owner: 10GTirloni) [09:36:25] (03PS2) 10Filippo Giunchedi: site: spare::system for restbase200[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/479403 (https://phabricator.wikimedia.org/T211070) [09:39:40] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479404 (owner: 10Marostegui) [09:40:13] (03CR) 10Filippo Giunchedi: [C: 032] Remove restbase200[1-6] from restbase [puppet] - 10https://gerrit.wikimedia.org/r/479401 (https://phabricator.wikimedia.org/T211070) (owner: 10Filippo Giunchedi) [09:40:20] (03PS2) 10Filippo Giunchedi: Remove restbase200[1-6] from restbase [puppet] - 10https://gerrit.wikimedia.org/r/479401 (https://phabricator.wikimedia.org/T211070) [09:42:09] (03CR) 10Elukey: "> I am still thinking about from which angle we should approach this." [puppet] - 10https://gerrit.wikimedia.org/r/479224 (https://phabricator.wikimedia.org/T210478) (owner: 10Banyek) [09:43:43] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:43:55] PROBLEM - etcd request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:45:27] (03CR) 10Filippo Giunchedi: [C: 032] Remove restbase200[1-6] cassandra instances [puppet] - 10https://gerrit.wikimedia.org/r/479402 (https://phabricator.wikimedia.org/T211070) (owner: 10Filippo Giunchedi) [09:45:36] (03PS2) 10Filippo Giunchedi: Remove restbase200[1-6] cassandra instances [puppet] - 10https://gerrit.wikimedia.org/r/479402 (https://phabricator.wikimedia.org/T211070) [09:46:01] 10Operations, 10Beta-Cluster-Infrastructure: Possible to run writes (e.g. UPDATE) on Beta Cluster replica - https://phabricator.wikimedia.org/T110115 (10Marostegui) [09:47:23] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:47:35] RECOVERY - etcd request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:48:24] (03CR) 10Mobrovac: [C: 031] site: spare::system for restbase200[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/479403 (https://phabricator.wikimedia.org/T211070) (owner: 10Filippo Giunchedi) [09:51:48] (03CR) 10Filippo Giunchedi: [C: 032] site: spare::system for restbase200[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/479403 (https://phabricator.wikimedia.org/T211070) (owner: 10Filippo Giunchedi) [09:51:57] (03PS3) 10Filippo Giunchedi: site: spare::system for restbase200[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/479403 (https://phabricator.wikimedia.org/T211070) [09:54:46] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479405 (https://phabricator.wikimedia.org/T86338) [09:55:58] !log removed openssl 1.1.0f-3+deb9u2+wmf1 from stretch-wikimedia/component/node10 (superseded by openssl update in DSA 4348 for stretch) [09:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479405 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [09:58:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479405 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [09:59:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101:3317 T86338 T202167 (duration: 00m 51s) [10:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:01] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [10:00:01] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [10:00:44] !log upgrade nodejs on aqs100[5-9] [10:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:07] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479405 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [10:08:20] (03CR) 10DCausse: wdqs: prefix exporter with wdqs_updater_ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479395 (https://phabricator.wikimedia.org/T208215) (owner: 10Mathew.onipe) [10:10:20] (03PS1) 10Hoo man: WikibaseClient: Enable Lua function usage tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479407 (https://phabricator.wikimedia.org/T191416) [10:10:22] (03CR) 10Reedy: Increase default minimum password length on multiple group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478807 (https://phabricator.wikimedia.org/T208246) (owner: 10Dmaza) [10:15:11] (03CR) 10Reedy: Increase default minimum password length on multiple group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478807 (https://phabricator.wikimedia.org/T208246) (owner: 10Dmaza) [10:23:52] (03CR) 10DCausse: "not fully checked all possibilities but I think you can remove more" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/479376 (https://phabricator.wikimedia.org/T210431) (owner: 10Mathew.onipe) [10:26:13] (03PS1) 10Elukey: superset: set /tmp as upload directory [puppet] - 10https://gerrit.wikimedia.org/r/479408 [10:27:16] (03CR) 10Ema: [C: 031] Fix CAA iodef tag [dns] - 10https://gerrit.wikimedia.org/r/479394 (https://phabricator.wikimedia.org/T211860) (owner: 10Vgutierrez) [10:28:27] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:28:53] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:28:57] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [10:28:57] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [10:28:59] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:28:59] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:28:59] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:29:05] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [10:29:05] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:29:11] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:29:13] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:29:15] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [10:29:15] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:29:17] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:29:20] <_joe_> sigh [10:29:21] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [10:29:21] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [10:29:21] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:29:21] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:29:23] what's going on? [10:29:25] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:29:35] <_joe_> marostegui: I think a oom in zotero [10:29:35] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:29:38] <_joe_> again [10:30:03] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [10:30:03] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:30:05] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:31:59] (03CR) 10Ema: [C: 031] cookbook: split main into argument_parser and run [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:33:21] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:50] (03PS1) 10GTirloni: nfs-mount-manager: Reduce verbosity and fix linter warnings [puppet] - 10https://gerrit.wikimedia.org/r/479409 (https://phabricator.wikimedia.org/T211817) [10:33:54] (03CR) 10Mathew.onipe: wdqs: prefix exporter with wdqs_updater_ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479395 (https://phabricator.wikimedia.org/T208215) (owner: 10Mathew.onipe) [10:34:35] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479410 [10:39:20] 10Operations, 10Graphite, 10Services (watching), 10User-fgiunchedi: Cassandra Graphite metrics space usage audit and cleanup - https://phabricator.wikimedia.org/T191315 (10fgiunchedi) 05Open>03declined We no longer have separate cassandra metrics hosts since moving to Prometheus. [10:40:16] (03CR) 10Hoo man: "To be deployed after wmf9 is on all wikis (Dec 20)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479407 (https://phabricator.wikimedia.org/T191416) (owner: 10Hoo man) [10:40:43] (03PS2) 10Arturo Borrero Gonzalez: osm::planet_sync: configure logrotate to use non-root user [puppet] - 10https://gerrit.wikimedia.org/r/479348 (https://phabricator.wikimedia.org/T211013) (owner: 10BryanDavis) [10:41:29] (03CR) 10Arturo Borrero Gonzalez: [C: 032] osm::planet_sync: configure logrotate to use non-root user [puppet] - 10https://gerrit.wikimedia.org/r/479348 (https://phabricator.wikimedia.org/T211013) (owner: 10BryanDavis) [10:42:11] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479410 (owner: 10Marostegui) [10:43:15] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479410 (owner: 10Marostegui) [10:43:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479410 (owner: 10Marostegui) [10:44:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:3317 T86338 T202167 (duration: 00m 51s) [10:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:29] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [10:44:30] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [10:48:59] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:49:49] (03PS1) 10Tulsi Bhagat: Update maiwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479411 (https://phabricator.wikimedia.org/T211845) [10:50:21] (03CR) 10DCausse: wdqs: prefix exporter with wdqs_updater_ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479395 (https://phabricator.wikimedia.org/T208215) (owner: 10Mathew.onipe) [10:51:20] !log rebooting dbmonitor hosts to pick up SSBD-enabled qemu [10:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:23] (03CR) 10Tulsi Bhagat: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479411 (https://phabricator.wikimedia.org/T211845) (owner: 10Tulsi Bhagat) [10:58:38] zotero looks like is having a bad time https://grafana.wikimedia.org/d/000000011/service-citoid?orgId=1 [10:58:46] is anyone looking into it? [10:58:54] cc mobrovac akosiaris ? [10:59:41] (03CR) 10Jayprakash12345: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479411 (https://phabricator.wikimedia.org/T211845) (owner: 10Tulsi Bhagat) [11:02:59] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:04:11] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:04:15] (03CR) 10Filippo Giunchedi: "Thanks for the fixes! See inline, just one non-nit left to do" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [11:04:33] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) Thanks! I love diagrams, they help me better understand topology and architectures. Please, confirm the following ar... [11:05:02] fsero: it's the same ol' problem ... [11:14:16] !log rebooting webperf hosts to pick up SSBD-enabled qemu [11:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:16] (03CR) 10Jayprakash12345: [C: 031] Update maiwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479411 (https://phabricator.wikimedia.org/T211845) (owner: 10Tulsi Bhagat) [11:26:12] (03CR) 10Fdans: [C: 031] superset: set /tmp as upload directory [puppet] - 10https://gerrit.wikimedia.org/r/479408 (owner: 10Elukey) [11:28:44] (03PS8) 10ArielGlenn: convert dump scripts to python3 [dumps] - 10https://gerrit.wikimedia.org/r/478702 (https://phabricator.wikimedia.org/T210989) [11:31:13] PROBLEM - Host google is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:31:41] !log fsero@deploy1001 scap-helm zotero upgrade production -f ../zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad] [11:31:42] !log fsero@deploy1001 scap-helm zotero cluster eqiad completed [11:31:42] !log fsero@deploy1001 scap-helm zotero finished [11:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:52] !log mobrovac@deploy1001 Started deploy [restbase/deploy@29a0902]: Remove restbase200[1-6] and ensure body.tfa exists for feed responses - T211070 T211871 [11:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:56] T211070: decommission of restbase200[1-6] (lease return in December 2018) - https://phabricator.wikimedia.org/T211070 [11:31:57] T211871: TFA missing from MCS response - https://phabricator.wikimedia.org/T211871 [11:32:23] RECOVERY - Host google is UP: PING WARNING - Packet loss = 80%, RTA = 601.14 ms [11:38:15] !log rebooting krypton to pick up SSBD-enabled qemu [11:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:37] !log fsero@deploy1001 scap-helm zotero upgrade production -f ../zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [11:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:39] !log fsero@deploy1001 scap-helm zotero cluster codfw completed [11:38:39] !log fsero@deploy1001 scap-helm zotero finished [11:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:00] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@29a0902]: Remove restbase200[1-6] and ensure body.tfa exists for feed responses - T211070 T211871 (duration: 07m 08s) [11:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:05] T211070: decommission of restbase200[1-6] (lease return in December 2018) - https://phabricator.wikimedia.org/T211070 [11:39:05] T211871: TFA missing from MCS response - https://phabricator.wikimedia.org/T211871 [11:39:18] !log mobrovac@deploy1001 Started deploy [restbase/deploy@29a0902]: Remove restbase200[1-6] and ensure body.tfa exists for feed responses - T211070 T211871 [11:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:39] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [11:43:01] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [11:43:22] !log rebooting vega/bromine to pick up SSBD-enabled qemu [11:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:27] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [11:44:07] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy [11:44:11] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [11:45:26] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@29a0902]: Remove restbase200[1-6] and ensure body.tfa exists for feed responses - T211070 T211871 (duration: 06m 08s) [11:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:31] T211070: decommission of restbase200[1-6] (lease return in December 2018) - https://phabricator.wikimedia.org/T211070 [11:45:31] T211871: TFA missing from MCS response - https://phabricator.wikimedia.org/T211871 [11:47:53] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [11:47:55] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [11:48:33] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [11:49:55] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy [11:51:01] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@35841a7]: (no justification provided) [11:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:39] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@35841a7]: (no justification provided) (duration: 00m 38s) [11:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:33] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [11:53:49] !log mobrovac@deploy1001 Started deploy [restbase/deploy@55fcd4b]: Remove restbase200[1-6], ensure body.tfa exists for feed responses and disable Citoid check - T211070 T211871 T211411 [11:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:55] T211411: Citoid automated monitoring times out due to Zotero v2 - https://phabricator.wikimedia.org/T211411 [11:53:55] T211070: decommission of restbase200[1-6] (lease return in December 2018) - https://phabricator.wikimedia.org/T211070 [11:53:55] T211871: TFA missing from MCS response - https://phabricator.wikimedia.org/T211871 [11:59:11] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [11:59:39] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) For the D day: `lang=shell-session # create new subnet root@cloudcontrol1004:~# neutron subnet-create --gateway 208.... [11:59:53] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181213T1200). [12:00:04] Thiemo_WMDE and Tulsi: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:14] o/ [12:00:28] zeljkof: Hi [12:00:29] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [12:00:31] !log rebooting ununpentium to pick up SSBD-enabled qemu [12:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:40] Thiemo_WMDE: around for swat? [12:01:05] hi Tulsi [12:01:05] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy [12:01:08] I can SWAT today [12:01:18] Great [12:01:22] :) [12:01:29] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [12:01:31] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [12:01:37] Tulsi: please stand by, I'll let you know when the first patch is ready for testing at mwdebug1002 [12:01:41] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [12:01:41] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [12:01:51] Tulsi: do you know how to test there, or do you need help? [12:01:55] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [12:01:57] Ofcourse! [12:01:59] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [12:02:02] I know a bit [12:02:28] zeljkof: I will let you know, if i need help. [12:02:33] ok [12:04:01] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy [12:04:09] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy [12:04:21] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy [12:04:55] Tulsi: there is trailing whitespace in commit message for 478143, I'll fix it now, but please make sure there is none in the future [12:05:10] zeljkof: Okay [12:05:15] (03PS3) 10Zfilipin: Namespace configuration on shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478143 (https://phabricator.wikimedia.org/T210699) (owner: 10Tulsi Bhagat) [12:05:55] (03CR) 10Zfilipin: "PS3 removes trailing white-space from commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478143 (https://phabricator.wikimedia.org/T210699) (owner: 10Tulsi Bhagat) [12:06:02] (03PS4) 10Zfilipin: Namespace configuration on shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478143 (https://phabricator.wikimedia.org/T210699) (owner: 10Tulsi Bhagat) [12:06:27] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy [12:06:27] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy [12:06:50] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478143 (https://phabricator.wikimedia.org/T210699) (owner: 10Tulsi Bhagat) [12:07:53] (03Merged) 10jenkins-bot: Namespace configuration on shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478143 (https://phabricator.wikimedia.org/T210699) (owner: 10Tulsi Bhagat) [12:08:43] 10Operations, 10Patch-For-Review, 10User-jijiki: Add option maxmemory-policy: 'volatile-lru' on Redis class for debian stretch - https://phabricator.wikimedia.org/T209628 (10jijiki) 05Open>03Resolved [12:08:49] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [12:08:50] Zfilipin: Hello [12:08:51] Tulsi: 478143 is at mwdebug1002, please test and let me know if I can deploy it [12:09:07] Jayprakash12345: hi, it's zeljkof not zflipin :) [12:10:04] !log oblivian@deploy1001 scap-helm -h [namespace: -h, clusters: eqiad,codfw] [12:10:04] !log oblivian@deploy1001 scap-helm -h cluster eqiad completed [12:10:04] !log oblivian@deploy1001 scap-helm -h cluster codfw completed [12:10:04] !log oblivian@deploy1001 scap-helm -h finished [12:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:49] (03CR) 10jenkins-bot: Namespace configuration on shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478143 (https://phabricator.wikimedia.org/T210699) (owner: 10Tulsi Bhagat) [12:10:59] _joe_: -h is not for getting the help message? :) [12:11:00] zeljkof: Ok, I have tested the https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/478143/ on 1002. [12:11:19] <_joe_> volans: oh it does [12:11:35] <_joe_> but I forgot we didn't do any check [12:11:36] zeljkof: It is working fine. [12:11:45] zeljkof: Yup [12:12:22] ack :) [12:12:23] Tulsi: please remove trailing whitespace from commit message for 478138 [12:12:47] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@55fcd4b]: Remove restbase200[1-6], ensure body.tfa exists for feed responses and disable Citoid check - T211070 T211871 T211411 (duration: 18m 59s) [12:12:52] Tulsi, Jayprakash12345: ok, deploying 478143 [12:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:53] T211411: Citoid automated monitoring times out due to Zotero v2 - https://phabricator.wikimedia.org/T211411 [12:12:53] T211070: decommission of restbase200[1-6] (lease return in December 2018) - https://phabricator.wikimedia.org/T211070 [12:12:54] T211871: TFA missing from MCS response - https://phabricator.wikimedia.org/T211871 [12:13:37] (03PS4) 10Tulsi Bhagat: Enable 'flood' user group at ne.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478138 (https://phabricator.wikimedia.org/T211181) [12:14:26] zeljkof: Done. Removed. Check it plz [12:14:29] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:478143|Namespace configuration on shnwiki (T210699)]] (duration: 00m 53s) [12:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:33] T210699: Set WP namespace alias to NS_PROJECT in Shan Wikipedia (shnwiki) - https://phabricator.wikimedia.org/T210699 [12:14:47] Tulsi, Jayprakash12345: 478143 deployed [12:15:28] zeljkof: Thank you! ;) [12:15:38] (03PS5) 10Zfilipin: Enable 'flood' user group at ne.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478138 (https://phabricator.wikimedia.org/T211181) (owner: 10Tulsi Bhagat) [12:15:46] namespaceDupes? [12:16:14] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478138 (https://phabricator.wikimedia.org/T211181) (owner: 10Tulsi Bhagat) [12:16:39] zeljkof: that shnwiki namespace patch needs namespaceDupes [12:16:48] Tulsi, Jayprakash12345, Hauskatze: do I need to run namespaceDupes for 478143? [12:16:52] (03CR) 10Alexandros Kosiaris: ">Thank you for the guide! I will post in the discussion page. Hopefully the format will be okay." [deployment-charts] - 10https://gerrit.wikimedia.org/r/479026 (https://phabricator.wikimedia.org/T211708) (owner: 10Jeena Huneidi) [12:16:53] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#namespaceDupes [12:17:03] zeljkof: Yes [12:17:05] zeljkof: yes, first dry run and post results [12:17:06] Hauskatze: ok, running it [12:17:15] (03Merged) 10jenkins-bot: Enable 'flood' user group at ne.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478138 (https://phabricator.wikimedia.org/T211181) (owner: 10Tulsi Bhagat) [12:17:17] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "Great! I am merging this. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/479026 (https://phabricator.wikimedia.org/T211708) (owner: 10Jeena Huneidi) [12:17:27] Jayprakash12345: you have to make a note in gerrit that a script needs to run [12:17:33] Jayprakash12345: for future reference [12:18:23] zeljkof: will remember, Thanks [12:20:56] zeljkof: Is 478138 on the mwdebug1002? [12:21:18] Tulsi, Jayprakash12345, Hauskatze: https://phabricator.wikimedia.org/T210699#4820079 [12:22:06] (03PS1) 10Alexandros Kosiaris: Package blubberoid and update repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/479420 [12:22:13] Tulsi, Jayprakash12345: 478138 is at mwdebug1002, please test [12:22:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Package blubberoid and update repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/479420 (owner: 10Alexandros Kosiaris) [12:22:51] zeljkof: looks good, thanks [12:23:00] wrt the namespaceDupes.php run [12:23:12] not commenting on 478138 [12:23:24] (03CR) 10jenkins-bot: Enable 'flood' user group at ne.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478138 (https://phabricator.wikimedia.org/T211181) (owner: 10Tulsi Bhagat) [12:25:02] zeljkof: 478138 is working fine. Please go ahead [12:25:21] Jayprakash12345: deploying [12:26:18] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:478138|Enable flood user group at ne.wiki (T211181)]] (duration: 00m 51s) [12:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:22] T211181: Enable flood flag at newiki - https://phabricator.wikimedia.org/T211181 [12:27:25] Tulsi, Jayprakash12345: 478138 is deployed, please test [12:27:45] zeljkof: Sure [12:28:22] zeljkof: https://ne.wikipedia.org/wiki/विशेष:प्रयोगकर्ता_समूह_अधिकार?uselang=en Looks good. [12:28:25] zeljkof: Working fine. [12:29:17] It's Pseudobots? [12:29:18] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479411 (https://phabricator.wikimedia.org/T211845) (owner: 10Tulsi Bhagat) [12:29:40] It should be Flooders? [12:30:25] (03Merged) 10jenkins-bot: Update maiwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479411 (https://phabricator.wikimedia.org/T211845) (owner: 10Tulsi Bhagat) [12:30:35] Tulsi: I don't understand you, something is wrong? [12:30:59] zeljkof: I mean User group name. [12:31:14] zeljkof: No, I think it is realted to i18n [12:31:18] Tulsi: you can customize the name locally [12:31:32] Oh. Okay [12:31:34] flood = i18n -> pseudobot [12:31:41] Got it. [12:31:45] :) [12:33:31] Tulsi, Jayprakash12345: 479411 is at mwdebug1002, please test [12:34:09] zeljkof: Looks good. [12:34:21] Tulsi: ok, deploying [12:34:29] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:34:33] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [12:34:33] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [12:34:37] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:34:41] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [12:34:51] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [12:34:51] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [12:35:26] !log oblivian@deploy1001 scap-helm zotero upgrade production -f ../zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: eqiad,codfw] [12:35:27] !log oblivian@deploy1001 scap-helm zotero cluster eqiad completed [12:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:28] !log oblivian@deploy1001 scap-helm zotero cluster codfw completed [12:35:28] !log oblivian@deploy1001 scap-helm zotero finished [12:35:29] !log zfilipin@deploy1001 Synchronized static/images/project-logos: SWAT: [[gerrit:479411|Update maiwiki logo (T211845)]] (duration: 00m 52s) [12:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:38] T211845: Update Maithili Wikipedia Logo - https://phabricator.wikimedia.org/T211845 [12:35:44] (03CR) 10jenkins-bot: Update maiwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479411 (https://phabricator.wikimedia.org/T211845) (owner: 10Tulsi Bhagat) [12:35:47] Tulsi: Ab aur patch bache hain kya? [12:35:56] Jayprakash12345: nhi [12:36:07] Hogya.. aaj k liye [12:36:12] Tulsi, Jayprakash12345: 479411 is deployed, I've purged the logos, please test [12:36:30] zeljkof: Fine. [12:37:02] Okay for me. [12:37:17] zeljkof, Tulsi Good Bye. Thanks for being here :) [12:37:36] Tulsi, Jayprakash12345: great! thanks for deploying with #releng! :) [12:38:27] zeljkof: Thank you! Nice to meet you!! :) [12:38:31] !log EU SWAT finished [12:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:46] !log icinga downtime (30 mins) cloudcontrol1003, cloudnet1003 and cloudnet1004 for package upgrades [12:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:24] (03PS1) 10GTirloni: prometheus-node-exporter: Ignore Docker/Kubelet mount points [puppet] - 10https://gerrit.wikimedia.org/r/479424 (https://phabricator.wikimedia.org/T211810) [12:46:20] (03CR) 10Vgutierrez: [C: 032] Fix CAA iodef tag [dns] - 10https://gerrit.wikimedia.org/r/479394 (https://phabricator.wikimedia.org/T211860) (owner: 10Vgutierrez) [12:47:16] (03CR) 10GTirloni: "I think ignoring 'nsfs' is uncontroversial. However, I'm not sure ignoring /var/lib/docker and /var/lib/kubelet will cause any trouble for" [puppet] - 10https://gerrit.wikimedia.org/r/479424 (https://phabricator.wikimedia.org/T211810) (owner: 10GTirloni) [12:48:02] 10Operations, 10Traffic: Fix CAA iodef tags - https://phabricator.wikimedia.org/T211860 (10Vgutierrez) 05Open>03Resolved [12:51:49] 10Operations, 10Patch-For-Review: Upgrade calico in production to version 2.4+ - https://phabricator.wikimedia.org/T207804 (10akosiaris) I 've had to deannotate the zotero namespace with commands like the one below ` sudo KUBECONFIG="/etc/kubernetes/admin-codfw.config" kubectl annotate namespace zotero net.be... [12:59:36] !log superset on analytics-tool1003 upgraded to 0.28.1 [12:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181213T1300) [13:15:38] !log stop restbase and cassandra on restbase200[1-6] - T211070 [13:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:41] T211070: decommission of restbase200[1-6] (lease return in December 2018) - https://phabricator.wikimedia.org/T211070 [13:22:27] !log creating 300+ wiki indices on elastic-omega@codfw [13:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:24] (03PS4) 10Volans: puppet: add PuppetMaster class [software/spicerack] - 10https://gerrit.wikimedia.org/r/477707 (https://phabricator.wikimedia.org/T205884) [13:34:26] (03PS4) 10Volans: Add ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/478030 (https://phabricator.wikimedia.org/T205884) [13:34:28] (03PS3) 10Volans: icinga: fix typo in test docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/478931 (https://phabricator.wikimedia.org/T205884) [13:34:30] (03PS1) 10Volans: puppet: add additional methods to PuppetHosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/479431 (https://phabricator.wikimedia.org/T205884) [13:34:40] (03CR) 10Volans: "done" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/477707 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [13:34:55] (03CR) 10Volans: "done" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/478030 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [13:40:13] (03PS12) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [13:40:15] (03PS12) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [13:40:17] (03PS14) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [13:40:20] (03PS1) 10DCausse: [cirrus] fix cluster settings for temp clusters psi&omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) [13:41:11] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [13:41:19] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) >>! In T207258#4804302, @Cmjohnson wrote: > @marostegui and all, > > the system board that was replaced yesterday was faulty. Showing errors on DIMM slots B4 and B1. After swapping... [13:41:21] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [13:41:26] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] fix cluster settings for temp clusters psi&omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [13:43:54] (03PS1) 10BBlack: Move authdns zone data options to other repo [dns] - 10https://gerrit.wikimedia.org/r/479433 [13:44:16] (03PS4) 10BBlack: gdnsd config: update for 3.x [puppet] - 10https://gerrit.wikimedia.org/r/464862 [13:44:18] (03PS1) 10BBlack: Move authdns zone data options to other repo [puppet] - 10https://gerrit.wikimedia.org/r/479434 [13:47:09] (03PS2) 10DCausse: [cirrus] fix cluster settings for temp clusters psi&omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) [13:47:11] (03PS13) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [13:47:13] (03PS13) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [13:47:15] (03PS15) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [13:48:06] (03CR) 10BBlack: [C: 032] Move authdns zone data options to other repo [dns] - 10https://gerrit.wikimedia.org/r/479433 (owner: 10BBlack) [13:48:18] (03CR) 10BBlack: [C: 032] gdnsd config: update for 3.x [puppet] - 10https://gerrit.wikimedia.org/r/464862 (owner: 10BBlack) [13:48:22] (03CR) 10BBlack: [C: 032] Move authdns zone data options to other repo [puppet] - 10https://gerrit.wikimedia.org/r/479434 (owner: 10BBlack) [13:48:53] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] fix cluster settings for temp clusters psi&omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [13:48:57] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [13:49:06] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [13:49:21] (03CR) 10Filippo Giunchedi: profile: enable statsd_exporter and add matching rules to logstash::collector (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [13:59:17] (03PS3) 10DCausse: [cirrus] fix cluster settings for temp clusters psi&omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) [13:59:19] (03PS14) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [13:59:21] (03PS14) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [13:59:23] (03PS16) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [13:59:44] subbu: double-checking… You're ready for me to delete mw-expt.wikitextexp.eqiad.wmflabs and mw-base.wikitextexp.eqiad.wmflabs? [14:00:04] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] fix cluster settings for temp clusters psi&omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [14:00:04] zeljkof: That opportune time is upon us again. Time for a MediaWiki train - European version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181213T1400). [14:00:16] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [14:00:19] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [14:01:27] (03PS1) 10BBlack: authdns: remove authdns-lint bits [puppet] - 10https://gerrit.wikimedia.org/r/479435 (https://phabricator.wikimedia.org/T205439) [14:01:29] (03PS1) 10BBlack: authdns-local-update: only support 3.x [puppet] - 10https://gerrit.wikimedia.org/r/479436 [14:01:31] (03PS1) 10BBlack: authdns spec: require stretch [puppet] - 10https://gerrit.wikimedia.org/r/479437 [14:03:08] thanks for the reminder jouncebot [14:04:32] (03CR) 10Vgutierrez: [C: 032] mx: Get rid of nginx [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/479396 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:04:40] (03PS4) 10Vgutierrez: mx: Get rid of nginx [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/479396 (https://phabricator.wikimedia.org/T207050) [14:06:32] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Services, 10Release-Engineering-Team (Kanban): graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) [14:09:36] (03CR) 10Vgutierrez: [C: 032] mx: Get rid of nginx [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/479397 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:09:44] (03PS4) 10Vgutierrez: mx: Get rid of nginx [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/479397 (https://phabricator.wikimedia.org/T207050) [14:10:41] (03PS4) 10DCausse: [cirrus] fix cluster settings for temp clusters psi&omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) [14:10:43] (03PS15) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [14:10:45] (03PS15) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [14:10:47] (03PS17) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [14:11:40] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] fix cluster settings for temp clusters psi&omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [14:11:42] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [14:11:47] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [14:14:04] 10Operations, 10User-Elukey: Move oxygen to weblog1001 - https://phabricator.wikimedia.org/T211883 (10elukey) p:05Triage>03Normal [14:14:20] 10Operations, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10elukey) Created https://phabricator.wikimedia.org/T211883 :) [14:14:43] PROBLEM - very high load average likely xfs on ms-be1044 is CRITICAL: CRITICAL - load average: 121.34, 121.32, 113.83 [14:14:57] PROBLEM - very high load average likely xfs on ms-be1045 is CRITICAL: CRITICAL - load average: 121.17, 121.29, 113.91 [14:16:04] 10Operations, 10Traffic: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Vgutierrez) 05Open>03Resolved [14:17:36] (03PS1) 10Zfilipin: all wikis to 1.33.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479441 [14:17:38] (03CR) 10Zfilipin: [C: 032] all wikis to 1.33.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479441 (owner: 10Zfilipin) [14:18:06] (03PS5) 10Andrew Bogott: Nova: lower cpu_allocation_ratio by a lot [puppet] - 10https://gerrit.wikimedia.org/r/478955 [14:18:48] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479441 (owner: 10Zfilipin) [14:18:56] (03CR) 10Andrew Bogott: [C: 032] "In eqiad1 the busiest hypervisor (which is VERY busy) is just over 3x. So 4x won't actively change the current scheduling behavior, but i" [puppet] - 10https://gerrit.wikimedia.org/r/478955 (owner: 10Andrew Bogott) [14:21:23] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.8 [14:21:39] (03PS5) 10DCausse: [cirrus] fix cluster settings for temp clusters psi&omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) [14:21:41] (03PS16) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [14:21:41] !log ladsgroup@deploy1001 Started deploy [ores/deploy@a9d5e95]: noop [14:21:43] (03PS16) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [14:21:45] (03PS18) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [14:21:56] I'm deploying something for ores, it's a noop for prod [14:22:03] akosiaris: ^ [14:22:43] (03PS1) 10Elukey: profile::kafkatee::webrequest::ops: move hiera to parameters [puppet] - 10https://gerrit.wikimedia.org/r/479443 (https://phabricator.wikimedia.org/T211883) [14:23:23] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:43] heh [14:23:44] (03CR) 10Elukey: [C: 032] profile::kafkatee::webrequest::ops: move hiera to parameters [puppet] - 10https://gerrit.wikimedia.org/r/479443 (https://phabricator.wikimedia.org/T211883) (owner: 10Elukey) [14:26:15] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479441 (owner: 10Zfilipin) [14:27:07] (03PS1) 10Elukey: Add IPv6 mapped config to weblog1001 [puppet] - 10https://gerrit.wikimedia.org/r/479444 (https://phabricator.wikimedia.org/T211883) [14:28:04] (03CR) 10Elukey: [C: 032] Add IPv6 mapped config to weblog1001 [puppet] - 10https://gerrit.wikimedia.org/r/479444 (https://phabricator.wikimedia.org/T211883) (owner: 10Elukey) [14:29:29] (03CR) 10Ottomata: [C: 031] "Huh...how will this work? I don' think we should let people upload into the 'main' database; this is the one on an-coord1001." [puppet] - 10https://gerrit.wikimedia.org/r/479408 (owner: 10Elukey) [14:31:02] (03CR) 10Elukey: "> Huh...how will this work? I don' think we should let people upload" [puppet] - 10https://gerrit.wikimedia.org/r/479408 (owner: 10Elukey) [14:31:12] (03PS6) 10Volans: cookbook: split main into argument_parser and run [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) [14:32:22] (03PS1) 10Muehlenhoff: Remove Diamond from further roles [puppet] - 10https://gerrit.wikimedia.org/r/479446 (https://phabricator.wikimedia.org/T183454) [14:34:17] (03CR) 10Volans: [C: 032] cookbook: split main into argument_parser and run [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:35:39] (03Merged) 10jenkins-bot: cookbook: split main into argument_parser and run [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:36:35] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Addshore) As far as I remember from an IRC discussion, the next step here is to wait for the WMF to get a cert for the domain, then we can move... [14:36:40] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@a9d5e95]: noop (duration: 14m 59s) [14:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:38] akosiaris: my deployment is done, it should be a noop now [14:37:39] (03PS1) 10Elukey: Add AAAA/PTR records for weblog1001 [dns] - 10https://gerrit.wikimedia.org/r/479447 (https://phabricator.wikimedia.org/T211883) [14:38:13] RECOVERY - very high load average likely xfs on ms-be1045 is OK: OK - load average: 0.31, 34.32, 79.20 [14:38:40] (03CR) 10Elukey: [C: 032] Add AAAA/PTR records for weblog1001 [dns] - 10https://gerrit.wikimedia.org/r/479447 (https://phabricator.wikimedia.org/T211883) (owner: 10Elukey) [14:39:02] Amir1: ok, proceeding then [14:39:09] RECOVERY - very high load average likely xfs on ms-be1044 is OK: OK - load average: 0.15, 28.59, 74.67 [14:39:46] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Revert "ores: Remove added celery configs"" [puppet] - 10https://gerrit.wikimedia.org/r/479206 (owner: 10Ladsgroup) [14:39:54] (03PS3) 10Alexandros Kosiaris: Revert "Revert "ores: Remove added celery configs"" [puppet] - 10https://gerrit.wikimedia.org/r/479206 (owner: 10Ladsgroup) [14:40:09] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "Revert "ores: Remove added celery configs"" [puppet] - 10https://gerrit.wikimedia.org/r/479206 (owner: 10Ladsgroup) [14:40:49] !log disable puppet on ores1* and ores2* machines to deploy https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/479206/ [14:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:56] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10BBlack) There's still a couple of things that can be done serially at present, one of which is necessary for the cert issuance later: 1. Switc... [14:41:35] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: cronspam from elasticsearch-curator on stretch - https://phabricator.wikimedia.org/T211859 (10herron) a:03herron [14:41:37] (03PS3) 10Hashar: Log docker build output [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475779 [14:42:28] thanks [14:43:36] (03CR) 10Ottomata: [C: 031] "I think it makes sense, especially if there is a good place set aside for it. I think we are both superset admins, so we can see the 'mai" [puppet] - 10https://gerrit.wikimedia.org/r/479408 (owner: 10Elukey) [14:46:04] !log updating grafana/stretch-wikimedia to 5.4.2: reprepro --restrict grafana update stretch-wikimedia [14:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:09] (03PS1) 10Elukey: Swap oxygen with weblog1001 [puppet] - 10https://gerrit.wikimedia.org/r/479448 (https://phabricator.wikimedia.org/T211883) [14:48:26] zeljkof, Amir1: Did we have an uneventful train today, or is it still going on? [14:49:07] anomie: wmf.8 is at all wikis, one new problem noticed so far T211885 [14:49:07] T211885: ErrorException from line 47 of /srv/mediawiki/php-1.33.0-wmf.8/extensions/Kartographer/includes/ApiQueryMapData.php: PHP Warning: data error - https://phabricator.wikimedia.org/T211885 [14:50:29] anomie: most things got fixed, your patch for full scan issue got merged but not yet deployed [14:50:51] I'm planning to deploy it later today but I might not be around to do it [14:51:15] Amir1: I can do it now if we want. I also still have that config change on group 2 that I didn't get to do yesterday. [14:51:37] anomie: sure! [14:52:00] zeljkof: ^ Is that ok with you if I do that backport and my config change now? [14:52:23] anomie: go ahead, logs look good, I'm hunting for minor problems [14:53:21] ok, doing those now. Backport first. [14:55:17] !log installing openssl 1.1 security updates on mw canaries (along with nginx restart/upgrade) [14:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:08] anomie: +1 to deploy the backport now if you have time! :) [15:01:20] (if we are talking about the same one!) [15:01:47] Already working on it, currently waiting for Jenkins. I think we're both talking about the one for T211804. [15:01:48] T211804: A huge spike on read rows for commonswiki - https://phabricator.wikimedia.org/T211804 [15:01:50] Amir1: https://grafana.wikimedia.org/d/000000255/ores?refresh=1m&orgId=1 looks fine I 'll proceed with the entire set of hosts [15:02:02] anomie: <3 [15:02:10] the overload errors are before my deployment btw [15:02:19] akosiaris: I tested it manually on ores1001 too [15:02:45] yes, those are for the deployment itself, when there is lots of restart at the same time, the queue gets super big [15:03:13] !log ores2* deploy https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/479206/ [15:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:43] Amir1: ok, then I 'll do eqiad a bit more conservatively [15:04:30] thanks! [15:11:44] jouncebot: next [15:11:44] In 1 hour(s) and 48 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181213T1700) [15:12:06] (03CR) 10Hashar: Log docker build output (033 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475779 (owner: 10Hashar) [15:12:11] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10BPirkle) [15:12:43] !log anomie@deploy1001 Synchronized php-1.33.0-wmf.8/includes/api/ApiPageSet.php: Backport fix for T211804: ApiPageSet::initFromPageIds: Default $filterIds to true (duration: 00m 46s) [15:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:50] T211804: A huge spike on read rows for commonswiki - https://phabricator.wikimedia.org/T211804 [15:13:15] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10BPirkle) [15:15:37] (03PS1) 10Anomie: Set ActorTableSchemaMigrationStage => write-both/read-old on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479454 (https://phabricator.wikimedia.org/T188327) [15:16:06] (03CR) 10Anomie: [C: 032] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479454 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [15:17:10] (03Merged) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-old on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479454 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [15:17:52] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478621 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [15:18:40] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-both/read-old on all wikis (T188327) (duration: 00m 45s) [15:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:45] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [15:19:12] (03CR) 10Filippo Giunchedi: [C: 031] "IMHO ready to go, no impact expected in production at this point only on beta" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478621 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [15:20:19] anyone available for another set of eyes on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/478621 ? I'd like to push that out today now that wmf.8 is everywhere [15:20:52] 10Operations, 10ops-codfw: Interface errors on cr1-codfw:xe-5/3/1 - https://phabricator.wikimedia.org/T211715 (10Papaul) 05Open>03Resolved This is complete [15:23:57] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Papaul) [15:28:36] (03CR) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-old on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479454 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [15:28:58] (03CR) 10Alexandros Kosiaris: [C: 031] phabricator: use Stdlib::Fqdn data type for hostname parameters [puppet] - 10https://gerrit.wikimedia.org/r/479327 (owner: 10Dzahn) [15:33:53] !log rebooting pybal-test hosts to pick up SSBD-enabled qemu [15:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:15] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Services, 10Release-Engineering-Team (Kanban): graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) [15:37:19] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Create Graphoid .pipeline files - https://phabricator.wikimedia.org/T203092 (10akosiaris) [15:37:26] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Services (watching): Move Graphoid to Kubernetes via the deployment pipeline - https://phabricator.wikimedia.org/T203091 (10akosiaris) [15:37:30] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Create Graphoid .pipeline files - https://phabricator.wikimedia.org/T203092 (10akosiaris) 05Open>03stalled Stalling until T211811 is done [15:37:39] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Services, 10Release-Engineering-Team (Kanban): graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) [15:37:43] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Services (watching): Move Graphoid to Kubernetes via the deployment pipeline - https://phabricator.wikimedia.org/T203091 (10akosiaris) [15:38:16] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Services (watching): TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10akosiaris) [15:38:23] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Services (watching): Move Graphoid to Kubernetes via the deployment pipeline - https://phabricator.wikimedia.org/T203091 (10akosiaris) 05Open>03stalled The migration uncovered a number of issues in graphoid t... [15:43:03] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Blubberoid - Create Helm Chart - https://phabricator.wikimedia.org/T211708 (10akosiaris) Chart merged and is available at https://releases.wikimedia.org/charts/ [15:44:27] !log installing openssl 1.1 security updates on Hadoop workers [15:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:45] (03PS1) 10Alexandros Kosiaris: Add blubberoid kubernetes tokens [puppet] - 10https://gerrit.wikimedia.org/r/479461 (https://phabricator.wikimedia.org/T205919) [15:55:52] (03PS1) 10Volans: API: convert to new Spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/479463 (https://phabricator.wikimedia.org/T205884) [15:56:59] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) For part 2. of this please contact @CRoslof [15:57:23] 10Operations, 10ops-codfw: Interface errors on cr1-codfw:xe-5/3/1 - https://phabricator.wikimedia.org/T211715 (10ayounsi) 05Resolved>03Open a:05Papaul>03ayounsi Still seeing errors. Next step is to either replace the patch cable or contact Telia. [15:59:26] (03CR) 10Volans: [C: 04-2] "To be merged only after a new package of Spicerack including Id6860d84131854a68f4f630eb90d7fbbdaf3cd91 has been released to prod." [cookbooks] - 10https://gerrit.wikimedia.org/r/479463 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [16:00:25] (03PS1) 10Andrew Bogott: Add another 'maps' VM to the NFS hiera config [puppet] - 10https://gerrit.wikimedia.org/r/479464 [16:01:20] (03CR) 10Andrew Bogott: [C: 032] Add another 'maps' VM to the NFS hiera config [puppet] - 10https://gerrit.wikimedia.org/r/479464 (owner: 10Andrew Bogott) [16:02:38] 10Operations, 10Product-Analytics, 10Patch-For-Review: Upload shiny-server .deb to our Stretch apt repository - https://phabricator.wikimedia.org/T168967 (10mpopov) @EBjune I am waiting for someone with access to the internal WMF apt repository to make shiny-server available. @aborrero Howdy! I wasn't sure... [16:04:10] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Papaul) ` papaul@asw-a-codfw# run show interfaces ge-5/0/8 descriptions Interface Admin Link Description ge-5/0/8 down down DISABLED pa... [16:06:57] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Papaul) [16:07:12] (03CR) 10Hashar: "Will have to review it. But in short Jenkins and Zuul each have an embedded web server which is proxied behind Apache. That is why the pro" [puppet] - 10https://gerrit.wikimedia.org/r/453554 (owner: 10Dzahn) [16:11:19] Amir1: I am seeing issues with change_tag [16:11:42] oh oh, marostegui where? [16:11:48] Looks like a non existing key [16:11:49] let me show you [16:12:30] https://logstash.wikimedia.org/goto/e782d0d83bc7f014b512b579a50f5c0b [16:12:34] let me see if it is always the same host [16:13:01] So far I only see db1089 [16:13:48] Isn't that the key we dropped? [16:13:52] marostegui: it's missing index [16:14:07] yep, but isn't that the one we dropped? [16:14:16] AFAIK it should not be that one [16:14:25] let me dig deep [16:15:07] https://phabricator.wikimedia.org/T205904 [16:15:49] mhhh icinga-wm is offline, I'm in a meeting tho [16:16:18] marostegui: shit, I made a mistake here: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/476470/1/frontend/specialpages/reports/ProblemChanges_body.php [16:16:21] Amir1: That change_tag_rev_tag index isn't on tables sql [16:16:28] I make a patch and deploy it right away [16:16:43] Amir1: Glad to hear! Alter all the slaves would have taken longer! :) [16:16:51] Amir1: Do you need me to create a task or no need? [16:17:09] marostegui: if you make the task and assign it to me, it would be amazing [16:17:10] Hmm. Scap sync-file claimed a timeout and didn't log here, but otherwise seemed to work. I'm going to run it again just in case... [16:17:12] so I can deploy [16:17:17] godog: icinga appears nonresponsive to http or ssh [16:17:17] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10Cmjohnson) a:05Cmjohnson>03RobH @robh I did a racreset....all is working [16:17:18] Amir1: Doing it! [16:17:35] (03CR) 10Dzahn: "Hashar, i think that's already what i'm doing here. I am deleting the "common" profile and instead the httpd setup is moved to the new htt" [puppet] - 10https://gerrit.wikimedia.org/r/453554 (owner: 10Dzahn) [16:17:46] shdubsh: confirmed, can't login [16:18:27] I can get to the console tho [16:18:33] !log Deployed fix for T210937: API: Use parenthesized join in ApiQueryBase::showHiddenUsersAddBlockInfo [16:18:57] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson) no ETA [16:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:00] T210937: API query for userprops not working on group0 wikis (maybe because comment migration read-new) - https://phabricator.wikimedia.org/T210937 [16:19:23] Amir1: https://phabricator.wikimedia.org/T211896 [16:19:49] [2062477.420767] tg3 0000:04:00.0 eno1: Link is down [16:20:17] (03CR) 10Hashar: [C: 031] "Arhhh yeah indeed :-) Sorry I am tired today!" [puppet] - 10https://gerrit.wikimedia.org/r/453554 (owner: 10Dzahn) [16:20:21] marostegui: The patch is up, let me find someone to review. anomie do you have a minute to check https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/479467/1/frontend/specialpages/reports/ProblemChanges_body.php ? [16:20:22] i just got a catchpoint email about icinga [16:20:24] dc work going on in eqiad right now? [16:20:26] yea, cant connect [16:20:33] Fix up of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/476470/1/frontend/specialpages/reports/ProblemChanges_body.php [16:20:44] icinga1001 is indeed off the network, I've logged off the console now [16:21:13] in a meeting now, can't look further [16:21:17] robh, cmjohnson1, papaul dc work going on right now? [16:21:19] cmjohnson1: are you working in the DC ? [16:21:21] (03CR) 10Hashar: [C: 031] ci::httpd: add support for stretch/PHP 7.0 [puppet] - 10https://gerrit.wikimedia.org/r/478125 (owner: 10Dzahn) [16:22:02] mutante nothing that would be affecting anything [16:22:09] Amir1: I commented on the patch [16:22:42] Thanks. [16:23:09] mutante i take that back....apparently the network cable is really touchy on icinga1001 [16:23:23] i must've gently grazed it connecting to sulfur1001 [16:23:30] confirmed, it's back [16:23:34] it's back on now [16:23:38] cmjohnson1: ah! ok, i like that we have an explanation :) [16:23:40] any help needed? [16:23:42] <_joe_> hey what [16:23:49] <_joe_> going on? [16:23:56] thanks cmjohnson1 :) [16:23:59] <_joe_> icinga reports everything donw [16:24:04] network cable of icinga1001 was disconnected [16:24:08] <_joe_> ok [16:24:09] phew... [16:24:09] RECOVERY - Host 2620:0:862:1:91:198:174:106 is UP: PING OK - Packet loss = 0%, RTA = 83.79 ms [16:24:10] <_joe_> sigh [16:24:15] I was terrified for a bit [16:24:17] how does affect catchpoint? [16:24:18] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 24.49 ms [16:24:19] RECOVERY - Host 2620:0:862:1:91:198:174:122 is UP: PING OK - Packet loss = 0%, RTA = 84.31 ms [16:24:24] I got just a page for api, nothing else [16:24:30] mark: i got email from catchpoint telling me that icinga was down [16:24:32] RECOVERY - Host api.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:24:33] ok [16:24:40] I may need to replace that cable [16:24:42] that right there [16:24:56] i mean I gently grazed it at the most [16:25:13] <_joe_> cmjohnson1: I would do that before doing anything else [16:25:25] Is it okay to do now? [16:26:36] I think so, since we're all watching it [16:26:49] yea, should i stop icinga for a moment? [16:27:04] good idea mutante [16:27:24] !log icinga1001 - disable puppet, stopped icinga, for cable replacement [16:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:32] Amir1: Found a bug in your patch. [16:28:48] okay..i'm standing by cable is ready just say go [16:29:23] cmjohnson1: do it [16:29:36] the service is stopped [16:31:01] anomie: oh thanks. Nice catch [16:31:13] nodone [16:31:14] done [16:32:27] <_joe_> icinga still down [16:33:09] !log icinga1001 - started service again, enabeld puppet [16:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:50] thanks chris, done. in a meetingnow [16:40:34] Amir1: Is this query new or something and doesn't happen too often? Asking cause after the initial spike of errors, I don't see more them happening too often, last one is from .27 [16:40:41] 16:27 UTC I mean [16:40:55] (03CR) 10Cwhite: [C: 031] Remove Diamond from further roles [puppet] - 10https://gerrit.wikimedia.org/r/479446 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:41:07] it happens when someone requests a certain special page [16:41:19] it can spikes [16:41:28] Ah, I see… [16:41:50] I will leave the task with High priority then and not UBN [16:44:51] subbu: double-checking… You're ready for me to delete mw-expt.wikitextexp.eqiad.wmflabs and mw-base.wikitextexp.eqiad.wmflabs? [16:46:14] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T211899 (10ops-monitoring-bot) [16:48:54] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Services, 10Release-Engineering-Team (Kanban): graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Milimetric) The reason Graphoid was initially developed was to provide a static image so we wouldn't have to bundle Vega... [16:49:04] subbu: never mind, deleted :) [16:51:53] (03PS5) 10Cwhite: ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) [16:53:03] (03CR) 10jerkins-bot: [V: 04-1] ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:53:19] (03CR) 10Cwhite: ci: define statsd prometheus exporter mappings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:54:30] (03PS6) 10Cwhite: ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) [16:57:00] (03PS1) 10Anomie: Set comment migration stage to new on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479489 (https://phabricator.wikimedia.org/T166733) [16:57:29] (03CR) 10Anomie: [C: 032] "Deploying config change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479489 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [16:58:30] hi [16:58:49] I get a message "The file you uploaded seems to be empty. This might be due to a typo in the filename. Please check whether you really want to upload this file." [16:58:55] (03Merged) 10jenkins-bot: Set comment migration stage to new on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479489 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [16:59:01] but the file is ok, and there is no typo [17:00:04] godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181213T1700). [17:00:04] dcausse: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:41] !log Set comment migration to new on group 1 (T166733) [17:00:42] I see bugs with the same error message, but nothing related to my issue [17:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:45] T166733: Deploy refactored comment storage - https://phabricator.wikimedia.org/T166733 [17:00:45] o/ [17:01:13] I want upload by URL this file https://archive.org/download/lillustrationjou04pari/lillustrationjou04pari.djvu [17:01:29] I downloaded it, it is perfectly ok [17:06:50] (03CR) 10jenkins-bot: Set comment migration stage to new on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479489 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [17:07:11] https://phabricator.wikimedia.org/T211900 [17:09:24] anyone around for puppetswat and a very innocent patch to deploy? :) [17:09:46] dcausse: yes, I'll take a look [17:09:55] godog: thanks! :) [17:12:09] <_joe_> dcausse: sorry I'm in meetings until 7 pm [17:12:12] (03CR) 10Mathew.onipe: [C: 031] Add ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/478030 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [17:12:22] _joe_: np! [17:13:36] (03CR) 10Cwhite: profile: enable statsd_exporter and add matching rules to logstash::collector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:15:28] (03CR) 10Filippo Giunchedi: [C: 032] elasticsearch: configure LVS endpoint for new eqiad clusters [puppet] - 10https://gerrit.wikimedia.org/r/479184 (https://phabricator.wikimedia.org/T207195) (owner: 10DCausse) [17:15:36] (03PS3) 10Filippo Giunchedi: elasticsearch: configure LVS endpoint for new eqiad clusters [puppet] - 10https://gerrit.wikimedia.org/r/479184 (https://phabricator.wikimedia.org/T207195) (owner: 10DCausse) [17:17:26] andrewbogott, sorry i was not on and off .. but all good. :) [17:17:53] yep! [17:18:59] !log T205919 create namespace for blubberoid on eqiad/codfw/staging clusters [17:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:03] T205919: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 [17:20:05] jouncebot: now [17:20:05] For the next 0 hour(s) and 39 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181213T1700) [17:20:06] jouncebot: next [17:20:06] In 0 hour(s) and 39 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181213T1800) [17:20:21] !log run puppet and bounce pybal on lvs in eqiad to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/479184 [17:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:11] 10Operations, 10netops, 10cloud-services-team (Kanban): labtestn neutron router is accesible from the internet - https://phabricator.wikimedia.org/T211901 (10aborrero) p:05Triage>03Normal [17:22:22] (03PS2) 10Alexandros Kosiaris: Add blubberoid kubernetes tokens [puppet] - 10https://gerrit.wikimedia.org/r/479461 (https://phabricator.wikimedia.org/T205919) [17:22:27] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add blubberoid kubernetes tokens [puppet] - 10https://gerrit.wikimedia.org/r/479461 (https://phabricator.wikimedia.org/T205919) (owner: 10Alexandros Kosiaris) [17:22:41] 10Operations: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10Dzahn) [17:26:16] Since there is no patch in puppet swat, I'm going to do a ores deploy now [17:26:21] dcausse: almost done [17:26:22] is it fine akosiaris? [17:26:24] godog: from my end this works very well [17:26:40] Amir1: there is a patch for puppet swat, though feel free to go ahead [17:26:55] oh thanks! [17:27:22] dcausse: sweet [17:28:11] 10Operations, 10User-ArielGlenn: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10ArielGlenn) [17:29:43] 10Operations, 10User-ArielGlenn: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10jijiki) [17:30:41] ores revision to rollback (in case needed) a9d5e95a4888cc40c5b4080c881c1a66cb9e09cd [17:31:09] https://phabricator.wikimedia.org/T211900 can someone at this urgently please? [17:31:15] *look [17:31:20] !log T168967 added shiny-server .deb to stretch-wikimedia [17:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:24] T168967: Upload shiny-server .deb to our Stretch apt repository - https://phabricator.wikimedia.org/T168967 [17:31:32] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:31:36] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:31:50] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:32:00] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:32:00] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:32:00] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:32:02] Amir1: sure [17:32:02] !log ladsgroup@deploy1001 Started deploy [ores/deploy@1a3de73]: T211267 [17:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:06] T211267: ORES preaching seem to not understand it's own config - https://phabricator.wikimedia.org/T211267 [17:32:17] ah zotero again [17:32:27] * akosiaris looking [17:32:31] 10Operations, 10Product-Analytics, 10Patch-For-Review: Upload shiny-server .deb to our Stretch apt repository - https://phabricator.wikimedia.org/T168967 (10aborrero) Please @mpopov try now. If later is decided that having this package in the WMF repo is not the way to go, someone ping me and I will drop it... [17:32:34] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:33:10] godog: is it ok on your side? [17:33:29] hmm, doesn't look like a memory issues this time around [17:34:18] dcausse: yup looks to me! [17:34:31] godog: great! thanks for your time and the deploy! [17:34:43] damn citoid logs are useless [17:34:56] 10Operations, 10User-ArielGlenn: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10ArielGlenn) Do we want a feed of changes to the subteam's tickets in the channel? [17:35:19] dcausse: np! glad it worked [17:35:34] @akosiaris network policies reapplied maybe? [17:35:44] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:35:48] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:35:52] !log elastic@eqiad created cirrus metastore on psi&omega [17:35:52] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:35:52] fsero: go, be with your headache, I 'll debug this one [17:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:06] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:36:06] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [17:36:10] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:36:15] ah great, so now both clusters [17:36:18] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:36:24] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:36:42] 10Operations, 10User-ArielGlenn: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10ArielGlenn) [17:37:01] 10Operations, 10User-ArielGlenn: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10jijiki) [17:37:56] basic zotero requests look fine though [17:38:13] 10Operations, 10User-ArielGlenn: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10jijiki) >>! In T211902#4821032, @ArielGlenn wrote: > Do we want a feed of changes to the subteam's tickets in the channel? I think we should try and see how it goes, after... [17:38:19] 10Operations, 10User-ArielGlenn: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10ArielGlenn) [17:38:32] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Services, 10Release-Engineering-Team (Kanban): graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Yurik) > However, that begs a question. If the caller of the graphoid service API already knows the hash of the graph, t... [17:39:15] 10Operations, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10jijiki) [17:39:21] akosiaris: 38:09.38448119Z (1)(+0003842): Error: ETIMEDOUT from zotero logs [17:40:13] ah dammit ... [17:40:30] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [17:40:34] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [17:40:35] probably the calico policy reapplied right? [17:40:36] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [17:40:44] fsero: yeah that's exactly what happened [17:40:49] why though... [17:40:50] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [17:40:50] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [17:40:54] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [17:40:56] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [17:40:57] ah nope, I know [17:41:02] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [17:41:02] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [17:41:05] the addition of the blubberoid token [17:41:08] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [17:41:10] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [17:41:12] needs an api server restart [17:41:15] ok that explains it all [17:41:26] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [17:41:36] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [17:41:36] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [17:41:36] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [17:41:58] !log reapply the zotero calico policy to allow LVS endpoints [17:41:58] I am *this* close to removing that check from the service [17:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:17] I 'll gather some data on who is actually citing wikipedia using wikipedia next week [17:42:34] 10Operations, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10Dzahn) [17:42:36] maybe it's not really used at all [17:42:37] 10Operations, 10Research-Programs, 10SRE-Access-Requests: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10Afandian) @jijiki Sorry this is getting drawn out! I am unable to log in using the instructions at https://wikitech.wiki... [17:42:57] then i can finally rest in peace [17:43:01] ttyl [17:44:00] 10Operations, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10Dzahn) I made a subtask to request wm-bot to join our channel. That's what creates public logs for wikimedia- channels at https://wm-bot.wmflabs.org/logs/... [17:45:17] 10Operations, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10ArielGlenn) [17:45:36] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Imarlier) [17:45:55] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@1a3de73]: T211267 (duration: 13m 53s) [17:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:59] T211267: ORES preaching seem to not understand it's own config - https://phabricator.wikimedia.org/T211267 [17:46:34] 10Operations, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10Dzahn) [17:46:35] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:46:41] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:46:43] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:46:57] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:46:57] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:47:03] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:47:09] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:47:15] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:50:33] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [17:50:33] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:37] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:52:07] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [17:52:10] I need to deploy this https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/479480 [17:52:11] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [17:52:17] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [17:52:23] It's fataling on one special page everywhere [17:52:41] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [17:52:41] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:52:43] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [17:52:47] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [17:53:01] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [17:53:01] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [17:53:06] https://en.wikipedia.org/wiki/Special:ProblemChanges [17:53:27] marostegui: FYI ^ [17:55:47] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.14 ms [17:55:47] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [17:59:13] 10Operations, 10Performance-Team, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), and 3 others: Expose PHP7/HHVM to NavTiming in a header, send with navtiming beacon so we can use it as a dimension - https://phabricator.wikimedia.org/T211906 (10Imarlier) p:05Triage>03Normal [18:00:05] cscott, arlolra, subbu, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181213T1800). [18:00:25] I have a deployment for ores but I'm deploying something for mediawiki right now [18:04:37] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [18:05:00] (03PS1) 10Daniel Kinzler: Allow MediaInfo on commons to reference Wikidata items. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479497 (https://phabricator.wikimedia.org/T204748) [18:06:13] PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 43 connections established with conf1004.eqiad.wmnet:4001 (min=45) [18:06:17] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [18:07:01] !log ladsgroup@deploy1001 Synchronized php-1.33.0-wmf.8/extensions/FlaggedRevs/frontend/specialpages/reports/ProblemChanges_body.php: Use the right index for change_tag (T211896) (duration: 00m 46s) [18:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:05] T211896: Query trying to use the wrong index (change_tag_rev_tag) on change_tag - https://phabricator.wikimedia.org/T211896 [18:07:54] 10Operations, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10ArielGlenn) [18:08:11] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [18:08:24] 10Operations, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10ArielGlenn) [18:09:04] (03CR) 10Dduvall: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/399256 (owner: 10Dduvall) [18:09:46] (03Abandoned) 10Dduvall: Use sed instead of envsubst [deployment-charts] - 10https://gerrit.wikimedia.org/r/399256 (owner: 10Dduvall) [18:10:43] (03CR) 10Jforrester: Allow MediaInfo on commons to reference Wikidata items. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479497 (https://phabricator.wikimedia.org/T204748) (owner: 10Daniel Kinzler) [18:11:03] !log restart pybal on lvs1006 for config updates [18:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:25] RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 45 connections established with conf1004.eqiad.wmnet:4001 (min=45) [18:11:50] (03CR) 10Lucas Werkmeister (WMDE): Perform even more PHP constraint checks before falling back (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478630 (https://phabricator.wikimedia.org/T209504) (owner: 10Michael Große) [18:12:17] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:13:15] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:13:15] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [18:15:53] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:15:54] Amir1: You still deploying? [18:16:02] James_F: not anymore [18:16:11] OK, I'm grabbing the conch. [18:16:22] have fun! [18:16:23] (03PS2) 10BBlack: authdns: remove authdns-lint bits [puppet] - 10https://gerrit.wikimedia.org/r/479435 (https://phabricator.wikimedia.org/T205439) [18:16:25] (03PS2) 10BBlack: authdns-local-update: only support 3.x [puppet] - 10https://gerrit.wikimedia.org/r/479436 [18:16:27] (03PS2) 10BBlack: authdns spec: require stretch [puppet] - 10https://gerrit.wikimedia.org/r/479437 [18:16:29] (03PS1) 10BBlack: authdns: Remove temporary ensure => absent file entries [puppet] - 10https://gerrit.wikimedia.org/r/479499 [18:17:12] (03PS2) 10Jforrester: [Beta Only] Allow MediaInfo on Commons to reference Wikidata items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479497 (https://phabricator.wikimedia.org/T204748) (owner: 10Daniel Kinzler) [18:17:17] (03CR) 10Jforrester: [C: 032] [Beta Only] Allow MediaInfo on Commons to reference Wikidata items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479497 (https://phabricator.wikimedia.org/T204748) (owner: 10Daniel Kinzler) [18:17:22] (03PS7) 10Cwhite: ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) [18:17:24] (03CR) 10BBlack: [C: 032] authdns: remove authdns-lint bits [puppet] - 10https://gerrit.wikimedia.org/r/479435 (https://phabricator.wikimedia.org/T205439) (owner: 10BBlack) [18:17:46] (03CR) 10BBlack: [C: 032] authdns-local-update: only support 3.x [puppet] - 10https://gerrit.wikimedia.org/r/479436 (owner: 10BBlack) [18:18:10] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:18:22] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms [18:18:22] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [18:19:15] (03Merged) 10jenkins-bot: [Beta Only] Allow MediaInfo on Commons to reference Wikidata items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479497 (https://phabricator.wikimedia.org/T204748) (owner: 10Daniel Kinzler) [18:19:28] (03CR) 10Daniel Kinzler: [Beta Only] Allow MediaInfo on Commons to reference Wikidata items (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479497 (https://phabricator.wikimedia.org/T204748) (owner: 10Daniel Kinzler) [18:20:12] (03CR) 10Jforrester: [Beta Only] Allow MediaInfo on Commons to reference Wikidata items (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479497 (https://phabricator.wikimedia.org/T204748) (owner: 10Daniel Kinzler) [18:21:00] (03CR) 10jenkins-bot: [Beta Only] Allow MediaInfo on Commons to reference Wikidata items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479497 (https://phabricator.wikimedia.org/T204748) (owner: 10Daniel Kinzler) [18:22:41] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T204748 Rename repo-only Wikibase config for clarity [no-op] (duration: 00m 45s) [18:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:45] T204748: Create Federated Wikibase instance on Beta Commons - https://phabricator.wikimedia.org/T204748 [18:22:50] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 (Dec 2018 lease return) - https://phabricator.wikimedia.org/T209858 (10RobH) 05Open>03stalled a:05RobH>03Papaul @papaul, We will be returning these to Farnam sometime this o... [18:24:29] !log jforrester@deploy1001 Synchronized wmf-config/Wikibase.php: T204748 [Beta only] Use newly-fixed config for Wikibase->Commons federation (duration: 00m 44s) [18:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:46] OK, done. [18:26:47] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@93443fe]: Refine MW API queries [18:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:27] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@93443fe]: Refine MW API queries (duration: 03m 41s) [18:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:58] (03CR) 10BBlack: [C: 032] authdns spec: require stretch [puppet] - 10https://gerrit.wikimedia.org/r/479437 (owner: 10BBlack) [18:31:00] !log arlolra@deploy1001 Started deploy [parsoid/deploy@e27574c]: Updating Parsoid to 4242ad0 [18:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:39] (03CR) 10BBlack: [C: 032] authdns: Remove temporary ensure => absent file entries [puppet] - 10https://gerrit.wikimedia.org/r/479499 (owner: 10BBlack) [18:33:31] (03CR) 10BryanDavis: [C: 031] logging: introduce cee formatter usage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478621 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [18:37:28] (03CR) 1020after4: [C: 031] Add jenkins-agent user to releases-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/474824 (owner: 10Thcipriani) [18:40:17] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@e27574c]: Updating Parsoid to 4242ad0 (duration: 09m 17s) [18:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:10] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 (Dec 2018 lease return) - https://phabricator.wikimedia.org/T209858 (10Papaul) 05stalled>03Resolved This can be resolved then since i am done with it . [18:47:07] 10Operations, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10Dzahn) [18:49:20] (03PS1) 10Volans: package_builder: add component/spicerack support [puppet] - 10https://gerrit.wikimedia.org/r/479506 (https://phabricator.wikimedia.org/T205884) [18:50:14] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:50:26] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:51:09] (03CR) 10Smalyshev: "Agree with David - probably more defaults can be added." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479376 (https://phabricator.wikimedia.org/T210431) (owner: 10Mathew.onipe) [18:51:48] Updating [18:52:25] !log Updated Parsoid to 4242ad0 (T204622, T211738) [18:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:31] T204622: Use native Javascript (ES6) classes instead of prototype-based definition pattern in the Parsoid codebase - https://phabricator.wikimedia.org/T204622 [18:52:31] T211738: Clarify "mw:ExpandedAttr" annotation - https://phabricator.wikimedia.org/T211738 [18:52:51] (03CR) 10Smalyshev: [C: 031] [cirrus] fix cluster settings for temp clusters psi&omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [18:53:46] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [18:54:35] what's going on? [18:54:37] (03PS2) 10Volans: package_builder: add component/spicerack support [puppet] - 10https://gerrit.wikimedia.org/r/479506 (https://phabricator.wikimedia.org/T205884) [18:55:28] XioNoX: 1s eqiad<->esams link down? [18:55:31] err 1x [18:55:42] looking [18:56:38] bblack: yes, one link is down since 8min ago [18:56:41] ok [18:56:45] (03CR) 10Volans: "Compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/479506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [18:56:57] we had a spike of 503s centered on that, I guess that was the in-flight stuff + route flapping over, etc [18:57:14] it went away though, ~3-5m [18:57:52] I don't see any maintenance notifications for that one [18:58:14] will wait a bit and open a support ticket if they don't send an email first [18:58:58] 10Operations, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10Dzahn) [18:59:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [19:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181213T1900). [19:00:04] dmaza, Thiemo_WMDE, and dcausse: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:18] * Thiemo_WMDE is here :-) [19:00:19] o/ [19:00:29] hello [19:00:35] I can SWAT [19:02:17] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10CRoslof) @Dzahn Yes, we plan to renew all our current .ee domain registrations. [19:02:29] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479360 (https://phabricator.wikimedia.org/T211234) (owner: 10Dmaza) [19:03:48] (03PS2) 10Muehlenhoff: Remove Diamond from further roles [puppet] - 10https://gerrit.wikimedia.org/r/479446 (https://phabricator.wikimedia.org/T183454) [19:04:43] (03PS2) 10DCausse: Enable Block notice stats on top blocking wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479360 (https://phabricator.wikimedia.org/T211234) (owner: 10Dmaza) [19:04:58] (03CR) 10DCausse: Enable Block notice stats on top blocking wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479360 (https://phabricator.wikimedia.org/T211234) (owner: 10Dmaza) [19:05:05] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479360 (https://phabricator.wikimedia.org/T211234) (owner: 10Dmaza) [19:05:41] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from further roles [puppet] - 10https://gerrit.wikimedia.org/r/479446 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [19:06:10] (03Merged) 10jenkins-bot: Enable Block notice stats on top blocking wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479360 (https://phabricator.wikimedia.org/T211234) (owner: 10Dmaza) [19:06:12] (03PS1) 10Ottomata: Run Presto coordinator on ca-master-2 in cloud-analytics [puppet] - 10https://gerrit.wikimedia.org/r/479508 (https://phabricator.wikimedia.org/T204951) [19:06:52] (03PS2) 10Ottomata: Run Presto coordinator on ca-master-2 in cloud-analytics [puppet] - 10https://gerrit.wikimedia.org/r/479508 (https://phabricator.wikimedia.org/T204951) [19:06:53] dmaza: it's live on mwdebug1002, can you test? [19:07:43] (03PS3) 10Ottomata: Run Presto coordinator on ca-master-2 in cloud-analytics [puppet] - 10https://gerrit.wikimedia.org/r/479508 (https://phabricator.wikimedia.org/T204951) [19:07:46] (03CR) 10Ottomata: [C: 032] Run Presto coordinator on ca-master-2 in cloud-analytics [puppet] - 10https://gerrit.wikimedia.org/r/479508 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [19:07:49] yes.. it's gonna take me a few minutes [19:07:53] (03CR) 10Ottomata: [V: 032 C: 032] Run Presto coordinator on ca-master-2 in cloud-analytics [puppet] - 10https://gerrit.wikimedia.org/r/479508 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [19:07:54] checking now [19:07:54] sure [19:09:58] (03CR) 10jenkins-bot: Enable Block notice stats on top blocking wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479360 (https://phabricator.wikimedia.org/T211234) (owner: 10Dmaza) [19:12:24] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:12:32] (03PS3) 10Volans: apt: add component/spicerack support [puppet] - 10https://gerrit.wikimedia.org/r/479506 (https://phabricator.wikimedia.org/T205884) [19:13:38] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational [19:15:10] PROBLEM - Check systemd state on puppetdb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:15:39] dcausse: a few more mins.. sorry for the delay [19:15:44] np [19:15:54] (03PS1) 10Ottomata: Add ::profile::presto::server to ca-master-2 to run Presto coordinator [puppet] - 10https://gerrit.wikimedia.org/r/479509 (https://phabricator.wikimedia.org/T204951) [19:15:56] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/479506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [19:16:39] (03CR) 10Ottomata: [C: 032] Add ::profile::presto::server to ca-master-2 to run Presto coordinator [puppet] - 10https://gerrit.wikimedia.org/r/479509 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [19:16:50] moritzm: hmm wondering if you know what webperf flapped just now. Maintenance? [19:17:01] why* [19:17:34] RECOVERY - Check systemd state on puppetdb1001 is OK: OK - running: The system is fully operational [19:17:43] Maybe something for diamond crying shortly before being removed? [19:17:53] Krinkle: caused by the Diamond, there's a bug in the prerm script in the Debian package which makes it sometimes not properly clean up after itself [19:17:58] Diamond removal [19:18:00] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910787 [19:18:22] we have no good way to fix it (updating the package with a fixed prerm would also run the prerm once :-) [19:18:38] but as Diamond is outgoing anyway, this doesn't matter much [19:18:45] next puppet run fixes it [19:19:12] !log authdns1001: upgrading gdnsd to 2.99.9944-beta [19:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:18] (03PS4) 10Volans: apt: add component/spicerack support [puppet] - 10https://gerrit.wikimedia.org/r/479506 (https://phabricator.wikimedia.org/T205884) [19:20:38] dcausse: mwdebug1002 ?? [19:20:47] dmaza: yes [19:21:08] (03CR) 10Volans: [C: 032] apt: add component/spicerack support [puppet] - 10https://gerrit.wikimedia.org/r/479506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [19:21:20] oh yeah.. got it.. my bad [19:21:22] moritzm: thanks, maken sense. [19:21:23] looks good here [19:21:45] dcausse: we are good to go [19:21:45] dmaza: ok to sync? [19:21:47] ok [19:22:53] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T211234: Enable Block notice stats on top blocking wikis (duration: 00m 45s) [19:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:57] T211234: Set wgEnableBlockNoticeStats to true on 19 more wikis - https://phabricator.wikimedia.org/T211234 [19:24:26] dmaza: live [19:24:36] thank you dcausse [19:25:14] yw! [19:25:42] Thiemo_WMDE: it's live on mwdebug1002 is it possible for you to test? [19:27:25] !log multatuli: upgrading gdnsd to 2.99.9944-beta [19:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:37] dcausse: I'm testing! [19:27:44] thanks! :) [19:28:36] hm, 1002 doesn't respond... waiting... [19:28:40] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:46] hu [19:29:12] interesting: Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.8/includes/Title.php on line 1224 [19:29:25] can not really by the patch. [19:29:50] !log authdns2001: upgrading gdnsd to 2.99.9944-beta [19:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:23] Thiemo_WMDE: can you try again? [19:30:31] dcausse: Now I got a response, and what I see is positive, the patch works as expected. [19:30:45] Thiemo_WMDE: ok, syncing [19:30:52] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 74787 bytes in 0.173 second response time [19:33:07] (03PS1) 10BBlack: authdns: turn on NSID [puppet] - 10https://gerrit.wikimedia.org/r/479514 [19:33:08] (03PS1) 10BBlack: authdns: turn on NSID [puppet] - 10https://gerrit.wikimedia.org/r/479514 [19:33:54] 10Operations, 10Cloud-VPS, 10netops: Create cloud-in4 filter for labtestn - https://phabricator.wikimedia.org/T211921 (10chasemp) [19:33:57] (03CR) 10BBlack: [C: 03+2] authdns: turn on NSID [puppet] - 10https://gerrit.wikimedia.org/r/479514 (owner: 10BBlack) [19:33:59] (03CR) 10BBlack: [C: 032] authdns: turn on NSID [puppet] - 10https://gerrit.wikimedia.org/r/479514 (owner: 10BBlack) [19:34:02] 10Operations, 10Cloud-VPS, 10netops: Create cloud-in4 filter for labtestn - https://phabricator.wikimedia.org/T211921 (10chasemp) p:05Triage→03Normal [19:35:44] !log dcausse@deploy1001 Synchronized php-1.33.0-wmf.8/extensions/TwoColConflict/: T210501: Add missing code to not loose edits on the other side (duration: 00m 45s) [19:35:45] (03CR) 1020after4: [C: 03+1] "awesome!" [puppet] - 10https://gerrit.wikimedia.org/r/479399 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [19:35:46] (03CR) 1020after4: [C: 031] "awesome!" [puppet] - 10https://gerrit.wikimedia.org/r/479399 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [19:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:48] T210501: Edits on the left (other) side get lost when using preview/diff - https://phabricator.wikimedia.org/T210501 [19:35:59] Thiemo_WMDE: it's live [19:37:47] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:37:48] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:39:01] (03PS1) 10Muehlenhoff: Switch jamesur to volunteer NDA [puppet] - 10https://gerrit.wikimedia.org/r/479516 [19:39:05] (03Merged) 10jenkins-bot: [cirrus] fix cluster settings for temp clusters psi&omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:39:08] dcausse: Awesome, thanks! [19:39:53] !log imported python-elasticsearch_5.4.0-1~deb9u1 into apt.w.o stretch-wikimedia component/spicerack - T205884 [19:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:57] T205884: Spicerack: split wmf-auto-reimage-lib into Spicerack modules - https://phabricator.wikimedia.org/T205884 [19:40:46] !log removed labvirt1014 from debmonitor DB, has been renamed to cloudvirt1014 [19:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:15] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) Ok, thanks @CRoslof [19:42:44] 10Operations, 10Research-Programs, 10SRE-Access-Requests: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10Dzahn) a:05Afandian→03Dzahn [19:44:51] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T210381: [cirrus] fix cluster settings for temp clusters psi&omega (duration: 00m 44s) [19:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:55] T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381 [19:46:34] !log SF Morning SWAT done [19:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:15] (03CR) 10jenkins-bot: [cirrus] fix cluster settings for temp clusters psi&omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479432 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:48:32] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Services, 10Release-Engineering-Team (Kanban): graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Yurik) @akosiaris also, please add usage before the Varnish - to see how often graphoid objects are actually requested b... [19:49:02] !log creating 300 wiki indices in elastic-omega@eqiad [19:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:49] 10Operations, 10Research-Programs, 10SRE-Access-Requests: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10Dzahn) @Afandian Hi, i can help you with this. First of all let me confim i see in the logs on bast1002 some failed login... [19:53:17] !log Ran scap pull on snapshot1005 to undo live changes done for dump performance testing [19:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:59] (03PS1) 10BBlack: authdns: synchronized cookie master key [puppet] - 10https://gerrit.wikimedia.org/r/479524 [19:59:20] (03CR) 10BBlack: [C: 03+2] authdns: synchronized cookie master key [puppet] - 10https://gerrit.wikimedia.org/r/479524 (owner: 10BBlack) [20:00:05] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181213T2000) [20:01:45] (03PS1) 10Ottomata: Make presto module and profile smarter [puppet] - 10https://gerrit.wikimedia.org/r/479525 (https://phabricator.wikimedia.org/T204951) [20:01:51] (03CR) 10jerkins-bot: [V: 04-1] Make presto module and profile smarter [puppet] - 10https://gerrit.wikimedia.org/r/479525 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [20:02:05] 10Operations, 10Performance-Team, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), and 3 others: Expose PHP7/HHVM to NavTiming in a header, send with navtiming beacon so we can use it as a dimension - https://phabricator.wikimedia.org/T211906 (10Gilles) If there's already a cookie p... [20:04:09] 10Operations, 10Research-Programs, 10SRE-Access-Requests: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10Dzahn) >>! In T209298#4821089, @Afandian wrote: > Because I created the key in a different way to those given in the inst... [20:04:35] (03PS2) 10Ottomata: Make presto module and profile smarter [puppet] - 10https://gerrit.wikimedia.org/r/479525 (https://phabricator.wikimedia.org/T204951) [20:05:29] (03CR) 10jerkins-bot: [V: 04-1] Make presto module and profile smarter [puppet] - 10https://gerrit.wikimedia.org/r/479525 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [20:06:26] 10Operations, 10Research-Programs, 10SRE-Access-Requests: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10Dzahn) @Afandian Could you make a new one with either "ssh-keygen -t ed25519" or "ssh-keygen -t rsa -b 4096 -o" per http... [20:07:09] (03PS3) 10Ottomata: Make presto module and profile smarter [puppet] - 10https://gerrit.wikimedia.org/r/479525 (https://phabricator.wikimedia.org/T204951) [20:07:32] 10Operations, 10Research-Programs, 10SRE-Access-Requests: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10Dzahn) a:05Dzahn→03Afandian [20:07:53] 10Operations, 10Research-Programs, 10SRE-Access-Requests: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10Afandian) This was created, in error, using the standard arguments for ssh-keygen, i.e. RSA 2048, SHA256. My mistake. Pe... [20:08:02] (03CR) 10jerkins-bot: [V: 04-1] Make presto module and profile smarter [puppet] - 10https://gerrit.wikimedia.org/r/479525 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [20:09:59] (03PS4) 10Ottomata: Make presto module and profile smarter [puppet] - 10https://gerrit.wikimedia.org/r/479525 (https://phabricator.wikimedia.org/T204951) [20:10:44] !log LDAP - added mneisler to wmf (T211742) - existing shell user, so no gerrit change needed [20:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:50] T211742: LDAP access to the wmf group for Megan Neisler - https://phabricator.wikimedia.org/T211742 [20:11:18] 10Operations, 10hardware-requests, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Done with CPT), 10User-Eevans: Hardware for session storage service - https://phabricator.wikimedia.org/T206017 (10CCicalese_WMF) [20:13:04] 10Operations, 10netops: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930 (10ayounsi) p:05Triage→03Normal [20:19:04] (03PS1) 10Dzahn: admins: replace/fix ssh key for afandian (rsa2048 -> ed25519) [puppet] - 10https://gerrit.wikimedia.org/r/479527 (https://phabricator.wikimedia.org/T209298) [20:21:05] (03CR) 10Dzahn: [C: 03+2] admins: replace/fix ssh key for afandian (rsa2048 -> ed25519) [puppet] - 10https://gerrit.wikimedia.org/r/479527 (https://phabricator.wikimedia.org/T209298) (owner: 10Dzahn) [20:21:32] !log imported elasticsearch-curator_5.2.0-1~deb9u1 into apt.w.o stretch-wikimedia component/spicerack - T205884 [20:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:36] T205884: Spicerack: split wmf-auto-reimage-lib into Spicerack modules - https://phabricator.wikimedia.org/T205884 [20:21:53] (03CR) 10Dzahn: [C: 03+2] "fixes access in general because there was no prefix before the key before (typo)" [puppet] - 10https://gerrit.wikimedia.org/r/479527 (https://phabricator.wikimedia.org/T209298) (owner: 10Dzahn) [20:25:30] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Patch-For-Review: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10Dzahn) @Afandian I replaced the key and also noticed in https://gerrit.wikimedia.org/r/#/c/operatio... [20:29:54] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T211807 (10Dzahn) [20:29:56] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T211899 (10Dzahn) [20:30:46] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T211807 (10Dzahn) [20:33:57] !log creating 300+ wikis indices in elastic-psi @eqiad and @codfw [20:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:29] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Blubberoid - Create Helm Chart - https://phabricator.wikimedia.org/T211708 (10LarsWirzenius) Is this task done now? [20:37:33] 10Operations, 10Traffic, 10Performance-Team (Radar), 10Services (designing), and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Krinkle) [20:38:19] 10Operations: install2002 94% disk usage on "/" - https://phabricator.wikimedia.org/T211850 (10Dzahn) `apt-get clean` -> 93% So it's /srv and there the notable ones by size are: ~ 20G of Junos stuff for #netops , ~ 35G /srv/wikimedia/pool the APT repo of which are 20G main and 12G thirdparty. /srv/wikimedia/... [20:42:29] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10thcipriani) [20:42:36] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Blubberoid - Create Helm Chart - https://phabricator.wikimedia.org/T211708 (10thcipriani) 05Open→03Resolved >>! In T211708#4821969, @LarsWirzenius wrote: > Is this task done now? yes... [20:42:50] 10Operations: install2002 94% disk usage on "/" - https://phabricator.wikimedia.org/T211850 (10Dzahn) It's notable here that install1002 has more space than install2002 does. (should have picked the same size at VM creation?) but also install1002 uses more space (at least the aptrepo part should be automatical... [20:47:19] (03PS9) 10ArielGlenn: convert dump scripts to python3 [dumps] - 10https://gerrit.wikimedia.org/r/478702 (https://phabricator.wikimedia.org/T210989) [20:51:26] (03PS10) 10ArielGlenn: convert dump scripts to python3 [dumps] - 10https://gerrit.wikimedia.org/r/478702 (https://phabricator.wikimedia.org/T210989) [20:59:45] !log otto@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: revert to version 0.26.3 [20:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:54] PROBLEM - superset on analytics-tool1003 is CRITICAL: connect to address 10.64.36.112 and port 9080: Connection refused [21:00:04] this is me/andrew --^ [21:00:06] ^ oops shoudl have silenced [21:00:10] should be back in a sec [21:00:17] !log otto@deploy1001 Finished deploy [analytics/superset/deploy@UNKNOWN]: revert to version 0.26.3 (duration: 00m 32s) [21:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:06] RECOVERY - superset on analytics-tool1003 is OK: TCP OK - 0.001 second response time on 10.64.36.112 port 9080 [21:02:14] (03CR) 10Smalyshev: "I don't know enough about prometheus to evaluate this, unfortunately. We could either wait for Gehel to come back or somebody else with re" [puppet] - 10https://gerrit.wikimedia.org/r/479395 (https://phabricator.wikimedia.org/T208215) (owner: 10Mathew.onipe) [21:08:28] (03PS1) 10Herron: add forward/reverse records for kibana.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/479539 [21:12:46] (03CR) 10Dmaza: "@Reedy: I would really like a +1 here if you are comfortable with it so we can go ahead and deploy this asap. Unless you can think of any " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478807 (https://phabricator.wikimedia.org/T208246) (owner: 10Dmaza) [21:15:19] (03PS1) 10Herron: add load balanced service kibana.svc.codfw.wmnet (eqiad backup) [puppet] - 10https://gerrit.wikimedia.org/r/479541 (https://phabricator.wikimedia.org/T205850) [21:17:01] (03PS5) 10Ottomata: Make presto module and profile smarter [puppet] - 10https://gerrit.wikimedia.org/r/479525 (https://phabricator.wikimedia.org/T204951) [21:17:06] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Make presto module and profile smarter [puppet] - 10https://gerrit.wikimedia.org/r/479525 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [21:18:02] (03PS1) 10CDanis: cdanis dotfiles fun: auto-set reprepro vars [puppet] - 10https://gerrit.wikimedia.org/r/479543 [21:19:05] (03CR) 10CDanis: [C: 03+2] cdanis dotfiles fun: auto-set reprepro vars [puppet] - 10https://gerrit.wikimedia.org/r/479543 (owner: 10CDanis) [21:19:06] (03PS2) 10CDanis: cdanis dotfiles fun: auto-set reprepro vars [puppet] - 10https://gerrit.wikimedia.org/r/479543 [21:20:30] (03PS1) 10Ottomata: Fix missing $ in presto::server [puppet] - 10https://gerrit.wikimedia.org/r/479546 [21:21:15] (03CR) 10Ottomata: [C: 03+2] Fix missing $ in presto::server [puppet] - 10https://gerrit.wikimedia.org/r/479546 (owner: 10Ottomata) [21:31:02] (03PS1) 10Ottomata: Fix coordiantor typo in profile::presto::server [puppet] - 10https://gerrit.wikimedia.org/r/479548 [21:32:05] (03CR) 10Ottomata: [C: 03+2] Fix coordiantor typo in profile::presto::server [puppet] - 10https://gerrit.wikimedia.org/r/479548 (owner: 10Ottomata) [21:34:47] (03CR) 10Reedy: [C: 03+1] Increase default minimum password length on multiple group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478807 (https://phabricator.wikimedia.org/T208246) (owner: 10Dmaza) [21:34:53] (03PS1) 10Ottomata: Puppet booleans work find rendering to string in presto properties file [puppet] - 10https://gerrit.wikimedia.org/r/479550 [21:35:39] (03CR) 10Ottomata: [C: 03+2] Puppet booleans work find rendering to string in presto properties file [puppet] - 10https://gerrit.wikimedia.org/r/479550 (owner: 10Ottomata) [21:36:34] jouncebot: now [21:36:34] For the next 0 hour(s) and 23 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181213T2000) [21:36:36] jouncebot: next [21:36:36] In 2 hour(s) and 23 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181214T0000) [21:38:23] (03PS1) 10Ottomata: presto::server should set up worker only by default [puppet] - 10https://gerrit.wikimedia.org/r/479552 [21:39:27] (03CR) 10Ottomata: [C: 03+2] presto::server should set up worker only by default [puppet] - 10https://gerrit.wikimedia.org/r/479552 (owner: 10Ottomata) [21:47:26] (03PS1) 10Volans: spicerack: configure APT component/spicerack [puppet] - 10https://gerrit.wikimedia.org/r/479555 (https://phabricator.wikimedia.org/T205884) [21:55:47] !log Ganeti - creating new 120G virtual disk on install2002 (T211850) [21:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:51] T211850: install2002 94% disk usage on "/" - https://phabricator.wikimedia.org/T211850 [22:08:30] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Services, 10Release-Engineering-Team (Kanban): graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Tgr) >>! In T211881#4820828, @Milimetric wrote: > The reason Graphoid was initially developed was to provide a static im... [22:08:56] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/13927/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/479327 (owner: 10Dzahn) [22:09:03] (03PS3) 10Dzahn: phabricator: use Stdlib::Fqdn data type for hostname parameters [puppet] - 10https://gerrit.wikimedia.org/r/479327 [22:13:31] (03PS1) 10RobH: decom restbase200[1-6].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/479559 (https://phabricator.wikimedia.org/T211070) [22:15:35] (03PS1) 10RobH: decom restbase200[1-6] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/479560 (https://phabricator.wikimedia.org/T211070) [22:15:56] (03CR) 10RobH: [C: 03+2] decom restbase200[1-6].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/479559 (https://phabricator.wikimedia.org/T211070) (owner: 10RobH) [22:16:21] (03PS2) 10RobH: decom restbase200[1-6].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/479559 (https://phabricator.wikimedia.org/T211070) [22:16:25] hrmm [22:17:07] (03CR) 10RobH: [V: 03+2 C: 03+2] decom restbase200[1-6].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/479559 (https://phabricator.wikimedia.org/T211070) (owner: 10RobH) [22:17:25] (03CR) 10RobH: [C: 03+2] decom restbase200[1-6] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/479560 (https://phabricator.wikimedia.org/T211070) (owner: 10RobH) [22:21:59] 10Operations: install2002 94% disk usage on "/" - https://phabricator.wikimedia.org/T211850 (10Dzahn) p:05Triage→03High [22:22:20] 10Operations, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10Dzahn) p:05Triage→03Normal [22:23:23] 10Operations, 10Performance-Team, 10Graphite: Graphite generates a lot of 502 in Grafana - https://phabricator.wikimedia.org/T211747 (10Dzahn) [22:23:33] 10Operations, 10Performance-Team, 10Graphite: Graphite generates a lot of 502 in Grafana - https://phabricator.wikimedia.org/T211747 (10Dzahn) p:05Triage→03Normal [22:24:50] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1045 - https://phabricator.wikimedia.org/T211796 (10Dzahn) T209618#4819650 shows a stress test was done on these. looks like it got too stressed [22:25:43] 10Operations, 10media-storage, 10User-fgiunchedi: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10Dzahn) >>! In T209618#4819650, @Stashbot wrote: > stress-test ms-be10[44-50] - T209618 T211796 shows a new RAID failure on ms-be1045. looks like that happene... [22:26:17] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission of restbase200[1-6] (lease return in December 2018) - https://phabricator.wikimedia.org/T211070 (10RobH) a:05RobH→03Papaul [22:27:19] 10Operations, 10Puppet, 10puppet-compiler: Cleanup the puppetmaster module so that we stop breaking expectations (and the puppet compiler) - https://phabricator.wikimedia.org/T211547 (10Dzahn) p:05Triage→03High [22:29:20] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.8/extensions/MobileFrontend: T211903 (duration: 00m 48s) [22:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:24] T211903: McsContentProviderTest::testGetHtmlWithNoResponse Undefined index: 1 - https://phabricator.wikimedia.org/T211903 [22:30:10] 10Operations, 10Security-Team: Fetching ORES API from en.wikipedia.org blocked in debug mode - https://phabricator.wikimedia.org/T211511 (10Dzahn) T137433 cc: @Ladsgroup ? [22:31:15] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.8/includes/http/HttpRequestFactory.php: T211886 (duration: 00m 44s) [22:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:23] T211886: Fatal error: request has exceeded memory limit in /srv/mediawiki/php-1.33.0-wmf.8/vendor/guzzlehttp/psr7/src/Stream.php on line 97 - https://phabricator.wikimedia.org/T211886 [22:31:30] 10Operations, 10ORES, 10Scoring-platform-team, 10Security-Team: Fetching ORES API from en.wikipedia.org blocked in debug mode - https://phabricator.wikimedia.org/T211511 (10Dzahn) [22:33:27] (03PS1) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479561 [22:33:29] (03PS1) 10Zoranzoki21: wmgBabelMainCategory: Update srwikinews translation, add srwikibooks, srwikisource, srwikiquote and srwiktionary translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479562 [22:35:11] (03Abandoned) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479561 (owner: 10Zoranzoki21) [22:36:08] (03PS2) 10Zoranzoki21: wmgBabelMainCategory: Update srwikinews translation, add srwikibooks, srwikiquote and srwiktionary translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479562 [22:36:32] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1045 - https://phabricator.wikimedia.org/T211796 (10Cmjohnson) I am not seeing anything wrong with the disks Exit Code: 0x00 cmjohnson@ms-be1045:~$ sudo megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware state: Online, Spun Up... [22:37:00] (03PS1) 10Cwhite: profile: enable statsd_exporter and add matching rules to ores::worker [puppet] - 10https://gerrit.wikimedia.org/r/479563 (https://phabricator.wikimedia.org/T205870) [22:37:55] 10Operations, 10Wikimedia-General-or-Unknown: Request for information about hosting services for WM-ES - https://phabricator.wikimedia.org/T211414 (10Dzahn) As MarcoAurelio said, we can create a wiki for chapters. That would run on our cluster just like all the other wikis. It would be separate from the other... [22:38:15] 10Operations, 10Wikimedia-General-or-Unknown: Request for information about hosting services for WM-ES - https://phabricator.wikimedia.org/T211414 (10Dzahn) p:05Triage→03Normal [22:40:55] (03PS3) 10Jforrester: wmgBabelMainCategory: Update srwikinews translation, add srwikibooks, srwikiquote and srwiktionary translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479562 (owner: 10Zoranzoki21) [22:41:20] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1045 - https://phabricator.wikimedia.org/T211796 (10Dzahn) 05Open→03Resolved a:03Dzahn oh, false alert. the ticket got auto-created because the check failed. the cause was actually "CHECK_NRPE: Error - Could not connect to 10.64.0.139: Connection reset by... [22:41:24] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1045 - https://phabricator.wikimedia.org/T211796 (10Dzahn) 05Resolved→03Invalid [22:44:41] 10Operations, 10ops-codfw, 10ops-eqiad, 10ops-ulsfo: Devices with wmf* names and status active - https://phabricator.wikimedia.org/T209074 (10Cmjohnson) 05Open→03Resolved Moved all the servers that begin with wmf from active to planned [22:50:54] 10Operations, 10Analytics, 10EventBus, 10Wikidata, and 8 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10CCicalese_WMF) [22:51:30] (03PS1) 10BBlack: prometheus gdnsd stats: add cookie stats [puppet] - 10https://gerrit.wikimedia.org/r/479564 [22:53:30] (03CR) 10BBlack: [C: 03+2] prometheus gdnsd stats: add cookie stats [puppet] - 10https://gerrit.wikimedia.org/r/479564 (owner: 10BBlack) [22:54:13] 10Operations, 10ops-codfw, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Done with CPT), and 3 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10CCicalese_WMF) [23:11:02] (03PS5) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [23:11:54] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [23:13:51] !log andrew@deploy1001 Started deploy [horizon/deploy@18c4ca6]: Rolling out fix for T131367 [23:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:55] T131367: Proxy corner case: proxy name foo.wmflabs.org == domain name foo.wmflabs.org - https://phabricator.wikimedia.org/T131367 [23:17:16] !log andrew@deploy1001 Finished deploy [horizon/deploy@18c4ca6]: Rolling out fix for T131367 (duration: 03m 25s) [23:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:50] (03PS6) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [23:20:30] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [23:24:59] (03PS7) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [23:37:33] (03PS2) 10Jforrester: Increase default minimum new password length to 10 for privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478807 (https://phabricator.wikimedia.org/T208246) (owner: 10Dmaza) [23:38:13] I'm grabbing the conch. [23:38:29] (03CR) 10Jforrester: [C: 03+2] Increase default minimum new password length to 10 for privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478807 (https://phabricator.wikimedia.org/T208246) (owner: 10Dmaza) [23:42:18] (03PS1) 10Dduvall: ci: Permit ES traffic from jenkins masters to relforge [puppet] - 10https://gerrit.wikimedia.org/r/479567 (https://phabricator.wikimedia.org/T78705) [23:43:00] (03PS4) 10Paladox: php: Add support for php 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/479144 [23:45:38] (03PS3) 10Jforrester: Increase default minimum new password length to 10 for privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478807 (https://phabricator.wikimedia.org/T208246) (owner: 10Dmaza) [23:45:44] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478807 (https://phabricator.wikimedia.org/T208246) (owner: 10Dmaza) [23:46:51] (03Merged) 10jenkins-bot: Increase default minimum new password length to 10 for privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478807 (https://phabricator.wikimedia.org/T208246) (owner: 10Dmaza) [23:49:44] Okie-dokie. [23:49:50] mooeypoo: Ready to test? [23:50:16] Live on mwdebug1002. [23:51:00] (03CR) 10jenkins-bot: Increase default minimum new password length to 10 for privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478807 (https://phabricator.wikimedia.org/T208246) (owner: 10Dmaza) [23:51:27] (03PS8) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [23:54:37] 10Operations: Add LDAP to aezell for read/write access of Grafana - https://phabricator.wikimedia.org/T211945 (10aezell) [23:55:20] OK, I think we're good to go. [23:55:37] (03PS9) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [23:57:14] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T208246: Increase default minimum new password length to 10 for privileged groups (duration: 00m 44s) [23:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:28] T208246: Change password length requirement and ensure enforcement for privileged users (from 8 to 10) - https://phabricator.wikimedia.org/T208246 [23:57:54] OK, conch is up.