[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190123T0000). [00:00:04] kart_: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:15] * kart_ is here. [00:00:20] Who can SWAT? [00:02:40] kart_: bribe legoktm/reedy with a stroopwafel maybe, if no one else stands up? [00:02:46] (03PS1) 10GTirloni: wmcs::nfs::misc - Disable notifications temporarily [puppet] - 10https://gerrit.wikimedia.org/r/485977 (https://phabricator.wikimedia.org/T209527) [00:02:57] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on kafka1023 is CRITICAL: 6.001 ge 4 daniel_zahn https://phabricator.wikimedia.org/T194249 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=kafka1023&var-datasource=eqiad+prometheus/ops [00:03:33] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Disable notifications temporarily [puppet] - 10https://gerrit.wikimedia.org/r/485977 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [00:03:52] p858snake: Sure :) [00:04:07] ACKNOWLEDGEMENT - Check systemd state on cloudnet1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T214299 [00:05:18] !log T209527 disabled notifications for cloudstore100{8,9} [00:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:22] T209527: Set up scratch and maps NFS services on cloudstore1008/9 - https://phabricator.wikimedia.org/T209527 [00:06:47] legoktm: can you SWAT? [00:08:22] 10Operations, 10Cloud-Services: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10RobH) p:05Triage→03Normal [00:09:12] ACKNOWLEDGEMENT - puppet last run on cloudstore1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Mount[/srv/scratch] daniel_zahn https://phabricator.wikimedia.org/T214079 [00:10:17] ACKNOWLEDGEMENT - Toolforge instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: worker class instances not spread out enough daniel_zahn https://phabricator.wikimedia.org/T214447 [00:10:19] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:13:23] Looks like no one is around for SWAT. [00:14:10] 10Operations, 10Cloud-Services: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10RobH) Ok, just confirmed with @ayounsi that row B is the row for cloudvirt hosts! @papaul: Unless the #cloud-services-team states differently, I think you can move ahe... [00:20:42] (03PS1) 10EBernhardson: cirrus: Set shard counts for new archive index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485984 (https://phabricator.wikimedia.org/T213851) [00:20:50] kart_: i can probably do swat i suppose [00:21:53] (03CR) 10jerkins-bot: [V: 04-1] cirrus: Set shard counts for new archive index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485984 (https://phabricator.wikimedia.org/T213851) (owner: 10EBernhardson) [00:23:01] kart_: if you're arount let me know and can ship it [00:24:12] (03PS2) 10EBernhardson: cirrus: Set shard counts for new archive index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485984 (https://phabricator.wikimedia.org/T213851) [00:26:18] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485984 (https://phabricator.wikimedia.org/T213851) (owner: 10EBernhardson) [00:26:41] ebernhardson: thanks. Please go ahead. [00:27:21] * kart_ was grabbing coffee [00:27:37] (03Merged) 10jenkins-bot: cirrus: Set shard counts for new archive index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485984 (https://phabricator.wikimedia.org/T213851) (owner: 10EBernhardson) [00:28:14] (03CR) 10jenkins-bot: cirrus: Set shard counts for new archive index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485984 (https://phabricator.wikimedia.org/T213851) (owner: 10EBernhardson) [00:28:38] 10Operations, 10Cloud-Services: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10bd808) >>! In T214448#4900967, @RobH wrote: > Ok, just confirmed with @ayounsi that row B is the row for cloudvirt hosts! > > @papaul: Unless the #cloud-services-team... [00:30:09] 10Operations, 10Cloud-Services: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10RobH) [00:32:39] (03PS1) 10EBernhardson: cirrus: Set replica counts for new archive index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485985 (https://phabricator.wikimedia.org/T213851) [00:34:01] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485985 (https://phabricator.wikimedia.org/T213851) (owner: 10EBernhardson) [00:36:26] (03Merged) 10jenkins-bot: cirrus: Set replica counts for new archive index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485985 (https://phabricator.wikimedia.org/T213851) (owner: 10EBernhardson) [00:40:21] (03PS2) 10Tim Starling: When running scripts from staging, use the CommonSettings.php from staging [puppet] - 10https://gerrit.wikimedia.org/r/480695 [00:40:44] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT T213851 Cirrus: Setup archive index shard/replica counts (duration: 00m 54s) [00:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:47] T213851: Drop "archive" from generic index on testwiki - https://phabricator.wikimedia.org/T213851 [00:40:47] (03CR) 10Tim Starling: [C: 03+2] When running scripts from staging, use the CommonSettings.php from staging [puppet] - 10https://gerrit.wikimedia.org/r/480695 (owner: 10Tim Starling) [00:40:49] (03CR) 10jenkins-bot: cirrus: Set replica counts for new archive index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485985 (https://phabricator.wikimedia.org/T213851) (owner: 10EBernhardson) [00:48:43] ebernhardson: patch merged :) [00:49:07] kart_: its syncing to mwdebug1001 right now [00:49:20] kart_: is there anything to test? i see it's probably a maintenance script [00:49:54] ebernhardson: it is just adding debug to script, so nothing can be test but let me try to run script to make sure it is OK. [00:50:36] ebernhardson: go ahead for sync to wmf.13, I can test it on mwdebug1001 :) [00:50:43] Eh. I can't :) [00:51:16] kart_: ok, syncing to cluster [00:52:07] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/ContentTranslation/scripts/purge-unpublished-drafts.php: SWAT T203059 ContentTranslation: Remove waitForReplication for dry-run (duration: 00m 55s) [00:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:11] T203059: Fourth manual run of unpublished draft purge script - https://phabricator.wikimedia.org/T203059 [00:52:15] kart_: ok, all shipped [00:52:26] ebernhardson: cool. Thanks a lot. [00:52:46] (03PS8) 10Dzahn: geoip::maxmind: add data types, rm deprecated validate_string [puppet] - 10https://gerrit.wikimedia.org/r/483222 [00:53:00] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/483222 (owner: 10Dzahn) [00:57:13] (03PS7) 10CRusnov: Upgrade netbox to v2.5.3 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/485142 (https://phabricator.wikimedia.org/T212524) [00:58:37] !log scandium - deleting /etc/apt/preferences.d/stretch_backports.pref ; apt-get remove nodejs [00:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:49] (03PS1) 10Krinkle: webperf: Document which class is which regarding xhgui [puppet] - 10https://gerrit.wikimedia.org/r/485990 (https://phabricator.wikimedia.org/T180761) [01:05:05] !log scandium - deleting /etc/apt/preferences.d/stretch_backports.pref ; apt-get remove nodejs ; apt-get install -t stretch-backports npm ; now has nodejs 10 and npm from backports installed (T201366) [01:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:08] T201366: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 [01:08:03] (03CR) 10Dzahn: "is this really just used on puppetmaster? https://puppet-compiler.wmflabs.org/compiler1001/102/" [puppet] - 10https://gerrit.wikimedia.org/r/483222 (owner: 10Dzahn) [01:08:17] (03PS3) 10Volans: decorators: make retry() DRY-RUN aware [software/spicerack] - 10https://gerrit.wikimedia.org/r/484582 [01:12:36] !log scandium - git cloning parsoid from gerrit - mediawiki/services/parsoid/deploy to /srv/deployment/parsoid/deploy ; still needs https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/484602/ (T201366) [01:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:39] T201366: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 [01:13:14] RECOVERY - parsoid on scandium is OK: HTTP OK: HTTP/1.1 200 OK - 1564 bytes in 0.058 second response time [01:15:28] !log scandium - puppet run now without errors for the first time for the parsoid testing role on stretch instead of jessie. nodejs 10. - @subbu @arlolra you can start using it to replace ruthenium (T201366) [01:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:19] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) >>! In T201366#4901109, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://tools.wmflabs.org/sal/log/AWh4Rr... [01:21:38] (03PS2) 10Volans: sre.host: add Icinga downtime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/484432 (https://phabricator.wikimedia.org/T205886) [01:21:56] (03CR) 10Volans: "replies inline" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/484432 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [01:23:59] (03CR) 10Volans: [C: 03+1] "LGTM, thanks! But before merging let's plan a date and time for the upgrade." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/485142 (https://phabricator.wikimedia.org/T212524) (owner: 10CRusnov) [01:37:34] RECOVERY - Backup of s6 in eqiad on db1115 is OK: Backup for s6 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-01-22 23:54:34 from dbstore1001.eqiad.wmnet:3316 (78 GB) [01:51:36] RECOVERY - Backup of s4 in eqiad on db1115 is OK: Backup for s4 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-01-22 22:16:36 from db1102.eqiad.wmnet:3314 (108 GB) [02:11:12] (03PS2) 10Volans: puppet: add check_{en,dis}abled() methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 [02:12:57] (03CR) 10Volans: "replies inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans) [02:13:22] (03PS3) 10Volans: documentation: fine-tune generated documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/484330 [02:31:38] (03PS4) 10Tim Starling: Un-revert "Refactor profiler.php and X-Wikimedia-Debug parsing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480419 [02:40:15] (03CR) 10Tim Starling: [C: 03+2] Un-revert "Refactor profiler.php and X-Wikimedia-Debug parsing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480419 (owner: 10Tim Starling) [02:41:28] (03Merged) 10jenkins-bot: Un-revert "Refactor profiler.php and X-Wikimedia-Debug parsing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480419 (owner: 10Tim Starling) [02:49:18] (03CR) 10jenkins-bot: Un-revert "Refactor profiler.php and X-Wikimedia-Debug parsing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480419 (owner: 10Tim Starling) [02:50:20] PROBLEM - HHVM rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:26] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 76485 bytes in 0.132 second response time [03:09:56] !log tstarling@deploy1001 Scap failed!: Call to mwscript eval.php returned: None [03:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:56] seems to be the same problem as for labs [03:22:45] !log manually edited LocalSettings.php in php-1.33.0-wmf.13 and php-1.33.0-wmf.14 to use a relative path, like in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480695/ [03:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:10] !log tstarling@deploy1001 Synchronized src/XWikimediaDebug.php: gerrit 480419 (duration: 00m 55s) [03:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:46] !log tstarling@deploy1001 Synchronized php-1.33.0-wmf.14/LocalSettings.php: gerrit 480419 (duration: 00m 52s) [03:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:31] !log tstarling@deploy1001 Synchronized php-1.33.0-wmf.13/LocalSettings.php: gerrit 480419 (duration: 00m 54s) [03:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:52] !log tstarling@deploy1001 Synchronized wmf-config/arclamp.php: gerrit 480419 (duration: 00m 54s) [03:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:36] !log tstarling@deploy1001 scap failed: average error rate on 9/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [03:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:09] !log tstarling@deploy1001 Synchronized wmf-config/CommonSettings.php: gerrit 480419 (duration: 00m 54s) [03:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:41:42] !log tstarling@deploy1001 Synchronized wmf-config/profiler.php: gerrit 480419 (duration: 00m 54s) [03:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:17] !log tstarling@deploy1001 Synchronized wmf-config/PhpAutoPrepend.php: gerrit 480419 (duration: 00m 52s) [03:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:45:28] !log tstarling@deploy1001 Started scap: gerrit 480419 [03:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:02:38] PROBLEM - MariaDB Slave Lag: s1 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.67 seconds [04:02:50] PROBLEM - MariaDB Slave Lag: s1 on db2055 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.68 seconds [04:02:56] PROBLEM - MariaDB Slave Lag: s1 on db2048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.66 seconds [04:03:00] PROBLEM - MariaDB Slave Lag: s1 on db2092 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.62 seconds [04:03:00] PROBLEM - MariaDB Slave Lag: s1 on db2071 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.70 seconds [04:03:10] PROBLEM - MariaDB Slave Lag: s1 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.62 seconds [04:03:18] PROBLEM - MariaDB Slave Lag: s1 on db2072 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.64 seconds [04:03:26] PROBLEM - MariaDB Slave Lag: s1 on db2062 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.14 seconds [04:03:28] PROBLEM - MariaDB Slave Lag: s1 on db2070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.18 seconds [04:03:38] PROBLEM - MariaDB Slave Lag: s1 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.31 seconds [04:05:01] !log tstarling@deploy1001 Finished scap: gerrit 480419 (duration: 19m 33s) [04:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:44] (03PS1) 10Tim Starling: In LocalSettings.php use a relative path to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486004 [04:18:44] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:54:09] (03PS8) 10Tim Starling: Class wrapper for ProductionServices.php etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477956 [05:14:14] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Tgr) What IMO we need to know to understand how the service will deal with spikes: * Are requests cached?... [05:39:04] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Pchelolo) > Does that also go for errors? (Especially when rendering takes too long and gets aborted by th... [06:02:55] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Joe) >>! In T213371#4901265, @Tgr wrote: > What IMO we need to know to understand how the service will dea... [06:10:18] (03PS3) 10Marostegui: parsercachepurging: Increase keys TTL [puppet] - 10https://gerrit.wikimedia.org/r/485583 (https://phabricator.wikimedia.org/T210992) [06:10:20] (03PS3) 10Marostegui: InitialiseSettings.php: Increase parsercache TTL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485582 (https://phabricator.wikimedia.org/T210992) [06:13:49] (03CR) 10Marostegui: [C: 03+2] InitialiseSettings.php: Increase parsercache TTL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485582 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [06:14:52] (03Merged) 10jenkins-bot: InitialiseSettings.php: Increase parsercache TTL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485582 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [06:16:48] !log marostegui@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Increase parsercache TTL keys from 22 to 24 days T210992 (duration: 01m 06s) [06:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:51] T210992: Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 [06:17:29] (03CR) 10Marostegui: [C: 03+2] parsercachepurging: Increase keys TTL [puppet] - 10https://gerrit.wikimedia.org/r/485583 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [06:18:24] 10Operations, 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Marostegui) I have merged both changes after the review from @Krinkle (thanks!). Let's see how it goes [06:19:01] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Tgr) >>! In T213371#4901269, @Pchelolo wrote: > Given the req rate, my gut feeling is that PDFs will take... [06:25:20] !log Stop s4 on db1102 to clone dbstore1004 - T210478 [06:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:23] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [06:25:47] (03CR) 10jenkins-bot: InitialiseSettings.php: Increase parsercache TTL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485582 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [06:28:18] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:32:46] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:46:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486010 (https://phabricator.wikimedia.org/T210713) [06:51:00] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486010 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:52:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486010 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:52:22] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486010 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:53:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1113:3316 T210713 (duration: 00m 53s) [06:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:46] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [06:53:50] !log Deploy schema change on db1113:3316 - T210713 [06:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:31] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486013 [06:58:54] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:23] 10Operations, 10Maps: Kartotherian service on maps100[2-4] timed out on when trying to get tiles. - https://phabricator.wikimedia.org/T214434 (10Gehel) Steps taken to fix this: * run `nodetool repair system_auth` on all nodes * recreate users and permissions, based on the `/usr/local/bin/maps-grants.cql` scr... [07:05:31] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486013 (owner: 10Marostegui) [07:06:35] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486013 (owner: 10Marostegui) [07:07:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1113:3316 T210713 (duration: 00m 52s) [07:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:41] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [07:08:45] (03PS1) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486016 (https://phabricator.wikimedia.org/T210713) [07:10:12] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486016 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:11:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486016 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:12:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1085 T210713 (duration: 00m 53s) [07:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:56] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [07:13:02] !log Deploy schema change on db1085, this will generate lag on s6 labs - T210713 [07:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:35] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486013 (owner: 10Marostegui) [07:18:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486016 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:24:04] 10Operations, 10Traffic, 10netops: Connection problem (Moscow ISP, 4G) - https://phabricator.wikimedia.org/T214459 (10Iluvatar) [07:25:17] 10Operations, 10Traffic, 10netops: Connection problem (Moscow ISP, 4G) - https://phabricator.wikimedia.org/T214459 (10Iluvatar) [07:28:12] (03PS1) 10Elukey: systemd::timer::job: add proper handling of ensure [puppet] - 10https://gerrit.wikimedia.org/r/486017 (https://phabricator.wikimedia.org/T172532) [07:29:39] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486018 [07:34:42] RECOVERY - Backup of s5 in codfw on db1115 is OK: Backup for s5 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-01-23 04:27:40 from dbstore2001.codfw.wmnet:3315 (94 GB) [07:35:34] (03CR) 10Elukey: "Looks good from the pcc's perspective: https://puppet-compiler.wmflabs.org/compiler1002/14430/ (no-op)" [puppet] - 10https://gerrit.wikimedia.org/r/486017 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [07:35:57] if anybody wants to review --^ :) [07:36:12] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:36:56] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486018 (owner: 10Marostegui) [07:37:28] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:46] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:05] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486018 (owner: 10Marostegui) [07:39:19] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1085 T210713 (duration: 00m 52s) [07:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:23] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [07:44:14] (03PS1) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486022 (https://phabricator.wikimedia.org/T210713) [07:45:09] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486018 (owner: 10Marostegui) [07:45:46] RECOVERY - Backup of s2 in codfw on db1115 is OK: Backup for s2 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-01-23 03:00:55 from dbstore2001.codfw.wmnet:3312 (113 GB) [07:46:06] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486022 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:47:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486022 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:48:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1093 T210713 (duration: 00m 54s) [07:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:29] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [07:48:30] !log Deploy schema change on db1093 - T210713 [07:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:32] !log Compress tables on dbstore1004:3314 - T210478 [07:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:35] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [07:58:13] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486022 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:03:17] (03PS1) 10Elukey: profile::analytics::refinery::job::spark_job: handle ensure and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/486024 (https://phabricator.wikimedia.org/T172532) [08:03:57] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486025 [08:07:46] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14431/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/486024 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [08:07:48] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486025 (owner: 10Marostegui) [08:08:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486025 (owner: 10Marostegui) [08:09:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1093 T210713 (duration: 00m 52s) [08:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:31] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [08:09:36] (03PS1) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486026 (https://phabricator.wikimedia.org/T210713) [08:10:45] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486026 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:11:44] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486025 (owner: 10Marostegui) [08:11:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486026 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:12:03] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486026 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:12:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1088 T210713 (duration: 00m 52s) [08:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:54] !log Deploy schema change on db1088 T210713 [08:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:14] !log Add dbstore1004:3314 to zarcillo - T210478 [08:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:17] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [08:18:24] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:19:00] !log Add dbstore1004:3314 to tendril - T210478 [08:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:29] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10Joe) We still have some usages of nutcracker for memcached, specifically within `$wgObjectCaches['mysql-multiwrite']`, whic... [08:24:33] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486028 [08:25:39] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486028 (owner: 10Marostegui) [08:26:41] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486028 (owner: 10Marostegui) [08:27:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1088 T210713 (duration: 00m 55s) [08:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:54] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [08:28:45] !log Deploy schema change on db1061 (s6 primary master) - T210713 [08:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:32] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:37:22] (03PS1) 10Hashar: (DO NOT SUBMIT) Debug for _is_published [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/486029 (https://phabricator.wikimedia.org/T214441) [08:38:18] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486028 (owner: 10Marostegui) [08:38:48] (03CR) 10jerkins-bot: [V: 04-1] (DO NOT SUBMIT) Debug for _is_published [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/486029 (https://phabricator.wikimedia.org/T214441) (owner: 10Hashar) [08:44:14] !log addshore@mwmaint1002:~$ mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki yuewiktionary --batch-size 1000 // T214400 [08:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:17] T214400: Add yue.wikt to Cognate - https://phabricator.wikimedia.org/T214400 [08:46:28] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Yes, that I know of" [puppet] - 10https://gerrit.wikimedia.org/r/483222 (owner: 10Dzahn) [08:47:31] (03PS1) 10Elukey: Create a testing Analytics Druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/486030 (https://phabricator.wikimedia.org/T212256) [08:50:03] (03CR) 10Alexandros Kosiaris: "I can't either. Nothing in the repo and nothing under https://tools.wmflabs.org/openstack-browser/puppetclass/." [puppet] - 10https://gerrit.wikimedia.org/r/485098 (owner: 10Dzahn) [08:54:24] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 2 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Addshore) >>! In T99531#4878798, @CRoslof wrote: > Transferring the domain name from WMDE to the Foundation requires that WMDE complete an own... [08:54:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery: Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Mathew.onipe) 05Open→03Invalid I'm closing this task as invalid I no longer see any error [08:55:00] !log Deploy schema change on s5 codfw master with replication, lag will be generated - T210713 [08:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:03] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [08:59:27] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Did a local diff between the configmaps of 2 helm releases created by the old and the new version of the chart and the only difference is " [deployment-charts] - 10https://gerrit.wikimedia.org/r/483184 (owner: 10Alexandros Kosiaris) [09:00:10] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add an stdout log stanza to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/483227 (owner: 10Alexandros Kosiaris) [09:01:11] (03CR) 10Filippo Giunchedi: [C: 03+1] systemd::timer::job: add proper handling of ensure [puppet] - 10https://gerrit.wikimedia.org/r/486017 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [09:03:48] (03CR) 10Elukey: [C: 03+2] systemd::timer::job: add proper handling of ensure [puppet] - 10https://gerrit.wikimedia.org/r/486017 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [09:04:18] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::spark_job: handle ensure and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/486024 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [09:07:06] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:07:22] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [09:22:50] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging] [09:22:51] !log akosiaris@deploy1001 scap-helm mathoid cluster staging completed [09:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:52] !log akosiaris@deploy1001 scap-helm mathoid finished [09:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:31] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-values.yaml --set resources.replicas=1 staging stable/mathoid [namespace: mathoid, clusters: staging] [09:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:38] !log akosiaris@deploy1001 scap-helm mathoid cluster staging completed [09:23:38] !log akosiaris@deploy1001 scap-helm mathoid finished [09:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:00] Hi there! Just asking: is it possible for a SWAT member to dry-run a maintscript during a SWAT window? [09:29:47] (03CR) 10Elukey: [C: 03+2] Create a testing Analytics Druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/486030 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [09:29:54] (03PS2) 10Elukey: Create a testing Analytics Druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/486030 (https://phabricator.wikimedia.org/T212256) [09:30:55] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus1003.eqiad.wmnet [09:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:33] Daimona: sure [09:35:05] is it connected to a patch that needs to be deployed? [09:38:40] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-values.yaml production stable/mathoid [namespace: mathoid, clusters: eqiad,codfw] [09:38:41] !log akosiaris@deploy1001 scap-helm mathoid cluster eqiad completed [09:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:43] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [09:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:43] !log akosiaris@deploy1001 scap-helm mathoid finished [09:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:00] !log upgrade mathoid in eqiad and codfw to latest chart version [09:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:50] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:43:26] zeljkof: Yes [09:43:50] Daimona: just leave a comment in gerrit, so we don't forget during swat [09:44:03] but yes, running a script during swat is normal operating procedure [09:44:07] I added a patch to deploy both on wmf.13 and .14 and it contains an update of a script which should be run [09:44:13] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Maintenance_scripts [09:44:15] Cool, thanks! [09:44:22] (03PS1) 10Elukey: Clean up old code in Analytics' classes using systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/486037 (https://phabricator.wikimedia.org/T172532) [09:44:33] Should I mention this in the commit msg or on wikitech (or both)? [09:44:48] Daimona: just as a gerrit comment [09:45:03] or in the commit message, if you think it's important [09:45:03] Ok, thanks :) [09:45:14] I'll go for the commit msg [09:45:32] I'm not sure if a commit message should be about deployment notes [09:45:42] but then, if it's important, put it there [09:45:47] can't hurt [09:46:05] 10Operations, 10Traffic, 10netops: Connection problem (Moscow ISP, 4G) - https://phabricator.wikimedia.org/T214459 (10Aklapper) Thanks for the good report! See https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue for the full list. :) [09:47:28] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14433/" [puppet] - 10https://gerrit.wikimedia.org/r/486037 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [09:47:32] (03PS1) 10Giuseppe Lavagetto: Switch ImageFSM.is_published to use GET instead of HEAD [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/486038 (https://phabricator.wikimedia.org/T214441) [09:48:14] (03CR) 10GTirloni: [C: 03+1] toolforge: Prometheus replacement for sge.py diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/485372 (https://phabricator.wikimedia.org/T211684) (owner: 10BryanDavis) [09:48:26] Well, the goal of the patch is solely to fix a script, so I guess it's pretty important [09:51:10] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors [09:51:54] this might be related to my clean ups [09:55:03] (but I don't recall how to force a icinga config check) [09:58:11] Error: Could not find any hostgroup matching 'druid_test_analytics_eqiad' (config file '/etc/icinga/objects/puppet_hosts.cfg', starting on line 1744) [09:58:14] buuuuu [09:58:30] these analytics people [09:58:32] :D [09:58:33] fixing [09:58:47] (I'll also update https://wikitech.wikimedia.org/wiki/Icinga) [10:01:28] (03PS1) 10Elukey: profile::analyytics::refinery::job: remove old cron-related parameters [puppet] - 10https://gerrit.wikimedia.org/r/486040 (https://phabricator.wikimedia.org/T172532) [10:01:30] (03Abandoned) 10Hashar: (DO NOT SUBMIT) Debug for _is_published [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/486029 (https://phabricator.wikimedia.org/T214441) (owner: 10Hashar) [10:02:08] (03CR) 10jerkins-bot: [V: 04-1] profile::analyytics::refinery::job: remove old cron-related parameters [puppet] - 10https://gerrit.wikimedia.org/r/486040 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [10:05:17] (03CR) 10Hashar: [C: 03+2] "Works for me now :)" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/486038 (https://phabricator.wikimedia.org/T214441) (owner: 10Giuseppe Lavagetto) [10:05:46] (03PS2) 10Elukey: profile::analyytics::refinery::job: remove old cron-related parameters [puppet] - 10https://gerrit.wikimedia.org/r/486040 (https://phabricator.wikimedia.org/T172532) [10:06:41] (03Merged) 10jenkins-bot: Switch ImageFSM.is_published to use GET instead of HEAD [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/486038 (https://phabricator.wikimedia.org/T214441) (owner: 10Giuseppe Lavagetto) [10:07:09] (03CR) 10jenkins-bot: Switch ImageFSM.is_published to use GET instead of HEAD [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/486038 (https://phabricator.wikimedia.org/T214441) (owner: 10Giuseppe Lavagetto) [10:07:50] (03PS1) 10Elukey: Add cluster config for druid_test_analytics [puppet] - 10https://gerrit.wikimedia.org/r/486041 (https://phabricator.wikimedia.org/T212256) [10:09:33] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486042 (https://phabricator.wikimedia.org/T210713) [10:10:40] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486042 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:11:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486042 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:12:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1096:3315 T210713 (duration: 00m 53s) [10:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:56] !log Deploy schema change on db1096:3315 - T210713 [10:12:56] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [10:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:12] 10Operations, 10Traffic, 10netops: Connection problem (Moscow ISP, 4G) - https://phabricator.wikimedia.org/T214459 (10Iluvatar) > See [wikitech] for the full list" Do you mean "Additional troubleshooting" sectional and curl to http? Is this a completely full list or maybe you need something else? The problem... [10:13:34] (03PS3) 10Elukey: profile::analyytics::refinery::job: remove old cron-related parameters [puppet] - 10https://gerrit.wikimedia.org/r/486040 (https://phabricator.wikimedia.org/T172532) [10:16:28] (03PS4) 10Elukey: profile::analyytics::refinery::job: remove old cron-related parameters [puppet] - 10https://gerrit.wikimedia.org/r/486040 (https://phabricator.wikimedia.org/T172532) [10:17:02] I steal this time to deploy a security patc [10:17:05] *patch [10:17:20] https://phabricator.wikimedia.org/T207814 [10:18:34] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486042 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:20:39] (03PS5) 10Elukey: profile::analyytics::refinery::job: remove old cron-related parameters [puppet] - 10https://gerrit.wikimedia.org/r/486040 (https://phabricator.wikimedia.org/T172532) [10:20:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486046 [10:21:24] 10Operations, 10Traffic, 10netops: Connection problem (Moscow ISP, 4G) - https://phabricator.wikimedia.org/T214459 (10Iluvatar) [10:22:29] Amir1: let me know when you are done [10:23:14] sure [10:23:35] (03PS6) 10Elukey: profile::analyytics::refinery::job: remove old cron-related parameters [puppet] - 10https://gerrit.wikimedia.org/r/486040 (https://phabricator.wikimedia.org/T172532) [10:25:34] (03CR) 10Elukey: [C: 03+2] Add cluster config for druid_test_analytics [puppet] - 10https://gerrit.wikimedia.org/r/486041 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [10:27:06] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) @Cmjohnson you've got any rough ETA for these? Thanks! [10:31:48] !log Deployed patch for T207814 on wmf.13 [10:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:14] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: neutron: introduce base interface hiera key [puppet] - 10https://gerrit.wikimedia.org/r/486047 (https://phabricator.wikimedia.org/T214299) [10:33:33] !log Deployed patch for T207814 on wmf.14 [10:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:50] marostegui: I'm done. Tell me if you see anything wrong in the logs [10:35:23] sure [10:35:24] thanks [10:35:28] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486046 (owner: 10Marostegui) [10:36:35] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486046 (owner: 10Marostegui) [10:37:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1096:3315 T210713 (duration: 00m 52s) [10:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:42] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [10:38:17] (03PS7) 10Elukey: profile::analytics::refinery::job: remove old cron-related parameters [puppet] - 10https://gerrit.wikimedia.org/r/486040 (https://phabricator.wikimedia.org/T172532) [10:38:24] (03PS8) 10Elukey: profile::analytics::refinery::job: remove old cron-related parameters [puppet] - 10https://gerrit.wikimedia.org/r/486040 (https://phabricator.wikimedia.org/T172532) [10:38:50] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14438/" [puppet] - 10https://gerrit.wikimedia.org/r/486040 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [10:39:10] !log updating puppet catalog compiler facts: `PUPPET_COMPILER=compiler1002.puppet-diffs.eqiad.wmflabs modules/puppet_compiler/files/compiler-update-facts` [10:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:19] (03PS1) 10Elukey: Add druid_test_analytics_eqiad to monitoring_groups [puppet] - 10https://gerrit.wikimedia.org/r/486050 (https://phabricator.wikimedia.org/T212256) [10:41:55] (03CR) 10Elukey: [C: 03+2] Add druid_test_analytics_eqiad to monitoring_groups [puppet] - 10https://gerrit.wikimedia.org/r/486050 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [10:43:26] (03PS1) 10Filippo Giunchedi: WIP prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) [10:44:00] (03CR) 10jerkins-bot: [V: 04-1] WIP prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [10:44:56] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486046 (owner: 10Marostegui) [10:45:19] (03CR) 10Gehel: [C: 04-1] "Much cleaner! The complexity is now hidden from the clients!" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans) [10:46:50] (03PS2) 10Filippo Giunchedi: WIP prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) [10:47:15] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486052 (https://phabricator.wikimedia.org/T210713) [10:47:29] (03CR) 10jerkins-bot: [V: 04-1] WIP prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [10:48:35] 10Operations: puppet: compiler-update-facts error - https://phabricator.wikimedia.org/T214472 (10aborrero) [10:48:42] (03CR) 10Gehel: sre.host: add Icinga downtime cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/484432 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [10:49:55] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486052 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:50:33] 10Operations: puppet: compiler-update-facts error and warning - https://phabricator.wikimedia.org/T214472 (10aborrero) [10:50:49] 10Operations: puppet: compiler-update-facts error and warning - https://phabricator.wikimedia.org/T214472 (10aborrero) [10:51:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486052 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:52:12] (03PS1) 10Elukey: role::druid::test_analytics::worker: use zookeeper 3.4.9-3+deb9u1 [puppet] - 10https://gerrit.wikimedia.org/r/486053 (https://phabricator.wikimedia.org/T212256) [10:52:13] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[zookeeper] [10:52:17] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct [10:52:23] PROBLEM - Druid middlemanager on analytics1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server middleManager [10:52:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097:3315 T210713 (duration: 00m 52s) [10:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:36] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [10:52:48] (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: use zookeeper 3.4.9-3+deb9u1 [puppet] - 10https://gerrit.wikimedia.org/r/486053 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [10:52:51] (03CR) 10Gehel: [C: 03+1] "LGTM" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484582 (owner: 10Volans) [10:53:06] (03PS1) 10Arturo Borrero Gonzalez: compiler-update-facts: refresh default compiler VM name [puppet] - 10https://gerrit.wikimedia.org/r/486054 [10:53:09] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486055 [10:53:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] compiler-update-facts: refresh default compiler VM name [puppet] - 10https://gerrit.wikimedia.org/r/486054 (owner: 10Arturo Borrero Gonzalez) [10:55:59] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: neutron: introduce base interface hiera key [puppet] - 10https://gerrit.wikimedia.org/r/486047 (https://phabricator.wikimedia.org/T214299) [10:56:18] (03CR) 10Gehel: [C: 03+1] "Lot of magic! But still mostly readable. LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/485976 (owner: 10Volans) [10:57:20] (03PS1) 10Hashar: (WIP) better handle registry http codes [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/486056 [10:58:12] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486052 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:58:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Catalog compilatio as expected: https://puppet-compiler.wmflabs.org/compiler1002/14441/" [puppet] - 10https://gerrit.wikimedia.org/r/486047 (https://phabricator.wikimedia.org/T214299) (owner: 10Arturo Borrero Gonzalez) [10:58:31] akosiaris: hey, tell me if I can do anything for moving ores to k8s for now [10:58:41] I'm not sure if I'm able to do anything [10:59:40] Amir1: you could comment on https://phabricator.wikimedia.org/T211625 if you feel like it. It's mostly about how we will support ORES [10:59:50] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486055 (owner: 10Marostegui) [10:59:57] noted [11:00:31] (03PS1) 10Elukey: role::druid::test_analytics::worker: fix druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/486057 (https://phabricator.wikimedia.org/T212256) [11:00:54] (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: fix druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/486057 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [11:01:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486055 (owner: 10Marostegui) [11:02:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097:3315 T210713 (duration: 00m 52s) [11:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:15] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [11:02:19] (03PS3) 10Filippo Giunchedi: WIP prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) [11:03:06] 10Operations, 10Analytics, 10Research, 10Article-Recommendation, and 3 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Marostegui) Maybe @MoritzMuehlenhoff can give some ideas [11:04:18] !log T214299 reboot cloudnet2001-dev, cloudnet2002-dev and cloudnet1003 for new interface names [11:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:21] T214299: cloudvps: neutron: upgrade jessie -> stretch - https://phabricator.wikimedia.org/T214299 [11:05:21] PROBLEM - Host cloudnet2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [11:13:33] RECOVERY - Host cloudnet2002-dev is UP: PING OK - Packet loss = 0%, RTA = 36.41 ms [11:14:09] RECOVERY - Check systemd state on cloudnet2002-dev is OK: OK - running: The system is fully operational [11:17:20] (03PS1) 10Filippo Giunchedi: WIP: hieradata: use v2 for prometheus1003 [puppet] - 10https://gerrit.wikimedia.org/r/486059 [11:17:35] (03CR) 10Gehel: [C: 04-1] "Quick test shows that there is a problem (I did absolutely no investigation)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) (owner: 10Smalyshev) [11:22:08] (03PS5) 10Filippo Giunchedi: WIP prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) [11:22:10] (03PS2) 10Filippo Giunchedi: WIP: hieradata: use v2 for prometheus1003 [puppet] - 10https://gerrit.wikimedia.org/r/486059 [11:25:32] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: neutron: l3_agent: merge similar code for eqiad1/labtestn [puppet] - 10https://gerrit.wikimedia.org/r/486060 (https://phabricator.wikimedia.org/T214299) [11:26:23] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: neutron: l3_agent: merge similar code for eqiad1/labtestn [puppet] - 10https://gerrit.wikimedia.org/r/486060 (https://phabricator.wikimedia.org/T214299) (owner: 10Arturo Borrero Gonzalez) [11:28:58] (03PS1) 10Elukey: role::druid::test_analytics::worker: disable zookeeper monitoring [puppet] - 10https://gerrit.wikimedia.org/r/486061 (https://phabricator.wikimedia.org/T212256) [11:29:15] jouncebot: now [11:29:15] No deployments scheduled for the next 0 hour(s) and 30 minute(s) [11:29:18] jouncebot: next [11:29:18] In 0 hour(s) and 30 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190123T1200) [11:29:31] (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: disable zookeeper monitoring [puppet] - 10https://gerrit.wikimedia.org/r/486061 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [11:30:15] (03CR) 10Filippo Giunchedi: [C: 03+2] "I'll go ahead and merge this since we'll need it for https://gerrit.wikimedia.org/r/c/operations/puppet/+/486051 as well" [puppet] - 10https://gerrit.wikimedia.org/r/484793 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [11:30:23] (03PS7) 10Filippo Giunchedi: role: add prometheus2 backwards-compatibility rules [puppet] - 10https://gerrit.wikimedia.org/r/484793 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [11:34:44] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: neutron: l3_agent: merge similar code for eqiad1/labtestn [puppet] - 10https://gerrit.wikimedia.org/r/486060 (https://phabricator.wikimedia.org/T214299) [11:36:33] (03PS1) 10Mathew.onipe: maps: migrate maps1001 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/486062 (https://phabricator.wikimedia.org/T198622) [11:38:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Catalog looks ok: https://puppet-compiler.wmflabs.org/compiler1002/14447/" [puppet] - 10https://gerrit.wikimedia.org/r/486060 (https://phabricator.wikimedia.org/T214299) (owner: 10Arturo Borrero Gonzalez) [11:39:02] (03PS1) 10Reedy: Fix EP namespace typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486066 (https://phabricator.wikimedia.org/T214456) [11:42:27] (03CR) 10Reedy: [C: 03+2] Fix EP namespace typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486066 (https://phabricator.wikimedia.org/T214456) (owner: 10Reedy) [11:43:24] (03Merged) 10jenkins-bot: Fix EP namespace typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486066 (https://phabricator.wikimedia.org/T214456) (owner: 10Reedy) [11:44:01] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: T214456 (duration: 00m 53s) [11:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:09] T214456: Spelling error in education program talk namespace name - https://phabricator.wikimedia.org/T214456 [11:51:34] 10Operations, 10Traffic, 10netops: Connection problem (Moscow ISP, 4G) - https://phabricator.wikimedia.org/T214459 (10akosiaris) Adding that a, from our `esams` DC, traceroute to this IP seems to stop before `beelive.ru` ` $ traceroute 83.220.238.125 traceroute to 83.220.238.125 (83.220.238.125), 30 hops m... [11:52:01] (03CR) 10Filippo Giunchedi: "drive-by comment: is there an example of the metrics being generated?" [puppet] - 10https://gerrit.wikimedia.org/r/485372 (https://phabricator.wikimedia.org/T211684) (owner: 10BryanDavis) [11:52:03] (03CR) 10jenkins-bot: Fix EP namespace typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486066 (https://phabricator.wikimedia.org/T214456) (owner: 10Reedy) [11:52:05] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: neutron: l3_agent: drop redundant hiera keys for interface names [puppet] - 10https://gerrit.wikimedia.org/r/486067 [11:52:54] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: neutron: l3_agent: drop redundant hiera keys for interface names [puppet] - 10https://gerrit.wikimedia.org/r/486067 [11:56:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Catalog compilation: https://puppet-compiler.wmflabs.org/compiler1002/14448/" [puppet] - 10https://gerrit.wikimedia.org/r/486067 (owner: 10Arturo Borrero Gonzalez) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190123T1200). [12:00:04] aharoni, Daimona, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:17] I can swat today [12:00:25] hallo zeljkof [12:00:29] o/ [12:00:29] o/ [12:00:44] That was synced lol [12:00:59] hi aharoni :) I'll ping you in a few minutes when your patch is at mwdebug1002 ready for testing [12:01:23] (03PS3) 10Zfilipin: Define ImportSources for nywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485151 (owner: 10Amire80) [12:01:30] Daimona: I'll start merging one of the patches, they might take a while to merge [12:01:47] zeljkof: Yes, sure, they always do [12:01:57] aharoni: is there a task connected to your patch? it's not in the commit message [12:02:25] aharoni: it's not required, but it's usually there :) [12:02:33] Mmmm... no. [12:02:50] did the requester just ask in irc? [12:03:00] zeljkof: https://phabricator.wikimedia.org/T214139 is kind of related, but not really about this directly. [12:03:06] yes, asked me [12:03:29] I met him at the last Wikimania, and we've been talking frequently since then [12:03:35] aharoni: there's usually some trail about why some configuration is changed, phab task, community discussion on-wiki... [12:04:06] aharoni: anyway, I'm fine with deploying it as-is, if there's no task [12:04:09] OK :) [12:04:39] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485151 (owner: 10Amire80) [12:04:40] for what it's worth, that's why I created https://phabricator.wikimedia.org/T214139 - so that there would be a sensible default instead of ad hoc requests [12:05:43] (03Merged) 10jenkins-bot: Define ImportSources for nywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485151 (owner: 10Amire80) [12:07:09] zeljkof: it works [12:07:12] aharoni: the patch is at mwdebug1002, please test [12:07:17] well, that was quick :) [12:07:21] :) [12:07:23] ok, deploying then [12:07:31] thank you [12:09:46] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:485151|Define ImportSources for nywiki]] (duration: 00m 54s) [12:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:02] aharoni: it's deployed, please test and thanks for deploying with #releng ;) [12:10:11] site config changes really should have a phab task, for tracking the future if needed, although importsources is generally something that doesn't cause much issue. even chapter wikis and etc that don't follow the normal census system [12:10:58] Amir1: go ahead and deploy your config changes, it will take a while for Daimona's patches to merge [12:11:06] Sure [12:11:09] Thanks [12:12:04] (03PS3) 10Ladsgroup: Fix project talk namespace alias of Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485847 (https://phabricator.wikimedia.org/T213733) [12:13:35] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485847 (https://phabricator.wikimedia.org/T213733) (owner: 10Ladsgroup) [12:14:41] (03Merged) 10jenkins-bot: Fix project talk namespace alias of Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485847 (https://phabricator.wikimedia.org/T213733) (owner: 10Ladsgroup) [12:14:46] (03PS3) 10Ladsgroup: Add new namespace abbreviation for Swedish (sv) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485704 (https://phabricator.wikimedia.org/T214329) (owner: 10Ammarpad) [12:15:58] (03Abandoned) 10Muehlenhoff: Enable base::service_auto_restart for certcentral [puppet] - 10https://gerrit.wikimedia.org/r/481985 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:17:39] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485704 (https://phabricator.wikimedia.org/T214329) (owner: 10Ammarpad) [12:17:42] (03CR) 10jenkins-bot: Define ImportSources for nywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485151 (owner: 10Amire80) [12:17:44] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:485847|Fix project talk namespace alias of Persian Wikipedia (T213733)]] (duration: 00m 53s) [12:17:44] (03CR) 10jenkins-bot: Fix project talk namespace alias of Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485847 (https://phabricator.wikimedia.org/T213733) (owner: 10Ladsgroup) [12:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:47] T213733: Add new synonyms for namespaces in Persian (fa) - https://phabricator.wikimedia.org/T213733 [12:18:46] (03Merged) 10jenkins-bot: Add new namespace abbreviation for Swedish (sv) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485704 (https://phabricator.wikimedia.org/T214329) (owner: 10Ammarpad) [12:20:41] (03PS1) 10Alexandros Kosiaris: Fix path to prometheus-statsd-exporter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/486069 [12:20:58] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:485704|Add new namespace abbreviation for Swedish (sv) (T214329)]] (duration: 00m 53s) [12:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:02] T214329: Swedish Wiktionary: Namespace abbreviation "KAT:" - https://phabricator.wikimedia.org/T214329 [12:21:13] zeljkof: I'm done! [12:22:47] Amir1: great, I'm still waiting for patches to merge :) [12:23:01] good synchronization [12:23:07] nobody waits [12:24:13] (03PS1) 10Giuseppe Lavagetto: Switch ImageFSM.is_published to use GET instead of HEAD [docker-images/docker-pkg] (1.x) - 10https://gerrit.wikimedia.org/r/486071 (https://phabricator.wikimedia.org/T214441) [12:24:29] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix path to prometheus-statsd-exporter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/486069 (owner: 10Alexandros Kosiaris) [12:24:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Switch ImageFSM.is_published to use GET instead of HEAD [docker-images/docker-pkg] (1.x) - 10https://gerrit.wikimedia.org/r/486071 (https://phabricator.wikimedia.org/T214441) (owner: 10Giuseppe Lavagetto) [12:24:55] (03CR) 10jerkins-bot: [V: 04-1] Switch ImageFSM.is_published to use GET instead of HEAD [docker-images/docker-pkg] (1.x) - 10https://gerrit.wikimedia.org/r/486071 (https://phabricator.wikimedia.org/T214441) (owner: 10Giuseppe Lavagetto) [12:25:28] (03CR) 10jerkins-bot: [V: 04-1] Switch ImageFSM.is_published to use GET instead of HEAD [docker-images/docker-pkg] (1.x) - 10https://gerrit.wikimedia.org/r/486071 (https://phabricator.wikimedia.org/T214441) (owner: 10Giuseppe Lavagetto) [12:26:50] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Switch ImageFSM.is_published to use GET instead of HEAD [docker-images/docker-pkg] (1.x) - 10https://gerrit.wikimedia.org/r/486071 (https://phabricator.wikimedia.org/T214441) (owner: 10Giuseppe Lavagetto) [12:27:48] Daimona: both patches have just merged, I'll deploy them both to mwdebug and let you knwo [12:27:50] know [12:27:54] can you test there? [12:28:00] or is running the script the test? [12:28:08] (03CR) 10GTirloni: [C: 03+1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/485372 (https://phabricator.wikimedia.org/T211684) (owner: 10BryanDavis) [12:29:04] zeljkof No need to test [12:29:06] :) [12:29:10] Running in dry mode is testing [12:29:20] Daimona: ok [12:29:44] To be sure, it should be run for every DB, with --dry-run option, and then the results should go for instance in a paste [12:31:01] (03CR) 10jenkins-bot: Add new namespace abbreviation for Swedish (sv) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485704 (https://phabricator.wikimedia.org/T214329) (owner: 10Ammarpad) [12:31:20] (03PS1) 10Alexandros Kosiaris: Bump prometheus-statsd-exporter version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/486072 [12:32:06] !log zfilipin@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/AbuseFilter/: SWAT: [[gerrit:486034|Re-fix the throttle script (T209565)]] (duration: 00m 54s) [12:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:09] T209565: Dry run for normalizeThrottleParameters.php - https://phabricator.wikimedia.org/T209565 [12:32:27] Daimona: ok, that was my next question, what exactly do I need to run? [12:32:29] this? [12:32:34] mwscript extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php --dry-run [12:33:09] Yes [12:33:46] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Bump prometheus-statsd-exporter version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/486072 (owner: 10Alexandros Kosiaris) [12:34:46] (03CR) 10Jcrespo: [C: 03+1] Change WMF logo to white [software/tendril] - 10https://gerrit.wikimedia.org/r/479741 (owner: 10Ladsgroup) [12:36:48] !log zfilipin@deploy1001 Synchronized php-1.33.0-wmf.14/extensions/AbuseFilter: SWAT: [[gerrit:486035|Re-fix the throttle script (T209565)]] (duration: 00m 55s) [12:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:02] Daimona: both patches deployed, running script... [12:37:14] Alright, thanks [12:37:29] BTW public paste is fine [12:37:56] Daimona: good that you said it, since I usually paste the output into the task [12:38:04] Into the task is fine too [12:38:11] Whatever is fine, actually [12:40:32] 10Operations, 10Traffic, 10netops: Connection problem (Moscow ISP, 4G) with Beeline / Sovintel - https://phabricator.wikimedia.org/T214459 (10Aklapper) [12:40:56] Daimona: hm, looks like I don't know how to run it [12:40:57] :( [12:41:05] (03PS2) 10Hashar: ImageFSM._is_published error handling [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/486056 (https://phabricator.wikimedia.org/T214441) [12:41:10] Urgh [12:41:17] `zfilipin@mwmaint1002:~$ mwscript extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php --dry-run` [12:41:37] `Fatal error: Usage: mwscript scriptName.php --wiki=dbname` [12:41:46] Ah it's missing the dbname [12:41:52] Well it should be run for every DB [12:42:01] huh, there are many of them, right? [12:42:07] I think there's "foreachdb" or similar function [12:42:08] hundreds? [12:42:47] Indeed [12:42:48] Amir1: do you know how to run a script for all DBs? [12:43:14] hashar: around to help with running a script? [12:43:22] (03CR) 10Hashar: "I tried to mock requests.get and its response via unittest.mock only to fail. Or to be more precise I was not happy with the resulting cod" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/486056 (https://phabricator.wikimedia.org/T214441) (owner: 10Hashar) [12:43:24] Maybe it's foreachwiki [12:43:28] Uhmmm [12:44:02] found this https://wikitech.wikimedia.org/wiki/Wikimedia_binaries#foreachwiki [12:44:12] `foreachwiki maintenance/doExample.php [args]` [12:44:19] Yes, that's it [12:44:25] and this https://wikitech.wikimedia.org/wiki/Regenerate_cached_special_pages [12:44:28] And all.dblist should be the correct list [12:44:46] zeljkof: script? [12:45:07] hashar: I need to run this for all dbs `mwscript extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php --dry-run` [12:45:17] So it should be: [12:45:22] foreachwiki extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php --dry-run [12:45:28] so, `foreachwiki mwscript extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php --dry-run`? [12:45:30] what is that maintenance script doing? [12:45:43] I think without "mwscript" [12:46:00] hashar: see https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/486034 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/486035 [12:46:11] summary? :) [12:46:13] hashar It checks throttle parameters in AbuseFilter to see if they're malformed. Doesn't change anything being executed in dry run [12:46:51] Because we added some validation to the fields but old filters contain invalid data and we need to fix them [12:48:02] honestly. I don't know what that script is going to cause :( [12:48:11] It's already been executed by Tim [12:48:14] Twice [12:48:37] The patch deployed before only fix some minor problems [12:49:03] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Change WMF logo to white [software/tendril] - 10https://gerrit.wikimedia.org/r/479741 (owner: 10Ladsgroup) [12:49:06] The last one is https://phabricator.wikimedia.org/T209565#4833225 [12:49:44] maybe it is fast enough ;) [12:50:23] my worry is/was about a transaction with LOCK IN SHARE MODE then processing of each rows [12:50:36] but if there is only a few rows, I guess it is fine :) [12:50:42] * Daimona is checking [12:51:14] zeljkof: so yeah get the code on mwmaint1002.eqiad.wmnet (scap pull) [12:51:16] screen [12:51:33] and then what Daimona said: foreachwiki extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php --dry-run [12:51:52] potentially you want to capture the whole session using script [12:51:54] screen [12:52:01] script afntp.log [12:52:07] foreachwiki extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php [12:52:08] ;) [12:52:59] Uhm hashar I sincerely don't know how long it took when Tim executed it... However, the abuse_filter_action holds unique records, so each filter may have 0 to 5-6 rows (1-2 on average), so even on enwiki it has no more than 2000 rows [12:53:24] yeah that would be rather quick to process them [12:53:56] Yup [12:54:47] The other query instead is on abuse_filter_history, which is a bigger table (although not //too// big), however the WHERE should be pretty efficient and reduce the amount of needed rows [12:55:12] thanks hashar, the script is running [12:55:20] Daimona: ^ [12:55:45] ! [12:56:10] (Confirming that it uses where and index) [12:56:12] zeljkof: Sorry was afk for lunch [12:56:17] did you manage to do it? [12:56:28] Amir1: no problem, hashar helped, all is good :) [12:56:37] cool [12:56:45] And the third -and last- query again has a where on a primary key so it should be very quick [13:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190123T1300) [13:02:18] Ahhhh I see some PHP notices on logstash [13:02:20] From the script [13:02:22] I'm done with deployments, but one script is still running, it's at officewiki, so I guess I will be done soon [13:02:50] Cool, thanks [13:03:15] I guess I'll have to fix this PHP notice stuff, but let's see the results anyway [13:05:30] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [13:05:50] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [13:06:02] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [13:06:06] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [13:06:08] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [13:06:30] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [13:08:15] Daimona: script is finished, pasting output [13:09:42] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [13:10:11] https://phabricator.wikimedia.org/P8026 I see, thanks [13:10:23] Daimona: https://phabricator.wikimedia.org/T209565#4902141 [13:10:36] !log EU SWAT finished [13:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:56] zeljkof Thanks again! [13:11:17] Daimona: thanks for deploying with #releng ;) [13:12:48] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10mark) I really don't see the point of this. With the scarcity of IPv4 space we only need to get MORE flexible about how we use our IP space, and we will almost certainly not be able... [13:13:56] (03PS1) 10Gehel: Revert "cache::upload: depool kartotherian in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/486074 (https://phabricator.wikimedia.org/T214434) [13:14:20] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [13:15:59] (03PS2) 10Gehel: Revert "cache::upload: depool kartotherian in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/486074 (https://phabricator.wikimedia.org/T214434) [13:16:38] RECOVERY - Disk space on stat1007 is OK: DISK OK [13:16:56] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [13:17:08] RECOVERY - DPKG on stat1007 is OK: All packages OK [13:17:12] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up [13:17:12] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational [13:17:18] (03CR) 10Mathew.onipe: [C: 03+1] Revert "cache::upload: depool kartotherian in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/486074 (https://phabricator.wikimedia.org/T214434) (owner: 10Gehel) [13:17:36] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient [13:17:44] (03CR) 10Gehel: [C: 03+2] Revert "cache::upload: depool kartotherian in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/486074 (https://phabricator.wikimedia.org/T214434) (owner: 10Gehel) [13:18:40] !log running cumin 'P{O:cache::upload} and A:eqiad' 'run-puppet-agent' [13:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:58] 10Operations, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Patch-For-Review: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10Joe) After some work and further research on NGINX to the end of supporting connection pooling for all outgoing... [13:20:10] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:22:28] 10Operations, 10ops-esams, 10Patch-For-Review: install/designate other machine as esams bastion - https://phabricator.wikimedia.org/T184936 (10mark) [13:22:31] 10Operations, 10ops-esams, 10netops: reconfigure esams switch port for new bastion - https://phabricator.wikimedia.org/T186021 (10mark) 05Stalled→03Declined This was solved by fixing the original bastion, a while ago. [13:24:28] PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received: /img/osm-intl,1,0.0,0.0,100x100@1.5x.png [13:24:28] p) timed out before a response was received [13:24:59] really [13:25:04] damn, trouble again [13:25:36] let's just have a look for 3 minutes before depooling again [13:26:13] Ok [13:27:15] !log Migrate some tokudb tables to innodb on dbstore1002 - T213706 [13:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:18] T213706: Convert Aria/Tokudb tables to InnoDB on dbstore1002 - https://phabricator.wikimedia.org/T213706 [13:27:19] might just be a warm up issue, same check directly on maps1003 now works [13:30:08] gehel: can you confirm that maps1002 is pooled? [13:30:32] RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy [13:30:44] !log restarting kartotherian on maps1003 [13:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:56] onimisionipe: still inactive [13:31:22] (03PS3) 10Sbisson: Revert "Revert "Enable the Welcome survey on viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485212 [13:32:17] !log restarting kartotherian on maps100[234] [13:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:31] load is going down on maps eqiad, previous alert was probably a lack of warm up [13:34:45] Oh.. Ok [13:35:24] let's keep our eyes open, but it looks ok now [13:35:44] Yea.. Im tailing logs and looking at icinga :) [13:37:37] gehel: only maps1003 and maps1004 are receiving request.. [13:37:53] yep, let's repool 1002 as well [13:38:24] !log repooling maps1002 [13:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:54] 10Operations, 10netops: Outbound BGP graceful shutdown - https://phabricator.wikimedia.org/T211728 (10mark) Have a look at https://github.com/mwiget/bgp_graceful_shutdown for a JunOS op script (SLAX) that does this fully automatically for all peers with a single command. It does unfortunately seem to need a m... [13:39:10] 10Operations, 10Puppet, 10Continuous-Integration-Config: puppet.git rake fails with ruby 2.5 - https://phabricator.wikimedia.org/T208566 (10fgiunchedi) In case it is useful, on a buster system I'm using this to run `rake` locally: `PUPPET_GEM_VERSION=4.10.12 bundle exec rake test` [13:41:16] jouncebot: next [13:41:17] In 0 hour(s) and 18 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190123T1400) [13:41:31] (03CR) 10Muehlenhoff: "https://puppet-compiler.wmflabs.org/compiler1002/14449/" [puppet] - 10https://gerrit.wikimedia.org/r/483695 (owner: 10Muehlenhoff) [13:44:15] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10BBlack) It's the same basic rationale as moving WMCS out of `10.68.0.0/16`. We could obviously leave them there and just manage our ACLs better with more automation, but it pays som... [13:44:34] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1007 is OK: OK: synced at Wed 2019-01-23 13:44:33 UTC. [13:44:48] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:44:52] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:46:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/485372 (https://phabricator.wikimedia.org/T211684) (owner: 10BryanDavis) [13:47:09] !log Convert a bunch of Aria tables to InnoDB on dbstore1002 [13:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:30] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:48:34] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:55:25] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10mark) >>! In T211254#4902223, @BBlack wrote: > It's the same basic rationale as moving WMCS out of `10.68.0.0/16`. We could obviously leave them there and just manage our ACLs bette... [13:55:32] 10Operations, 10Puppet, 10Continuous-Integration-Config: puppet.git rake fails with ruby 2.5 - https://phabricator.wikimedia.org/T208566 (10jbond) I have also been trying to get this working the following is also a useful data point https://bugzilla.redhat.com/show_bug.cgi?id=1440710, however applying that f... [14:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190123T1400) [14:06:36] (03PS6) 10Filippo Giunchedi: WIP prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) [14:06:38] (03PS3) 10Filippo Giunchedi: WIP: hieradata: use v2 for prometheus1003 [puppet] - 10https://gerrit.wikimedia.org/r/486059 [14:07:36] (03CR) 10jerkins-bot: [V: 04-1] WIP prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [14:13:12] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [14:13:19] sigh, one of those cases in which I'm running 'bundle exec rake test' locally and passes, unlike when ran by jenkins [14:13:24] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [14:13:48] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [14:13:58] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [14:14:02] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [14:14:02] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [14:14:08] I really need to start looking into limiting users abilities on stat boxes [14:14:14] it is really frustrating [14:14:26] PROBLEM - SSH on stat1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:14:56] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [14:15:03] checking console [14:15:17] elukey: being bold I'm sure will help in this case, limiting resources that is [14:15:55] godog: do you think that cgroups could help with some sane defaults? [14:16:46] RECOVERY - SSH on stat1007 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [14:16:57] elukey: for sure, even ulimit depending on what's going on [14:17:18] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [14:17:19] godog: usually users creating a ton of processes [14:17:28] like in this case [14:17:35] * elukey refers to hoo :) [14:18:16] !log Convert tokudb tables into innodb on dbstore1002 - T213706 [14:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:19] T213706: Convert Aria/Tokudb tables to InnoDB on dbstore1002 - https://phabricator.wikimedia.org/T213706 [14:18:46] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [14:18:49] stat1007 should recover [14:18:50] marostegui: \o/ [14:18:50] elukey: Killed it [14:18:52] let's see [14:18:56] RECOVERY - DPKG on stat1007 is OK: All packages OK [14:19:02] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up [14:19:02] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational [14:19:02] Meh, revscoring created 32 processes there [14:19:13] I now went down to 12 [14:19:18] thanks! [14:19:24] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient [14:19:36] RECOVERY - Disk space on stat1007 is OK: DISK OK [14:20:10] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:23:16] 10Operations, 10netops: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 (10mark) Yes, we should probably move over to `prefix-limit` to prevent (improving) filters from making `accepted-prefix-limit` ineffective. 1) Is worth checking indeed, I suppose we can do that... [14:32:42] (03PS7) 10Filippo Giunchedi: WIP prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) [14:32:44] (03PS4) 10Filippo Giunchedi: WIP: hieradata: use v2 for prometheus1003 [puppet] - 10https://gerrit.wikimedia.org/r/486059 [14:33:43] (03CR) 10jerkins-bot: [V: 04-1] WIP prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [14:35:26] (03PS2) 10Elukey: profile::analytics::packages::statistics: deploy git-lfs [puppet] - 10https://gerrit.wikimedia.org/r/485852 (https://phabricator.wikimedia.org/T214089) [14:36:26] (03CR) 10Elukey: [C: 03+2] profile::analytics::packages::statistics: deploy git-lfs [puppet] - 10https://gerrit.wikimedia.org/r/485852 (https://phabricator.wikimedia.org/T214089) (owner: 10Elukey) [14:40:51] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10BBlack) >>! In T211254#4902250, @mark wrote: >>>! In T211254#4902223, @BBlack wrote: >> It's the same basic rationale as moving WMCS out of `10.68.0.0/16`. We could obviously leave... [14:42:54] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[git-lfs] [14:45:11] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) Not until after the all hands. I will move it up on the list. [14:46:02] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) Thank you! [14:47:32] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1007 is OK: OK: synced at Wed 2019-01-23 14:47:31 UTC. [14:55:09] !log Compress InnoDB on a few tables on dbstore1002 to gain some extra space - T213670 [14:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:12] T213670: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 [15:17:01] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10mark) >>! In T211254#4902340, @BBlack wrote: >>>! In T211254#4902250, @mark wrote: >>>>! In T211254#4902223, @BBlack wrote: >>> It's the same basic rationale as moving WMCS out of `1... [15:20:14] !log Truncate wmf_checksum table on dbstore1002 - T213670 [15:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:17] T213670: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 [15:20:18] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Security-Team, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q3): [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10pmiazga) Done : documented how to perform updates and wrote d... [15:20:25] (03PS4) 10Bstorm: toolforge: kube2proxy: validate requests library version [puppet] - 10https://gerrit.wikimedia.org/r/484609 (https://phabricator.wikimedia.org/T213711) (owner: 10BryanDavis) [15:20:37] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Security-Team, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q3): [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10pmiazga) a:03Jhernandez [15:21:20] (03CR) 10Bstorm: [C: 03+2] toolforge: kube2proxy: validate requests library version [puppet] - 10https://gerrit.wikimedia.org/r/484609 (https://phabricator.wikimedia.org/T213711) (owner: 10BryanDavis) [15:23:01] (03PS1) 10Andrew Bogott: wmfkeystonehooks: update keystone auth [puppet] - 10https://gerrit.wikimedia.org/r/486084 (https://phabricator.wikimedia.org/T214106) [15:23:05] (03PS1) 10Andrew Bogott: wmfkeystonehooks: add some more logging [puppet] - 10https://gerrit.wikimedia.org/r/486085 (https://phabricator.wikimedia.org/T214106) [15:34:42] 10Operations, 10Puppet, 10Continuous-Integration-Config: puppet.git rake fails with ruby 2.5 - https://phabricator.wikimedia.org/T208566 (10jbond) > https://bugzilla.redhat.com/show_bug.cgi?id=1440710, however applying that fix just causes a different error in fact, the patch mentioned in the commit above i... [15:43:36] !log jbond@cumin1001 conftool action : set/pooled=no; selector: name=dns2001.wikimedia.org [15:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:05] (03PS2) 10Andrew Bogott: wmfkeystonehooks: update keystone auth [puppet] - 10https://gerrit.wikimedia.org/r/486084 (https://phabricator.wikimedia.org/T214106) [15:44:07] (03PS2) 10Andrew Bogott: wmfkeystonehooks: add some more logging [puppet] - 10https://gerrit.wikimedia.org/r/486085 (https://phabricator.wikimedia.org/T214106) [15:44:09] (03PS1) 10Andrew Bogott: wmfkeystonehooks: use page.text() instead of page.edit() [puppet] - 10https://gerrit.wikimedia.org/r/486089 (https://phabricator.wikimedia.org/T214106) [15:44:11] (03PS1) 10Andrew Bogott: designatemakedomain: update keystone auth [puppet] - 10https://gerrit.wikimedia.org/r/486090 (https://phabricator.wikimedia.org/T214106) [15:45:11] (03PS1) 10CDanis: add dsharpe to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/486091 (https://phabricator.wikimedia.org/T214090) [15:45:19] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: update keystone auth [puppet] - 10https://gerrit.wikimedia.org/r/486084 (https://phabricator.wikimedia.org/T214106) (owner: 10Andrew Bogott) [15:45:59] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: add some more logging [puppet] - 10https://gerrit.wikimedia.org/r/486085 (https://phabricator.wikimedia.org/T214106) (owner: 10Andrew Bogott) [15:46:01] (03PS2) 10CDanis: add dsharpe to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/486091 (https://phabricator.wikimedia.org/T214090) [15:46:25] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: use page.text() instead of page.edit() [puppet] - 10https://gerrit.wikimedia.org/r/486089 (https://phabricator.wikimedia.org/T214106) (owner: 10Andrew Bogott) [15:46:54] (03CR) 10Andrew Bogott: [C: 03+2] designatemakedomain: update keystone auth [puppet] - 10https://gerrit.wikimedia.org/r/486090 (https://phabricator.wikimedia.org/T214106) (owner: 10Andrew Bogott) [15:47:00] (03PS3) 10CDanis: add dsharpe to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/486091 (https://phabricator.wikimedia.org/T214090) [15:47:10] (03CR) 10CDanis: [V: 03+2 C: 03+2] add dsharpe to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/486091 (https://phabricator.wikimedia.org/T214090) (owner: 10CDanis) [15:47:19] 10Operations, 10Cloud-Services: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) a:03Papaul [15:47:24] (03PS4) 10CDanis: add dsharpe to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/486091 (https://phabricator.wikimedia.org/T214090) [15:57:20] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10pmiazga) >>! In T213371#4901265, @Tgr wrote: > What IMO we need to know to understand how the service will... [15:57:24] 10Operations, 10netops: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 (10ayounsi) a:05faidon→03ayounsi [15:57:49] !log adding dbstore1004:s2 to tendril [15:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:01] !log jbond@cumin1001 conftool action : set/pooled=yes; selector: name=dns2001.wikimedia.org [16:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:18] !log restarting ntpd on dns2001 [16:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:34] (03PS1) 10Mforns: Deactivate EL Druid loading job for PageIssues schema [puppet] - 10https://gerrit.wikimedia.org/r/486094 (https://phabricator.wikimedia.org/T214136) [16:03:59] (03CR) 10jerkins-bot: [V: 04-1] Deactivate EL Druid loading job for PageIssues schema [puppet] - 10https://gerrit.wikimedia.org/r/486094 (https://phabricator.wikimedia.org/T214136) (owner: 10Mforns) [16:06:06] PROBLEM - puppet last run on db1075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:12:09] (03PS1) 10Filippo Giunchedi: prometheus: add alerts_default.yml [puppet] - 10https://gerrit.wikimedia.org/r/486095 [16:12:38] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add alerts_default.yml [puppet] - 10https://gerrit.wikimedia.org/r/486095 (owner: 10Filippo Giunchedi) [16:13:29] !log rolling restarts of PDNS recursors/ntpd in codfw/esams/ulsfi/eqsin to pick up openssl security update [16:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:35] (03PS2) 10Mforns: Deactivate EL Druid loading job for PageIssues schema [puppet] - 10https://gerrit.wikimedia.org/r/486094 (https://phabricator.wikimedia.org/T214136) [16:14:14] !log jbond@cumin1001 conftool action : set/pooled=no; selector: name=dns2002.wikimedia.org [16:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:07] (03PS3) 10Bstorm: toolforge: Rotate SGE accounting file from NFS master [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis) [16:20:36] (03CR) 10Bstorm: "Did I do that right?" [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis) [16:20:40] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Rotate SGE accounting file from NFS master [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis) [16:21:30] (03CR) 10Bstorm: "apparently not :) Finding out what Jenkins needs." [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis) [16:22:12] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10BBlack) >>! In T211254#4902524, @mark wrote: > It may be //possible// to get more space in various shady ways, but it's not possible by following RIR rules. Well, we can obviously s... [16:25:39] (03PS8) 10Filippo Giunchedi: WIP prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) [16:25:42] (03PS5) 10Filippo Giunchedi: WIP: hieradata: use v2 for prometheus1003 [puppet] - 10https://gerrit.wikimedia.org/r/486059 [16:26:03] (03PS4) 10Bstorm: toolforge: Rotate SGE accounting file from NFS master [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis) [16:26:25] (03CR) 10jerkins-bot: [V: 04-1] WIP prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [16:27:02] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Rotate SGE accounting file from NFS master [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis) [16:27:18] (03CR) 10Bstorm: "It has a tab that I do not have locally. That's bizarre. I'll try to fix." [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis) [16:29:41] (03PS5) 10Bstorm: toolforge: Rotate SGE accounting file from NFS master [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis) [16:31:37] !log jbond@cumin1001 conftool action : set/pooled=yes; selector: name=dns2002.wikimedia.org [16:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:02] (03CR) 10Bstorm: "Ok, does that look good, Bryan?" [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis) [16:33:09] (03CR) 10Smalyshev: Add allocator metrics export for Blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) (owner: 10Smalyshev) [16:34:35] (03PS3) 10Volans: puppet: add check_{en,dis}abled() methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 [16:35:06] (03CR) 10Volans: "done" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans) [16:36:35] !log jbond@cumin1001 conftool action : set/pooled=no; selector: name=dns4001.wikimedia.org [16:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:47] (03CR) 1020after4: [C: 04-1] "I haven't deployed the upstream change that uses this yet. We need to coordinate this with the phabricator code deployment." [puppet] - 10https://gerrit.wikimedia.org/r/482400 (https://phabricator.wikimedia.org/T212989) (owner: 10Paladox) [16:37:09] twentyafterfour that config is in the version we use :) [16:37:19] (it's been in phabricator code base for a year) [16:37:30] RECOVERY - puppet last run on db1075 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:38:01] (03CR) 10Volans: [C: 03+2] "thanks for the review!" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484582 (owner: 10Volans) [16:38:23] (03CR) 1020after4: [C: 03+1] "Nevermind, I haven't deployed the change that removes the old config but I see that cluster.mailers should be supported." [puppet] - 10https://gerrit.wikimedia.org/r/482400 (https://phabricator.wikimedia.org/T212989) (owner: 10Paladox) [16:42:57] (03PS1) 10Volans: Add missing timeout to requests calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/486097 [16:43:14] 10Operations, 10Security-Team: Improve LDAP logging - https://phabricator.wikimedia.org/T214489 (10herron) p:05Triage→03Normal [16:43:32] !log jbond@cumin1001 conftool action : set/pooled=yes; selector: name=dns4001.wikimedia.org [16:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:13] !log jbond@cumin1001 conftool action : set/pooled=no; selector: name=dns4002.wikimedia.org [16:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:43] (03PS1) 10Filippo Giunchedi: role: add storage_retention for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/486098 [16:48:17] (03CR) 10jerkins-bot: [V: 04-1] role: add storage_retention for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/486098 (owner: 10Filippo Giunchedi) [16:48:24] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:49:17] (03PS1) 10Bstorm: toolforge: add npm to the stretch grid nodes [puppet] - 10https://gerrit.wikimedia.org/r/486099 (https://phabricator.wikimedia.org/T212981) [16:49:59] (03PS2) 10Filippo Giunchedi: role: add storage_retention for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/486098 [16:50:29] !log jbond@cumin1001 conftool action : set/pooled=yes; selector: name=dns4002.wikimedia.org [16:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:57] (03CR) 10jerkins-bot: [V: 04-1] role: add storage_retention for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/486098 (owner: 10Filippo Giunchedi) [16:53:32] (03CR) 10GTirloni: [C: 03+1] toolforge: add npm to the stretch grid nodes [puppet] - 10https://gerrit.wikimedia.org/r/486099 (https://phabricator.wikimedia.org/T212981) (owner: 10Bstorm) [16:53:52] !log jbond@cumin1001 conftool action : set/pooled=no; selector: name=dns5001.wikimedia.org [16:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:12] (03CR) 10Bstorm: [C: 03+2] toolforge: add npm to the stretch grid nodes [puppet] - 10https://gerrit.wikimedia.org/r/486099 (https://phabricator.wikimedia.org/T212981) (owner: 10Bstorm) [16:54:37] (03CR) 10Filippo Giunchedi: [C: 03+2] "This will be a profile eventually and then hiera arguments will be style-approved" [puppet] - 10https://gerrit.wikimedia.org/r/486098 (owner: 10Filippo Giunchedi) [16:54:53] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] role: add storage_retention for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/486098 (owner: 10Filippo Giunchedi) [16:55:01] (03PS3) 10Filippo Giunchedi: role: add storage_retention for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/486098 [16:55:31] (03PS4) 10GTirloni: toolforge: Prometheus replacement for sge.py diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/485372 (https://phabricator.wikimedia.org/T211684) (owner: 10BryanDavis) [16:55:38] (03CR) 10jerkins-bot: [V: 04-1] role: add storage_retention for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/486098 (owner: 10Filippo Giunchedi) [16:55:50] (03CR) 10CRusnov: [C: 03+1] "Looks good. I can't help but think timeouts should be configurable but that's cool." [software/spicerack] - 10https://gerrit.wikimedia.org/r/486097 (owner: 10Volans) [16:56:36] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] role: add storage_retention for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/486098 (owner: 10Filippo Giunchedi) [16:57:10] bstorm_: I'm merging your change too FYI [16:57:16] Thanks :) [16:57:26] (03PS2) 10Dzahn: webperf: Document which class is which regarding xhgui [puppet] - 10https://gerrit.wikimedia.org/r/485990 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [16:57:34] (03PS1) 10Alexandros Kosiaris: statsd-exporter: Specify default mapping file [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/486101 [16:57:46] np! traffic is busy at the puppet intersection ATM [16:57:46] !log jbond@cumin1001 conftool action : set/pooled=yes; selector: name=dns5001.wikimedia.org [16:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:02] (03CR) 10Dzahn: [C: 03+2] "comments-only and thanks for adding them" [puppet] - 10https://gerrit.wikimedia.org/r/485990 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [16:58:48] RECOVERY - MariaDB Slave Lag: s1 on db2070 is OK: OK slave_sql_lag Replication lag: 59.60 seconds [16:58:48] RECOVERY - MariaDB Slave Lag: s1 on db2048 is OK: OK slave_sql_lag Replication lag: 58.28 seconds [16:59:04] RECOVERY - MariaDB Slave Lag: s1 on db2072 is OK: OK slave_sql_lag Replication lag: 55.41 seconds [16:59:12] (03CR) 10Volans: [C: 03+2] decorators: improve tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/485976 (owner: 10Volans) [16:59:16] RECOVERY - MariaDB Slave Lag: s1 on db2085 is OK: OK slave_sql_lag Replication lag: 53.17 seconds [16:59:35] (03CR) 10Gehel: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans) [16:59:46] RECOVERY - MariaDB Slave Lag: s1 on db2055 is OK: OK slave_sql_lag Replication lag: 38.71 seconds [16:59:52] RECOVERY - MariaDB Slave Lag: s1 on db2094 is OK: OK slave_sql_lag Replication lag: 36.71 seconds [16:59:54] RECOVERY - MariaDB Slave Lag: s1 on db2071 is OK: OK slave_sql_lag Replication lag: 37.31 seconds [16:59:56] (03PS3) 10Elukey: Deactivate EL Druid loading job for PageIssues schema [puppet] - 10https://gerrit.wikimedia.org/r/486094 (https://phabricator.wikimedia.org/T214136) (owner: 10Mforns) [17:00:00] RECOVERY - MariaDB Slave Lag: s1 on db2062 is OK: OK slave_sql_lag Replication lag: 35.73 seconds [17:00:02] RECOVERY - MariaDB Slave Lag: s1 on db2088 is OK: OK slave_sql_lag Replication lag: 34.45 seconds [17:00:02] RECOVERY - MariaDB Slave Lag: s1 on db2092 is OK: OK slave_sql_lag Replication lag: 34.58 seconds [17:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Morning SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190123T1700). [17:00:04] Zoranzoki21 and stephanebisson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:02:39] Here :) [17:03:27] (03PS10) 10Smalyshev: Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) [17:04:32] (03PS5) 10GTirloni: toolforge: Prometheus replacement for sge.py diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/485372 (https://phabricator.wikimedia.org/T211684) (owner: 10BryanDavis) [17:04:48] (03Merged) 10jenkins-bot: decorators: improve tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/485976 (owner: 10Volans) [17:04:50] (03Merged) 10jenkins-bot: decorators: make retry() DRY-RUN aware [software/spicerack] - 10https://gerrit.wikimedia.org/r/484582 (owner: 10Volans) [17:05:58] (03CR) 10GTirloni: [C: 03+2] toolforge: Prometheus replacement for sge.py diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/485372 (https://phabricator.wikimedia.org/T211684) (owner: 10BryanDavis) [17:06:00] (03CR) 10jenkins-bot: decorators: improve tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/485976 (owner: 10Volans) [17:06:47] (03CR) 10jenkins-bot: decorators: make retry() DRY-RUN aware [software/spicerack] - 10https://gerrit.wikimedia.org/r/484582 (owner: 10Volans) [17:07:27] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Jhernandez) This was discussed in the Web/Infra/SRE/Services Q3-Q4 interlock meeting today. I think there... [17:07:59] (03PS1) 10Ammarpad: Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) [17:10:13] 10Operations, 10Security-Team: Improve LDAP logging - https://phabricator.wikimedia.org/T214489 (10herron) http://www.openldap.org/doc/admin24/overlays.html#Password%20Policies (specifically sections 12.2 and 12.10) outline some possibilities for audit logging and password policy that could be useful here [17:10:34] 10Operations, 10Proton, 10Security-Team, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q3), 10Reading-Infrastructure-Team-Backlog (Kanban): [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10Jhernandez) >>! In T213366#4902530, @pmiazga wrote:... [17:11:00] 10Operations, 10Proton, 10Security-Team, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q3), 10Reading-Infrastructure-Team-Backlog (Kanban): [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10Jhernandez) a:05Jhernandez→03None [17:11:35] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10ayounsi) [17:13:04] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10ayounsi) Removed: ` maerlant multatuli eeden nescio ` from the list as they are still in Puppet (and some it prod). Updated Netbox to set status "inventory" on the few hosts that were still "... [17:13:54] addshore: Heyas, are you asking lydia which account she uses for the ldap request? [17:14:13] (I didnt wanna assume you are asking, and im on clinic duty so my job to not assume this week ;) [17:14:46] (03PS2) 10Ammarpad: Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) [17:15:05] (03CR) 10jerkins-bot: [V: 04-1] Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [17:16:05] (03CR) 10Dzahn: [C: 03+2] geoip::maxmind: add data types, rm deprecated validate_string [puppet] - 10https://gerrit.wikimedia.org/r/483222 (owner: 10Dzahn) [17:16:18] (03PS9) 10Dzahn: geoip::maxmind: add data types, rm deprecated validate_string [puppet] - 10https://gerrit.wikimedia.org/r/483222 [17:17:49] (03PS3) 10Ammarpad: Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) [17:20:01] ever stared at puppet compiler output and had to learn to ignore the warnings about geoip::maxmind on top? that should be gone now ^ [17:20:10] !log dcausse@deploy1001 Started deploy [search/mjolnir/deploy@a141ad3]: fix retry_on_conflict [17:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:12] (03CR) 10Dzahn: [C: 03+2] "noop on puppetmaster1001, puppetmaster2002" [puppet] - 10https://gerrit.wikimedia.org/r/483222 (owner: 10Dzahn) [17:21:54] Hi [17:22:01] SWAT is currently? [17:24:31] !log dcausse@deploy1001 Finished deploy [search/mjolnir/deploy@a141ad3]: fix retry_on_conflict (duration: 04m 21s) [17:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:09] jouncebot: now [17:25:09] For the next 0 hour(s) and 34 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190123T1700) [17:26:42] Lucas_WMDE: I thinked who working it currently [17:28:27] robh: apparently she has 2 ldap accounts :P [17:28:42] addshore: yeah, i just replied on task requesting she confirm which she uses [17:28:47] ack! [17:28:51] i am 99.99% sure its the one already in data module [17:28:57] but easier to check than add wrong one =] [17:29:06] indeed, and then we can cleanup the other one etc [17:29:15] yep [17:30:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The change is obviously correct, but it needs some care in being released. We'll need to disable puppet on the appservers and api in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/474910 (https://phabricator.wikimedia.org/T209946) (owner: 10Hashar) [17:32:45] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] statsd-exporter: Specify default mapping file [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/486101 (owner: 10Alexandros Kosiaris) [17:34:00] 10Operations, 10SRE-Access-Requests: remove shell access for mkroetzsch on 2019-01-26 - https://phabricator.wikimedia.org/T214498 (10RobH) [17:35:07] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I have a real usability concern with this. As the patch is now, it will just make docker-pkg unusable on any minor quirk of the registry; " (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/486056 (https://phabricator.wikimedia.org/T214441) (owner: 10Hashar) [17:36:09] 10Operations, 10SRE-Access-Requests: remove shell access for mkroetzsch on 2019-01-26 - https://phabricator.wikimedia.org/T214498 (10RobH) [17:37:32] (03PS1) 10Bstorm: toolforge: ensure stretch grid nodes use backports for nodejs [puppet] - 10https://gerrit.wikimedia.org/r/486110 (https://phabricator.wikimedia.org/T212981) [17:37:48] (03PS1) 10RobH: disabling user mkroetzsch [puppet] - 10https://gerrit.wikimedia.org/r/486111 (https://phabricator.wikimedia.org/T214498) [17:42:49] (03PS2) 10Bstorm: toolforge: ensure stretch grid nodes use backports for nodejs [puppet] - 10https://gerrit.wikimedia.org/r/486110 (https://phabricator.wikimedia.org/T212981) [17:43:51] (03PS3) 10Bstorm: toolforge: ensure stretch grid nodes use backports for nodejs [puppet] - 10https://gerrit.wikimedia.org/r/486110 (https://phabricator.wikimedia.org/T212981) [17:44:57] 10Operations, 10serviceops, 10Patch-For-Review: "sql" command fails with "sh: 1: mysql: not found" on mwdebug1002 - https://phabricator.wikimedia.org/T211512 (10jijiki) [17:45:21] (03CR) 10Bstorm: [C: 03+2] toolforge: ensure stretch grid nodes use backports for nodejs [puppet] - 10https://gerrit.wikimedia.org/r/486110 (https://phabricator.wikimedia.org/T212981) (owner: 10Bstorm) [17:45:30] (03PS4) 10Bstorm: toolforge: ensure stretch grid nodes use backports for nodejs [puppet] - 10https://gerrit.wikimedia.org/r/486110 (https://phabricator.wikimedia.org/T212981) [17:48:22] (03CR) 10Bstorm: toolforge: ensure stretch grid nodes use backports for nodejs [puppet] - 10https://gerrit.wikimedia.org/r/486110 (https://phabricator.wikimedia.org/T212981) (owner: 10Bstorm) [17:49:19] 10Operations, 10Cloud-Services: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) [17:52:54] (03PS4) 10Volans: documentation: fine-tune generated documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/484330 [17:53:45] Sorry I super late to this SWAT window. There's nothing after this so I would like to deploy 2 quick patches that were already in the schedule. Any objections? [17:53:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Thanks for this, looking forward to merging it. A few comments inline as my tests revealed some issues." (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/482718 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:54:22] (03CR) 10Muehlenhoff: [C: 03+1] disabling user mkroetzsch [puppet] - 10https://gerrit.wikimedia.org/r/486111 (https://phabricator.wikimedia.org/T214498) (owner: 10RobH) [17:55:12] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485212 (owner: 10Sbisson) [17:56:22] (03Merged) 10jenkins-bot: Revert "Revert "Enable the Welcome survey on viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485212 (owner: 10Sbisson) [17:56:41] (03CR) 10jenkins-bot: Revert "Revert "Enable the Welcome survey on viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485212 (owner: 10Sbisson) [17:57:47] !log jbond@cumin1001 conftool action : set/pooled=no; selector: name=dns5002.wikimedia.org [17:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:53] (03PS5) 10Bstorm: toolforge: ensure stretch grid nodes use backports for nodejs [puppet] - 10https://gerrit.wikimedia.org/r/486110 (https://phabricator.wikimedia.org/T212981) [17:58:12] (03PS1) 10Alexandros Kosiaris: Fix mathoid's prometheus-statsd.conf inclusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/486114 [17:58:36] (03PS4) 10Elukey: Deactivate EL Druid loading job for PageIssues schema [puppet] - 10https://gerrit.wikimedia.org/r/486094 (https://phabricator.wikimedia.org/T214136) (owner: 10Mforns) [17:59:47] (03CR) 10Bstorm: [C: 03+2] toolforge: ensure stretch grid nodes use backports for nodejs [puppet] - 10https://gerrit.wikimedia.org/r/486110 (https://phabricator.wikimedia.org/T212981) (owner: 10Bstorm) [17:59:55] (03PS6) 10Bstorm: toolforge: ensure stretch grid nodes use backports for nodejs [puppet] - 10https://gerrit.wikimedia.org/r/486110 (https://phabricator.wikimedia.org/T212981) [18:01:26] stephanebisson: Let me know when you're done, I have a patch to deploy too. [18:02:43] (03PS2) 10Volans: Add missing timeout to requests calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/486097 [18:03:45] !log jbond@cumin1001 conftool action : set/pooled=yes; selector: name=dns5002.wikimedia.org [18:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:04] (03CR) 10Elukey: [C: 03+2] Deactivate EL Druid loading job for PageIssues schema [puppet] - 10https://gerrit.wikimedia.org/r/486094 (https://phabricator.wikimedia.org/T214136) (owner: 10Mforns) [18:04:06] (03PS5) 10Elukey: Deactivate EL Druid loading job for PageIssues schema [puppet] - 10https://gerrit.wikimedia.org/r/486094 (https://phabricator.wikimedia.org/T214136) (owner: 10Mforns) [18:04:08] (03CR) 10Elukey: [V: 03+2 C: 03+2] Deactivate EL Druid loading job for PageIssues schema [puppet] - 10https://gerrit.wikimedia.org/r/486094 (https://phabricator.wikimedia.org/T214136) (owner: 10Mforns) [18:09:24] !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/GrowthExperiments/: SWAT: [[gerrit:485209|Help panel: ResourceLoaderHelpPanelModule handle help panel disabled]] (duration: 00m 54s) [18:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:57] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:485212|Reapply Enable the Welcome survey on viwiki]] (duration: 00m 53s) [18:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:26] anomie: The floor is yours. [18:12:27] (03CR) 10Volans: [C: 03+2] Add missing timeout to requests calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/486097 (owner: 10Volans) [18:12:31] thanks [18:12:34] (03PS1) 10Jforrester: Clean-up: Drop B/C checking for $wgEchoConfig, not used since 2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486117 [18:12:36] (03PS1) 10Jforrester: Clean-up: Drop reading for wgEcho*FooterNotice*, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486118 [18:12:38] (03PS1) 10Jforrester: Clean-up: Drop writing to wgEcho*FooterNotice*, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486119 [18:12:40] (03PS1) 10Jforrester: Clean-up: Stop setting $wgFlowEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486120 [18:12:42] (03PS1) 10Jforrester: Clean-up: Stop setting $wgParsoidWikiPrefix, unused since the Parsoid extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486121 [18:14:55] (03CR) 10Filippo Giunchedi: "re: the mappings, after this review was out we've also worked on generic statsd + k8s guidelines at https://wikitech.wikimedia.org/wiki/Pr" [deployment-charts] - 10https://gerrit.wikimedia.org/r/482718 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [18:17:57] (03Merged) 10jenkins-bot: Add missing timeout to requests calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/486097 (owner: 10Volans) [18:18:58] (03PS4) 10Volans: puppet: add check_{en,dis}abled() methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 [18:19:39] (03CR) 10jenkins-bot: Add missing timeout to requests calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/486097 (owner: 10Volans) [18:24:42] (03CR) 10BryanDavis: toolforge: ensure stretch grid nodes use backports for nodejs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486110 (https://phabricator.wikimedia.org/T212981) (owner: 10Bstorm) [18:26:49] !log rebooting mendelevium/ticket.wikimedia.org to pick up SSBD-enabled qemu [18:26:50] (03CR) 10Volans: [C: 03+2] puppet: add check_{en,dis}abled() methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans) [18:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:28] (03PS1) 10Bstorm: toolforge: remove nodejs-legacy from the stretch grid [puppet] - 10https://gerrit.wikimedia.org/r/486124 (https://phabricator.wikimedia.org/T212981) [18:31:45] anomie: I have a patch to deploy, too. :-) [18:32:07] * anomie is waiting for Jenkins [18:32:14] Aren't we all? :-) [18:32:21] (03CR) 10Bstorm: [C: 03+2] toolforge: remove nodejs-legacy from the stretch grid [puppet] - 10https://gerrit.wikimedia.org/r/486124 (https://phabricator.wikimedia.org/T212981) (owner: 10Bstorm) [18:33:40] (03Merged) 10jenkins-bot: puppet: add check_{en,dis}abled() methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans) [18:34:40] (03CR) 10jenkins-bot: puppet: add check_{en,dis}abled() methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans) [18:34:44] but Leeeroy goes first [18:35:01] !log anomie@deploy1001 Synchronized php-1.33.0-wmf.13/includes/page/WikiPage.php: Add even more temporary logging for T210739 (duration: 00m 54s) [18:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:05] T210739: Target deletion during page move fails - https://phabricator.wikimedia.org/T210739 [18:35:10] (03PS1) 10Dzahn: delete ifttt roles and module [puppet] - 10https://gerrit.wikimedia.org/r/486126 [18:37:16] (03CR) 10Bstorm: "> Patch Set 6:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486110 (https://phabricator.wikimedia.org/T212981) (owner: 10Bstorm) [18:37:24] James_F: Looks like I'm good, go ahead. [18:37:29] anomie: Brill, thanks. [18:37:42] (03PS3) 10Jforrester: WBMI: Disable showing 'depicts' statements on Commons for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484781 [18:37:48] (03CR) 10Jforrester: [C: 03+2] WBMI: Disable showing 'depicts' statements on Commons for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484781 (owner: 10Jforrester) [18:37:51] (03PS1) 10Dzahn: publichtml: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486127 [18:37:56] (03PS3) 10Jforrester: [Beta Cluster] WBMI: Show 'depicts' statements on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484782 [18:38:04] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] WBMI: Show 'depicts' statements on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484782 (owner: 10Jforrester) [18:38:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1, we can always revert if there is indeed a user" [puppet] - 10https://gerrit.wikimedia.org/r/486126 (owner: 10Dzahn) [18:38:58] (03Merged) 10jenkins-bot: WBMI: Disable showing 'depicts' statements on Commons for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484781 (owner: 10Jforrester) [18:39:49] PROBLEM - puppet last run on restbase-dev1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:39:51] (03Merged) 10jenkins-bot: [Beta Cluster] WBMI: Show 'depicts' statements on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484782 (owner: 10Jforrester) [18:43:43] (03PS1) 10Mforns: Normalize field names in eventlogging_to_druid_job calls [puppet] - 10https://gerrit.wikimedia.org/r/486128 (https://phabricator.wikimedia.org/T214136) [18:44:23] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.14/extensions/WikimediaEvents/includes/WikimediaEventsHooks.php: Hot-deploy Ief9c9155c to avoid auto-opting new accounts into PHP7 (duration: 00m 53s) [18:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:54] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: Disable showing 'depicts' statements on Commons for now via I66d97031 (duration: 00m 52s) [18:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:14] (03CR) 10jenkins-bot: WBMI: Disable showing 'depicts' statements on Commons for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484781 (owner: 10Jforrester) [18:49:16] (03CR) 10jenkins-bot: [Beta Cluster] WBMI: Show 'depicts' statements on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484782 (owner: 10Jforrester) [18:58:58] 10Operations, 10Traffic, 10netops: Connection problem (Moscow ISP, 4G) with Beeline / Sovintel - https://phabricator.wikimedia.org/T214459 (10ayounsi) I think the "poliplastic" track a red herring, even though those few IPs are assigned to them in whois, it's used by [[ https://bgp.he.net/ip/62.105.150.251... [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190123T1900) [19:03:46] !log rebooting puppetdb2001 to pick up SSBD-enabled qemu [19:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:59] RECOVERY - puppet last run on restbase-dev1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:06:43] (03PS4) 10Ammarpad: Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) [19:07:47] PROBLEM - puppet last run on maps2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:08:09] PROBLEM - puppet last run on mw2271 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:08:39] PROBLEM - puppet last run on mw2286 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:09:06] ^known, should recover soon [19:09:07] PROBLEM - puppet last run on elastic2050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:09:15] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:09:17] PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:09:47] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:10:05] PROBLEM - puppet last run on db2089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:10:19] PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:10:28] Cool! [19:10:33] PROBLEM - puppet last run on pybal-test2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:10:49] PROBLEM - puppet last run on ms-be2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:11:11] PROBLEM - puppet last run on cp4031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:11:23] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:11:47] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:11:53] PROBLEM - puppet last run on mw2138 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:11:53] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:12:04] (03PS1) 10Herron: prometheus: drop rsyslog_exporter default metrics not containing ':' [puppet] - 10https://gerrit.wikimedia.org/r/486132 [19:13:01] RECOVERY - puppet last run on maps2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:13:23] RECOVERY - puppet last run on mw2271 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:13:53] RECOVERY - puppet last run on mw2286 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [19:14:19] RECOVERY - puppet last run on elastic2050 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:14:29] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:14:31] RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:15:01] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [19:15:17] RECOVERY - puppet last run on db2089 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:15:33] RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [19:16:03] RECOVERY - puppet last run on ms-be2030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:16:23] RECOVERY - puppet last run on cp4031 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:16:37] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:17:07] RECOVERY - puppet last run on mw2138 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:19:41] (03PS1) 10Aaron Schulz: Switch parser cache tier 1 to mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486134 (https://phabricator.wikimedia.org/T214275) [19:20:58] (03PS1) 10Volans: spicerack: expose the icinga_master_host property [software/spicerack] - 10https://gerrit.wikimedia.org/r/486135 [19:20:59] RECOVERY - puppet last run on pybal-test2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:22:15] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:22:21] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:23:30] (03PS2) 10Aaron Schulz: Switch parser cache top tier backing cache to mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486134 (https://phabricator.wikimedia.org/T214275) [19:23:46] (03PS3) 10Volans: sre.host: add Icinga downtime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/484432 (https://phabricator.wikimedia.org/T205886) [19:24:07] (03CR) 10Volans: "replies inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/484432 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [19:24:49] (03CR) 10星耀晨曦: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [19:27:00] (03CR) 10星耀晨曦: [C: 03+1] Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [19:32:11] (03PS1) 10EBernhardson: elasticsearch: remote search_thread_pool_executors param [puppet] - 10https://gerrit.wikimedia.org/r/486139 [19:32:38] (03PS2) 10EBernhardson: elasticsearch: remove search_thread_pool_executors param [puppet] - 10https://gerrit.wikimedia.org/r/486139 [19:33:05] !log rebooting dubnium to pick up SSBD-enabled qemu [19:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:20] _joe_: I put a patch for https://phabricator.wikimedia.org/T214275 in SWAT [19:34:30] not sure how that was missed before [19:36:52] (03PS8) 10Volans: sre.hosts: add varnish upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema) [19:36:58] (03CR) 10Volans: "replies inline" (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema) [19:46:16] (03CR) 10Smalyshev: "@Gehel how exactly do you test the script on 1010?" [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) (owner: 10Smalyshev) [19:50:28] (03CR) 10CRusnov: "Looks good, some minor linguistic issues as noted." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484330 (owner: 10Volans) [19:52:09] (03PS1) 10GTirloni: toollabs::k8s::worker - Allow prometheus to access read-only metrics port 10255 [puppet] - 10https://gerrit.wikimedia.org/r/486142 (https://phabricator.wikimedia.org/T214512) [19:52:29] (03CR) 10jerkins-bot: [V: 04-1] toollabs::k8s::worker - Allow prometheus to access read-only metrics port 10255 [puppet] - 10https://gerrit.wikimedia.org/r/486142 (https://phabricator.wikimedia.org/T214512) (owner: 10GTirloni) [19:53:43] <_joe_> AaronSchulz: yeah not sure either, but we might have to brace for some impact when we switch it over [19:54:24] (03PS2) 10Bstorm: toolforge: change the default proxy options to real proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/484335 (https://phabricator.wikimedia.org/T213711) [19:56:17] (03CR) 10Bstorm: [C: 03+2] toolforge: change the default proxy options to real proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/484335 (https://phabricator.wikimedia.org/T213711) (owner: 10Bstorm) [19:58:40] 10Operations, 10Phabricator, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10mmodell) https://wikitech.wikimedia.org/wiki/Phabricator/Meeting_Notes/2019-01-23 [20:00:01] * AaronSchulz pictures Commander Sulu [20:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190123T2000). [20:01:46] (03PS2) 10Dzahn: publichtml: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486127 [20:03:06] (03PS1) 1020after4: group0 wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486143 [20:03:08] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486143 (owner: 1020after4) [20:03:48] (03CR) 10Dzahn: [V: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/486127 (owner: 10Dzahn) [20:04:24] (03Merged) 10jenkins-bot: group0 wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486143 (owner: 1020after4) [20:05:44] (03CR) 10Dzahn: [V: 03+1 C: 03+2] publichtml: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486127 (owner: 10Dzahn) [20:05:55] (03PS3) 10Dzahn: publichtml: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486127 [20:06:55] 10Operations, 10ops-ulsfo, 10Traffic: cp4026 correctable dimm error - https://phabricator.wikimedia.org/T214516 (10RobH) p:05Triage→03Normal [20:07:11] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.33.0-wmf.14 refs T206668 [20:08:15] twentyafterfour@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:08:16] T206668: 1.33.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T206668 [20:09:57] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.33.0-wmf.14 refs T206668 [20:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:29] (03PS2) 10Mforns: Normalize field names in eventlogging_to_druid_job calls [puppet] - 10https://gerrit.wikimedia.org/r/486128 (https://phabricator.wikimedia.org/T214136) [20:14:02] (03PS1) 10Dzahn: librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150 [20:14:25] (03CR) 10Mforns: "I already run the puppet catalog compiler against an-coord1001.eqiad.wmnet and the diff looked good!" [puppet] - 10https://gerrit.wikimedia.org/r/486128 (https://phabricator.wikimedia.org/T214136) (owner: 10Mforns) [20:16:24] so, uhm... error rates are up in wmf.13 after I promoted wmf.14 [20:16:32] I don't quite understand what happened there. [20:16:36] (03CR) 10jenkins-bot: group0 wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486143 (owner: 1020after4) [20:17:24] 10Operations, 10ops-ulsfo, 10Traffic: cp4026 correctable dimm error - https://phabricator.wikimedia.org/T214516 (10RobH) a:05RobH→03BBlack So, everything I see on wikitech supports that I can offline this single host at any time to do the work on reseating the dimm. However, it is the week before all ha... [20:17:57] (03PS2) 10Dzahn: librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150 [20:18:05] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn) [20:18:30] 10Operations, 10Fundraising-Backlog, 10Mail, 10fundraising-tech-ops, 10Patch-For-Review: Stronger DKIM key for fundraising emails? - https://phabricator.wikimedia.org/T210445 (10Jgreen) >>! In T210445#4900745, @bsisolak wrote: > > Not sure the best way to do this, but someone should set a timer and remov... [20:20:48] I'm going to roll back just to see if error rates go back down because this is somewhat concerning (very much higer rate of errors, even with only group0) [20:21:15] (03PS1) 1020after4: group0 wikis to 1.33.0-wmf.13 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486151 [20:21:17] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.33.0-wmf.13 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486151 (owner: 1020after4) [20:21:22] (03CR) 10Dzahn: "i sent a quick mail to slaporte" [puppet] - 10https://gerrit.wikimedia.org/r/486126 (owner: 10Dzahn) [20:21:44] !log rolling back because error rate increased significantly after promoting [20:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:59] 10Operations, 10Cloud-Services, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287 (10Bstorm) [20:22:30] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: remove shell access for mkroetzsch on 2019-01-26 - https://phabricator.wikimedia.org/T214498 (10Peachey88) [20:22:33] (03Merged) 10jenkins-bot: group0 wikis to 1.33.0-wmf.13 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486151 (owner: 1020after4) [20:23:03] twentyafterfour: But only timeouts? [20:23:49] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.33.0-wmf.13 refs T206668 [20:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:52] T206668: 1.33.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T206668 [20:24:52] Well, timeouts an the mysterious things caused by multi-line errors, like "parent, LightProcess exiting" [20:27:10] (03PS18) 10Paladox: gerrit: Add colour to PolyGerrit header and update the theme slightly [puppet] - 10https://gerrit.wikimedia.org/r/482379 [20:28:02] (03PS4) 10EBernhardson: Add wbsearchentities profiles for de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484334 (https://phabricator.wikimedia.org/T214515) [20:28:05] (03PS1) 10EBernhardson: Turn on wbsearchentities ab test in de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486154 (https://phabricator.wikimedia.org/T214515) [20:29:45] (03CR) 10jenkins-bot: group0 wikis to 1.33.0-wmf.13 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486151 (owner: 1020after4) [20:29:46] 10Operations, 10Traffic, 10netops: Connection problem (Moscow ISP, 4G) with Beeline / Sovintel - https://phabricator.wikimedia.org/T214459 (10Elitre) Ciao @ayounsi! It is possible that CommRel can help you finding someone to diagnose this tomorrow if @Iluvatar doesn't happen to find others to test in the mea... [20:30:35] (03PS1) 1020after4: group0 wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486155 [20:30:44] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486155 (owner: 1020after4) [20:31:10] James_F: not only timeouts though that looks like it might be most of them [20:31:47] (03Merged) 10jenkins-bot: group0 wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486155 (owner: 1020after4) [20:32:02] twentyafterfour: Other than the Translate thing I don't see any non-timeouts (counting both HHVM timeouts and MariaDB ones). [20:32:24] Oh, wait, one `error: request has exceeded memory limit` [20:33:14] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.33.0-wmf.14 refs T206668 [20:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:17] T206668: 1.33.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T206668 [20:34:19] yeah still worrisome that there was such a spike but I'm gonna let things settle a little [20:37:33] well I re-promoted and the spike didn't happen this time so I guess it was coincidence [20:37:51] Now group1... [20:38:22] (03PS1) 1020after4: group1 wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486156 [20:38:24] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486156 (owner: 1020after4) [20:39:43] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486156 (owner: 1020after4) [20:40:35] (03CR) 10Hashar: "When using pull, not knowing whether an image is published is a hard error in my opinion. The risk is to eventually to upload a different " (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/486056 (https://phabricator.wikimedia.org/T214441) (owner: 10Hashar) [20:42:31] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.14 refs T206668 [20:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:51] T206668: 1.33.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T206668 [20:42:58] (03CR) 10jenkins-bot: group0 wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486155 (owner: 1020after4) [20:43:00] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486156 (owner: 1020after4) [20:43:25] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.14 refs T206668 (duration: 00m 52s) [20:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:39] T214517 is getting quite a lot of hits. :-( [20:43:40] T214517: PHP Fatal Error: Argument 2 passed to TranslateHooks::onPageContentLanguage() must be an instance of Language, StubUserLang given - https://phabricator.wikimedia.org/T214517 [20:43:50] (03CR) 10Hashar: "T214441 is solved, docker-registry.wikimedia.org would sometime yield a 405 http error when issuing an HEAD request. We are not sending a " [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 (owner: 10Hashar) [20:44:02] yep [20:44:16] I think this will be a train blocker for sure [20:44:18] Bump to UBN as a train blocker for further roll-out? [20:44:26] James_F: yeah [20:44:30] Done. [20:44:35] thanks! [20:44:53] the question is should we stay at group 1 or go back to group 0 [20:46:59] error rate ~100 per minute [20:47:18] Yeah, I'm looking at the code now. [20:47:21] (03PS1) 1020after4: group1 wikis to 1.33.0-wmf.13 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486158 [20:47:23] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.33.0-wmf.13 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486158 (owner: 1020after4) [20:47:38] at 100 errors per minute I'm rolling back for the time being [20:48:39] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.13 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486158 (owner: 1020after4) [20:50:01] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.13 refs T206668 [20:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:04] T206668: 1.33.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T206668 [20:50:54] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.13 refs T206668 (duration: 00m 52s) [20:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:39] ok I need to step away for a few minutes but I will check back in on the UBN task in case something changes in time to get the train going again today [20:52:16] (03CR) 10Volans: [C: 04-1] "LGTM, just one typo inline. Nice the check experimental!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn) [20:56:03] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.13 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486158 (owner: 1020after4) [20:59:16] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Milimetric) This wikitext-in-JSON thing seems really complicated. I read through both comments above and walked away with a mu... [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190123T2100). [21:02:08] 10Operations, 10ops-ulsfo, 10Traffic: cp4026 correctable dimm error - https://phabricator.wikimedia.org/T214516 (10BBlack) a:05BBlack→03RobH https://wikitech.wikimedia.org/wiki/Cache_servers#Depool_and_downtime is correct, it just needs to be depooled (it will auto-depool on shutdown, but a manual depool... [21:05:35] 10Operations, 10ops-ulsfo, 10Traffic: cp4026 correctable dimm error - https://phabricator.wikimedia.org/T214516 (10BBlack) See also T178011 for last time. Why didn't the icinga EDAC check catch this? [21:06:31] (03PS5) 10Volans: documentation: fine-tune generated documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/484330 [21:06:45] (03CR) 10Volans: documentation: fine-tune generated documentation (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484330 (owner: 10Volans) [21:07:56] 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177 (10Dzahn) T214516 was a case of a memory error but Icinga did not detect it? T214516#4903917 [21:10:24] (03PS3) 10Mforns: Normalize field names in eventlogging_to_druid_job calls [puppet] - 10https://gerrit.wikimedia.org/r/486128 (https://phabricator.wikimedia.org/T214136) [21:11:29] (03CR) 10Volans: wdqs: convert prom exporter script tp py3 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) (owner: 10Mathew.onipe) [21:15:02] 10Operations, 10hardware-requests, 10Patch-For-Review: request to assign wmf6937 (mw1298, former imagescaler) (now: wmf4727) as phab1002 - https://phabricator.wikimedia.org/T195623 (10Dzahn) 05Resolved→03Open a:05mark→03None @Robh We noticed this server has 32GB of RAM but phab1001 has 64GB RAM and... [21:18:32] 10Operations, 10DNS, 10Traffic, 10fundraising-tech-ops: remove IBM/Silverpop 1024-bit domain key - https://phabricator.wikimedia.org/T214525 (10Jgreen) [21:18:50] 10Operations, 10DNS, 10Traffic, 10fundraising-tech-ops: remove IBM/Silverpop 1024-bit domain key - https://phabricator.wikimedia.org/T214525 (10Jgreen) [21:20:17] 10Operations, 10DNS, 10Traffic, 10fundraising-tech-ops: remove IBM/Silverpop 1024-bit domain key - https://phabricator.wikimedia.org/T214525 (10Jgreen) [21:21:00] 10Operations, 10DNS, 10Traffic, 10fundraising-tech-ops: remove IBM/Silverpop 1024-bit domain key - https://phabricator.wikimedia.org/T214525 (10Dzahn) This is the kind of ticket where we need that Phabricator calendar feature.. add a date to a task and have it notify or raise priority once the data gets cl... [21:30:45] !log scandium - removing npm and nodejs*, testing puppetization to reinstall them [21:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:27] 10Operations, 10Fundraising-Backlog, 10Mail, 10fundraising-tech-ops, 10Patch-For-Review: Stronger DKIM key for fundraising emails? - https://phabricator.wikimedia.org/T210445 (10Jgreen) 05Open→03Resolved Opened T214525 to schedule removal of the deprecated 1024-bit key on 2019-04-22. [21:41:21] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10ssastry) >>! In T201366#4901110, @Dzahn wrote: >>>! In T201366#4901109, @Stashbot wrote: >> {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), hre... [21:44:17] subbu: while i had nodejs10/npm installed on scandium, i noticed one issue in puppetization / it worked because i did one thing manual. trying to fix that last bit now.. that's why i temp removed a package and trying to get that done now [21:45:08] also saw your comments, i will respond to the first 2. about the DNS change.. it doesn't have to be another ticket, just a TODO box for the one we have. i will be able to do it [21:45:12] (03PS19) 10Thcipriani: gerrit: update PolyGerrit theme [puppet] - 10https://gerrit.wikimedia.org/r/482379 (owner: 10Paladox) [21:45:54] mutante, ok .. let us wait for the last 2 steps till next week ... we want to continue using ruthenium this week for rt testing and deployes. we can do scandium testing next week during the no-deploy period. [21:47:25] https://gerrit-new.wmflabs.org/r/