[02:24:50] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.12) (duration: 09m 42s) [02:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:46] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.13) (duration: 13m 51s) [02:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:23] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Mon Jul 23 03:08:23 UTC 2018 (duration 10m 37s) [03:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:14] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 855.47 seconds [04:08:03] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 289.70 seconds [04:42:37] !log Deploy schema change on db1061 (s6 primary master) T144010 T51190 T199368 [04:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:43] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [04:42:44] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [04:42:44] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [06:06:57] (03PS2) 10Jcrespo: move_replica.py: A script to do replica topology changes [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447225 [06:07:35] (03CR) 10Jcrespo: [C: 032] move_replica.py: A script to do replica topology changes [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447225 (owner: 10Jcrespo) [06:15:41] (03PS2) 10Jcrespo: mariadb: Added functionality to perform arbitrary topology changes [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447233 [06:16:08] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Added functionality to perform arbitrary topology changes [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447233 (owner: 10Jcrespo) [06:23:48] (03PS3) 10Jcrespo: mariadb: Added functionality to perform arbitrary topology changes [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447233 [06:24:50] (03CR) 10Jcrespo: [C: 032] mariadb: Added functionality to perform arbitrary topology changes [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447233 (owner: 10Jcrespo) [06:25:17] (03Merged) 10jenkins-bot: mariadb: Added functionality to perform arbitrary topology changes [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447233 (owner: 10Jcrespo) [06:29:15] (03PS2) 10Jcrespo: replication: Add additional replica migration functionalities [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447341 [06:30:18] (03CR) 10Jcrespo: [C: 032] replication: Add additional replica migration functionalities [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447341 (owner: 10Jcrespo) [06:37:27] !log stop db1102 to clone to db1118 [06:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:05] 10Operations, 10Analytics, 10hardware-requests: eqiad: (1) new stat box to offload users from stat1005 - https://phabricator.wikimedia.org/T196345 (10elukey) @RobH If there are no more blockers I'd proceed with the quote request (no rush, just wanted to avoid this task to stall). [07:19:15] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad | (3) Labs Data Lake hardware - https://phabricator.wikimedia.org/T199674 (10elukey) [07:19:17] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad | (14 + 6) hadoop hardware refresh and expansion - https://phabricator.wikimedia.org/T199673 (10elukey) [07:23:04] (03PS1) 10Marostegui: db-eqiad.php: Depool db1122 for alter table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447363 (https://phabricator.wikimedia.org/T199368) [07:24:10] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1122 for alter table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447363 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [07:24:19] (03PS1) 10Jcrespo: mariadb: Migrate db1118 from mysql 8 to mariadb 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/447365 (https://phabricator.wikimedia.org/T199224) [07:24:52] (03PS2) 10Jcrespo: mariadb: Migrate db1118 from mysql 8 to mariadb 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/447365 (https://phabricator.wikimedia.org/T199224) [07:25:33] (03PS2) 10Marostegui: db-eqiad.php: Depool db1122 for alter table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447363 (https://phabricator.wikimedia.org/T199368) [07:25:47] (03CR) 10Jcrespo: [C: 032] mariadb: Migrate db1118 from mysql 8 to mariadb 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/447365 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [07:26:38] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1122 for alter table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447363 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [07:27:13] come on vim... [07:29:43] (03PS3) 10Marostegui: db-eqiad.php: Depool db1122 for alter table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447363 (https://phabricator.wikimedia.org/T199368) [07:31:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1122 for alter table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447363 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [07:32:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1122 for alter table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447363 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [07:32:46] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1122 for alter table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447363 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [07:33:49] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1122, db1103:3312 (duration: 00m 55s) [07:33:51] !log Deploy schema change on db1122 T144010 T51190 T199368 [07:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:57] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [07:33:58] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [07:33:58] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [07:46:32] 10Operations, 10Analytics, 10EventBus, 10Services: Set a proper max open files limit for Kafka clusters - https://phabricator.wikimedia.org/T200177 (10elukey) p:05Triage>03High [07:53:36] (03PS1) 10Jcrespo: WMFReplication: Fix bug where error was wrongly reported [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447379 [07:54:14] !log Stop replication in sync on db1122 and db2035 [07:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:33] !log start of ladsgroup@mwmaint1001:~$ foreachwikiindblist s6 populateChangeTagDef.php (T193873) [07:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:37] T193873: Run maintenance script to populate change_tag_def on WMF production (all wikis) - https://phabricator.wikimedia.org/T193873 [07:54:43] (03CR) 10Jcrespo: [C: 032] WMFReplication: Fix bug where error was wrongly reported [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447379 (owner: 10Jcrespo) [08:04:03] !log Apply schema change  T197891 to labswiki and labstestwiki T200140 [08:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:08] T200140: Ensure labswiki and labtestwiki are up to date with MW schema changes - https://phabricator.wikimedia.org/T200140 [08:04:08] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [08:05:22] !log Apply schema change  T190148 to labswiki and labstestwiki T200140 [08:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:27] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [08:07:54] !log Apply schema change  T191519 to labswiki and labstestwiki T200140 [08:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:59] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [08:11:52] (03PS1) 10Jcrespo: WMFReplication: Shortcircuit slaves() to avoid errors [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447380 [08:11:54] !log Apply schema change  T191316 to labswiki and labstestwiki T200140 [08:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:59] T200140: Ensure labswiki and labtestwiki are up to date with MW schema changes - https://phabricator.wikimedia.org/T200140 [08:11:59] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [08:12:41] (03PS2) 10Jcrespo: WMFReplication: Shortcircuit slaves() to avoid errors [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447380 [08:13:08] (03CR) 10Jcrespo: [C: 032] WMFReplication: Shortcircuit slaves() to avoid errors [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447380 (owner: 10Jcrespo) [08:15:56] !log Apply schema change T160415 to labswiki and labstestwiki T200140 [08:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:00] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [08:21:16] !log Apply schema change T192926 to labswiki and labstestwiki T200140 [08:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:20] T200140: Ensure labswiki and labtestwiki are up to date with MW schema changes - https://phabricator.wikimedia.org/T200140 [08:21:21] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [08:21:44] 10Operations, 10Traffic: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 (10ema) p:05Triage>03Normal [08:23:16] !log Stop replication in sync on db1122 and db1103:3312 [08:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:17] (03PS3) 10Ema: Initial WMF packaging [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/447074 (https://phabricator.wikimedia.org/T200178) [08:42:20] !log enable deprecation page for status.w.o - T199816 [08:42:23] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 51.37, 38.88, 24.04 [08:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:24] T199816: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 [08:44:40] (03CR) 10Hashar: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/447074 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [08:44:57] (03CR) 10jerkins-bot: [V: 04-1] Initial WMF packaging [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/447074 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [08:47:53] !log Stop replication in sync on db1103:3312 db2035 [08:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:04] PROBLEM - Disk space on neodymium is CRITICAL: DISK CRITICAL - free space: / 1151 MB (3% inode=95%) [08:53:35] * volans checking [08:53:41] I will clean up stuff I have [08:53:56] marostegui: is it your stuff? [08:54:48] volans: don't know, I just dumped a 1.8G file but I will clean up more stuff [08:55:02] but yeah, my home is 22G XD [08:55:03] ack, I'll have a look around [08:55:08] lol [08:55:23] for a 37G partition is not bad [08:56:43] RECOVERY - Disk space on neodymium is OK: DISK OK [08:58:54] ok my home is now 8.G [08:58:56] 8.6G [08:59:10] I can probably clean up more stuff, but I will do that a bit later [09:00:17] !log start of ladsgroup@mwmaint1001:~$ foreachwikiindblist s8 populateChangeTagDef.php (T193873) [09:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:21] T193873: Run maintenance script to populate change_tag_def on WMF production (all wikis) - https://phabricator.wikimedia.org/T193873 [09:01:13] Aaand 5.5G now [09:08:33] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 8.89, 13.33, 23.36 [09:21:00] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282 (10Prtksxna) >>! In T192129#4443301, @bd808 wrote: >>>! In T192129#4440027, @Prtksxna wrote: >> Would it make sense to add rules t... [09:23:34] 10Operations, 10monitoring, 10Patch-For-Review: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10fgiunchedi) I've setup a very bare deprecation page for status.wikimedia.org, we can sunset the DNS name in some weeks time. @Quiddity thanks for the list! I'm updating met... [09:24:44] (03CR) 10Ema: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/447074 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [09:24:57] (03CR) 10jerkins-bot: [V: 04-1] Initial WMF packaging [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/447074 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [09:30:36] (03PS3) 10Volans: wmf-decommission-host: initial version [puppet] - 10https://gerrit.wikimedia.org/r/446887 (https://phabricator.wikimedia.org/T198649) [09:33:45] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1122 for alter table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447387 [09:35:05] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1122 for alter table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447387 (owner: 10Marostegui) [09:36:36] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122 for alter table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447387 (owner: 10Marostegui) [09:37:47] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1122, db1103:3312 (duration: 00m 54s) [09:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:06] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122 for alter table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447387 (owner: 10Marostegui) [09:38:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1074, db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447388 (https://phabricator.wikimedia.org/T200061) [09:40:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1074, db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447388 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [09:41:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1074, db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447388 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [09:42:27] !log Stop replication in sync on db1105:3312 db1074 [09:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1074, db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447388 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [09:42:43] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1074, db1105:3312 (duration: 00m 53s) [09:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:37] (03PS1) 10Elukey: profile::kafka::broker: raise default max open files to 128k [puppet] - 10https://gerrit.wikimedia.org/r/447389 (https://phabricator.wikimedia.org/T200177) [09:49:53] (03CR) 10Ema: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/447074 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [09:53:00] (03CR) 10jerkins-bot: [V: 04-1] Initial WMF packaging [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/447074 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [10:03:45] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447392 (https://phabricator.wikimedia.org/T128546) [10:06:22] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447392 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:07:22] (03CR) 10Mobrovac: [C: 031] profile::kafka::broker: raise default max open files to 128k [puppet] - 10https://gerrit.wikimedia.org/r/447389 (https://phabricator.wikimedia.org/T200177) (owner: 10Elukey) [10:07:23] 10Operations, 10Traffic, 10Patch-For-Review: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 (10ema) CI tests [[https://integration.wikimedia.org/ci/job/debian-glue/1232/console | were failing ]] due to CI slaves being jessie and thus running with an old pristi... [10:07:35] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447392 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:08:12] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447392 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:09:33] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:447392|Bumping portals to master (T128546)]] (duration: 00m 55s) [10:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:37] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:10:28] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:447392|Bumping portals to master (T128546)]] (duration: 00m 54s) [10:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:00] (03CR) 10Hashar: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/447074 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [10:28:08] (03CR) 10jerkins-bot: [V: 04-1] Initial WMF packaging [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/447074 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [10:29:01] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10fgiunchedi) [10:31:59] woooooowwwww!!! \o/ "Initial WMF packaging [debs/trafficserver]" [10:33:46] somebody drank too much coffee this morning :P [10:34:35] nono simply super happy to see this project moving [10:34:57] (I also cheer for ema independently of what he does just because he is great :D) [10:39:02] * ema thanks elukey for the unconditional support [10:49:16] (03PS4) 10Ema: Initial WMF packaging [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/447074 (https://phabricator.wikimedia.org/T200178) [11:00:05] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180723T1100). [11:00:06] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180723T1100). [11:00:06] razesoldier: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:24] I can SWAT today [11:00:31] I here [11:01:06] razesoldier: I'll ping you once the first commit is at mwdebug1002 for testing, do you know how to test there, or do you need help? [11:01:26] I know how to test [11:01:48] razesoldier: great [11:04:57] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447030 (https://phabricator.wikimedia.org/T199863) (owner: 10星耀晨曦) [11:06:03] (03CR) 10Mobrovac: JobQueue: Signal JobQueueEventBus is never read-only (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447055 (https://phabricator.wikimedia.org/T199594) (owner: 10Mobrovac) [11:06:05] (03Merged) 10jenkins-bot: Change zhwikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447030 (https://phabricator.wikimedia.org/T199863) (owner: 10星耀晨曦) [11:08:26] (03CR) 10jenkins-bot: Change zhwikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447030 (https://phabricator.wikimedia.org/T199863) (owner: 10星耀晨曦) [11:10:04] razesoldier: 447030 is at mwdebug1002, please test and let me know if I can deploy it [11:10:32] (03PS4) 10Zfilipin: Exempt Template and Module namespace on zhwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445975 (https://phabricator.wikimedia.org/T187783) (owner: 10星耀晨曦) [11:13:34] zeljkof: LGTM [11:13:43] razesoldier: ok, deploying [11:14:54] !log zfilipin@deploy1001 Synchronized static/images/project-logos/: SWAT: [[gerrit:447030|Change zhwikiquote logo (T199863)]] (duration: 00m 55s) [11:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:59] T199863: Change zh.wikiquote logo - https://phabricator.wikimedia.org/T199863 [11:15:12] razesoldier: it's deployed, I'll purge the cache [11:17:31] (03CR) 10Zfilipin: "Purging script output is at T199863#4444948" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447030 (https://phabricator.wikimedia.org/T199863) (owner: 10星耀晨曦) [11:17:40] razesoldier: purged https://phabricator.wikimedia.org/T199863#4444948 [11:17:47] moving on to the next patch [11:18:32] zeljkof: There have some issue [11:18:41] https://zh.wikiquote.org/static/images/project-logos/zhwikiquote-hans.png is 404 [11:18:45] razesoldier: what's the problem? [11:18:51] ok, let me check [11:19:07] razesoldier: works for me, can you check again? [11:20:21] I check again, still 404 [11:20:38] May be a cache issue? [11:20:56] ok, let me see again [11:21:50] razesoldier: ok, so that's one of the files that have been added [11:22:13] maybe it is a cache issue, I could try purging it [11:22:25] razesoldier: can you see other files, just this one is 404? [11:24:06] I've just checked, I can see all files [11:24:30] I'll purge the new ones, just in case, maybe it helps [11:25:17] Hum, just this file I can't see [11:25:44] Other files looks good [11:26:03] razesoldier: I've purged them all, try again [11:26:59] Ok, I see [11:27:19] razesoldier: all good now? [11:27:24] (03PS4) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [11:27:34] zeljkof: yes [11:27:46] razesoldier: ok, going on to the next patch then [11:28:07] (03CR) 10jerkins-bot: [V: 04-1] WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [11:28:39] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445975 (https://phabricator.wikimedia.org/T187783) (owner: 10星耀晨曦) [11:28:57] razesoldier: do I need to run a script after 445975 is deployed? [11:29:25] No [11:29:49] (03Merged) 10jenkins-bot: Exempt Template and Module namespace on zhwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445975 (https://phabricator.wikimedia.org/T187783) (owner: 10星耀晨曦) [11:30:07] (03CR) 10jenkins-bot: Exempt Template and Module namespace on zhwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445975 (https://phabricator.wikimedia.org/T187783) (owner: 10星耀晨曦) [11:30:59] razesoldier: 445975 is at mwdebug1002 [11:31:33] zeljkof: Checked, looks good [11:31:41] razesoldier: ok, deploying [11:32:46] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:445975|Exempt Template and Module namespace on zhwiktionary (T187783)]] (duration: 00m 55s) [11:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:50] T187783: Changing the first letter of title of Template and Module namespace into case-insensitive for zh.wiktionary - https://phabricator.wikimedia.org/T187783 [11:33:10] razesoldier: it's deployed, please check and thanks for deploying with #releng! ;) [11:33:45] Everything is fine! [11:33:54] razesoldier: great! [11:34:00] !log EU SWAT finished [11:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:25] zeljkof: Thanks for your SWAT [11:34:40] razesoldier: no problem at all! :D [11:57:39] (03PS1) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [11:58:21] (03CR) 10jerkins-bot: [V: 04-1] store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) (owner: 10ArielGlenn) [12:50:12] (03PS1) 10Jcrespo: WMFReplication: Fix bug where the wrong address was chosen on move() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447406 [12:50:35] (03CR) 10jerkins-bot: [V: 04-1] WMFReplication: Fix bug where the wrong address was chosen on move() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447406 (owner: 10Jcrespo) [12:54:39] (03PS5) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [12:55:38] (03CR) 10jerkins-bot: [V: 04-1] WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [12:58:34] PROBLEM - puppet last run on mw1348 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:07:12] jouncebot: next [13:07:12] In 3 hour(s) and 52 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180723T1700) [13:09:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1074, db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447407 [13:09:09] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1074, db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447407 [13:10:58] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1074, db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447407 (owner: 10Marostegui) [13:11:10] (03PS1) 10Volans: Updated PyPI URLs to the new website [software/cumin] - 10https://gerrit.wikimedia.org/r/447408 [13:12:22] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074, db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447407 (owner: 10Marostegui) [13:12:41] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074, db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447407 (owner: 10Marostegui) [13:13:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1074, db1105:3312 (duration: 00m 55s) [13:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:49] 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) Just talked to Chris - as this disk is on predictive failure but not failed yet, we are going to wait for the new disks to arrive in order to avoid trying again with used ones. [13:14:08] (03CR) 10jerkins-bot: [V: 04-1] Updated PyPI URLs to the new website [software/cumin] - 10https://gerrit.wikimedia.org/r/447408 (owner: 10Volans) [13:18:41] (03CR) 10Ema: [C: 032] Initial WMF packaging [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/447074 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [13:23:25] !log Deploy schema change on labswiki and labstestwiki T144010 T51190 T199368 [13:23:29] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Update log config for scb* boxes, to deal with ORES verbose logging - https://phabricator.wikimedia.org/T182497 (10Ladsgroup) 05Open>03Resolved a:03awight ORES has been moved out of scb nodes and ores nodes seems pretty clean to me: `... [13:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:30] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [13:23:30] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [13:23:31] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [13:24:14] RECOVERY - puppet last run on mw1348 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [13:36:43] !log Deploy schema change on s4 codfw master (db2051), this will generate lag on s4 codfw T144010 T51190 T199368 [13:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:49] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [13:36:50] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [13:36:50] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [13:44:33] !log !log trafficserver 7.1.3+ds-4wm1 uploaded to stretch-wikimedia T200178 [13:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:36] T200178: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 [13:45:11] 10Operations, 10Traffic, 10Patch-For-Review: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 (10ema) [13:47:32] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447419 (https://phabricator.wikimedia.org/T200061) [13:49:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447419 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [13:50:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447419 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [13:50:39] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447419 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [13:51:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1081, db1097:3314 (duration: 00m 54s) [13:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:15] !log Stop replication in sync on db1081 and db1097:3317 [13:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:43] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 52446 MB (10% inode=99%) [13:57:31] !log mobrovac@deploy1001 Started deploy [eventstreams/deploy@5ed03e0]: Update node-rdkafka to v2.3.4 - T199813 [13:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:35] T199813: EventStreams accumulates too much memory on SCB nodes in CODFW - https://phabricator.wikimedia.org/T199813 [14:00:00] !log mobrovac@deploy1001 Finished deploy [eventstreams/deploy@5ed03e0]: Update node-rdkafka to v2.3.4 - T199813 (duration: 02m 29s) [14:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:38] (03PS1) 10Ema: cache: temporarily return 404 for stream.w.o/socket.io [puppet] - 10https://gerrit.wikimedia.org/r/447423 (https://phabricator.wikimedia.org/T199813) [14:08:03] RECOVERY - Disk space on elastic1027 is OK: DISK OK [14:17:51] !log Stop replication in sync on db1081 and db2051 [14:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:07] (03CR) 10Ema: [C: 032] cache: temporarily return 404 for stream.w.o/socket.io [puppet] - 10https://gerrit.wikimedia.org/r/447423 (https://phabricator.wikimedia.org/T199813) (owner: 10Ema) [14:42:37] (03CR) 10Filippo Giunchedi: [C: 031] profile::kafka::broker: raise default max open files to 128k [puppet] - 10https://gerrit.wikimedia.org/r/447389 (https://phabricator.wikimedia.org/T200177) (owner: 10Elukey) [14:43:33] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) They provided me with a list of 7 names, I asked them to specify which one (or two) are going onsite. It takes 24 hours for them to get back to me from any reply. [14:43:47] (03PS1) 10Ema: cache: fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/447438 [14:44:53] (03CR) 10Ema: [C: 032] cache: fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/447438 (owner: 10Ema) [14:45:17] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1081, db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447439 [14:45:24] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [14:45:29] that's me ^ [14:46:11] (03PS2) 10Jcrespo: WMFReplication: Fix bug where the wrong address was chosen on move() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447406 [14:46:23] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [14:46:43] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [14:47:29] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1081, db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447439 (owner: 10Marostegui) [14:48:11] (03PS3) 10Jcrespo: WMFReplication: Fix bug where the wrong address was chosen on move() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447406 [14:48:40] (03CR) 10jerkins-bot: [V: 04-1] WMFReplication: Fix bug where the wrong address was chosen on move() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447406 (owner: 10Jcrespo) [14:48:47] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081, db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447439 (owner: 10Marostegui) [14:49:01] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081, db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447439 (owner: 10Marostegui) [14:49:03] 10Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS disk usage increase is correlated with reloading of categories - https://phabricator.wikimedia.org/T200202 (10Gehel) [14:49:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1081, db1097:3314 (duration: 00m 53s) [14:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:15] (03PS4) 10Jcrespo: WMFReplication: Fix bug where the wrong address was chosen on move() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447406 [14:50:33] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:51:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10Andrew) [14:51:23] (03CR) 10Jcrespo: [C: 032] WMFReplication: Fix bug where the wrong address was chosen on move() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447406 (owner: 10Jcrespo) [14:51:24] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:51:44] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:52:04] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:52:23] ACKNOWLEDGEMENT - configured eth on labtestnet2002 is CRITICAL: eth1 reporting no carrier. andrew bogott https://phabricator.wikimedia.org/T199821 [14:53:22] !log delete empty/not-used/wrongly-created topics in Kafka main-eqiad - T199510 [14:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:25] T199510: Clean up leftover topics - https://phabricator.wikimedia.org/T199510 [14:54:55] 10Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS disk usage increase is correlated with reloading of categories - https://phabricator.wikimedia.org/T200202 (10Gehel) Looking at http://localhost:9999/bigdata/#namespaces it seems that categories namespaces are deleted. But maybe the disk space is not r... [14:55:56] 10Operations: requesting additional production ssh key for jmorgan - https://phabricator.wikimedia.org/T200103 (10herron) p:05Triage>03Normal a:03herron Hey @Capt_Swing, I can help you out with this. Prepping the patch now. [14:57:25] !log mobrovac@deploy1001 Started restart [eventstreams/deploy@5ed03e0]: Fresh start of the service after alleviating 404s [14:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:03] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:00:42] !log mobrovac@deploy1001 Started deploy [changeprop/deploy@5cfdf26]: Increase concurrency back to normal levels (from 20 to 50) [15:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:44] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54197 MB (3% inode=99%) [15:02:20] !log mobrovac@deploy1001 Finished deploy [changeprop/deploy@5cfdf26]: Increase concurrency back to normal levels (from 20 to 50) (duration: 01m 38s) [15:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:30] 10Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS disk usage increase is correlated with reloading of categories - https://phabricator.wikimedia.org/T200202 (10Gehel) It looks like there is [[ https://wiki.blazegraph.com/wiki/index.php/RWStore#minReleaseAge | some configuration around the release of h... [15:03:01] gehel: maps1001 running out of disk space? ^ [15:03:17] mobrovac: yep, just saw that. Thanks! [15:03:27] kk np [15:03:32] damn, just back from vacation and both wdqs and maps are blowing up :( [15:04:14] gehel: that's their only way of telling you you've been missed! :) [15:04:21] :/ [15:05:10] hashar zeljkof Our AdvancedSearch browser tests where we check for specific colors are failing since 1-2 weeks (e.g.https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/AdvancedSearch/+/443071/). To mee it looks like a (fixed) bug in the rgb2hex dependency of webdriver io (https://github.com/webdriverio/webdriverio/issues/2773) Could you help us with that, e.g. by updating the rgb2hex in node_modules for browsers tests? [15:07:29] chiborg: is there a related phab task? [15:07:50] chiborg: so what exactly should I do, update package.json in core? [15:10:38] ACKNOWLEDGEMENT - Device not healthy -SMART- on labstore1003 is CRITICAL: cluster=labsnfs device=megaraid,11 instance=labstore1003:9100 job=node site=eqiad andrew bogott https://phabricator.wikimedia.org/T199780 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1003&var-datasource=eqiad%2520prometheus%252Fops [15:11:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labstore1003: more SMART failures - https://phabricator.wikimedia.org/T199780 (10Andrew) [15:12:56] (03PS1) 10Herron: admin: add ssh key for jmorgan's new laptop [puppet] - 10https://gerrit.wikimedia.org/r/447446 (https://phabricator.wikimedia.org/T200103) [15:13:51] (03CR) 10Gergő Tisza: Configure group management for interface-admin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440676 (owner: 10Gergő Tisza) [15:16:31] 10Operations, 10Traffic: Discard of cold labeled VCL crashes varnish parent and child - https://phabricator.wikimedia.org/T200207 (10ema) [15:16:44] 10Operations, 10Traffic: Discard of cold labeled VCL crashes varnish parent and child - https://phabricator.wikimedia.org/T200207 (10ema) p:05Triage>03Normal [15:21:20] 10Operations, 10Traffic: Discard of cold labeled VCL crashes varnish parent and child - https://phabricator.wikimedia.org/T200207 (10ema) [15:25:33] !log decrease gc_grace_seconds to 4 days on cassandra / maps [15:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:51] zeljkof There is nor phab task yet, I can create one. Would Continuous-Integration-Infrastructure be the right board? I don't know enough about how the infrastructure works to answer what you should do with confidence. Does package.json have webdriver io as a dependency? then we're out of luck since they did not do a release yet and rgb2hex is listed as its dependency. If it's a top-level dependecy, then it works. [15:31:23] RECOVERY - Disk space on maps1001 is OK: DISK OK [15:34:00] chiborg: please create a task and CC me, I'll take care of tags [15:34:40] chiborg: can you reproduce the problem locally, using mediawiki vagrant for example? [15:34:57] That would speed up the process significantly [15:35:51] zeljkof I'll try tomorrow and hold off creating the ticket until then. [15:38:29] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install graphite2003 - https://phabricator.wikimedia.org/T196483 (10fgiunchedi) 05Open>03Resolved Resolving, I'll file decom tasks for graphite200[12] [15:40:25] chiborg: I have some time now, I'll create a task and test locally, I'll cc you, assuming the same irc and phab usernames [15:40:33] (03PS4) 10EBernhardson: Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) [15:40:34] (03CR) 10EBernhardson: "the patches that introduce this daemon are now merged and this can be deployed" [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) (owner: 10EBernhardson) [15:41:22] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10fgiunchedi) @Cmjohnson what's the status for graphite1004 ? [15:42:50] 10Operations, 10ops-codfw, 10monitoring: Decom graphite2001 - https://phabricator.wikimedia.org/T200209 (10fgiunchedi) [15:43:16] 10Operations, 10ops-codfw, 10monitoring: Decom graphite2002 - https://phabricator.wikimedia.org/T200210 (10fgiunchedi) [15:43:41] zeljkof I'm gabriel-wmde on phab. Please also cc Tonina_WMDE (Tonina Zhelyazkova (WMDE) on phab) and Tim_WMDE (Tim eurlitz WMDE on phab), I just realized i'm not be not around in the next days and they'll be affected. [15:43:54] zeljkof Thank you! [15:44:21] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install graphite2003 - https://phabricator.wikimedia.org/T196483 (10fgiunchedi) Decom tasks: {T200210} {T200209} [15:46:21] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) Got it down to two names and submitting a smarthands ticket for escort on Wednesday, July 25th. [15:50:34] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10Cmjohnson) @fgiunchedi I am currently working through 14 racking tasks...the CP's are the highest priority. I am not sure where the graphite falls in the priority list but I am working thr... [15:59:13] PROBLEM - Host puppetmaster1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:59:23] RECOVERY - Host puppetmaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:01:17] wut? [16:02:13] 10 seconds of misery [16:02:21] no, rebooted [16:05:49] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 (10ayounsi) [16:10:01] 10Operations, 10Performance-Team, 10vm-requests: Increase webperf1002/webperf2002 space from 50GB to 150GB (Ganeti) - https://phabricator.wikimedia.org/T199853 (10Krinkle) [16:14:30] 10Operations, 10Patch-For-Review: requesting additional production ssh key for jmorgan - https://phabricator.wikimedia.org/T200103 (10Capt_Swing) Thanks @herron! And @Reedy: that might be technically true, but I'm a long way from proficient in basic software development practice ;) But thanks to your nudge I'v... [16:15:33] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/02-redirects.conf] [16:15:33] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/disable-puppet] [16:16:48] 10Operations, 10SRE-Access-Requests: WMF-NDA-Request for User:Braveheart - https://phabricator.wikimedia.org/T198190 (10RobH) 05stalled>03declined I'm going to go ahead and close this out as declined for now, mostly because with no movement and likely solving this another way, this task is just sitting ope... [16:18:13] PROBLEM - puppet last run on mw1316 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [16:18:23] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [16:18:23] PROBLEM - puppet last run on kafka-jumbo1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/smart-data-dump] [16:18:44] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh] [16:23:14] RECOVERY - puppet last run on mw1316 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:27:02] (03PS4) 10Krinkle: webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) [16:27:45] (03CR) 10jerkins-bot: [V: 04-1] webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [16:28:33] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:28:34] RECOVERY - puppet last run on kafka-jumbo1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:03] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:30:54] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:30:54] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:31:02] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) Site Visit Ticket #: 1-162553077672 SmartHands Escort Ticket #: 1-162554266089 Emailed info over to the dell tech and its scheduled for 9am this Wednesday. (They may show up later, 9am is the ea... [16:35:54] (03PS1) 10Bstorm: dumps distribution: remove labstore1007 from NFS etc. [puppet] - 10https://gerrit.wikimedia.org/r/447467 (https://phabricator.wikimedia.org/T196651) [16:43:47] 10Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS disk usage increase is correlated with reloading of categories - https://phabricator.wikimedia.org/T200202 (10Smalyshev) Yep, this looks like what we should be doing. [16:44:01] 10Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS disk usage increase is correlated with reloading of categories - https://phabricator.wikimedia.org/T200202 (10Smalyshev) p:05Triage>03High [16:44:19] (03CR) 10Bstorm: [C: 032] dumps distribution: remove labstore1007 from NFS etc. [puppet] - 10https://gerrit.wikimedia.org/r/447467 (https://phabricator.wikimedia.org/T196651) (owner: 10Bstorm) [16:46:08] (03PS1) 10RobH: adding ppenda to ldap section of users module [puppet] - 10https://gerrit.wikimedia.org/r/447471 (https://phabricator.wikimedia.org/T199557) [16:46:54] (03PS2) 10RobH: adding ppenda to ldap section of users module [puppet] - 10https://gerrit.wikimedia.org/r/447471 (https://phabricator.wikimedia.org/T199557) [16:47:02] (03CR) 10RobH: [C: 032] adding ppenda to ldap section of users module [puppet] - 10https://gerrit.wikimedia.org/r/447471 (https://phabricator.wikimedia.org/T199557) (owner: 10RobH) [16:54:04] (03PS1) 10RobH: adding Nicholas Ray to ldap users section [puppet] - 10https://gerrit.wikimedia.org/r/447475 (https://phabricator.wikimedia.org/T200106) [16:54:45] (03CR) 10RobH: [C: 032] adding Nicholas Ray to ldap users section [puppet] - 10https://gerrit.wikimedia.org/r/447475 (https://phabricator.wikimedia.org/T200106) (owner: 10RobH) [16:54:53] 10Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS disk usage increase is correlated with reloading of categories - https://phabricator.wikimedia.org/T200202 (10Gehel) Damn, we already set `minReleaseAge=1` in `RWStore.properties`. We need to be looking for something else. [16:55:18] (03PS2) 10Bstorm: gridengine: Add package information for stretch exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/447089 (https://phabricator.wikimedia.org/T199276) [17:00:04] gehel: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180723T1700). [17:00:16] 10Operations, 10LDAP-Access-Requests, 10User-Addshore: Give access to graphite and grafana-admin to Aleksey Bekh-Ivanov (WMDE) - https://phabricator.wikimedia.org/T199233 (10RobH) a:03Aleksey_WMDE [17:02:22] jouncebot: o/ [17:02:56] 10Operations, 10JADE, 10TechCom, 10Goal, and 3 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10Harej) [17:03:11] 10Operations, 10JADE, 10TechCom, 10Goal, and 3 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10Harej) 05Open>03stalled [17:07:45] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@d7e8292]: new version of wdqs GUI (wdqs1009 only) (duration: 00m 30s) [17:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:02] (03PS3) 10Bstorm: gridengine: Add package information for stretch exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/447089 (https://phabricator.wikimedia.org/T199276) [17:09:58] (03CR) 10Bstorm: [C: 032] gridengine: Add package information for stretch exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/447089 (https://phabricator.wikimedia.org/T199276) (owner: 10Bstorm) [17:13:05] hoo: T199983 is blocking the train, is there anything we can do to unblock the train? [17:13:06] T199983: Wikidata showing wrong language for page elements - https://phabricator.wikimedia.org/T199983 [17:13:59] zeljkof: There's are two patches up, both should fix this… I guess [17:14:12] I'll take a look in a bit and do a manual test [17:15:06] hoo: please do, the train is blocked for 5 days, and tomorrow we should already deploy .14 to group 0, but we will not be able to do it if we are still blocked [17:16:02] 10Operations, 10ops-codfw, 10decommission: Decommission mw2017 & mw2099 - https://phabricator.wikimedia.org/T187467 (10RobH) [17:16:21] !log canceling deployment of new WDQS GUI to all servers, looks like there is a JS issue. Will reschedule once the issue is fixed. [17:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:46] (03PS1) 10RobH: decom mw2017 and mw2099 [puppet] - 10https://gerrit.wikimedia.org/r/447476 (https://phabricator.wikimedia.org/T187467) [17:22:04] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission mw2017 & mw2099 - https://phabricator.wikimedia.org/T187467 (10RobH) mw2099 seems to have been decommissioned and removed via another task, since its fully out and in the decom rack. I'm removing it from this task. [17:22:43] 10Operations, 10ops-codfw, 10decommission: Decommission mw2017 - https://phabricator.wikimedia.org/T187467 (10RobH) [17:37:38] 10Operations, 10Analytics, 10EventBus, 10Services (watching), and 2 others: Document the process for hard-deleting topics in kafka - https://phabricator.wikimedia.org/T199441 (10elukey) Just created https://wikitech.wikimedia.org/wiki/Kafka/Administration#Delete_a_topic, should be enough! [17:43:31] 10Operations, 10ops-codfw, 10decommission: Decommission mw2017 - https://phabricator.wikimedia.org/T187467 (10RobH) [17:44:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10Cmjohnson) a:03RobH This server is now out of warranty, expired March 2018. We will need to order a new disk. @RobH can you order a 2.5" 300GB disk Model EH0300JDYTH [17:44:17] (03CR) 10RobH: [C: 032] decom mw2017 and mw2099 [puppet] - 10https://gerrit.wikimedia.org/r/447476 (https://phabricator.wikimedia.org/T187467) (owner: 10RobH) [17:47:53] !log gehel@deploy1001 Started deploy [wdqs/wdqs@67fadb7]: new version of wdqs GUI (wdqs1009 only) [17:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:57] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@67fadb7]: new version of wdqs GUI (wdqs1009 only) (duration: 00m 03s) [17:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:28] (03PS1) 10RobH: decom mw2017 prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/447481 (https://phabricator.wikimedia.org/T187467) [17:49:42] (03CR) 10RobH: [C: 032] decom mw2017 prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/447481 (https://phabricator.wikimedia.org/T187467) (owner: 10RobH) [17:50:43] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission mw2017 - https://phabricator.wikimedia.org/T187467 (10RobH) a:05RobH>03Papaul [17:56:08] (03PS1) 10Alex Monk: puppetmaster: hacks to fix puppet logstash report [puppet] - 10https://gerrit.wikimedia.org/r/447483 [17:56:29] !log gehel@deploy1001 Started deploy [wdqs/wdqs@67fadb7]: new version of wdqs GUI (wdqs1009 only) [17:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:53] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@67fadb7]: new version of wdqs GUI (wdqs1009 only) (duration: 00m 25s) [17:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:07] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10Cmjohnson) disk ordered You have successfully submitted request SR977155354. [18:00:14] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180723T1800). [18:00:14] No GERRIT patches in the queue for this window AFAICS. [18:03:41] 10Operations, 10ops-eqiad, 10decommission, 10User-ArielGlenn: decommission dataset1001, ms1001 - https://phabricator.wikimedia.org/T194060 (10RobH) [18:07:40] 10Operations, 10Analytics, 10EventBus, 10Services (watching), and 2 others: Document the process for hard-deleting topics in kafka - https://phabricator.wikimedia.org/T199441 (10elukey) 05Open>03Resolved [18:16:54] (03PS1) 10Bstorm: dumps distribution: put back labstore1007 as a stats server [puppet] - 10https://gerrit.wikimedia.org/r/447484 (https://phabricator.wikimedia.org/T196651) [18:17:26] (03PS5) 10EBernhardson: Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) [18:18:22] (03CR) 10Bstorm: [C: 032] dumps distribution: put back labstore1007 as a stats server [puppet] - 10https://gerrit.wikimedia.org/r/447484 (https://phabricator.wikimedia.org/T196651) (owner: 10Bstorm) [18:30:47] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547 (10awight) Here are the notes from our meeting, plus some more discussion afterwards: https://etherpad.wikimedia.org/p/JADE_scalabili... [18:33:43] (03CR) 10Herron: [C: 032] admin: add ssh key for jmorgan's new laptop [puppet] - 10https://gerrit.wikimedia.org/r/447446 (https://phabricator.wikimedia.org/T200103) (owner: 10Herron) [18:33:50] (03PS2) 10Herron: admin: add ssh key for jmorgan's new laptop [puppet] - 10https://gerrit.wikimedia.org/r/447446 (https://phabricator.wikimedia.org/T200103) [18:40:29] (03PS1) 10Thcipriani: Beta: Fix for service dependency loops [puppet] - 10https://gerrit.wikimedia.org/r/447487 (https://phabricator.wikimedia.org/T171173) [18:41:13] (03CR) 10jerkins-bot: [V: 04-1] Beta: Fix for service dependency loops [puppet] - 10https://gerrit.wikimedia.org/r/447487 (https://phabricator.wikimedia.org/T171173) (owner: 10Thcipriani) [18:52:32] 10Operations, 10Patch-For-Review: requesting additional production ssh key for jmorgan - https://phabricator.wikimedia.org/T200103 (10herron) 05Open>03Resolved @Capt_Swing your new ssh key has been added and I watched it deploy successfully to `bast1002`. Just give this change another 30 minutes to finish... [19:31:24] 10Operations, 10Maps, 10Maps-Sprint: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Mholloway) [19:31:56] 10Operations, 10Maps, 10Maps-Sprint: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Mholloway) [19:36:22] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) email sent to team list so all other sre team members are aware of this work next Wednesday (2018-07-25). [19:36:53] !log gehel@deploy1001 Started deploy [wdqs/wdqs@690bf52]: new version of wdqs GUI (wdqs1009 only) [19:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:15] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@690bf52]: new version of wdqs GUI (wdqs1009 only) (duration: 00m 22s) [19:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:08] !log gehel@deploy1001 Started deploy [wdqs/wdqs@690bf52]: new version of wdqs GUI [19:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:43] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@690bf52]: new version of wdqs GUI (duration: 11m 35s) [19:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:10] SMalyshev: ^deployment completed [19:50:35] tests are green [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180723T2000). [20:02:43] nothing for poor ORES [20:03:29] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@565b41a]: Update mobileapps to 254cef5 [20:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:53] 10Operations, 10vm-requests, 10Performance-Team (Radar): Increase webperf1002/webperf2002 space from 50GB to 150GB (Ganeti) - https://phabricator.wikimedia.org/T199853 (10Imarlier) [20:09:25] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@565b41a]: Update mobileapps to 254cef5 (duration: 05m 56s) [20:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:15] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54936 MB (3% inode=99%) [20:13:01] !log arlolra@deploy1001 Started deploy [parsoid/deploy@80384a5]: Updating Parsoid to a1e851c [20:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:29] (03CR) 10Smalyshev: Generate daily diffs for categories RDF (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T198356) (owner: 10Smalyshev) [20:14:39] (03PS13) 10Smalyshev: Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T198356) [20:15:14] (03PS14) 10Smalyshev: Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T198356) [20:22:47] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@80384a5]: Updating Parsoid to a1e851c (duration: 09m 45s) [20:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:20] 10Operations, 10Analytics, 10decommission: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097 (10RobH) [20:26:21] (03PS1) 10RobH: stat1002 decom prod dns [dns] - 10https://gerrit.wikimedia.org/r/447503 (https://phabricator.wikimedia.org/T173097) [20:26:56] (03CR) 10RobH: [C: 032] stat1002 decom prod dns [dns] - 10https://gerrit.wikimedia.org/r/447503 (https://phabricator.wikimedia.org/T173097) (owner: 10RobH) [20:27:34] !log Updated Parsoid to a1e851c (T199808, T194083) [20:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:39] T194083: Found nested inserted dom-diff flags! - https://phabricator.wikimedia.org/T194083 [20:27:40] T199808: Update portals to cater for foundationwiki move - https://phabricator.wikimedia.org/T199808 [20:28:31] 10Operations, 10Analytics, 10decommission: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097 (10RobH) a:05RobH>03Cmjohnson [20:28:56] 10Operations, 10ops-eqiad, 10Analytics, 10decommission: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097 (10RobH) [20:31:35] 10Operations, 10ops-eqiad, 10PoolCounter, 10decommission: Decommision poolcounter1002 - https://phabricator.wikimedia.org/T193025 (10RobH) [20:37:05] RECOVERY - Disk space on maps1001 is OK: DISK OK [20:43:36] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:44:27] (03PS6) 10Andrew Bogott: Delegate 185.15.56.0/24 to labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/445303 (https://phabricator.wikimedia.org/T199374) [20:44:29] (03PS1) 10Andrew Bogott: Move labtest-recursor1 to 208.80.153.78 [dns] - 10https://gerrit.wikimedia.org/r/447552 [20:45:24] !log T156137: restarting elastic1027 to disable G1GC [20:45:25] (03PS2) 10Andrew Bogott: Move labtest-recursor1 to 208.80.153.78 [dns] - 10https://gerrit.wikimedia.org/r/447552 [20:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:27] (03PS7) 10Andrew Bogott: Delegate 185.15.56.0/24 to labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/445303 (https://phabricator.wikimedia.org/T199374) [20:45:28] T156137: Reduce impact of GC pauses on elasticsearch response time - https://phabricator.wikimedia.org/T156137 [20:46:21] (03CR) 10jerkins-bot: [V: 04-1] Move labtest-recursor1 to 208.80.153.78 [dns] - 10https://gerrit.wikimedia.org/r/447552 (owner: 10Andrew Bogott) [20:46:53] (03CR) 10Andrew Bogott: [C: 032] Move labtest-recursor1 to 208.80.153.78 [dns] - 10https://gerrit.wikimedia.org/r/447552 (owner: 10Andrew Bogott) [20:50:37] (03PS3) 10Alex Monk: Re-combine labs and production exim minimal config [puppet] - 10https://gerrit.wikimedia.org/r/439774 [20:51:17] (03CR) 10jerkins-bot: [V: 04-1] Re-combine labs and production exim minimal config [puppet] - 10https://gerrit.wikimedia.org/r/439774 (owner: 10Alex Monk) [20:51:21] !log T156137: restarting elastic1035 to disable G1GC [20:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:25] T156137: Reduce impact of GC pauses on elasticsearch response time - https://phabricator.wikimedia.org/T156137 [20:54:40] (03PS4) 10Alex Monk: Re-combine labs and production exim minimal config [puppet] - 10https://gerrit.wikimedia.org/r/439774 [20:55:05] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:55:16] (03CR) 10jerkins-bot: [V: 04-1] Re-combine labs and production exim minimal config [puppet] - 10https://gerrit.wikimedia.org/r/439774 (owner: 10Alex Monk) [20:58:27] _joe_, hi [20:58:47] I'm struggling to understand this wmf-style thing [20:58:50] 20:55:14 modules/standard/manifests/mail/sender.pp:2 wmf-style: Found hiera call in class 'standard::mail::sender' for 'route_wiki_mail' [20:59:49] I don't understand why it complains about that and not e.g. the do_acme stuff in role::tendril [21:00:04] bawolff and Reedy: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180723T2100). [21:02:26] (03PS5) 10Alex Monk: Re-combine labs and production exim minimal config [puppet] - 10https://gerrit.wikimedia.org/r/439774 [21:02:54] (03PS6) 10Alex Monk: Re-combine labs and production exim minimal config [puppet] - 10https://gerrit.wikimedia.org/r/439774 [21:03:30] 10Operations, 10Core-Platform-Team, 10PoolCounter, 10monitoring, 10Wikimedia-Incident: High levels of PoolCounter errors should trigger alerts - https://phabricator.wikimedia.org/T133318 (10Krinkle) [21:04:46] 10Operations, 10Core-Platform-Team, 10PoolCounter, 10monitoring: Fix monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729 (10Krinkle) [21:04:48] 10Operations, 10Core-Platform-Team, 10PoolCounter, 10monitoring, 10Wikimedia-Incident: High levels of PoolCounter errors should trigger alerts - https://phabricator.wikimedia.org/T133318 (10Krinkle) [21:05:00] well ok [21:05:09] turns out I can just bury the hiera call within the template [21:05:14] 10Operations, 10Core-Platform-Team, 10PoolCounter, 10monitoring, 10Wikimedia-Incident: Fix monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729 (10Krinkle) >>! In T133318#3459299, @tstarling wrote: > MW already provides a log of all PoolCounter errors, including queue overflow, in... [21:14:37] Krinkle: if you're still here, thcipriani can swat it out [21:14:43] * thcipriani waves [21:15:13] what patches am I looking at? [21:15:20] * Krinkle waves back [21:18:14] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/447506 and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/447507 though that last is failing tests (I just noticed) [21:19:58] yeah, I don't know if a recheck is going to fix that failure. Is there a depends-on relationship missing somewhere? [21:23:26] * thcipriani tries recheck anyway [21:24:38] Krinkle: do you know what could be causing the failure on https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/447507/ ? [21:25:07] thcipriani: Yes, the "Revert .. Accept .. " needs to land first. [21:25:23] Or rebase/git-parent-depend [21:25:40] I'm not sure Depends-On works within the same repo [21:25:53] But you can Rebase in gerrit and specify the other change id [21:26:13] I don't think it does, probably just the git parent. Unfortunately gerrit UI has grayed out the rebase button for me :) [21:26:36] thcipriani: The main one, or the modal one? [21:26:36] I'll just go ahead and land the other one if you're available to test? [21:27:01] Worked for me :) [21:27:04] modal one [21:27:12] Ah, needs to enable the check box first [21:27:21] [ ] Change parent rev [21:27:36] Otherwise it's only for rebasing on the latest HEAD of current branch, which it already is, and won't permit no-op. [21:27:59] makes sense [21:28:34] yeah, it's as intuitive as... Gerrit code review. [21:28:58] :) [21:34:09] PROBLEM - Recursive DNS on 208.80.153.78 is CRITICAL: CRITICAL - Plugin timed out while executing system call [21:41:50] (03PS1) 10Andrew Bogott: labservices: allow pdns database to talk to second_region designate [puppet] - 10https://gerrit.wikimedia.org/r/447554 [21:42:28] (03CR) 10jerkins-bot: [V: 04-1] labservices: allow pdns database to talk to second_region designate [puppet] - 10https://gerrit.wikimedia.org/r/447554 (owner: 10Andrew Bogott) [21:44:46] (03PS2) 10Andrew Bogott: labservices: allow pdns database to talk to second_region designate [puppet] - 10https://gerrit.wikimedia.org/r/447554 [21:47:15] (03CR) 10Andrew Bogott: [C: 032] labservices: allow pdns database to talk to second_region designate [puppet] - 10https://gerrit.wikimedia.org/r/447554 (owner: 10Andrew Bogott) [21:48:35] Krinkle: still around for those patches? After waiting on jenkins for a bit it seems they've merged. [21:49:17] Yes. [21:49:30] ok, they're staged on mwdebug1002, check please [21:51:44] thcipriani: thx, confirmed still broken on 1001 and confirmed fixed on mwdebug1002 [21:52:13] Krinkle: ok, doing a full scap to sync out [21:52:28] Aye, yeah, I guess it needs that. [21:53:44] !log thcipriani@deploy1001 Started scap: [[gerrit:447506|Revert "Accept BCP 47 codes as aliases for nonstandard variants"]] T199941 [[gerrit:447507|Revert "Ensure LanguageCode::bcp47() returns a valid BCP 47 language code"]] T199941 [21:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:49] T199941: Fatal MWException in Babel: "Language::isValidBuiltInCode must be passed a string" - https://phabricator.wikimedia.org/T199941 [21:56:06] thcipriani: btw, unrelated, I'm trying to find the task for switching scap (for mw) to use git fetch/checkout. [21:56:10] I assume there is one, right? [21:56:44] there was one at one point [21:56:46] * thcipriani looks [21:57:34] it would make syncs more atomic on each server compared to rsync, right? Or do we use symlinks with rsync currently to achieve that? [21:57:43] https://phabricator.wikimedia.org/T147938 [21:58:02] it would make syncs more atomic to use scap3 [21:58:05] /git [21:58:11] currently we use --delay-updates [21:58:12] Yeah, exactly [21:58:37] Which means we still get random fatals both for sync-dir and for full scap due to run-time conflicts between classes missing etc. [21:59:14] and whatever random issues result from that, which can sometimes be sticky given most code we write isn't expected to handle with such unpredictable frankenstein intermediary states. [21:59:16] indeed that can happen. Most of protection from that is due to the way train is deployed, i.e. having sync-wikiversions be the final step [21:59:23] Right [21:59:37] I have never been able to determine why those issues are "sticky" [21:59:44] I forgot about wikiversions. That's why we marked "the bug" about atomic issues as closed a while back. [21:59:46] why we have to touch and resync IS.php [21:59:59] some race condition somewhere [22:00:01] Yeah, that's only a small part of the sticky problems though [22:00:15] specifically the ones relating to wgConf expansion [22:00:35] but other ones are not fixed by a touch or by anything else. They'll be properly stuck until we identify and fix it. [22:00:54] E.g. partially executed responses that fatal due to a missing class or some other conflict. [22:01:51] so for scap3 we got as far as flattening out /srv/mediawiki into a git repo. That was deployed on beta, but it takes up a ton of space even with regular git gc [22:02:32] but it had a lot of nice side effects: it gave us a single number that represents the current state of what's deployed, which is neat [22:02:33] I remember now, yeah. scap3 for mw is blocked on the flat repo. [22:02:44] also tells you: what was deployed in a particular scap [22:02:49] yep [22:02:52] and that's without using it for transport yet [22:03:06] so that's beta only for now, right? [22:03:16] yep, it's feature-flagged in production [22:03:56] and will stay that way for the mid-term future likely. every now and then I have to remove /srv/mediawiki/.git because it grows out of control :( [22:04:12] well "out of control": grows over time [22:04:18] and eventually gets very large [22:04:31] Does the flat repo includes cdb builds? [22:05:00] no, but it includes the json of those files [22:05:29] was planning to leave cdb-rebuild as the last step on all targets alone [22:05:29] right [22:05:34] Yeah [22:05:46] so switching to static array files won't avoid the git growth issue [22:06:16] no, unfortunately [22:07:06] I guess at this point scap3 for mw will be racing against container images for mw. [22:07:11] as deployment method. [22:07:25] Might not be worth investing in. [22:07:33] yeah, which is part of the reason it's been on the backburner [22:07:38] okay [22:07:46] one last random q before I really sign off. /// [22:07:55] how's the static array for l10n going? [22:08:06] I think it's on test wikis now, right? [22:08:41] https://phabricator.wikimedia.org/T99740 [22:09:18] I hadn't kept up on this at all [22:09:52] I didn't realize it was going to be/was deployed anywhere? [22:09:55] looks like scap and localisation rebuild may need to be updated to account for both formats during the transition period, but maybe that's done already. [22:10:06] Hm.. yeah, it's not live on test wikis [22:10:27] It's on beta according to the task, but that's possibly easier because it could just be switched all at once [22:10:28] IIRC there was code in scap for it at one point, although it might not be what we need anymore [22:10:55] I guess localisation update doesn't run in beta either [22:11:17] not that it would make sense of course, aside from consistency. [22:11:23] k, gotta go. thanks! [22:11:36] ok, thanks for your help with train blockers! [22:11:48] (scap still rebuilding l10n) [22:27:05] !log thcipriani@deploy1001 Finished scap: [[gerrit:447506|Revert "Accept BCP 47 codes as aliases for nonstandard variants"]] T199941 [[gerrit:447507|Revert "Ensure LanguageCode::bcp47() returns a valid BCP 47 language code"]] T199941 (duration: 33m 21s) [22:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:11] T199941: Fatal MWException in Babel: "Language::isValidBuiltInCode must be passed a string" - https://phabricator.wikimedia.org/T199941 [22:46:08] RECOVERY - Recursive DNS on 208.80.153.78 is OK: DNS OK: 0.118 seconds response time. www.wikipedia.org returns 208.80.154.224 [22:53:26] !log mobrovac@deploy1001 Started restart [eventstreams/deploy@5ed03e0]: A ggodnight restart - T199813 [22:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:30] T199813: EventStreams accumulates too much memory on SCB nodes in CODFW - https://phabricator.wikimedia.org/T199813 [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180723T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:09:03] 10Operations, 10netops: Unexpected network packets in codfw mgmt - https://phabricator.wikimedia.org/T199832 (10ayounsi) 05Open>03Resolved From support: > I have confirmed that these addresses 128.0.0.16 , 191.255.255.255 are used in the system for internal purposes only. > This type of traffic can be safe... [23:09:16] 10Operations, 10netops: Unexpected network packets in codfw mgmt - https://phabricator.wikimedia.org/T199832 (10ayounsi) [23:13:28] PROBLEM - Recursive DNS on 208.80.153.78 is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:18:03] (03PS1) 10Bstorm: gridengine: try to translate all the Ubuntu package calls to Debian [puppet] - 10https://gerrit.wikimedia.org/r/447561 (https://phabricator.wikimedia.org/T199276) [23:49:35] (03PS2) 10EBernhardson: Delete unused code in elasticsearch module [puppet] - 10https://gerrit.wikimedia.org/r/445320 [23:49:37] (03PS7) 10EBernhardson: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) [23:49:39] (03PS20) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) [23:49:41] (03PS22) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 (https://phabricator.wikimedia.org/T198351) [23:49:43] (03PS1) 10EBernhardson: Remove support for elasticsearch 2.x [puppet] - 10https://gerrit.wikimedia.org/r/447564 [23:49:45] (03PS1) 10EBernhardson: Split elasticsearch::log::hot_threads into two pieces [puppet] - 10https://gerrit.wikimedia.org/r/447565 (https://phabricator.wikimedia.org/T198351) [23:49:47] (03PS1) 10EBernhardson: Make cirrus specific elasticsearch profile [puppet] - 10https://gerrit.wikimedia.org/r/447566 (https://phabricator.wikimedia.org/T198351) [23:49:49] (03PS1) 10EBernhardson: Split per-cluster config out of elasticsearch::curator [puppet] - 10https://gerrit.wikimedia.org/r/447567 (https://phabricator.wikimedia.org/T180807) [23:49:51] (03PS1) 10EBernhardson: Make elasticsearch http and transport ports explicit [puppet] - 10https://gerrit.wikimedia.org/r/447568 (https://phabricator.wikimedia.org/T198351) [23:53:03] (03PS26) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (https://phabricator.wikimedia.org/T198351) [23:53:05] (03PS29) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) [23:53:07] (03PS57) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) [23:53:10] (03PS2) 10EBernhardson: Cleanup ensure => absent after refactoring [puppet] - 10https://gerrit.wikimedia.org/r/444765 (https://phabricator.wikimedia.org/T198351) [23:53:49] (03CR) 10EBernhardson: Prep work for multi-instance elasticsearch refactor (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson)