[02:32:44] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.10) (duration: 13m 07s) [02:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:20] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:34:29] PROBLEM - Check systemd state on labcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:43:03] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Jul 3 02:43:03 UTC 2018 (duration 10m 19s) [02:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:37] (03PS1) 10Krinkle: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 [02:43:39] (03PS1) 10Krinkle: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443530 [02:43:41] (03PS1) 10Krinkle: Remove wgRightsUrl from Wikibase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 [02:43:43] (03PS1) 10Krinkle: Move wgRightsPage/wgRightsText from Wikibase.php to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532 [02:51:50] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [02:55:09] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:55:19] RECOVERY - Check systemd state on labcontrol1003 is OK: OK - running: The system is fully operational [02:57:39] (03PS1) 10Krinkle: Remove wmgUseClusterFileBackend (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443533 [02:57:41] (03PS1) 10Krinkle: Remove wmgUseClusterFileBackend (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443534 [02:58:30] PROBLEM - Check systemd state on labcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:01:19] (03PS2) 10Krinkle: Remove wmgUseClusterFileBackend (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443533 [03:01:21] (03PS2) 10Krinkle: Remove wmgUseClusterFileBackend (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443534 [03:16:12] (03PS1) 10Krinkle: Move filebackend.php include towards the top near other includes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443535 [03:46:23] (03PS6) 10Krinkle: webperf: Make performance::site apache config more dynamic [puppet] - 10https://gerrit.wikimedia.org/r/442232 (https://phabricator.wikimedia.org/T195314) [03:51:10] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [03:54:30] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:25:30] RECOVERY - Check systemd state on labcontrol1003 is OK: OK - running: The system is fully operational [04:27:19] !log Deploy schema change on s2 primary master db1066 T191316 T192926 T195193 [04:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:27] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [04:27:28] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [04:27:30] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [04:28:49] PROBLEM - Check systemd state on labcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:36:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443537 (https://phabricator.wikimedia.org/T191316) [04:37:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443537 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [04:38:42] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443537 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [04:38:58] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443537 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [04:40:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1086 for alter table (duration: 00m 52s) [04:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:21] !log Deploy schema change on db1086 T191316 T192926 T89737 T195193 T197459 [04:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:27] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [04:40:28] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [04:40:28] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [04:40:28] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [04:40:29] T197459: Optimize logging table - https://phabricator.wikimedia.org/T197459 [04:44:16] (03PS1) 10Marostegui: es1017.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/443538 (https://phabricator.wikimedia.org/T197072) [04:45:11] (03CR) 10Marostegui: [C: 032] es1017.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/443538 (https://phabricator.wikimedia.org/T197072) (owner: 10Marostegui) [04:55:10] RECOVERY - Check systemd state on labcontrol1003 is OK: OK - running: The system is fully operational [05:04:44] !log Deploy schema change on s8 primary master db1071 T191316 T192926 T195193 [05:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:50] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:04:50] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:04:51] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:14:55] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443541 [05:16:46] PROBLEM - Check systemd state on labcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:17:06] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443541 (owner: 10Marostegui) [05:18:18] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443541 (owner: 10Marostegui) [05:18:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443541 (owner: 10Marostegui) [05:19:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1086 after alter table (duration: 00m 50s) [05:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443544 (https://phabricator.wikimedia.org/T191316) [05:24:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443544 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:25:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443544 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:26:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1094 for alter table (duration: 00m 49s) [05:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:33] !log Deploy schema change on db1094 T191316 T192926 T89737 T195193 T197459 [05:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:39] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [05:26:40] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:26:40] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:26:40] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:26:41] T197459: Optimize logging table - https://phabricator.wikimedia.org/T197459 [05:27:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443544 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:58:39] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443546 [06:02:37] (03PS4) 10Giuseppe Lavagetto: profile::jobrunner: remove references to mediawiki::jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/443427 (https://phabricator.wikimedia.org/T198220) [06:02:46] <_joe_> elukey: ^^ [06:04:43] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443546 (owner: 10Marostegui) [06:05:54] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443546 (owner: 10Marostegui) [06:06:55] _joe_ (checking) [06:07:00] <_joe_> elukey: wait [06:07:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1094 after alter table (duration: 00m 50s) [06:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:07] <_joe_> I forgot some files, sorry [06:08:01] <_joe_> not it should be better :P [06:08:02] (03PS5) 10Giuseppe Lavagetto: profile::jobrunner: remove references to mediawiki::jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/443427 (https://phabricator.wikimedia.org/T198220) [06:08:08] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443546 (owner: 10Marostegui) [06:08:19] <_joe_> elukey: look at ps5 [06:08:45] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443547 (https://phabricator.wikimedia.org/T191316) [06:08:55] yep yep [06:09:58] (03PS2) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443547 (https://phabricator.wikimedia.org/T191316) [06:10:58] _joe_ what about modules/mediawiki/files/jobrunner.rsyslog.conf ? [06:11:10] <_joe_> uh did I miss it again? [06:11:21] <_joe_> I should've removed with the patch I merged yesterday [06:11:24] <_joe_> yeah I'll add it [06:11:53] ah okok maybe I don't have my puppet repo updated sorry [06:11:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443547 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [06:11:54] doing it [06:12:08] re-checking [06:12:19] <_joe_> elukey: no I guess I forgot [06:12:20] <_joe_> :P [06:13:03] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443547 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [06:13:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443547 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [06:13:48] nono not there anymore :) [06:14:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1079 for alter table (duration: 00m 50s) [06:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:23] !log Deploy schema change on db1079 with replication, this will generate lag on s7 on labsdb hosts T191316 T192926 T89737 T195193 T197459 [06:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:31] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [06:14:31] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [06:14:31] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [06:14:32] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [06:14:32] T197459: Optimize logging table - https://phabricator.wikimedia.org/T197459 [06:15:17] (03CR) 10Elukey: [C: 031] profile::jobrunner: remove references to mediawiki::jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/443427 (https://phabricator.wikimedia.org/T198220) (owner: 10Giuseppe Lavagetto) [06:16:20] !log Stop replication on db1079 to drop triggers on db1125:s7 - T192926 [06:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:46] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/00-dummy.conf] [06:30:16] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:42:03] (03CR) 10Lokal Profil: Add versioning for config and validate it (031 comment) [dumps/dcat] - 10https://gerrit.wikimedia.org/r/425065 (https://phabricator.wikimedia.org/T163328) (owner: 10Lokal Profil) [06:50:39] (03PS1) 10Jcrespo: mariadb: Depool es1017 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443550 (https://phabricator.wikimedia.org/T197072) [06:50:55] (03CR) 10Marostegui: [C: 031] mariadb: Depool es1017 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443550 (https://phabricator.wikimedia.org/T197072) (owner: 10Jcrespo) [06:53:31] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1017 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443550 (https://phabricator.wikimedia.org/T197072) (owner: 10Jcrespo) [06:54:45] (03Merged) 10jenkins-bot: mariadb: Depool es1017 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443550 (https://phabricator.wikimedia.org/T197072) (owner: 10Jcrespo) [06:55:45] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:55:56] RECOVERY - Check systemd state on labcontrol1003 is OK: OK - running: The system is fully operational [06:56:46] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443553 [06:56:57] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443553 [06:57:54] (03CR) 10jenkins-bot: mariadb: Depool es1017 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443550 (https://phabricator.wikimedia.org/T197072) (owner: 10Jcrespo) [06:59:15] PROBLEM - Check systemd state on labcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:59:16] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:02:19] jynus: are you going to scap? [07:03:44] yes, I was waiting for jenkins [07:03:52] but it was merged no? [07:03:59] now yes [07:04:00] at 8:54? [07:04:15] but it takes enough time to get me distracted to start other things [07:04:20] yeah [07:04:21] :( [07:04:54] (03PS3) 10Alexandros Kosiaris: docker: Create the configuration before installing the package [puppet] - 10https://gerrit.wikimedia.org/r/443464 [07:04:59] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] docker: Create the configuration before installing the package [puppet] - 10https://gerrit.wikimedia.org/r/443464 (owner: 10Alexandros Kosiaris) [07:05:54] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1017 (duration: 00m 51s) [07:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:05] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443553 (owner: 10Marostegui) [07:07:14] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443553 (owner: 10Marostegui) [07:07:41] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443553 (owner: 10Marostegui) [07:08:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1079 after alter table (duration: 00m 50s) [07:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:08] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443554 (https://phabricator.wikimedia.org/T191316) [07:11:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443554 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:12:15] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443554 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:12:30] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443554 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:12:35] PROBLEM - Host kubestage1001 is DOWN: PING CRITICAL - Packet loss = 100% [07:13:01] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-separable: New upstream release [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/443346 (https://phabricator.wikimedia.org/T191404) (owner: 10KartikMistry) [07:13:22] (03PS1) 10Giuseppe Lavagetto: jobqueue_redis: remove restarting cron in redis [puppet] - 10https://gerrit.wikimedia.org/r/443555 (https://phabricator.wikimedia.org/T191316) [07:13:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3317 for alter table (duration: 00m 51s) [07:13:34] !log Deploy schema change on db1090:3317 T191316 T192926 T89737 T195193 T197459 [07:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:41] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [07:13:41] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [07:13:42] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [07:13:42] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [07:13:43] T197459: Optimize logging table - https://phabricator.wikimedia.org/T197459 [07:14:45] RECOVERY - Host kubestage1001 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [07:16:56] PROBLEM - Check systemd state on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:17:05] !log reimage kubestage1001 [07:17:05] PROBLEM - configured eth on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:17:05] PROBLEM - dhclient process on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:16] PROBLEM - DPKG on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:17:16] PROBLEM - MD RAID on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:17:26] PROBLEM - Check size of conntrack table on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:17:28] I passed --new by mistake to wmf-host-reimage [07:17:36] PROBLEM - Disk space on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:17:45] PROBLEM - Check whether ferm is active by checking the default input chain on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:18:09] !log upgrading ffmpeg on mw1307 (T190333) [07:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:12] T190333: Backport libvpx 1.7.0, ffmpeg packages for VP9 -row-mt option - https://phabricator.wikimedia.org/T190333 [07:18:35] akosiaris: so no icinga downtime and no depool [07:18:44] yup, but both are fine [07:18:48] it's the staging cluster [07:18:56] not used yet [07:19:05] the kubernetes staging cluster to be more precise [07:19:09] ack, but I can add a check to the script [07:19:15] PROBLEM - puppet last run on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:21:16] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [07:22:09] (03PS6) 10Giuseppe Lavagetto: profile::jobrunner: remove references to mediawiki::jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/443427 (https://phabricator.wikimedia.org/T198220) [07:24:36] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:25:05] PROBLEM - DPKG on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:25:31] (03PS2) 10Filippo Giunchedi: DHCP: Add MAC address entries for graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/443100 (https://phabricator.wikimedia.org/T196483) (owner: 10Papaul) [07:25:45] RECOVERY - Check systemd state on labcontrol1003 is OK: OK - running: The system is fully operational [07:27:16] PROBLEM - MD RAID on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:27:25] PROBLEM - Check size of conntrack table on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:27:36] PROBLEM - Disk space on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:27:45] PROBLEM - Check whether ferm is active by checking the default input chain on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:28:05] PROBLEM - Check systemd state on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:28:15] PROBLEM - configured eth on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:28:15] PROBLEM - dhclient process on kubestage1001 is CRITICAL: Return code of 255 is out of bounds [07:28:42] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::jobrunner: remove references to mediawiki::jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/443427 (https://phabricator.wikimedia.org/T198220) (owner: 10Giuseppe Lavagetto) [07:29:03] PROBLEM - Check systemd state on labcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:29:23] PROBLEM - puppet last run on kubestage1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [07:31:44] PROBLEM - Check size of conntrack table on kubestage1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [07:32:14] RECOVERY - Check systemd state on kubestage1001 is OK: OK - running: The system is fully operational [07:32:23] RECOVERY - dhclient process on kubestage1001 is OK: PROCS OK: 0 processes with command name dhclient [07:32:23] RECOVERY - configured eth on kubestage1001 is OK: OK - interfaces up [07:32:34] RECOVERY - DPKG on kubestage1001 is OK: All packages OK [07:32:34] RECOVERY - MD RAID on kubestage1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:32:43] RECOVERY - Check size of conntrack table on kubestage1001 is OK: OK: nf_conntrack is 0 % full [07:33:25] !log upload apertium-separable_0.3.1-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main T191404 [07:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:29] T191404: Upgrade apertium-separable - https://phabricator.wikimedia.org/T191404 [07:37:47] (03PS1) 10Ema: cache: stop setting varnish_version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/443562 [07:38:19] !log reimage aqs1005 to debian stretch [07:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:57] (03PS3) 10Filippo Giunchedi: DHCP: Add MAC address entries for graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/443100 (https://phabricator.wikimedia.org/T196483) (owner: 10Papaul) [07:40:06] (03CR) 10Filippo Giunchedi: [C: 032] DHCP: Add MAC address entries for graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/443100 (https://phabricator.wikimedia.org/T196483) (owner: 10Papaul) [07:41:40] (03CR) 10Ema: [C: 032] cache: stop setting varnish_version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/443562 (owner: 10Ema) [07:41:49] (03PS2) 10Ema: cache: stop setting varnish_version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/443562 [07:42:19] (03CR) 10Filippo Giunchedi: [C: 031] gerrit: use localhost exim as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/440970 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron) [07:42:31] (03CR) 10Ema: [V: 032 C: 032] cache: stop setting varnish_version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/443562 (owner: 10Ema) [07:46:54] RECOVERY - Check whether ferm is active by checking the default input chain on kubestage1001 is OK: OK ferm input default policy is set [07:48:33] RECOVERY - Disk space on kubestage1001 is OK: DISK OK [07:55:23] RECOVERY - Check systemd state on labcontrol1003 is OK: OK - running: The system is fully operational [07:58:42] PROBLEM - Check systemd state on labcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:04:31] PROBLEM - HHVM rendering on mw2140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:05:21] RECOVERY - HHVM rendering on mw2140 is OK: HTTP OK: HTTP/1.1 200 OK - 74985 bytes in 0.298 second response time [08:06:09] !log resuming rolling restart of cassandra on restbase in codfw to pick up Java security updates [08:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:41] RECOVERY - puppet last run on kubestage1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:17:30] (03PS1) 10Alexandros Kosiaris: kubernetes: Switch to /dev/md1 across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/443566 [08:22:05] (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes: Switch to /dev/md1 across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/443566 (owner: 10Alexandros Kosiaris) [08:24:51] RECOVERY - Check systemd state on labcontrol1003 is OK: OK - running: The system is fully operational [08:28:02] PROBLEM - Check systemd state on labcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:32:42] !log stop es1017 mysql for upgrade and maintenance [08:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:46] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Physical_volume[/dev/md1],Volume_group[docker] [08:42:07] PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Physical_volume[/dev/md1],Volume_group[docker] [08:45:26] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Physical_volume[/dev/md1],Volume_group[docker] [08:47:49] PROBLEM - puppet last run on kubernetes2002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Physical_volume[/dev/md1],Volume_group[docker] [08:49:02] (03PS1) 10Jcrespo: mariadb: Prepare for reimage to stretch of db1100 and db2051 [puppet] - 10https://gerrit.wikimedia.org/r/443572 [08:49:13] (03PS1) 10Ema: cache_text: test switching to cache_misc VCL [puppet] - 10https://gerrit.wikimedia.org/r/443573 (https://phabricator.wikimedia.org/T164609) [08:49:58] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ema) [08:50:00] PROBLEM - puppet last run on kubernetes2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Physical_volume[/dev/md1],Volume_group[docker] [08:50:38] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443574 [08:50:46] (03CR) 10Jcrespo: [C: 032] mariadb: Prepare for reimage to stretch of db1100 and db2051 [puppet] - 10https://gerrit.wikimedia.org/r/443572 (owner: 10Jcrespo) [08:50:49] RECOVERY - puppet last run on kubernetes2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:51:37] (03PS2) 10Ema: cache_text: test switching to cache_misc VCL [puppet] - 10https://gerrit.wikimedia.org/r/443573 (https://phabricator.wikimedia.org/T164609) [08:52:13] (03CR) 10Ema: [C: 032] cache_text: test switching to cache_misc VCL [puppet] - 10https://gerrit.wikimedia.org/r/443573 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [08:52:40] PROBLEM - puppet last run on kubernetes2004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Physical_volume[/dev/md1],Volume_group[docker] [08:54:04] (03PS1) 10Jcrespo: mariadb: Depool db1100 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443575 [08:54:58] 10Operations, 10Analytics, 10hardware-requests: eqiad: (2) hardware refresh for analytics1003 - https://phabricator.wikimedia.org/T198685 (10elukey) [09:00:03] (03PS2) 10Jcrespo: mariadb: Depool db1100 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443575 [09:00:35] (03CR) 10Muehlenhoff: [C: 031] "One nit, looks good" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [09:01:04] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1100 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443575 (owner: 10Jcrespo) [09:01:24] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1100 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443575 (owner: 10Jcrespo) [09:02:37] (03Merged) 10jenkins-bot: mariadb: Depool db1100 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443575 (owner: 10Jcrespo) [09:02:53] (03CR) 10jenkins-bot: mariadb: Depool db1100 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443575 (owner: 10Jcrespo) [09:04:05] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1100 (duration: 00m 51s) [09:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:13] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443574 (owner: 10Marostegui) [09:04:18] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443574 [09:07:19] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443574 (owner: 10Marostegui) [09:08:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090:3317 after alter table (duration: 00m 50s) [09:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443577 (https://phabricator.wikimedia.org/T191316) [09:09:15] !log deploying sys schema into db1072 (phabricator master) [09:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443577 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [09:11:33] !log reimage aqs1006 to Debian Stretch [09:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:01] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443577 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [09:12:46] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443577 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [09:13:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1098:3317 for alter table (duration: 00m 50s) [09:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:13] !log Deploy schema change on db1098:3317 T191316 T192926 T89737 T195193 T197459 [09:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:25] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [09:13:25] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [09:13:26] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [09:13:26] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [09:13:29] T197459: Optimize logging table - https://phabricator.wikimedia.org/T197459 [09:14:20] (03CR) 10Alex Monk: [C: 032] Pin dependencies versions [software/certcentral] - 10https://gerrit.wikimedia.org/r/443372 (owner: 10Vgutierrez) [09:14:49] (03Merged) 10jenkins-bot: Pin dependencies versions [software/certcentral] - 10https://gerrit.wikimedia.org/r/443372 (owner: 10Vgutierrez) [09:15:14] (03CR) 10jenkins-bot: Pin dependencies versions [software/certcentral] - 10https://gerrit.wikimedia.org/r/443372 (owner: 10Vgutierrez) [09:21:09] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [09:24:20] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:24:49] RECOVERY - Check systemd state on labcontrol1003 is OK: OK - running: The system is fully operational [09:28:07] PROBLEM - Check systemd state on labcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:29:12] <_joe_> arturo: neutron-server is failing ^^ [09:29:49] <_joe_> ExecStart=/etc/init.d/neutron-server systemd-start (code=exited, status=1/FAILURE) [09:30:16] mmm [09:30:39] is an under-development system, icinga should not alert [09:33:50] downtimed for a week, thanks _joe_ [09:35:58] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443365 (https://phabricator.wikimedia.org/T198591) (owner: 10Volans) [09:37:55] <_joe_> arturo: yw [09:42:47] (03PS1) 10Vgutierrez: Add IPv6 records for authdns2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/443580 (https://phabricator.wikimedia.org/T196664) [09:43:28] arturo: for those I normally add norifications_enabled: 0 to hiera [09:43:38] *notif [09:43:54] (03Abandoned) 10Alexandros Kosiaris: kubernetes: Docker physical volume remapping [puppet] - 10https://gerrit.wikimedia.org/r/436530 (owner: 10Alexandros Kosiaris) [09:44:52] !log reimage db1100 for upgrade to stretch [09:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:50] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [09:52:17] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install authdns2001.wikimedia.org - https://phabricator.wikimedia.org/T196664 (10Vgutierrez) 05Open>03Resolved [09:53:02] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install authdns2001.wikimedia.org - https://phabricator.wikimedia.org/T196664 (10Vgutierrez) [09:56:47] !log Stop replication on db2094:3313 to replace triggers - T192926 [09:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:51] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [09:58:59] !log Deploy schema change on s3 codfw primary master (db2043) with replication, this will generate lag on s3 codfw T191316 T192926 T89737 T195193 T197459 [09:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:06] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [09:59:06] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [09:59:07] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [09:59:07] T197459: Optimize logging table - https://phabricator.wikimedia.org/T197459 [10:03:20] (03CR) 10Elukey: [C: 031] jobqueue_redis: remove restarting cron in redis [puppet] - 10https://gerrit.wikimedia.org/r/443555 (https://phabricator.wikimedia.org/T191316) (owner: 10Giuseppe Lavagetto) [10:05:29] (03PS1) 10Gehel: wdqs: create log files during log rotation [puppet] - 10https://gerrit.wikimedia.org/r/443583 [10:09:15] (03PS1) 10MarcoAurelio: Close chairwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443585 (https://phabricator.wikimedia.org/T184961) [10:10:21] 10Operations, 10Analytics, 10hardware-requests: eqiad: (1) new stat box to offload users from stat1005 - https://phabricator.wikimedia.org/T196345 (10faidon) The argument that switches between stat boxes are expensive in staff time, so we should make them less often doesn't resonate much with me (maybe we sh... [10:12:17] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@ba672a3]: Decrease checker job concurrency T198462 [10:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:20] T198462: Rethink pacing the cirrusSearchCheckerJob - https://phabricator.wikimedia.org/T198462 [10:13:09] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@ba672a3]: Decrease checker job concurrency T198462 (duration: 00m 52s) [10:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:36] (03PS2) 10MarcoAurelio: Close chairwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443585 (https://phabricator.wikimedia.org/T184961) [10:14:58] (03CR) 10MarcoAurelio: Close chairwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443585 (https://phabricator.wikimedia.org/T184961) (owner: 10MarcoAurelio) [10:16:26] 10Operations, 10JADE, 10TechCom, 10Goal, and 3 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10awight) [10:18:39] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 [10:18:54] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 [10:20:22] (03PS3) 10Jcrespo: Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 [10:20:47] (03CR) 10Jcrespo: [C: 04-1] "Waiting until buffer pool warmup." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 (owner: 10Jcrespo) [10:25:15] RECOVERY - Check systemd state on labcontrol1003 is OK: OK - running: The system is fully operational [10:44:40] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443588 [10:46:13] !log stop and reimage to stretch db2051 [10:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:28] this will create lag on all of s5-codfw replicas [10:47:40] also probably some noise on the mediawiki logs [10:59:03] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443588 (owner: 10Marostegui) [11:00:33] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443588 (owner: 10Marostegui) [11:00:51] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443588 (owner: 10Marostegui) [11:02:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1098:3317 after alter table (duration: 00m 52s) [11:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:37] !log reimage aqs1007 to Debian Stretch [11:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:48] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[hive_mysql_create_database],Exec[hive_mysql_create_user],Exec[oozie_mysql_create_database] [11:42:58] this is me, checking --^ [12:00:36] (03PS1) 10Elukey: Explicitly set database password for hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/443595 [12:00:57] (03CR) 10jerkins-bot: [V: 04-1] Explicitly set database password for hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/443595 (owner: 10Elukey) [12:08:15] (03PS2) 10Elukey: Explicitly set database password for hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/443595 [12:17:04] (03PS1) 10Elukey: Add mysql passwords for role::analytics_cluster::coordinator [labs/private] - 10https://gerrit.wikimedia.org/r/443600 [12:17:19] (03CR) 10Elukey: [V: 032 C: 032] Add mysql passwords for role::analytics_cluster::coordinator [labs/private] - 10https://gerrit.wikimedia.org/r/443600 (owner: 10Elukey) [12:20:26] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/11650/" [puppet] - 10https://gerrit.wikimedia.org/r/443595 (owner: 10Elukey) [12:21:06] (03PS1) 10Cmjohnson: Changing DNS es1017 moving to row C [dns] - 10https://gerrit.wikimedia.org/r/443602 (https://phabricator.wikimedia.org/T197072) [12:49:56] !log shutting down es1017 [12:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:35] (03CR) 10Jcrespo: [C: 032] Changing DNS es1017 moving to row C [dns] - 10https://gerrit.wikimedia.org/r/443602 (https://phabricator.wikimedia.org/T197072) (owner: 10Cmjohnson) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180703T1300). [13:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] Present [13:00:22] ERR Too many patches ! :D [13:00:26] ERR? [13:00:33] I scheduled more than 6 of them? :D [13:00:45] I think he's just trolling :) [13:00:47] o/ [13:01:49] Who will be SWATing? [13:01:52] I can SWAT [13:03:04] I am rebasing the patches [13:03:15] (03PS3) 10Hashar: New throttle rule for Wikimania 2018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442268 (https://phabricator.wikimedia.org/T198288) (owner: 10Urbanecm) [13:03:17] (03PS3) 10Hashar: Enable SandboxLink on eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442345 (https://phabricator.wikimedia.org/T198335) (owner: 10Urbanecm) [13:03:19] (03PS4) 10Hashar: Add namespace alias on pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440987 (https://phabricator.wikimedia.org/T197507) (owner: 10Urbanecm) [13:03:21] (03PS5) 10Hashar: Allow bcts on private&fishbowl wikis advanced privilege manipulation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [13:03:23] (03PS4) 10Hashar: Clean legacy AddGroups/RemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440001 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [13:03:25] (03PS4) 10Hashar: Some wikis bureacurats are able to grant non-grantable groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440002 (https://phabricator.wikimedia.org/T197026) (owner: 10Urbanecm) [13:03:41] hashar: do you want to do the swat? [13:03:57] (I can swat, just checking if you have already started) [13:05:10] zeljkof: well I just rebased the patches. Can you please handle the deployment? [13:05:17] I am messing up with a nasty bug in CI :/ [13:05:24] so yeah, I could use the extra hour! [13:06:48] hashar: sure, I'll SWAT [13:08:11] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442268 (https://phabricator.wikimedia.org/T198288) (owner: 10Urbanecm) [13:09:33] (03Merged) 10jenkins-bot: New throttle rule for Wikimania 2018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442268 (https://phabricator.wikimedia.org/T198288) (owner: 10Urbanecm) [13:09:45] (03CR) 10jenkins-bot: New throttle rule for Wikimania 2018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442268 (https://phabricator.wikimedia.org/T198288) (owner: 10Urbanecm) [13:14:18] (03CR) 10Hashar: [C: 031] Enable SandboxLink on eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442345 (https://phabricator.wikimedia.org/T198335) (owner: 10Urbanecm) [13:14:28] (03CR) 10Hashar: [C: 031] Add namespace alias on pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440987 (https://phabricator.wikimedia.org/T197507) (owner: 10Urbanecm) [13:14:46] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:442268|New throttle rule for Wikimania 2018 (T198288)]] (duration: 00m 53s) [13:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:49] T198288: Increase account creation at Wikimania 2018 July 18-22 - https://phabricator.wikimedia.org/T198288 [13:15:21] Urbanecm: 442268 deployed [13:15:25] ack [13:16:42] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442345 (https://phabricator.wikimedia.org/T198335) (owner: 10Urbanecm) [13:17:58] (03Merged) 10jenkins-bot: Enable SandboxLink on eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442345 (https://phabricator.wikimedia.org/T198335) (owner: 10Urbanecm) [13:19:35] !log Deploy schema change on dbstore1002:s3 T191316 T192926 T89737 T195193 [13:19:38] Urbanecm: 442345 at mwdebug [13:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:41] ack [13:19:41] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [13:19:41] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [13:19:42] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [13:19:42] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [13:21:23] (03PS1) 10Marostegui: db-eqiad.php: Change es1017 IP and rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443616 (https://phabricator.wikimedia.org/T197072) [13:21:38] zeljkof, working, please push it to the whole universe [13:21:39] (03CR) 10Marostegui: [C: 04-1] "Wait for the rack confirmation from Chris" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443616 (https://phabricator.wikimedia.org/T197072) (owner: 10Marostegui) [13:22:37] Urbanecm: sure [13:23:12] (03PS2) 10Marostegui: db-eqiad.php: Change es1017 IP and rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443616 (https://phabricator.wikimedia.org/T197072) [13:24:12] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:442345|Enable SandboxLink on eswikiversity (T198335)]] (duration: 00m 51s) [13:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:16] T198335: Enable Extension:SandboxLink in the Spanish Wikiversity - https://phabricator.wikimedia.org/T198335 [13:25:11] Urbanecm: 442345 deployed [13:25:15] thanks [13:25:55] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440987 (https://phabricator.wikimedia.org/T197507) (owner: 10Urbanecm) [13:27:05] (03Merged) 10jenkins-bot: Add namespace alias on pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440987 (https://phabricator.wikimedia.org/T197507) (owner: 10Urbanecm) [13:28:14] Urbanecm: 440987 at mwdebug [13:28:23] ack, testing [13:29:23] working, please deploy [13:29:59] Urbanecm: deploying [13:30:02] ack [13:31:03] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:440987|Add namespace alias on pswikivoyage (T197507)]] (duration: 00m 49s) [13:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:06] T197507: Add namespace alias on ps.wikivoyage - https://phabricator.wikimedia.org/T197507 [13:31:30] Urbanecm: 440987 deployed, just in time, half of the commits, half time [13:31:40] Sounds gread [13:31:42] *great [13:33:18] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [13:34:33] (03Merged) 10jenkins-bot: Allow bcts on private&fishbowl wikis advanced privilege manipulation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [13:35:17] PROBLEM - Device not healthy -SMART- on cp3048 is CRITICAL: cluster=cache_upload device=sdc instance=cp3048:9100 job=node site=esams https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3048&var-datasource=esams%2520prometheus%252Fops [13:36:18] Urbanecm: 440000 at mwdebug [13:36:27] (03CR) 10Herron: "Puppet and nginx were both behaving as they should. It's just on the compiler hosts a separate local puppet master and CA are used, so /e" [puppet] - 10https://gerrit.wikimedia.org/r/437057 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [13:36:42] (03PS4) 10Jcrespo: Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 [13:37:02] zeljkof, working, please deploy [13:37:12] Urbanecm: deploying [13:38:36] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:440000|Allow bcts on private&fishbowl wikis advanced privilege manipulation (T197024)]] (duration: 00m 50s) [13:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:40] T197024: Allow private&fishbowl wikis bureaucrats to do advanced rights manipulation - https://phabricator.wikimedia.org/T197024 [13:38:46] Urbanecm: 440000 deployed [13:38:57] ack [13:39:55] (03PS2) 10Giuseppe Lavagetto: jobqueue_redis: remove restarting cron in redis [puppet] - 10https://gerrit.wikimedia.org/r/443555 (https://phabricator.wikimedia.org/T191316) [13:40:09] (03PS5) 10Jcrespo: Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 [13:40:11] (03PS1) 10Jcrespo: mariadb: Increase db1100 weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443623 [13:40:15] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440001 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [13:41:39] (03Merged) 10jenkins-bot: Clean legacy AddGroups/RemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440001 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [13:42:54] (03CR) 10Giuseppe Lavagetto: [C: 032] jobqueue_redis: remove restarting cron in redis [puppet] - 10https://gerrit.wikimedia.org/r/443555 (https://phabricator.wikimedia.org/T191316) (owner: 10Giuseppe Lavagetto) [13:43:45] (03CR) 10Jcrespo: [C: 04-1] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443623 (owner: 10Jcrespo) [13:43:47] Urbanecm: 440001 at mwdebug, please test carefully, this looks like it could break stuff :) [13:45:05] Well, that's hard to test, because it just cleans manual declarations that are replaced by the default settings introduced in 440000. [13:45:54] Urbanecm: please check a few things if they still work :) [13:46:54] Two wikis I checked still works. That's all private wikis I'm allowed to access [13:47:04] Please deploy it [13:47:10] Urbanecm: deploying [13:47:12] ack [13:48:31] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:440001|Clean legacy AddGroups/RemoveGroups (T197024)]] (duration: 00m 50s) [13:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:35] T197024: Allow private&fishbowl wikis bureaucrats to do advanced rights manipulation - https://phabricator.wikimedia.org/T197024 [13:48:56] Urbanecm: 440001 deployed [13:49:00] ack [13:50:35] Urbanecm: something did explode :/ [13:50:46] zeljkof, what explode? [13:50:48] logs are full of "slow query" [13:50:48] *d [13:50:59] not sure if it's related [13:51:01] That might have been me [13:51:07] s7, right? [13:51:10] Private wikis [13:51:20] mmm, that is s3, so then no [13:51:40] (and fishbowl, but no fishbowl wiki is on s7) [13:52:05] We've just disallowed some users to grant some groups, how this can possibly cause "slow query"? [13:52:06] marostegui: I'm looking at these logstash graphs https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Browser_tabs [13:52:14] let me see [13:53:01] I caused some lag on s7 [13:53:03] but it is gone [13:53:05] it was there for a minute [13:53:21] marostegui: this is the last thing I deployed https://gerrit.wikimedia.org/r/#/c/440001/ [13:53:53] yes, mediawiki-errors seems to have calmed down [13:54:22] Yeah, it should be gone now [13:54:32] marostegui: ok to continue with swat? [13:54:38] Yeah, from my side [13:54:45] Sorry for the confusion [13:55:06] marostegui: no problem, thanks for letting me know I did not earn the t-shirt :) [13:55:11] :) [13:55:13] (03CR) 10Gehel: convert role::logstash::elasticsearch to profiles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441894 (owner: 10EBernhardson) [13:55:55] (03CR) 10jenkins-bot: Enable SandboxLink on eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442345 (https://phabricator.wikimedia.org/T198335) (owner: 10Urbanecm) [13:55:57] (03CR) 10jenkins-bot: Add namespace alias on pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440987 (https://phabricator.wikimedia.org/T197507) (owner: 10Urbanecm) [13:55:59] (03CR) 10jenkins-bot: Allow bcts on private&fishbowl wikis advanced privilege manipulation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [13:56:03] (03CR) 10jenkins-bot: Clean legacy AddGroups/RemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440001 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [13:56:05] Urbanecm: with logs back to normal, there is time for the last commit 440002 [13:56:32] That's great. Let's do the last commit then [13:57:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440002 (https://phabricator.wikimedia.org/T197026) (owner: 10Urbanecm) [13:58:07] (03PS1) 10Marostegui: Revert "es1017.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/443629 [13:58:18] (03Merged) 10jenkins-bot: Some wikis bureacurats are able to grant non-grantable groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440002 (https://phabricator.wikimedia.org/T197026) (owner: 10Urbanecm) [13:58:20] (03PS2) 10Marostegui: Revert "es1017.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/443629 [13:59:17] (03CR) 10Marostegui: [C: 032] Revert "es1017.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/443629 (owner: 10Marostegui) [13:59:31] Urbanecm: 440002 at mwdebug [14:00:05] kart_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) ContentTranslation Draft Purge deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180703T1400). [14:00:12] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1017 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443634 [14:00:22] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1017 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443634 [14:00:25] Cannot check, MW ignores such invalid definitions. Please proceed [14:00:37] jynus: I have this: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/443616/ [14:00:49] How do you want to proceed? [14:01:07] zeljkof: SWAT going on? [14:01:32] deploy that [14:01:39] when you can [14:01:43] ok, waiting for SWAT to finish [14:01:45] and I will push [14:03:13] (03PS3) 10Jcrespo: Revert "mariadb: Depool es1017 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443634 [14:03:16] kart_, marostegui: done in a minute [14:03:19] thanks [14:03:30] zeljkof, was 440002 deployed? [14:03:41] (03CR) 10jenkins-bot: Some wikis bureacurats are able to grant non-grantable groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440002 (https://phabricator.wikimedia.org/T197026) (owner: 10Urbanecm) [14:03:49] marostegui: how much time your deploy will take? [14:04:04] 1 minute :) [14:04:11] (more or less, depending on CI) [14:04:21] OK :) [14:04:28] marostegui: ping me when done. [14:04:33] kart_: will do [14:05:01] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:440002|Some wikis bureacurats are able to grant non-grantable groups (T197026)]] (duration: 00m 51s) [14:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:05] T197026: Bureaucrats are "officially" able to grant&revoke user and autoconfirmed on checkuserwiki and some other wikis - https://phabricator.wikimedia.org/T197026 [14:05:35] Urbanecm: 440002 deployed [14:05:47] Thank you a lot zeljkof ! [14:06:00] !log EU SWAT finished [14:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Change es1017 IP and rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443616 (https://phabricator.wikimedia.org/T197072) (owner: 10Marostegui) [14:06:14] kart_, marostegui: I'm done [14:06:22] Waiting for CI to merge! [14:07:16] (03Merged) 10jenkins-bot: db-eqiad.php: Change es1017 IP and rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443616 (https://phabricator.wikimedia.org/T197072) (owner: 10Marostegui) [14:07:23] sorry for the delay, my wife got stinged by a wasp a few minutes ago, was helping her [14:08:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Change es1017 IP and rack T197072 (duration: 00m 50s) [14:08:30] (03CR) 10Gehel: prometheus/elasticsearch support multiple exporters per host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441321 (owner: 10EBernhardson) [14:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:32] T197072: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072 [14:08:32] kart_ jynus I am done [14:08:37] !log rolling restart of cassandra on restbase in eqiad to pick up Java security updates [14:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:57] marostegui: thanks. [14:09:04] I'll go ahead with script run. [14:09:06] (03CR) 10jenkins-bot: db-eqiad.php: Change es1017 IP and rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443616 (https://phabricator.wikimedia.org/T197072) (owner: 10Marostegui) [14:10:51] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072 (10Marostegui) The move was successful. We need to still repool the server once the buffer pool is ready and then this can be considered done [14:11:14] !log Running ContentTranslation draft purge script [14:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:58] !log OS install on graphite2003 [14:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:16] !log Finished ContentTranslation draft purge script [14:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:27] OK. That was very quick :) [14:13:07] (03CR) 10Gehel: "Wow! That's an interesting patch to review! Amazing job!" [puppet] - 10https://gerrit.wikimedia.org/r/441321 (owner: 10EBernhardson) [14:16:12] ACKNOWLEDGEMENT - Device not healthy -SMART- on cp3048 is CRITICAL: cluster=cache_upload device=sdc instance=cp3048:9100 job=node site=esams Ema The drac apparently showed up as a usb device (sdc) because of a racadm getsel issued by Mark. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3048&var-datasource=esams%2520prometheus%252Fops [14:18:43] (03CR) 10Ema: "noop on upload and misc, text changes here:" [puppet] - 10https://gerrit.wikimedia.org/r/440157 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [14:26:04] (03PS6) 10Jcrespo: Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 [14:26:06] (03PS2) 10Jcrespo: mariadb: Increase db1100 weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443623 [14:27:03] (03PS7) 10Jcrespo: Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 [14:27:05] (03PS3) 10Jcrespo: mariadb: Increase db1100 weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443623 [14:28:12] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 (owner: 10Jcrespo) [14:29:35] (03PS4) 10Jcrespo: Revert "mariadb: Depool es1017 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443634 [14:33:49] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es1017 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443634 (owner: 10Jcrespo) [14:34:00] (03CR) 10Elukey: "credentials added to the production private repo too!" [puppet] - 10https://gerrit.wikimedia.org/r/443595 (owner: 10Elukey) [14:35:08] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es1017 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443634 (owner: 10Jcrespo) [14:35:21] RECOVERY - Device not healthy -SMART- on cp3048 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3048&var-datasource=esams%2520prometheus%252Fops [14:35:43] (03PS8) 10Jcrespo: Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 [14:37:13] (03PS3) 10Elukey: Explicitly set database password for hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/443595 [14:37:54] (03CR) 10Elukey: [C: 032] Explicitly set database password for hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/443595 (owner: 10Elukey) [14:38:04] (03CR) 10jenkins-bot: Revert "mariadb: Depool es1017 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443634 (owner: 10Jcrespo) [14:42:02] o/ kart_ how is your slot going? :) [14:42:23] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1017 with low weight (duration: 00m 50s) [14:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:01] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:45:56] aaha kart_ leszek_wmde looks like the script finished over 30 mins ago :0 [14:46:08] leszek_wmde: looks like we are app preped for our slot then [14:46:34] addshore: i am ready [14:48:28] (03CR) 10Volans: [C: 04-1] "Typo, looks good otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443583 (owner: 10Gehel) [14:49:58] leszek_wmde: awesome! [14:50:57] (03PS2) 10Gehel: wdqs: create log files during log rotation [puppet] - 10https://gerrit.wikimedia.org/r/443583 [14:51:09] (03PS2) 10Addshore: Load WikibaseLexeme on testwiki (again x5(EP5)) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443436 (https://phabricator.wikimedia.org/T197454) [14:51:14] (03CR) 10Volans: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/443583 (owner: 10Gehel) [14:51:18] (03CR) 10Addshore: [C: 032] Load WikibaseLexeme on testwiki (again x5(EP5)) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443436 (https://phabricator.wikimedia.org/T197454) (owner: 10Addshore) [14:51:36] leszek_wmde: ^^ [14:51:59] so, lets merge it and put it on mwdebug1002, then add a statement on testwikidata test it, remove the statement, then do the same for gorup0! [14:52:33] addshore: plan approved [14:52:50] (03Merged) 10jenkins-bot: Load WikibaseLexeme on testwiki (again x5(EP5)) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443436 (https://phabricator.wikimedia.org/T197454) (owner: 10Addshore) [14:53:28] leszek_wmde: right, it is on mwdebug1002 [14:54:57] Looks we are good addshore! [14:55:06] reallly? :D [14:55:08] which page? [14:55:09] https://test.wikipedia.org/wiki/Client_test [14:55:30] works like a charm [14:55:55] leszek_wmde: sure we dont want to try any purges etc? ;) [14:56:04] did that already [14:56:10] but feel free to purge all the things! [14:56:22] * addshore looks around for 2 more mins [14:58:21] yup, pretty sure everything looks good [14:59:32] !log addshore@deploy1001 sync-file aborted: T197454 EP5 Enable WikibaseLexeme on clients: testwiki (duration: 00m 09s) [14:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:35] T197454: Deploy WikibaseLexeme to wikidata clients - https://phabricator.wikimedia.org/T197454 [14:59:47] * addshore waits for our slot to actually tick over [15:00:04] addshore: I, the Bot under the Fountain, allow thee, The Deployer, to do Deploy WikibaseLexeme to wikidata clients deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180703T1500). [15:00:14] wheee [15:01:02] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T197454 EP5 Enable WikibaseLexeme on clients: testwiki (duration: 00m 51s) [15:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:19] leszek_wmde: time for the next one! [15:01:30] (03PS10) 10Addshore: Load WikibaseLexeme on all of group0 (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438006 (https://phabricator.wikimedia.org/T197454) [15:01:36] (03CR) 10Addshore: [C: 032] Load WikibaseLexeme on all of group0 (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438006 (https://phabricator.wikimedia.org/T197454) (owner: 10Addshore) [15:02:07] addshore: cant wait [15:03:08] leszek_wmde: EP5 is so satisfying ( compared with previous episodes) [15:03:35] (03Merged) 10jenkins-bot: Load WikibaseLexeme on all of group0 (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438006 (https://phabricator.wikimedia.org/T197454) (owner: 10Addshore) [15:04:02] leszek_wmde: right, its on mwdebug1002 for mw.org now [15:04:08] lets test it there too? [15:05:24] hmm, leszek_wmde I am seeing some things in the log, not sure if they are related to this or not [15:05:26] Warning: Class __PHP_Incomplete_Class has no unserializer in /srv/mediawiki/php-1.32.0-wmf.10/vendor/wikibase/data-model/src/Entity/EntityIdValue.php on line 50 [15:05:37] oopsie [15:05:42] Catchable fatal error: Argument 1 passed to Wikibase\Client\Usage\UsageTrackingSnakFormatter::addLabelUsage() must be an instance of Wikibase\DataModel\Entity\EntityId, __PHP_Incomplete_Class given in /srv/mediawiki/php-1.32.0-wmf.10/extensions/Wikibase/client/includes/Usage/UsageTrackingSnakFormatter.php on line 73 [15:06:02] that it is for mw.org? [15:06:10] *looks* [15:06:33] this basically means "i got entity revision data from memcached and there is lexeme id in it but I dont know such class" [15:06:40] which wouldnt make much sense really [15:07:35] (03PS9) 10Jcrespo: Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 [15:07:36] hmm, they are logged as hhvm errors not mediawiki errors so I see not which wiki thye come from [15:07:59] hmm, should I add lexeme statement on wikidata.org yet, or do we dig into them? [15:08:01] how many? [15:08:06] https://www.irccloud.com/pastebin/DXoUXbu7/ [15:08:14] leszek_wmde: ^^ i guess if we remove that it will be fine? [15:08:17] getting myself to logstsah as well [15:08:29] only a couple [15:08:39] I think they were triggered from my loading our test page not on mwdebug1002 [15:08:40] so this is the example I added while testing [15:08:43] yeah [15:08:46] indeed [15:08:48] let me kill it [15:09:00] removed [15:09:03] did it stop? [15:09:35] Well, it hasn't happened in a few mins now :) think it was my page load, I think we are okay to continue :) [15:09:42] okay [15:09:51] we have high QPS on main metadata servers [15:10:04] 50% increase [15:10:07] * addshore stops what he is doing [15:10:13] jeez, wait I cant edit Q3938 on wikidata without logging in [15:10:14] oops [15:10:14] since 15:02 [15:10:28] metadata servers? being the regular mysql servers? [15:10:30] or? [15:10:32] s* [15:10:40] on all shards? O_o [15:11:33] right now, mostly s1 [15:13:04] maybe s4, too, but it is going down [15:13:47] okay, the change I just deployed is only on testwiki, and the one we are looking at right now is group0 and only on mwdebug1002 right now [15:13:50] * addshore waits [15:14:25] 10Operations, 10ops-codfw: Disk predictive failure on db2052 - https://phabricator.wikimedia.org/T197146 (10Papaul) a:05Papaul>03Marostegui @Marostegui Disk replacement complete [15:14:27] I think it predates that [15:14:50] 10Operations, 10ops-codfw: db2056: disk with predictive failure - https://phabricator.wikimedia.org/T198048 (10Papaul) a:05Papaul>03Marostegui @Marostegui Disk replacement done. [15:15:30] I think things have been quite variable since the morning [15:15:43] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=1530026135534&to=1530630935534&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=s1&var-role=All [15:17:00] RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational [15:17:09] I think s3 is particularly worrying, actually, https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=1530026215478&to=1530631015478&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=s3&var-role=All [15:17:23] lots of writes followed by lots of reads [15:17:42] !log reset-failed on ms-be1024 for two user sessions failed when the host was in trouble [15:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:57] jynus: that is indeed a big spike [15:18:04] jynus: codfw - that was probably the schema change [15:18:10] ACKNOWLEDGEMENT - Host ms-be1036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Filippo Giunchedi https://phabricator.wikimedia.org/T196873 [15:18:11] (03PS1) 10Bstorm: labstore: block_sync backup job should email on error only [puppet] - 10https://gerrit.wikimedia.org/r/443643 (https://phabricator.wikimedia.org/T171394) [15:18:22] or is it there in eqiad too? [15:18:26] * marostegui in a meeting [15:19:02] on s1, yes: [15:19:11] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=1530026347276&to=1530631147276&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-role=All [15:19:21] s1 wasn't touched by a schema change, so that should not be it for it at least [15:19:46] but for the other s3 codfw graph, that matches a schema change on the codfw master with replication [15:20:01] ok, then forgetting about s3 [15:20:05] yep [15:20:17] for s1, however, we are doing 120K QPS sustained [15:20:40] it could be nothing, if those QPS are trivial [15:20:59] but there are spikes too on rows read and written [15:21:55] they are many traffic servers, so not api- either webrequests or the jobqueue [15:22:45] I don't think we have seen such high traffic ever [15:22:52] in terms of db queries [15:23:05] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [15:23:23] that could be related [15:23:50] (03CR) 10jenkins-bot: Load WikibaseLexeme on testwiki (again x5(EP5)) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443436 (https://phabricator.wikimedia.org/T197454) (owner: 10Addshore) [15:23:52] (03CR) 10jenkins-bot: Load WikibaseLexeme on all of group0 (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438006 (https://phabricator.wikimedia.org/T197454) (owner: 10Addshore) [15:24:06] yeah, I am checking db1089 for instance and it has two big spikes, very weird [15:24:10] looks like the highest QPS for S2 in the last 3 months has been 107 [15:24:12] db1089 is s1 [15:24:18] *S1 [15:24:21] what is db1116? [15:24:40] graphs say it has 268K queries per second [15:24:58] it is not in tendril [15:25:17] * addshore doesnt see db1116 in the mw config [15:25:33] it is a graph issue, maybe, then? [15:25:40] so all false alarm? [15:25:53] you sure db1116? it is not anywhere [15:26:26] db1089 does have two spikes in traffic [15:26:40] https://phab.wmfusercontent.org/file/data/5jbnxftorwme5a676ivv/PHID-FILE-lqj7bmslydllqzwarvto/Screenshot_20180703_172619.png [15:26:56] it's on icinga, puppetboard, etc... [15:26:57] I think that is confusing, but there is still something else going on [15:27:23] there is no mysql running on db1116, it is a spare probably [15:27:52] db1116 is a Unused spare system (spare::system) [15:27:52] yes, old temp sanitarium: https://phabricator.wikimedia.org/T196376 [15:27:56] but 89 and 83 and other enwiki are still with high load- not yet too worrying, but 2x the normal load [15:28:15] in puppet it says #Old temporary sanitarium hosts now ready to be productionized [15:28:19] role(spare::system) [15:28:42] yes, but the graphs are still showing an increase on qps, and not due to that host [15:29:40] and objetivized by our independent metrics system [15:30:44] the spikes of db1089 kinda matches: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?orgId=1&from=now-24h&to=now [15:31:53] yeah, I can see the queries, and I don't see any patterns- just the queries normal webrequests of traffic normally do [15:32:51] addshore: continue what you are doing- we are getting a lot of traffic, there is not much we can do [15:32:59] jynus: ack thanks! [15:34:36] Could it be that news are spreading about https://it.wikipedia.org/wiki/Wikipedia:Comunicato_3_luglio_2018 and we are getting more traffic in general? [15:35:03] leszek_wmde: right, let me remember where we were at [15:35:09] leszek_wmde: I think I was just about to sync it to group0 ? [15:35:13] ACKNOWLEDGEMENT - HP RAID on db2056 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T198728 [15:35:17] 10Operations, 10ops-codfw: Degraded RAID on db2056 - https://phabricator.wikimedia.org/T198728 (10ops-monitoring-bot) [15:35:19] I think you did already addshore [15:35:23] *looks* [15:35:38] right, its on mwdebug1002 for mw.org now [15:35:41] only for testwiki, doing group0 now :) [15:35:53] that's what you said abouve [15:36:21] leszek_wmde: syncing [15:37:09] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T197454 EP5 Enable WikibaseLexeme on clients: group0 (duration: 00m 51s) [15:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:13] T197454: Deploy WikibaseLexeme to wikidata clients - https://phabricator.wikimedia.org/T197454 [15:37:38] leszek_wmde: right, time for the next one! [15:37:51] (03PS10) 10Addshore: Load WikibaseLexeme on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436498 (https://phabricator.wikimedia.org/T195615) [15:37:54] (03CR) 10Addshore: [C: 032] Load WikibaseLexeme on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436498 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [15:38:28] addshore: shall I add the statement on wikidata.org yet? [15:38:48] or do you want to sync all of clients first? [15:38:55] leszek_wmde: I think lets hold off on that until we get this config everywhere now [15:39:00] yup [15:39:27] (03Merged) 10jenkins-bot: Load WikibaseLexeme on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436498 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [15:39:55] right, the patch for group1 is on mwdeug1002 now [15:41:19] all looking good, so will sync it [15:41:35] syncing [15:41:40] * leszek_wmde is getting excited [15:41:51] (03PS1) 10C. Scott Ananian: Remove $wgUseTidy and $wgTidyConfig from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) [15:42:22] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T197454 EP5 Enable WikibaseLexeme on clients: group1 (duration: 00m 50s) [15:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:26] T197454: Deploy WikibaseLexeme to wikidata clients - https://phabricator.wikimedia.org/T197454 [15:42:30] leszek_wmde: great! [15:42:41] (03PS10) 10Addshore: Load WikibaseLexeme on all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436499 (https://phabricator.wikimedia.org/T195615) [15:42:45] (03CR) 10Addshore: [C: 032] Load WikibaseLexeme on all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436499 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [15:43:05] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [15:43:42] jynus: fyi ^^ [15:44:00] (03Merged) 10jenkins-bot: Load WikibaseLexeme on all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436499 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [15:44:38] right leszek_wmde that last patch is on mwdebug1002, lets make sure it feels fine [15:45:09] leszek_wmde: well, enwiki loads fine and lexeme appears in special version [15:45:11] i'll sync [15:45:16] woo [15:45:38] syncing [15:46:00] addshore: thanks, we are commenting that on another channel [15:46:25] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T197454 EP5 Enable WikibaseLexeme on clients: all wikidata clients (duration: 00m 50s) [15:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:36] leszek_wmde: right, that should be everywhere now then [15:46:41] leszek_wmde: so lets test :) [15:46:44] (03CR) 10C. Scott Ananian: "> FTR I approve of splitting this into two parts, PS1 which turns on" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442142 (https://phabricator.wikimedia.org/T175706) (owner: 10Subramanya Sastry) [15:47:05] (03PS2) 10C. Scott Ananian: Remove $wgUseTidy and $wgTidyConfig from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) [15:47:34] 10Operations, 10ops-codfw: Degraded RAID on db2056 - https://phabricator.wikimedia.org/T198728 (10Marostegui) @Papaul has pulled out the disk and inserted again, let's see what happens. It is weird, as it is a new disk. Could it be the server disk slot? [15:47:39] addshore Lexemes cannot currently be used on Items or Properties due to a bug. See phab:T195611 and phab:T195615 for details. [15:47:40] T195615: handle use of statements linking to Lexemes (and Forms?) more gracefully on client - https://phabricator.wikimedia.org/T195615 [15:47:40] T195611: Internal error when viewing a page (EntityLookupException) - https://phabricator.wikimedia.org/T195611 [15:47:50] fancy removing the abusefilter thing or whatever it was? [15:47:56] oh yes [15:48:11] RECOVERY - Device not healthy -SMART- on db2052 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2052&var-datasource=codfw%2520prometheus%252Fops [15:48:56] leszek_wmde: it is disabled right now [15:50:21] addshore: editor loads forever... [15:50:45] looks like your edit on wikidata happened though right? [15:51:11] yup [15:52:06] sorry, have to login to edit my user page [15:52:14] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install graphite2003 - https://phabricator.wikimedia.org/T196483 (10Papaul) [15:52:17] leszek_wmde: hehe :) [15:53:25] addshore: all good https://www.mediawiki.org/wiki/User:Leszek_Manicki_(WMDE)/Lexeme [15:53:30] going to enwp [15:53:49] 10Operations, 10ops-codfw: Degraded RAID on db2056 - https://phabricator.wikimedia.org/T198728 (10Papaul) @could be since it is the second disk. let see what happen. [15:56:42] addshore: tested on mw.org and enwp. Looks it is all good [15:57:00] sweet [15:57:10] (03PS3) 10Addshore: BETA: Remove wmgUseWikibaseLexeme from IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443429 [15:57:15] (03CR) 10Addshore: [C: 032] BETA: Remove wmgUseWikibaseLexeme from IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443429 (owner: 10Addshore) [15:57:28] leszek_wmde: got this 1 patch for cleaning up the beta config now, but looks like we are all good! [15:57:39] season finale over! [15:57:47] woooo [15:58:42] (03Merged) 10jenkins-bot: BETA: Remove wmgUseWikibaseLexeme from IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443429 (owner: 10Addshore) [16:00:04] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: BETA ONLY (duration: 00m 51s) [16:00:05] godog, moritzm, and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180703T1600). Please do the needful. [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:06] !log WikibaseLexeme for Wikidata clients, SE01EP05 deploy slot complete [16:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:19] leszek_wmde: all done! woo! only took 5 weeks or something [16:00:58] leszek_wmde: I'll let you deal with https://phabricator.wikimedia.org/T195615 [16:01:06] lovely [16:01:15] thanks sir! [16:02:17] (03PS10) 10Jcrespo: Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 [16:02:32] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 (owner: 10Jcrespo) [16:02:36] addshore: so the abusfilter thing is off now for good? So I dont say something on the ticket which is not true [16:02:43] yup its off! [16:03:58] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 (owner: 10Jcrespo) [16:05:42] (03CR) 10jenkins-bot: Load WikibaseLexeme on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436498 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [16:05:44] (03CR) 10jenkins-bot: Load WikibaseLexeme on all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436499 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [16:05:46] (03CR) 10jenkins-bot: BETA: Remove wmgUseWikibaseLexeme from IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443429 (owner: 10Addshore) [16:05:48] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1100 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443587 (owner: 10Jcrespo) [16:09:12] RECOVERY - Device not healthy -SMART- on db2056 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2056&var-datasource=codfw%2520prometheus%252Fops [16:15:55] (03CR) 10Alex Monk: "Yeah I know that, I was referring to Alexandros' comment above." [puppet] - 10https://gerrit.wikimedia.org/r/437057 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [16:19:56] (03CR) 10Jforrester: "You're not meant to have patches that touch both CS and IS any more (as we do `sync-file`. Instead, do this as two patches, where the firs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) (owner: 10C. Scott Ananian) [16:23:31] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1100 with low weight (duration: 00m 50s) [16:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:36] (03PS4) 10Jcrespo: mariadb: Increase db1100 weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443623 [16:30:20] (03PS1) 10Jcrespo: mariadb: Increase es1017 weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443650 (https://phabricator.wikimedia.org/T197072) [16:30:42] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team: test.wp is using test2.wp's message cache - https://phabricator.wikimedia.org/T197450 (10Legoktm) Something is still wrong, unclear if it's related to mcrouter or not. e.g. on https://test.wikipedia.org/wiki/Special:Gadgets a bunch of messages... [16:32:47] (03CR) 10Andrew Bogott: [C: 031] labstore: block_sync backup job should email on error only [puppet] - 10https://gerrit.wikimedia.org/r/443643 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [16:36:12] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={create_container,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:37:43] (03CR) 10Jcrespo: [C: 032] mariadb: Increase db1100 weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443623 (owner: 10Jcrespo) [16:39:06] (03Merged) 10jenkins-bot: mariadb: Increase db1100 weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443623 (owner: 10Jcrespo) [16:39:18] (03CR) 10jenkins-bot: mariadb: Increase db1100 weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443623 (owner: 10Jcrespo) [16:40:08] (03PS2) 10Jcrespo: mariadb: Increase es1017 weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443650 (https://phabricator.wikimedia.org/T197072) [16:40:14] (03CR) 10Muehlenhoff: [C: 031] DataTables: save state for the session [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443364 (https://phabricator.wikimedia.org/T198591) (owner: 10Volans) [16:40:21] (03CR) 10Jcrespo: [C: 032] mariadb: Increase es1017 weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443650 (https://phabricator.wikimedia.org/T197072) (owner: 10Jcrespo) [16:40:22] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:41:47] (03Merged) 10jenkins-bot: mariadb: Increase es1017 weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443650 (https://phabricator.wikimedia.org/T197072) (owner: 10Jcrespo) [16:41:58] (03CR) 10Anomie: [C: 031] "Looks good now, and works in local testing on terbium." [puppet] - 10https://gerrit.wikimedia.org/r/441153 (owner: 10Tim Starling) [16:43:45] (03CR) 10jenkins-bot: mariadb: Increase es1017 weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443650 (https://phabricator.wikimedia.org/T197072) (owner: 10Jcrespo) [16:45:49] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase es1017,db1100 weights (duration: 00m 50s) [16:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:43] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:48:06] (03PS7) 10Muehlenhoff: webperf: Make performance::site apache config more dynamic [puppet] - 10https://gerrit.wikimedia.org/r/442232 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [16:49:05] (03CR) 10Muehlenhoff: [C: 032] webperf: Make performance::site apache config more dynamic [puppet] - 10https://gerrit.wikimedia.org/r/442232 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [16:49:45] (03CR) 10Bstorm: [C: 032] labstore: block_sync backup job should email on error only [puppet] - 10https://gerrit.wikimedia.org/r/443643 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [16:50:04] (03PS2) 10Bstorm: labstore: block_sync backup job should email on error only [puppet] - 10https://gerrit.wikimedia.org/r/443643 (https://phabricator.wikimedia.org/T171394) [17:00:05] cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180703T1700). [17:00:28] (03PS1) 10Ottomata: Set spark2 spark.sql.catalogImplementation=hive if hive enabled [puppet] - 10https://gerrit.wikimedia.org/r/443658 (https://phabricator.wikimedia.org/T190443) [17:03:44] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:04:18] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11654/stat1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/443658 (https://phabricator.wikimedia.org/T190443) (owner: 10Ottomata) [17:04:20] (03CR) 10Ottomata: [C: 032] Set spark2 spark.sql.catalogImplementation=hive if hive enabled [puppet] - 10https://gerrit.wikimedia.org/r/443658 (https://phabricator.wikimedia.org/T190443) (owner: 10Ottomata) [17:10:55] (03PS2) 10Nehajha: Removing gridengine as default and encouraging the use of Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) [17:13:55] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:15:04] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:20:24] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:20:34] PROBLEM - puppet last run on kubernetes2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[docker-registry.discovery.wmnet/calico/node] [17:25:54] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:31:10] (03CR) 10Mobrovac: [C: 031] "LGTM. Mental note: we will also need to rework the hieradata for deployment-prep in puppet prefix once this is merged (and figure out what" [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans) [17:43:24] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:48:11] (03PS2) 10Herron: gerrit: use localhost exim as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/440970 (https://phabricator.wikimedia.org/T196920) [17:48:54] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:49:51] (03CR) 10Herron: "planning to merge this and restart gerrit on cobalt in about 15 minutes" [puppet] - 10https://gerrit.wikimedia.org/r/440970 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron) [18:01:28] (03PS1) 1020after4: phabricator: refactor preamble.php to separate unrelated functionality. [puppet] - 10https://gerrit.wikimedia.org/r/443665 [18:02:00] (03CR) 10jerkins-bot: [V: 04-1] phabricator: refactor preamble.php to separate unrelated functionality. [puppet] - 10https://gerrit.wikimedia.org/r/443665 (owner: 1020after4) [18:03:09] (03PS2) 1020after4: phabricator: refactor preamble.php to separate unrelated functionality. [puppet] - 10https://gerrit.wikimedia.org/r/443665 [18:07:33] (03CR) 10Herron: [C: 032] gerrit: use localhost exim as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/440970 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron) [18:09:44] !log cobalt:~# systemctl restart gerrit [18:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:02] (03PS1) 10Herron: testing notification [puppet] - 10https://gerrit.wikimedia.org/r/443666 [18:11:53] (03PS3) 1020after4: phabricator: refactor preamble.php to separate unrelated functionality. [puppet] - 10https://gerrit.wikimedia.org/r/443665 [18:13:15] (03CR) 10Herron: [C: 032] "test message is looking good:" [puppet] - 10https://gerrit.wikimedia.org/r/440970 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron) [18:14:06] (03Abandoned) 10Herron: testing notification [puppet] - 10https://gerrit.wikimedia.org/r/443666 (owner: 10Herron) [18:14:10] (03PS1) 10Mholloway: Use backports version of osm2pgsql on Stretch for improved memory handling [puppet] - 10https://gerrit.wikimedia.org/r/443668 (https://phabricator.wikimedia.org/T198485) [18:14:50] (03CR) 1020after4: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/11657/" [puppet] - 10https://gerrit.wikimedia.org/r/443665 (owner: 1020after4) [18:16:28] (03PS2) 10Herron: puppet-agent: remove --show_diff from scheduled puppet-run script [puppet] - 10https://gerrit.wikimedia.org/r/434719 (https://phabricator.wikimedia.org/T1) [18:17:24] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:18:03] (03PS2) 10Anomie: Move CLI overrides after InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440543 (https://phabricator.wikimedia.org/T197475) [18:18:25] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:18:50] (03CR) 10Anomie: [C: 032] "Merging for deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440543 (https://phabricator.wikimedia.org/T197475) (owner: 10Anomie) [18:18:56] _joe_: this patch fixes the preamble included in redirector problem in phabricator puppet module: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/443665/ (and also fixes a regression with X_FORWARDED_FOR handling) [18:20:06] (03Merged) 10jenkins-bot: Move CLI overrides after InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440543 (https://phabricator.wikimedia.org/T197475) (owner: 10Anomie) [18:20:18] (03CR) 10jenkins-bot: Move CLI overrides after InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440543 (https://phabricator.wikimedia.org/T197475) (owner: 10Anomie) [18:21:49] !log anomie@deploy1001 Synchronized wmf-config/CommonSettings.php: Fix for T197475 ([[gerrit:440543]]) (duration: 00m 52s) [18:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:53] T197475: Wikimedia: Command-line scripts are saying to set $wgShowExceptionDetails - https://phabricator.wikimedia.org/T197475 [18:25:10] (03CR) 10Paladox: "Maybe move the php script into the deployment repo? Or phab repo?" [puppet] - 10https://gerrit.wikimedia.org/r/443665 (owner: 1020after4) [18:29:15] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:34:45] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:42:14] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[jupyterhub_create_virtualenv] [18:43:33] (03PS6) 10Volans: wmf-auto-reimage: validate certificate fingerprint [puppet] - 10https://gerrit.wikimedia.org/r/433928 [18:43:35] (03PS4) 10Volans: wmf-auto-reimage: improve donwtime of reimaged host [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423) [18:43:37] (03PS2) 10Volans: wmf-auto-reimage: use absolute path for subprocess [puppet] - 10https://gerrit.wikimedia.org/r/434896 [18:43:39] (03PS1) 10Volans: wmf-auto-reimage: fix parse argument bug [puppet] - 10https://gerrit.wikimedia.org/r/443670 [18:43:42] (03PS1) 10Volans: wmf-auto-reimage: use warning log level [puppet] - 10https://gerrit.wikimedia.org/r/443671 [18:46:45] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={create_container,podsandbox_status,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:51:25] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:57:45] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:59:05] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [19:12:44] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:15:24] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:22:24] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 376.04 seconds Jcrespo backups running [19:22:53] 10Operations, 10Maps, 10Maps-Sprint: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Mholloway) [19:27:25] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:38:24] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:39:34] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:43:41] (03CR) 1020after4: [C: 031] "@paladox: yeah I thought about it - it's a tossup between ease for me to do it vs. ease for ops to change things without deploying phab co" [puppet] - 10https://gerrit.wikimedia.org/r/443665 (owner: 1020after4) [19:44:03] (03CR) 10Paladox: [C: 031] phabricator: refactor preamble.php to separate unrelated functionality. [puppet] - 10https://gerrit.wikimedia.org/r/443665 (owner: 1020after4) [19:46:29] (03CR) 10Paladox: [C: 031] phabricator: refactor preamble.php to separate unrelated functionality. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443665 (owner: 1020after4) [19:49:29] (03CR) 1020after4: [C: 031] phabricator: refactor preamble.php to separate unrelated functionality. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443665 (owner: 1020after4) [19:51:09] (03PS4) 1020after4: phabricator: refactor preamble.php to separate unrelated functionality. [puppet] - 10https://gerrit.wikimedia.org/r/443665 [19:51:33] (03CR) 1020after4: [C: 031] phabricator: refactor preamble.php to separate unrelated functionality. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443665 (owner: 1020after4) [19:56:16] (03PS5) 1020after4: phabricator: refactor preamble.php to separate unrelated functionality. [puppet] - 10https://gerrit.wikimedia.org/r/443665 [20:10:09] (03PS1) 10Sbisson: Enable ORES wp10, draftquality on draft ns (118) for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443686 [20:12:08] (03CR) 10Paladox: [C: 031] phabricator: refactor preamble.php to separate unrelated functionality. [puppet] - 10https://gerrit.wikimedia.org/r/443665 (owner: 1020after4) [20:12:44] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:18:14] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:21:25] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:24:44] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:30:14] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:31:14] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [20:40:48] (03CR) 10Catrope: [C: 031] Enable ORES wp10, draftquality on draft ns (118) for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443686 (owner: 10Sbisson) [20:46:44] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={create_container,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:47:45] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:49:27] (03PS1) 10Ottomata: Install new venv with cwd set to deploy path [puppet] - 10https://gerrit.wikimedia.org/r/443735 (https://phabricator.wikimedia.org/T190443) [20:50:41] (03CR) 10Ottomata: [C: 032] Install new venv with cwd set to deploy path [puppet] - 10https://gerrit.wikimedia.org/r/443735 (https://phabricator.wikimedia.org/T190443) (owner: 10Ottomata) [20:55:33] 10Operations, 10fundraising-tech-ops, 10netops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10Jgreen) Looks like .4, .9, and .15 are available. .9 was tellurium and still has crufty DNS so my suggestion is we use that, and clean up the cruft in the process. [20:56:43] 10Operations, 10fundraising-tech-ops, 10netops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10Jgreen) Looks like .4, .9, and .15 are available. .9 was tellurium and still has crufty DNS so my suggestion is we use that, and clean up the cruft in the process. [20:58:35] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:59:44] 10Operations: Replace logo of the Spanish Wikipedia, which joins the protests against the European copyright directive proposal - https://phabricator.wikimedia.org/T198761 (10abian) [20:59:58] 10Operations: Replace logo of the Spanish Wikipedia, which joins the protests against the European copyright directive proposal - https://phabricator.wikimedia.org/T198761 (10abian) [21:04:29] 10Operations, 10Spanish-Sites: Replace logo of the Spanish Wikipedia, which joins the protests against the European copyright directive proposal - https://phabricator.wikimedia.org/T198761 (10abian) [21:05:22] (03PS1) 10Krinkle: webperf: Move site vars to profile class params (set from Hiera) [puppet] - 10https://gerrit.wikimedia.org/r/443739 (https://phabricator.wikimedia.org/T195314) [21:05:28] (03PS3) 10C. Scott Ananian: Give a name to en-rtl wiki in Special:SiteMatrix in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443475 (https://phabricator.wikimedia.org/T195675) [21:06:03] (03CR) 10jerkins-bot: [V: 04-1] webperf: Move site vars to profile class params (set from Hiera) [puppet] - 10https://gerrit.wikimedia.org/r/443739 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [21:09:44] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:12:09] (03PS2) 10Krinkle: webperf: Move site vars to profile class params (set from Hiera) [puppet] - 10https://gerrit.wikimedia.org/r/443739 (https://phabricator.wikimedia.org/T195314) [21:21:18] (03PS3) 10Krinkle: webperf: Get graphite_host for coal::processor from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/442900 (https://phabricator.wikimedia.org/T195314) [21:21:56] (03PS3) 10Krinkle: webperf: Move site vars to profile class params (set from Hiera) [puppet] - 10https://gerrit.wikimedia.org/r/443739 (https://phabricator.wikimedia.org/T195314) [21:22:06] (03PS4) 10Krinkle: webperf: Move site vars to profile class params (set from Hiera) [puppet] - 10https://gerrit.wikimedia.org/r/443739 (https://phabricator.wikimedia.org/T195314) [21:22:28] (03PS5) 10Krinkle: webperf: Move site vars to profile class params (set from Hiera) [puppet] - 10https://gerrit.wikimedia.org/r/443739 (https://phabricator.wikimedia.org/T195314) [21:33:35] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:39:14] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:44:34] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:45:17] (03PS1) 10Urbanecm: Change logo for eswiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443742 (https://phabricator.wikimedia.org/T198761) [21:46:28] (03PS2) 10Catrope: Enable ORES wp10, draftquality on draft ns (118) for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443686 (https://phabricator.wikimedia.org/T198768) (owner: 10Sbisson) [21:46:37] (03PS3) 10Catrope: Enable ORES wp10, draftquality on draft ns (118) for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443686 (https://phabricator.wikimedia.org/T198768) (owner: 10Sbisson) [21:50:04] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:55:34] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:01:04] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:05:11] 10Puppet, 10Analytics, 10Cassandra, 10RESTBase-Cassandra, and 2 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948 (10Eevans) AQS has been partially upgraded: ```name="prometheus-jmx-exporter" aqs1005.eqiad.wmnet: Installed: 1:0.3.0-1 aqs1008.eqi... [22:06:38] 10Puppet, 10Analytics, 10Cassandra, 10RESTBase-Cassandra, and 2 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948 (10Eevans) [22:06:58] 10Puppet, 10Analytics, 10Cassandra, 10monitoring, 10Services (watching): Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948 (10Eevans) [22:07:44] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:09:06] AaronSchulz addshore any insight into this RevisionAccessException https://logstash.wikimedia.org/goto/22b29f95b2c5e365115511a18e5847ae ? [22:09:08] thx in advance! [22:10:00] AndyRussG: I don't see the exception at that link [22:10:05] do you see it at this one? https://logstash.wikimedia.org/goto/610270aa608bafe1a8885e715135bbfd [22:10:24] ejegg: nope [22:10:48] oh nvm, I do see it down near the bottom of that lsit [22:10:50] sorry [22:11:41] anyone familiar with multi-content revision work might be able to help with ^^^ [22:12:02] ah okok [22:12:03] seems to be going through some of the MCS work from the past 6 months [22:12:34] The underlying error is a failure to load a blob from address tt: [22:12:35] (ejegg: yeah I think what u needed to change in Kibana is make the time frame absolute. Otherwise the shared link goes to the last 15 min) [22:13:14] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:13:39] ok, let's try to replicate this locally [22:14:00] adding a banner with translatable variables [22:14:24] yurp [22:18:35] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Current state and next steps for RESTBase storage - https://phabricator.wikimedia.org/T152724 (10Eevans) 05Open>03Resolved a:03Eevans I believe this was completed by virtue of the new storage strategy implementation. [22:19:45] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type=create_container https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:23:03] ejegg: Seddon I'm able to reproduce it on the beta cluster [22:25:24] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:28:39] 10Operations, 10Cassandra, 10Services (watching), 10User-Eevans: Cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590 (10Eevans) @fgiunchedi, is this still a thing? [22:35:15] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:36:24] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:50:21] (03PS6) 10Krinkle: webperf: Move site vars to profile class params (set from Hiera) [puppet] - 10https://gerrit.wikimedia.org/r/443739 (https://phabricator.wikimedia.org/T195314) [22:58:24] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={create_container,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:59:34] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180703T2300). [23:00:05] MaxSem, Niharika, and Urbanecm: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:22] Here [23:00:46] (03CR) 10Smalyshev: [C: 031] wdqs: create log files during log rotation [puppet] - 10https://gerrit.wikimedia.org/r/443583 (owner: 10Gehel) [23:01:51] I'll deploy my patch [23:08:30] !log maxsem@deploy1001 Synchronized php-1.32.0-wmf.10/extensions/CodeMirror: SWAT https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CodeMirror/+/443743/ (duration: 00m 52s) [23:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:01] (03PS4) 10Krinkle: webperf: Get graphite_host for coal::processor from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/442900 (https://phabricator.wikimedia.org/T195314) [23:09:24] (03CR) 10jerkins-bot: [V: 04-1] webperf: Get graphite_host for coal::processor from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/442900 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [23:09:33] MaxSem: Will you deploy my patch as well? [23:10:29] sorry, I'm in a meeting :] [23:11:08] Ok then. Somebody else around? [23:12:39] (03PS11) 10Krinkle: webperf: Add statsv, navtiming and coal to scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) [23:12:41] (03PS5) 10Krinkle: webperf: Get graphite_host for coal::processor from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/442900 (https://phabricator.wikimedia.org/T195314) [23:12:43] (03PS7) 10Krinkle: webperf: Move site vars to profile class params (set from Hiera) [puppet] - 10https://gerrit.wikimedia.org/r/443739 (https://phabricator.wikimedia.org/T195314) [23:12:45] (03PS1) 10Krinkle: webperf: Rename webperf::processors_and_site profile to webperf::processors [puppet] - 10https://gerrit.wikimedia.org/r/443752 [23:14:01] Krinkle: Can you deploy a patch for me? [23:15:10] I don't usually operate SWAT, but if nobody else around I could deploy something if needed. In an hour or so? [23:15:23] Swar t [23:15:39] Swat started before 15 minutes [23:16:07] I understand, but it seems there is no operator. [23:17:30] Yup. I probably didnt catch what you mean by "in an hour or so". You plan to deploy it out of the window? [23:17:44] Yes. [23:18:04] I don't plan yet, but if in an hour no swatter has been able to deploy your patch, I'll have time then. [23:18:33] O [23:18:50] Ok then. Will you need me for that purpose? [23:19:37] (I plan to go to sleep, as it is hour past midnight for me :) ) [23:23:03] https://phabricator.wikimedia.org/T198761 [23:23:13] Anyone developing this? :/ [23:23:25] *deplying, I don't know what I say [23:23:33] *deploying [23:23:40] ¬¬ [23:24:55] Abian: No SWATtter arrived. Krinkle might deploy it after an hour, provided nobody from the SWAT team arrived in that time as well. [23:29:47] Okay, I'll be around for half an hour or so [23:30:04] The blackout is planned for 00:00 [23:31:41] Sadly, when nobody with the buttons is here, no plans helps :/ [23:33:35] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:39:05] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:45:45] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:51:24] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:57:44] (03PS2) 10Krinkle: webperf: Rename webperf profiles for clarity [puppet] - 10https://gerrit.wikimedia.org/r/443752 [23:58:42] (03PS3) 10Krinkle: webperf: Rename webperf profiles for clarity [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314) [23:58:55] (03PS4) 10Krinkle: webperf: Rename webperf profiles for clarity [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314)