[00:03:43] (03PS1) 10Faidon Liambotis: autoinstall: add working support for EFI [puppet] - 10https://gerrit.wikimedia.org/r/512787 (https://phabricator.wikimedia.org/T93208) [00:04:06] (03PS2) 10Faidon Liambotis: autoinstall: add support for EFI [puppet] - 10https://gerrit.wikimedia.org/r/512787 (https://phabricator.wikimedia.org/T93208) [00:05:47] (03CR) 10Faidon Liambotis: [C: 03+2] autoinstall: add support for EFI [puppet] - 10https://gerrit.wikimedia.org/r/512787 (https://phabricator.wikimedia.org/T93208) (owner: 10Faidon Liambotis) [00:07:47] PROBLEM - Check systemd state on ms-be1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:10:42] 10Operations: (U)EFI support - https://phabricator.wikimedia.org/T93208 (10Maintenance_bot) [00:14:19] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 855.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:16:37] (03PS1) 10Andrew Bogott: keystone: add firewall rules to acess from the nova controller [puppet] - 10https://gerrit.wikimedia.org/r/512788 (https://phabricator.wikimedia.org/T223905) [00:18:32] (03CR) 10Andrew Bogott: [C: 03+2] keystone: add firewall rules to acess from the nova controller [puppet] - 10https://gerrit.wikimedia.org/r/512788 (https://phabricator.wikimedia.org/T223905) (owner: 10Andrew Bogott) [00:21:25] (03PS1) 10Andrew Bogott: keystone: make the api service active on both controller nodes [puppet] - 10https://gerrit.wikimedia.org/r/512789 (https://phabricator.wikimedia.org/T223905) [00:36:22] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10Eevans) [00:37:25] RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational [00:38:00] !log decommissioning restbase1014-a -- T223976 [00:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:06] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [00:40:24] (03PS1) 10Faidon Liambotis: autoinstall: kill trusty-installer [puppet] - 10https://gerrit.wikimedia.org/r/512790 [00:40:26] (03PS1) 10Faidon Liambotis: autoinstall: remove ixgbe.allow_unsupported_sfp=1 [puppet] - 10https://gerrit.wikimedia.org/r/512791 [00:40:29] (03PS1) 10Faidon Liambotis: autoinstall: add an efi-stretch-installer variant [puppet] - 10https://gerrit.wikimedia.org/r/512792 [00:42:00] (03CR) 10Faidon Liambotis: [C: 03+2] autoinstall: kill trusty-installer [puppet] - 10https://gerrit.wikimedia.org/r/512790 (owner: 10Faidon Liambotis) [00:42:41] (03PS2) 10Faidon Liambotis: autoinstall: kill trusty-installer [puppet] - 10https://gerrit.wikimedia.org/r/512790 [00:42:43] (03PS2) 10Faidon Liambotis: autoinstall: remove ixgbe.allow_unsupported_sfp=1 [puppet] - 10https://gerrit.wikimedia.org/r/512791 [00:42:45] (03PS2) 10Faidon Liambotis: autoinstall: add an efi-stretch-installer variant [puppet] - 10https://gerrit.wikimedia.org/r/512792 [00:44:02] (03CR) 10Faidon Liambotis: [C: 03+2] autoinstall: remove ixgbe.allow_unsupported_sfp=1 [puppet] - 10https://gerrit.wikimedia.org/r/512791 (owner: 10Faidon Liambotis) [00:44:08] (03CR) 10Faidon Liambotis: [C: 03+2] autoinstall: add an efi-stretch-installer variant [puppet] - 10https://gerrit.wikimedia.org/r/512792 (owner: 10Faidon Liambotis) [00:53:45] 10Operations: (U)EFI support - https://phabricator.wikimedia.org/T93208 (10faidon) 05Open→03Resolved a:03faidon OK, a few changes later, and we have a working EFI install in a VM (d-i-test) \o/ Everything should work. I even converted our `flat.cfg` partman recipe to work on both EFI & BIOS at the same ti... [01:33:19] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:32:45] PROBLEM - Check the last execution of refinery-drop-query-clicks on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-query-clicks [03:33:09] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:36:11] RECOVERY - Wikitech and wt-static content in sync on labweb1001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (20576 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [05:22:21] (03CR) 10Mobrovac: "@Volans, thnx! I think it's better to go with PS1 - we want to know as soon as $something is wrong with the hosts" [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406) (owner: 10Volans) [05:38:34] 10Operations, 10Dumps-Generation: Reboot dumps/snapshot hosts - https://phabricator.wikimedia.org/T223962 (10ArielGlenn) [05:39:25] RECOVERY - Wikitech and wt-static content in sync on labweb1002 is OK: wikitech-static OK - wikitech and wikitech-static in sync (20576 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [05:51:26] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: add total hits to the exporter [puppet] - 10https://gerrit.wikimedia.org/r/512801 [05:53:56] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: add total hits to the exporter [puppet] - 10https://gerrit.wikimedia.org/r/512801 [05:53:58] (03PS1) 10Smalyshev: Enable wgSpecialSearchFormOptions on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512802 (https://phabricator.wikimedia.org/T55652) [06:12:08] (03CR) 10ArielGlenn: [C: 03+1] mediawiki::php: add total hits to the exporter [puppet] - 10https://gerrit.wikimedia.org/r/512801 (owner: 10Giuseppe Lavagetto) [06:16:14] (03PS1) 10Effie Mouzeli: Send 20% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512803 (https://phabricator.wikimedia.org/T219150) [06:30:39] <_joe_> uhm some slowness on restbase [06:32:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Send 20% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512803 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [06:32:53] (03PS3) 10Giuseppe Lavagetto: mediawiki::php: add total hits to the exporter [puppet] - 10https://gerrit.wikimedia.org/r/512801 [06:35:50] (03CR) 10Effie Mouzeli: [C: 03+2] Send 20% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512803 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [06:36:53] (03Merged) 10jenkins-bot: Send 20% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512803 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [06:37:08] (03CR) 10jenkins-bot: Send 20% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512803 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [06:38:11] 04Critical Alert for device cr1-codfw.wikimedia.org - Juniper alarm active [06:38:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: add total hits to the exporter [puppet] - 10https://gerrit.wikimedia.org/r/512801 (owner: 10Giuseppe Lavagetto) [06:40:54] !log jiji@deploy1001 Synchronized wmf-config/CommonSettings.php: Send 20% of anonymous users to PHP7.2 - T219150 (duration: 00m 51s) [06:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:00] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [06:47:33] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: restore the apc size to 512M [puppet] - 10https://gerrit.wikimedia.org/r/512805 [06:48:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: restore the apc size to 512M [puppet] - 10https://gerrit.wikimedia.org/r/512805 (owner: 10Giuseppe Lavagetto) [07:02:36] !log decommission restbase1014-b -- T223976 [07:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:41] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [07:04:11] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:07:51] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [07:14:01] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:17:10] !log uploaded ffmpeg 3.2.14-1~deb9u1+wmf1 to component/vp9 of stretch-wikimedia (rebase of our vp9-row-mt backport to the latest stretch-security ffmpeg update) [07:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:18] (03PS2) 10Vgutierrez: ATS: Ensure proper permissions for ATS layouts [puppet] - 10https://gerrit.wikimedia.org/r/512643 (https://phabricator.wikimedia.org/T221217) [07:40:20] (03PS62) 10Vgutierrez: ATS: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [07:40:29] (03PS1) 10Vgutierrez: ATS: Avoid using traffic_layout [puppet] - 10https://gerrit.wikimedia.org/r/512855 (https://phabricator.wikimedia.org/T224428) [07:42:05] (03CR) 10jerkins-bot: [V: 04-1] ATS: Avoid using traffic_layout [puppet] - 10https://gerrit.wikimedia.org/r/512855 (https://phabricator.wikimedia.org/T224428) (owner: 10Vgutierrez) [07:46:59] (03PS2) 10Vgutierrez: ATS: Avoid using traffic_layout [puppet] - 10https://gerrit.wikimedia.org/r/512855 (https://phabricator.wikimedia.org/T224428) [07:47:01] (03PS63) 10Vgutierrez: ATS: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [07:47:23] good morning :) cutting the branch https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Create_the_new_branch_in_Gerrit [07:47:39] (03CR) 10Effie Mouzeli: [C: 03+1] restbase: add team-services to service::node alert [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406) (owner: 10Volans) [07:50:58] (03PS1) 10Muehlenhoff: Add a pbuilder hook for the VP9 component (used on the video scalers) [puppet] - 10https://gerrit.wikimedia.org/r/512857 [07:52:50] (03PS3) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [07:53:09] (03CR) 10Mathew.onipe: Add maps reboot cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [07:54:30] (03CR) 10jerkins-bot: [V: 04-1] Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [07:55:16] (03CR) 10Volans: [C: 03+1] "LGTM, we should probably add a TODO somewhere to generalize the addition of a component, clearly we're using them more and more :)" [puppet] - 10https://gerrit.wikimedia.org/r/512857 (owner: 10Muehlenhoff) [07:55:24] (03CR) 10Vgutierrez: [C: 03+1] "the layout generated with this commit satisfies the checks performed by:" [puppet] - 10https://gerrit.wikimedia.org/r/512855 (https://phabricator.wikimedia.org/T224428) (owner: 10Vgutierrez) [07:55:56] (03PS4) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [07:58:07] (03CR) 10Muehlenhoff: [C: 03+2] Add a pbuilder hook for the VP9 component (used on the video scalers) [puppet] - 10https://gerrit.wikimedia.org/r/512857 (owner: 10Muehlenhoff) [08:00:57] 10Operations, 10ops-codfw, 10DBA: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10jcrespo) Power drain and firmware upgrade, please (T216240), at least. [08:02:38] (03CR) 10Vgutierrez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:02:52] \o/ [08:09:52] (03PS1) 10Alex Monk: beta: dont include scap::scripts twice [puppet] - 10https://gerrit.wikimedia.org/r/512859 [08:15:18] (03PS1) 10Volans: tests: fix caplog matching [software/spicerack] - 10https://gerrit.wikimedia.org/r/512861 [08:21:17] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [08:22:05] looks like I broke it :/ [08:22:09] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [08:24:26] gerrit ui is timing out [08:24:29] is this a known issue? [08:24:43] yes we're on it [08:29:34] !log restarting gerrit due to stack threads - T224448 [08:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:40] T224448: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 [08:31:49] it should be back [08:31:57] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.062 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [08:32:31] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27504 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [08:34:09] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 3 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikibase/wikiba.se-deploy],Exec[git_pull_research/landing-page] [08:34:15] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [08:34:35] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [08:35:11] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars] [08:35:19] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [08:35:59] ^ that is probably due to gerrit's short outage [08:35:59] PROBLEM - puppet last run on db2095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [08:36:01] PROBLEM - puppet last run on schema1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [08:36:13] (03PS1) 10Vgutierrez: debian: Add release 0.17 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/512866 (https://phabricator.wikimedia.org/T220518) [08:36:41] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T220880 (10elukey) This host is currently part of the Hadoop testing cluster that uses old/to-be-decommed nodes, really sorry for this noise. I have put a request for new (not OOW) hardware for next fiscal for a new t... [08:36:43] PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [08:37:41] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars] [08:38:07] PROBLEM - puppet last run on labsdb1012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [08:38:37] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [08:39:11] (03CR) 10Effie Mouzeli: "After discussing on IRC, we decided to have a go at PS1 and see how it goes" [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406) (owner: 10Volans) [08:39:37] those failures are expected, forcing a puppet run [08:40:25] !log T224448 sudo cumin -b 15 -p 95 'R:git::clone' 'run-puppet-agent -q --failed-only' [08:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:30] T224448: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 [08:41:21] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.17 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/512866 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:42:31] although if I don't write 'y' and press enter that will not happen :D [08:43:03] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [08:43:29] RECOVERY - puppet last run on labsdb1012 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [08:43:52] (03CR) 10jenkins-bot: debian: Add release 0.17 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/512866 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:43:59] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:44:38] 10Operations, 10Performance-Team, 10serviceops: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10jijiki) [08:44:57] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:45:05] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:45:21] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:45:59] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:46:07] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:46:47] RECOVERY - puppet last run on schema1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:46:47] RECOVERY - puppet last run on db2095 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:46:58] (03PS4) 10Volans: restbase: add team-services to all Icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406) [08:47:28] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/512861 (owner: 10Volans) [08:47:31] RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:47:47] !log uploaded acme-chief 0.17 to apt.wikimedia.org (buster) - T220518 T213820 [08:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:53] T220518: acme-chief: Validate that configured certificates can be actually issued - https://phabricator.wikimedia.org/T220518 [08:47:54] T213820: certcentral is incompatible with the current python3-acme version shipped in stretch-backports - https://phabricator.wikimedia.org/T213820 [08:49:41] (03CR) 10Volans: "As agreed restored PS1, compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406) (owner: 10Volans) [08:51:13] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.4386 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [08:52:20] !log uploaded ffmpeg 3.2.14-1~deb9u1+wmf3 to component/vp9 of stretch-wikimedia (rebase of our vp9-row-mt backport to the latest stretch-security ffmpeg update) [08:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:38] (03CR) 10Mobrovac: [C: 03+1] restbase: add team-services to all Icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406) (owner: 10Volans) [08:52:45] !log jiji@deploy1001 Started deploy [cpjobqueue/deploy@04cc66d]: Migrating ORESFetchScoresJob to PHP7 - T219148 [08:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:50] T219148: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 [08:53:42] (03CR) 10Volans: [C: 03+2] tests: fix caplog matching [software/spicerack] - 10https://gerrit.wikimedia.org/r/512861 (owner: 10Volans) [08:53:53] (03CR) 10Volans: [C: 03+2] restbase: add team-services to all Icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406) (owner: 10Volans) [08:54:03] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [08:54:06] !log jiji@deploy1001 Finished deploy [cpjobqueue/deploy@04cc66d]: Migrating ORESFetchScoresJob to PHP7 - T219148 (duration: 01m 21s) [08:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:59] 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10Volans) 05Open→03Resolved And with the above patch merged it should all be resolved. Reopen if needed. [08:57:28] (03Merged) 10jenkins-bot: tests: fix caplog matching [software/spicerack] - 10https://gerrit.wikimedia.org/r/512861 (owner: 10Volans) [08:58:25] (03CR) 10jenkins-bot: tests: fix caplog matching [software/spicerack] - 10https://gerrit.wikimedia.org/r/512861 (owner: 10Volans) [08:58:37] !log rebooting wdqs nodes for kernel upgrade [08:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:46] !log installing ffmpeg security updates [09:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:25] PROBLEM - HP RAID on db2035 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:3 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:08:27] ACKNOWLEDGEMENT - HP RAID on db2035 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:3 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T224456 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:08:31] 10Operations, 10ops-codfw: Degraded RAID on db2035 - https://phabricator.wikimedia.org/T224456 (10ops-monitoring-bot) [09:10:44] 10Operations, 10Acme-chief, 10Traffic, 10HTTPS: acme-chief: Validate that configured certificates can be actually issued - https://phabricator.wikimedia.org/T220518 (10Maintenance_bot) [09:11:04] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2035 - https://phabricator.wikimedia.org/T224456 (10Volans) p:05Triage→03Normal [09:11:07] 10Operations, 10Performance-Team, 10serviceops: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10jijiki) p:05Triage→03Normal [09:11:15] (03PS1) 10Vgutierrez: acme_chief: Enable SNI prevalidation for non-canonical certificates [puppet] - 10https://gerrit.wikimedia.org/r/512871 (https://phabricator.wikimedia.org/T220518) [09:14:32] (03CR) 10Volans: "rebasing to get CI fix" [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [09:14:35] (03PS8) 10Volans: confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [09:16:43] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2035 - https://phabricator.wikimedia.org/T224456 (10jcrespo) a:03Papaul Please change it with a spare when possible. [09:18:03] (03PS2) 10Vgutierrez: acme_chief: Enable SNI prevalidation for non-canonical certificates [puppet] - 10https://gerrit.wikimedia.org/r/512871 (https://phabricator.wikimedia.org/T220518) [09:18:12] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [09:21:44] (03CR) 10Vgutierrez: [C: 03+1] "PCC looks happy: https://puppet-compiler.wmflabs.org/compiler1001/16782/" [puppet] - 10https://gerrit.wikimedia.org/r/512871 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [09:24:21] 10Operations, 10Performance-Team, 10serviceops: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) Couple of notes: * We'd need to write a meaningful runbook to instruct people what metrics to check (mcrouter, redis, etc..) * Refactor https://grafana.wikimedia... [09:24:39] 10Operations, 10Performance-Team, 10serviceops, 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) [09:25:22] (03PS3) 10Mobrovac: Allow MW to honour the X-Request-Id header if set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510796 (https://phabricator.wikimedia.org/T201409) [09:27:12] * mobrovac taking over deploy1001 for wmfconfig for 5 mins [09:28:39] (03CR) 10Mobrovac: [C: 03+2] Allow MW to honour the X-Request-Id header if set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510796 (https://phabricator.wikimedia.org/T201409) (owner: 10Mobrovac) [09:28:46] !log installing php5 security updates [09:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:42] (03Merged) 10jenkins-bot: Allow MW to honour the X-Request-Id header if set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510796 (https://phabricator.wikimedia.org/T201409) (owner: 10Mobrovac) [09:29:58] (03CR) 10jenkins-bot: Allow MW to honour the X-Request-Id header if set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510796 (https://phabricator.wikimedia.org/T201409) (owner: 10Mobrovac) [09:31:25] (03PS9) 10Jbond: varnish: ratelimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) [09:32:21] !log mobrovac@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Allow MW to honour the X-Request-Id header if set - T201409 (duration: 01m 12s) [09:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:26] T201409: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 [09:32:45] * mobrovac is done [09:32:59] (03PS10) 10Jbond: varnish: ratelimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) [09:35:57] PROBLEM - Device not healthy -SMART- on db2035 is CRITICAL: cluster=mysql device=cciss,11 instance=db2035:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2035&var-datasource=codfw+prometheus/ops [09:38:54] (03CR) 10Jbond: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [09:53:36] (03CR) 10Gehel: [C: 04-1] Add postgres slave init cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [09:54:44] (03PS1) 10Alexandros Kosiaris: otrs: Avoid setting Precedence header for stewards [puppet] - 10https://gerrit.wikimedia.org/r/512875 (https://phabricator.wikimedia.org/T224404) [10:10:20] (03CR) 10Volans: [C: 03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [10:14:02] (03Merged) 10jenkins-bot: confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [10:14:57] (03CR) 10jenkins-bot: confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [10:15:22] (03PS7) 10Volans: confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto) [10:15:41] (03CR) 10Pmiazga: Disable the rdf2latex Collection portlet format (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512743 (https://phabricator.wikimedia.org/T224433) (owner: 10Pmiazga) [10:16:37] (03PS15) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [10:17:19] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [10:19:01] (03CR) 10Mathew.onipe: [C: 04-1] Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [10:19:42] (03CR) 10Volans: [C: 03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto) [10:22:35] (03PS16) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [10:23:12] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [10:23:16] (03CR) 10jerkins-bot: [V: 04-1] confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto) [10:28:46] 10Operations, 10Operations-Software-Development, 10Discovery-Search (Current work), 10User-Joe, 10User-jijiki: Create WDQS reboot cookbook - https://phabricator.wikimedia.org/T224385 (10Mathew.onipe) [10:32:30] (03PS1) 10Zfilipin: Group0 to 1.34.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512878 [10:38:32] 10Operations, 10serviceops, 10User-jijiki: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster - https://phabricator.wikimedia.org/T223647 (10elukey) Some clarification about: > From the [[ https://grafana.wikimedia.org/d/000000316/memcache?panelId=21&fullscreen&orgId=... [10:43:40] (03CR) 10Elukey: [C: 03+1] Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata) [10:45:14] !log zfilipin@deploy1001 Pruned MediaWiki: 1.34.0-wmf.4 [keeping static files] (duration: 06m 06s) [10:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:05] * Urbanecm waves to zeljkof [10:47:06] PROBLEM - PHP7 rendering on mw1329 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 379 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:00] <_joe_> interesting [10:48:04] (03CR) 10Elukey: [C: 03+1] "> Looks good in https://puppet-compiler.wmflabs.org/compiler1002/16740/an-master1001.eqiad.wmnet/change.an-master1001.eqiad.wmnet.pson" [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata) [10:48:18] !log zfilipin@deploy1001 Pruned MediaWiki: 1.34.0-wmf.3 [keeping static files] (duration: 01m 32s) [10:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:30] RECOVERY - PHP7 rendering on mw1329 is OK: HTTP OK: HTTP/1.1 200 OK - 75781 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:03] hi Urbanecm [10:51:19] !log zfilipin@deploy1001 Started scap: testwiki to php-1.34.0-wmf.7 and rebuild l10n cache [10:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:45] Urbanecm: swat will be about 10 minutes late, I'm still preparing for train [10:51:53] updating testwiki at the moment [10:51:56] ack [10:52:05] https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Sync_to_cluster_and_verify_on_testwiki [10:53:21] Michael_WMDE, raynor: swat will be 10 minutes late, I'm behind on train preparations [10:53:40] it's also Urbanecm's first ever swat, so he might be a bit slow [10:53:53] tldr: I'm not sure if all scheduled patches will make it [10:54:12] (03CR) 10Volans: [C: 04-1] "Race condition in a test" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto) [10:54:16] * Urbanecm is reading the docs one more time :D [10:54:19] !log zfilipin@deploy1001 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_4182265560" --threads=30 --lang en --quiet' returned non-zero exit status 1 (duration: 03m 00s) [10:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:34] zeljkof, np [10:54:41] thx for info [10:55:16] hashar: more train trouble :/ ^ [10:59:08] hashar: more info https://phabricator.wikimedia.org/T224465 [10:59:11] Michael_WMDE, forwarding messages from zeljkof, SWAT will be delayed, due to train preparations. [10:59:48] well, there are some scap problems with wmf.7, so I guess swat is on time [11:00:04] Amir1, Lucas_WMDE, MaxSem, RoanKattouw, and Niharika: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190528T1100). [11:00:04] Urbanecm, Michael_WMDE, and raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:18] Okay then zeljkof [11:02:09] zeljkof, can I start with 512422 then? [11:02:18] (03PS2) 10Pmiazga: Disable the rdf2latex Collection portlet format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512743 (https://phabricator.wikimedia.org/T224433) [11:02:45] I am here, but only half available for the next 30 minutes [11:02:51] Urbanecm: you’re a deployer now right? [11:02:54] yes [11:02:58] congrats :) [11:03:01] thanks [11:03:03] and I think you can go ahead with your patches! [11:03:07] (03PS3) 10Pmiazga: Disable the rdf2latex Collection portlet format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512743 (https://phabricator.wikimedia.org/T224433) [11:03:15] Okay Lucas_WMDE ! [11:03:36] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512422 (https://phabricator.wikimedia.org/T224308) (owner: 10Urbanecm) [11:04:41] (03Merged) 10jenkins-bot: Add abusefilter-modify-restricted to abusefilter group on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512422 (https://phabricator.wikimedia.org/T224308) (owner: 10Urbanecm) [11:06:09] zeljkof: What to do with the wikiversions.json train-related change? [11:06:09] Urbanecm: go ahead, I'm around if you need any help, trying to clean up train :/ [11:06:22] I can't rebase until the wikiversions.json change is away [11:06:35] Urbanecm: I'll fix it, just a second, sorry, forgot about that [11:06:39] okay [11:06:44] (03CR) 10jenkins-bot: Add abusefilter-modify-restricted to abusefilter group on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512422 (https://phabricator.wikimedia.org/T224308) (owner: 10Urbanecm) [11:10:37] Urbanecm: go ahead, things should be fine nwo [11:10:38] now [11:10:45] zeljkof, thanks, looks clear, continuing [11:14:03] verifying my change on mwdebug1002... [11:15:49] looks fine, deploying... [11:18:12] !log urbanecm@deploy1001 Synchronized wmf-config/abusefilter.php: SWAT: [[:gerrit:512422|Add abusefilter-modify-restricted to abusefilter group on plwiki (T224308)]] (duration: 02m 36s) [11:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:17] T224308: Add abusefilter-modify-restricted to abusefilter group on plwiki - https://phabricator.wikimedia.org/T224308 [11:18:41] deployed, moving to 512426 [11:18:58] (03CR) 10Urbanecm: [C: 03+2] Use underscores instead of spaces in wgMetaNamespace(Talk) for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) (owner: 10Urbanecm) [11:19:16] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) (owner: 10Urbanecm) [11:20:41] (03PS6) 10Urbanecm: Use underscores instead of spaces in wgMetaNamespace(Talk) for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) [11:20:56] (03CR) 10Urbanecm: [C: 03+2] Use underscores instead of spaces in wgMetaNamespace(Talk) for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) (owner: 10Urbanecm) [11:21:08] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) (owner: 10Urbanecm) [11:21:58] (03Merged) 10jenkins-bot: Use underscores instead of spaces in wgMetaNamespace(Talk) for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) (owner: 10Urbanecm) [11:22:01] (03PS6) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/508311 (https://phabricator.wikimedia.org/T221225) [11:22:13] (03CR) 10jenkins-bot: Use underscores instead of spaces in wgMetaNamespace(Talk) for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) (owner: 10Urbanecm) [11:24:16] (03CR) 10Jbond: [C: 03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/508311 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [11:24:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/508311 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [11:25:28] !log merging change to the puppet sudo module https://gerrit.wikimedia.org/r/c/operations/puppet/+/508311 [11:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:00] 512426 looks good at mwdebug, deploying [11:26:05] (03PS6) 10Lucas Werkmeister (WMDE): Add feature flag config for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223300) (owner: 10Michael Große) [11:27:43] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:512426|Use underscores instead of spaces in wgMetaNamespace(Talk) for several projects]] (T223039) (duration: 00m 54s) [11:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:48] T223039: Project pages inaccessible on several projects because of spaces in wgMetaNamespace(Talk) - https://phabricator.wikimedia.org/T223039 [11:28:14] (03PS7) 10Lucas Werkmeister (WMDE): Add feature flag config for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223300) (owner: 10Michael Große) [11:28:16] running namespaceDupes.php for all affected projects [11:28:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "added a comment indicating the task and fixed the capitalization of FeatureFlag" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223300) (owner: 10Michael Große) [11:31:09] (03CR) 10Michael Große: [C: 03+1] Add feature flag config for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223300) (owner: 10Michael Große) [11:31:37] 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10hashar) >>! In T224033#5213715, @Joe wrote: > I am 100% against having ci handle merges of ops/puppet. Think of the case ci is down and we need puppet for anything. Seem my thi... [11:31:48] !log Ran namespaceDupes.php for urwikibooks, urwikiquote, urwiktionary and aswikisource [11:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:20] Lucas_WMDE, Michael_WMDE: Deployed my patches. Can continue with other patches, or leave it to you, it's up to you :-). [11:32:27] I’ll take over, thank you :) [11:32:35] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223300) (owner: 10Michael Große) [11:33:01] the first change should be beta-only, so the test will consist of ensuring that production Wikidata still works with it [11:33:18] roger [11:33:47] (03Merged) 10jenkins-bot: Add feature flag config for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223300) (owner: 10Michael Große) [11:34:01] (03CR) 10jenkins-bot: Add feature flag config for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223300) (owner: 10Michael Große) [11:34:05] Urbanecm: congratulations on your first ever swat :) [11:34:14] 🍾 [11:34:15] thank you zeljkof [11:34:53] config change is on mwdebug1002, testing [11:35:48] yup, Wikidata still exhibits the bug as it should, i. e. feature flag not yet in effect [11:35:53] deploying that one [11:37:05] +2ing the backports, they’ll take a bit to go through CI anyways [11:37:39] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config: SWAT: [[gerrit:510204|Add feature flag config for breaking Wikibase API change (T223300)]] (duration: 00m 54s) [11:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:44] T223300: On beta enable bugfix for wbeditentity setting aliases to empty array - https://phabricator.wikimedia.org/T223300 [11:40:05] (03PS6) 10Lucas Werkmeister (WMDE): Add a list of IDs to skip in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große) [11:40:12] Michael_WMDE: we can already deploy the config change with the list of IDs too, right? [11:40:20] sure [11:40:24] ok let’s do that then [11:40:37] (03Abandoned) 10Urbanecm: Configuring $wgMetaNamespace for ur.wiktionary, ur.wikibooks and ur.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511868 (https://phabricator.wikimedia.org/T223964) (owner: 10Tulsi Bhagat) [11:40:44] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große) [11:41:08] that one will not have any effect on its own either, nothing to test there [11:41:24] I think I’ll even skip mwdebug1002 and just rely on the canary servers [11:41:44] (03CR) 10Jbond: "Looks good just some minor nits and comments" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [11:41:50] (03Merged) 10jenkins-bot: Add a list of IDs to skip in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große) [11:42:04] (03CR) 10jenkins-bot: Add a list of IDs to skip in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große) [11:43:35] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:511753|Add a list of IDs to skip in production]] (duration: 00m 54s) [11:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:02] oh, and in the meantime the first backport was merged, nice [11:44:07] let’s deploy that one [11:45:19] it’s on mwdebug1002 now [11:45:34] which means that the next new schema on test wikidata should skip some IDs, right? [11:45:38] (if created on mwdebug1002) [11:45:40] testing… [11:45:45] right [11:46:26] E1…E4 were created, E5…E9 should be skipped, so it should be E10 [11:46:47] nope, I got E11 actually [11:47:00] works for me: https://test.wikidata.org/wiki/EntitySchema:E10 [11:47:04] ah, that’s why ^^ [11:47:12] ok seems to work as expected, deploying [11:48:22] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:48:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10Urbanecm) a:03Volans I was able to deploy two patches without any problems, so I guess ev... [11:48:52] this HHVM getting stuck at 100% CPU on mwdebug1002 thing is getting pretty annoying [11:48:52] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.6/extensions/EntitySchema: SWAT: [[gerrit:512677|Skip configured IDs]] (duration: 00m 57s) [11:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:58] let’s hope the PHP7 switch will fix it [11:49:15] anyways, in the meantime the other two backports got merged too [11:49:22] ok, so that is not something new that we caused? [11:49:25] those are maintenance-only, so I’ll skip mwdebug1002 and sync both [11:49:28] no [11:49:36] has been happening for a while [11:49:38] * Michael_WMDE sighs in relieve [11:49:42] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 75585 bytes in 5.163 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:49:45] tada [11:49:53] it recovers itself after some timeout [11:50:18] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10MoritzMuehlenhoff) a:03RobH [11:50:37] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban), 10User-Urbanecm, and 2 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10Volans) 05Open→03Resolved Glad to hear, resolving then. [11:52:27] Michael_WMDE: and the maintenance script to run on testwikidata is just extensions/EntitySchema/maintenance/createPreexistingSchemas.php, with no arguments? [11:52:37] yes [11:52:38] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.6/extensions/EntitySchema/: SWAT: [[gerrit:512689|Add maintenance script to create preexisting Schemas]] + [[gerrit:512717|Small maintenance script adjustments]] (duration: 00m 54s) [11:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:45] ok then let’s do that [11:53:55] (03CR) 10Vgutierrez: Puppet, add RPKI validation daemon (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [11:54:06] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/EntitySchema/maintenance/createPreexistingSchemas.php --wiki=testwikidatawiki [11:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:24] !log ^ error, no change to wiki [11:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:16] okay, we know what the error is (User:Maintenance script exists and we didn’t tell User::newSystemUser() to steal it), but we won’t fix it in the remaining few minutes of the SWAT window [11:57:26] we’ll try again later, in our own “deploy EntitySchema” slot [11:57:53] and I don’t think that’s enough time to deploy raynor’s change either, sorry [11:58:02] !log EU SWAT done [11:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:06] mine is pretty quick config change, almost no testing :( [11:58:49] if any other SWATter wants to do it, that’s fine by me, but I’m not comfortable with it [11:58:51] sorry [11:59:05] raynor: are you a deployer? [11:59:10] yes, I am [11:59:18] oh, sorry, I didn’t know that [11:59:25] I can deploy it by myself [11:59:29] go ahead then, I guess [11:59:32] raynor: go ahead [11:59:34] zeljkof, if you have lots of stuff related to train I can wait [11:59:36] (shouldn’t have logged the EU SWAT done :/ ) [11:59:52] I need to to continue with train, but I need to finish something anyway, so 5-10 minutes is fine with me [11:59:53] that's not a big thing, just an UI glitch, it can go live later/tomorrow [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190528T1200) [12:00:12] (03CR) 10Lucas Werkmeister (WMDE): Disable the rdf2latex Collection portlet format (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512743 (https://phabricator.wikimedia.org/T224433) (owner: 10Pmiazga) [12:00:13] zeljkof ok, then give me 5 mins [12:00:21] raynor: go ahead now, if it's quick to deploy :) [12:00:28] !log EU SWAT re-opened [12:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:56] (03PS4) 10Pmiazga: Disable the rdf2latex Collection portlet format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512743 (https://phabricator.wikimedia.org/T224433) [12:01:21] (03PS2) 10Alexandros Kosiaris: otrs: Avoid setting Precedence header for stewards [puppet] - 10https://gerrit.wikimedia.org/r/512875 (https://phabricator.wikimedia.org/T224404) [12:01:39] (03CR) 10Pmiazga: [C: 03+2] Disable the rdf2latex Collection portlet format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512743 (https://phabricator.wikimedia.org/T224433) (owner: 10Pmiazga) [12:02:43] (03Merged) 10jenkins-bot: Disable the rdf2latex Collection portlet format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512743 (https://phabricator.wikimedia.org/T224433) (owner: 10Pmiazga) [12:03:02] (03CR) 10jenkins-bot: Disable the rdf2latex Collection portlet format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512743 (https://phabricator.wikimedia.org/T224433) (owner: 10Pmiazga) [12:03:34] (03CR) 10Pmiazga: Disable the rdf2latex Collection portlet format (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512743 (https://phabricator.wikimedia.org/T224433) (owner: 10Pmiazga) [12:04:17] (03PS2) 10Jbond: firewall logging: enable loggin on internal servers [puppet] - 10https://gerrit.wikimedia.org/r/511700 (https://phabricator.wikimedia.org/T116011) [12:08:35] 2 threads of hhvm on mwdebug1002 are using 100% cpu, trying to reload the page but it's not happening :/ [12:09:09] ok, got timeout, and now it works again [12:09:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Just tested this and it works, so merging" [puppet] - 10https://gerrit.wikimedia.org/r/512875 (https://phabricator.wikimedia.org/T224404) (owner: 10Alexandros Kosiaris) [12:10:45] 10Operations, 10ops-eqiad, 10decommission: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10Maintenance_bot) [12:10:46] deploying [12:11:12] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:512743 Disable the rdf2latex Collection portlet format(T224433)]] (duration: 00m 55s) [12:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:17] T224433: [Bug] Two PDF links appearing on left sidebar in desktop - https://phabricator.wikimedia.org/T224433 [12:12:09] zeljkof, I'm doned, thx [12:12:16] done* [12:12:29] should I close SWAT window? is there anyone with anything extra? [12:13:49] I think that's a clear "nope, go ahead and close the SWAT" [12:13:55] !log EU SWAT done [12:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:19] raynor: yes, nothing left for swat [12:14:26] I'll continue with preparations for train [12:15:10] thx for letting me deploy my config change [12:15:47] raynor: no problem, I'm pretty much stuck with train anyway :) but I would like to try a few things before I give up [12:17:42] zeljkof, do you need any help with that? [12:18:09] tarrow: if you could help with T224465, that would be great :) [12:18:09] T224465: `scap sync` fails with CalledProcessError - https://phabricator.wikimedia.org/T224465 [12:29:32] (03CR) 10Volans: [C: 03+1] "LGTM, just a nit to add the missing report inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/509445 (owner: 10CRusnov) [12:40:45] !log gilles@deploy1001 Started deploy [performance/asoranking@157c25f]: T224388 [12:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:50] T224388: AS report crashes on generation - https://phabricator.wikimedia.org/T224388 [12:40:51] !log gilles@deploy1001 Finished deploy [performance/asoranking@157c25f]: T224388 (duration: 00m 06s) [12:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:33] (03CR) 10Jbond: "mostly look sgood to me, some minor comments and questions. I also wonder if its worth advertising the 10.3.0.0/24 prefix as a fall back " (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (https://phabricator.wikimedia.org/T186550) (owner: 10Ayounsi) [12:41:54] (03PS1) 10Marostegui: db-eqiad.php: Remove db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512886 (https://phabricator.wikimedia.org/T223217) [12:44:33] (03CR) 10Urbanecm: [C: 04-2] "We're not deploying FR to any additional wikis, see Reedy's comment in the associated task. Escalating my previous -1 to -2, to prevent ac" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 (https://phabricator.wikimedia.org/T221933) (owner: 10星耀晨曦) [12:49:43] (03PS3) 10Jbond: firewall logging: enable loggin on internal servers [puppet] - 10https://gerrit.wikimedia.org/r/511700 (https://phabricator.wikimedia.org/T116011) [12:50:53] !log gilles@deploy1001 Started deploy [performance/asoranking@1c60db1]: T224388 [12:50:58] !log gilles@deploy1001 Finished deploy [performance/asoranking@1c60db1]: T224388 (duration: 00m 04s) [12:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:58] (03CR) 10Jbond: [C: 03+2] firewall logging: enable loggin on internal servers [puppet] - 10https://gerrit.wikimedia.org/r/511700 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:51:00] T224388: AS report crashes on generation - https://phabricator.wikimedia.org/T224388 [12:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:57] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [12:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - European version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190528T1300). [13:05:16] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:12:52] jouncebot: thanks for the reminder, doing train stuff, slightly behind schedule because of some problems with scap [13:13:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:15:35] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10Eevans) [13:16:26] <_joe_> that was one single spike ^^ [13:16:31] <_joe_> it has recovered [13:16:49] (03CR) 10Jcrespo: [C: 04-1] "I prefer to disable the checks on icinga or on puppet, as the checks on dumps are not in testing." [puppet] - 10https://gerrit.wikimedia.org/r/511454 (https://phabricator.wikimedia.org/T206203) (owner: 10Marostegui) [13:17:02] 10Operations, 10serviceops, 10User-jijiki: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster - https://phabricator.wikimedia.org/T223647 (10elukey) [13:17:45] zeljkof: hey! I'm back from Lunch. Did you mean to ping me? I'm not really sure I know where to start with T224465 [13:17:45] T224465: `scap sync` fails with CalledProcessError - https://phabricator.wikimedia.org/T224465 [13:18:08] (03CR) 10Marostegui: "Your call, I don't have an opinion really" [puppet] - 10https://gerrit.wikimedia.org/r/511454 (https://phabricator.wikimedia.org/T206203) (owner: 10Marostegui) [13:19:58] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.068e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:20:06] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:20:11] tarrow: oops, not sure how I ended up pinging you about T224465 :) I wanted to ping raynor :D [13:20:23] I think I got is solved, working on it now [13:23:51] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [13:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:08] !log decommissioning restbase1014-c -- T223976 [13:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:13] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [13:31:32] (03PS1) 10Jcrespo: mariadb: Disable checks of database snapshots [puppet] - 10https://gerrit.wikimedia.org/r/512894 (https://phabricator.wikimedia.org/T206203) [13:31:49] !log gilles@deploy1001 Started deploy [performance/asoranking@1c60db1]: T224388 [13:31:50] !log gilles@deploy1001 deploy aborted: T224388 (duration: 00m 01s) [13:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:53] T224388: AS report crashes on generation - https://phabricator.wikimedia.org/T224388 [13:31:55] !log gilles@deploy1001 Started deploy [performance/asoranking@60369cc]: T224388 [13:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:59] !log gilles@deploy1001 Finished deploy [performance/asoranking@60369cc]: T224388 (duration: 00m 03s) [13:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:37] !log zfilipin@deploy1001 Started scap: testwiki to php-1.34.0-wmf.7 and rebuild l10n cache [13:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:12] zeljkof, sorry, I'm back - was streching my legs [13:36:23] and making some strong coffee [13:36:40] 10Operations, 10ops-eqiad, 10decommission: Return sulfur to spares - https://phabricator.wikimedia.org/T224475 (10MoritzMuehlenhoff) [13:37:08] (03PS1) 10Muehlenhoff: Reclaim sulfur to spares [puppet] - 10https://gerrit.wikimedia.org/r/512896 (https://phabricator.wikimedia.org/T224475) [13:37:22] raynor: no problem, I ended up pinging the wrong person [13:37:25] * zeljkof facepalms [13:37:28] lol [13:37:32] happens [13:37:42] I think I've solved the problem, running `scap sync` now [13:38:10] (03CR) 10Muehlenhoff: [C: 03+2] Reclaim sulfur to spares [puppet] - 10https://gerrit.wikimedia.org/r/512896 (https://phabricator.wikimedia.org/T224475) (owner: 10Muehlenhoff) [13:38:40] ok, let me know if you need help [13:39:04] (03CR) 10Volans: [C: 04-1] "I've commented only the code with some comments/questions, see inline." (0312 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [13:39:55] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [13:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:02] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:42:34] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Return sulfur to spares - https://phabricator.wikimedia.org/T224475 (10MoritzMuehlenhoff) p:05Triage→03Normal [13:43:33] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Return sulfur to spares - https://phabricator.wikimedia.org/T224475 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH [13:48:36] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 5447 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:48:50] <_joe_> !log stopping hhvm on mwdebug1001 for testing [13:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:55] <_joe_> !log hhvm restarted on mwdebug1001 [13:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:17] (03CR) 10Elukey: "Giuseppe/Effie: ok to merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/510697 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [13:57:32] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:58:18] PROBLEM - Mjolnir bulk update failure check - codfw on icinga1001 is CRITICAL: 121.1 gt 2 https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [13:58:24] PROBLEM - PHP7 rendering on mw1342 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 516 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:58:42] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:00:16] PROBLEM - PHP7 rendering on mw1244 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 494 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:00:50] PROBLEM - PHP7 rendering on mw1327 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 553 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:00:52] (03CR) 10Jcrespo: [C: 03+1] "I'm ok with this, although we may need later something similar for the new hosts + labsdbs (or alternativelly, setup pt-kill there)." [software] - 10https://gerrit.wikimedia.org/r/511383 (owner: 10Marostegui) [14:01:58] PROBLEM - PHP7 rendering on mw1329 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 553 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:04:02] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [14:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:56] 10Operations, 10Traffic: rhenium [spare] server still receiving flow data - https://phabricator.wikimedia.org/T224477 (10jbond) p:05Triage→03Normal [14:05:48] PROBLEM - PHP7 rendering on mw1273 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 552 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:06:10] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 63.20, 30.06, 19.83 [14:06:20] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 76.65, 35.34, 21.61 [14:06:35] ok let's have a look [14:06:40] PROBLEM - High CPU load on API appserver on mw1287 is CRITICAL: CRITICAL - load average: 72.28, 35.46, 23.43 [14:06:46] PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 73.34, 35.07, 21.97 [14:08:06] RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 32.37, 31.39, 21.79 [14:08:48] RECOVERY - High CPU load on API appserver on mw1277 is OK: OK - load average: 20.48, 28.96, 21.40 [14:08:58] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 20.80, 28.48, 21.26 [14:09:06] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:09:20] RECOVERY - High CPU load on API appserver on mw1287 is OK: OK - load average: 22.25, 29.60, 23.21 [14:09:38] PROBLEM - PHP7 rendering on mw1326 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 106028 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:09:52] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:10:00] !log zfilipin@deploy1001 Finished scap: testwiki to php-1.34.0-wmf.7 and rebuild l10n cache (duration: 34m 22s) [14:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:26] RECOVERY - PHP7 rendering on mw1244 is OK: HTTP OK: HTTP/1.1 200 OK - 75396 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:10:48] 10Operations, 10ops-eqiad, 10decommission: Return sulfur to spares - https://phabricator.wikimedia.org/T224475 (10Maintenance_bot) [14:10:49] !log beginning rolling reboots of codfw kafka-main cluster for security updates [14:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:00] RECOVERY - PHP7 rendering on mw1326 is OK: HTTP OK: HTTP/1.1 200 OK - 75396 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:11:04] RECOVERY - PHP7 rendering on mw1342 is OK: HTTP OK: HTTP/1.1 200 OK - 75396 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:11:06] RECOVERY - PHP7 rendering on mw1329 is OK: HTTP OK: HTTP/1.1 200 OK - 75395 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:11:10] RECOVERY - PHP7 rendering on mw1273 is OK: HTTP OK: HTTP/1.1 200 OK - 75396 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:11:10] RECOVERY - PHP7 rendering on mw1327 is OK: HTTP OK: HTTP/1.1 200 OK - 75396 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:06] <_joe_> uh what [14:15:02] would be the scap deploy [14:15:26] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:15:38] <_joe_> which scap deploy? [14:16:26] <_joe_> we had a huge spilke of fatals [14:16:31] <_joe_> starting at 13:50 [14:16:36] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.195e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:16:57] <_joe_> herron: do you know how to handle mirrormaker? ^^ [14:17:08] that one [17:10] <+logmsgbot> | !log zfilipin@deploy1001 Finished scap: testwiki [14:17:45] <_joe_> ok something very, very wrong is happening [14:18:09] nothing stood out in kibana [14:18:25] _joe_: that should clear as codfw catches up [14:18:52] I just rebooted kafka2001 [14:18:59] <_joe_> oh ok [14:19:10] <_joe_> jijiki: I think I know what's actually happening [14:19:19] shoot [14:19:56] <_joe_> uhm actully no [14:20:03] lol [14:20:16] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:20:25] herron: the mirror maker alert is due to a single topic lagging (cirrussearch etc..) that should be related to the rolling reboot [14:20:32] https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw%20prometheus%2Fops&var-lag_datasource=eqiad%20prometheus%2Fops&var-mirror_name=main-eqiad_to_main-codfw&refresh=5m&panelId=5&fullscreen&orgId=1 [14:20:33] <_joe_> so the deploy is [14:20:36] <_joe_> 13:35 zfilipin@deploy1001: Started scap: testwiki to php-1.34.0-wmf.7 and rebuild l10n cache [14:20:42] herron: (reboot of elasticsearch, not kafka) [14:20:58] <_joe_> for some reason, all those appservers started having issues with php7 around 13:50 [14:21:13] elukey: yes looks like a weekly event [14:21:14] <_joe_> and we don't know what happened that made them recover at 14:10 [14:21:40] <_joe_> or well, the scap command finished [14:21:51] <_joe_> so let's check the opcache clears on those servers [14:22:01] (03PS1) 10Muehlenhoff: Remove access for dkg [puppet] - 10https://gerrit.wikimedia.org/r/512908 [14:22:28] (03CR) 10Zfilipin: [C: 03+2] Group0 to 1.34.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512878 (owner: 10Zfilipin) [14:22:48] <_joe_> jijiki: I see [14:22:59] <_joe_> fgrep /opcache-free php-admin-access.log [14:23:07] <_joe_> 2019-05-28T13:59:29 [14:23:13] <_joe_> and then 2019-05-28T14:09:59 [14:23:20] <_joe_> moments when the cache was cleared [14:23:39] elukey: or no actually I was looking at eqiad.cirrussearch.page-index-update [14:23:45] re weekly event [14:23:48] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512878 (owner: 10Zfilipin) [14:27:19] (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512878 (owner: 10Zfilipin) [14:27:26] <_joe_> jijiki: so it seems we're back with the corrupted opcache [14:27:53] sigh [14:27:53] <_joe_> the first call corrupts it, the second fixes it [14:28:03] <_joe_> not even sure why it was fixed by the second call [14:28:09] <_joe_> or better [14:28:15] <_joe_> why there was a second call [14:28:51] <_joe_> It can't be a conicidence that was a time of high load for appservers [14:29:06] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 5333 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:29:46] _joe_: we were discussing with chris that this looks like a pattern [14:30:10] for quite some time now, and then we had a discussion about l10n cache [14:30:26] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.7 [14:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:43] (03PS2) 10Jforrester: Fix order of "Edit" tabs when multi-tab mode used on single-tab wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512732 (https://phabricator.wikimedia.org/T223793) (owner: 10Bartosz Dziewoński) [14:30:47] I could have a look at the scap code and see if I come up with something [14:31:05] <_joe_> let's keep this in mind, but [14:31:18] <_joe_> we mostly survived the issue if Ihave to believe our graphs [14:31:24] <_joe_> we depooled the servers that were down [14:32:34] <_joe_> https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&from=1559051652695&to=1559053484000&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1327 [14:32:46] <_joe_> so go back to what you were doing [14:33:54] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for dkg [puppet] - 10https://gerrit.wikimedia.org/r/512908 (owner: 10Muehlenhoff) [14:36:30] (03PS1) 10Michael Große: Enable extension EntitySchema in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512909 (https://phabricator.wikimedia.org/T216955) [14:36:57] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [14:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:35] (03PS1) 10Filippo Giunchedi: parsoid: add local udp syslog shim [puppet] - 10https://gerrit.wikimedia.org/r/512910 [14:40:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1001/16784/" [puppet] - 10https://gerrit.wikimedia.org/r/512910 (owner: 10Filippo Giunchedi) [14:41:26] (03CR) 10Gehel: [C: 04-1] Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [14:44:04] (03CR) 10Mobrovac: [C: 03+1] parsoid: add local udp syslog shim [puppet] - 10https://gerrit.wikimedia.org/r/512910 (owner: 10Filippo Giunchedi) [14:45:13] (03CR) 10Cwhite: [C: 03+1] parsoid: add local udp syslog shim [puppet] - 10https://gerrit.wikimedia.org/r/512910 (owner: 10Filippo Giunchedi) [14:47:21] 10Operations, 10serviceops, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [14:50:18] (03CR) 10Filippo Giunchedi: [C: 03+2] parsoid: add local udp syslog shim [puppet] - 10https://gerrit.wikimedia.org/r/512910 (owner: 10Filippo Giunchedi) [14:50:26] (03PS2) 10Filippo Giunchedi: parsoid: add local udp syslog shim [puppet] - 10https://gerrit.wikimedia.org/r/512910 [14:52:10] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1028 - no PS redundancy - https://phabricator.wikimedia.org/T224065 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson [14:52:20] (03CR) 10Cwhite: [C: 03+2] logstash: add deprecated-input tag to deprecated inputs [puppet] - 10https://gerrit.wikimedia.org/r/512193 (https://phabricator.wikimedia.org/T220103) (owner: 10Cwhite) [14:52:28] (03PS4) 10Cwhite: logstash: add deprecated-input tag to deprecated inputs [puppet] - 10https://gerrit.wikimedia.org/r/512193 (https://phabricator.wikimedia.org/T220103) [14:53:04] 10Operations, 10Gerrit, 10serviceops, 10Release-Engineering-Team (Watching / External): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Cmjohnson) [14:54:48] (03PS17) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [14:54:57] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [14:56:14] RECOVERY - IPMI Sensor Status on cloudvirt1028 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [14:57:51] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:57:51] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:57:52] !log reboot ms-be2016 [14:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:10] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-syslog-tcp_10514: Servers logstash1007.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:59:29] (03PS6) 10CRusnov: profile::netbox: stop using icinga as remote cron [puppet] - 10https://gerrit.wikimedia.org/r/509445 [15:00:04] Lucas_WMDE and Michael_WMDE: Time to snap out of that daydream and deploy N/A. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190528T1500). [15:00:34] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:00:36] That sounds exciting [15:01:25] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [15:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:29] o/ [15:01:29] (03PS2) 10Jbond: firewall loggin: enable firewall logging on wmcs servers [puppet] - 10https://gerrit.wikimedia.org/r/511701 (https://phabricator.wikimedia.org/T116011) [15:01:42] deploying a new extension, EntitySchema [15:01:42] * Michael_WMDE is ready :) [15:01:48] first needs two backports though [15:02:45] (03CR) 10Jbond: [C: 03+2] firewall loggin: enable firewall logging on wmcs servers [puppet] - 10https://gerrit.wikimedia.org/r/511701 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [15:03:00] jouncebot: now [15:03:00] For the next 0 hour(s) and 56 minute(s): N/A (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190528T1500) [15:03:08] deploying https://gerrit.wikimedia.org/r/512912 first, then we’ll see if the maintenance script works on testwikidatawiki now [15:04:15] Krinkle: is there some other value we should put as the “window” arg for {{#invoke:Deployment schedule|row}}? [15:04:38] PROBLEM - Host ms-be1033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:05:05] also, there are some weird errors in the fatalmonitor (input is not proper UTF-8), are those known? [15:05:13] (03PS5) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [15:05:32] (03CR) 10CRusnov: "ty for review, changes implemented" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/509445 (owner: 10CRusnov) [15:07:58] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:07:58] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:05] I’ll +2 the backport to the other branch as well, since CI takes so long [15:08:44] oh, and the first one is done, going ahead [15:09:40] Lucas_WMDE: checking now, what dashboard do you mean btw? mediawiki-errors? [15:10:29] `fatalmonitor` command on whatever host it is where you’re supposed to run it during SWAT [15:10:37] mwlog1001 [15:10:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, 10decommission: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10Maintenance_bot) [15:11:00] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi: TEC6: Logging infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220103 (10Maintenance_bot) [15:11:12] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.7/extensions/EntitySchema/: [[gerrit:512912|Steal maintenance script user]] (duration: 00m 59s) [15:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:25] Lucas_WMDE: hm.. ok. haven't used that for a few years. I use the mwdebug dashboard for staging, and mediawiki-errors post-deploy (Logstash). [15:11:35] ok [15:12:01] it’s a fairly simple shell script, effectively getting the most common errors from the last 1000 lines of hhvm.log [15:12:03] PHP Fatal Error from line 29 of /srv/mediawiki/php-1.34.0-wmf.6/vendor/wikibase/data-model/src/Entity/Item.php: Interface 'Wikibase\DataModel\Statement\StatementListHolder' not found [15:12:04] (03PS4) 10Bstorm: cloudstore: switch scratch mounts from labstore1003 to cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/509469 (https://phabricator.wikimedia.org/T209527) [15:12:07] Haven't seen that before. [15:12:10] Will remember for later. [15:12:18] you said something about UTF-8? [15:12:22] yes [15:12:29] 1484: parser error : Input is not proper UTF-8, indicate encoding ! [15:13:04] hm.. from where? Does it contain a channel or type? [15:13:11] e.g. php error, mw exception [15:13:14] syslog [15:13:23] https://gist.github.com/lucaswerkmeister/a467d7c89a7ca66bc3735cc458c11a20 in hhvm.log [15:13:38] the surrounding lines are slow queries, probably unrelated [15:14:42] lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/EntitySchema/maintenance/createPreexistingSchemas.php --wiki=testwikidatawiki [15:14:56] oh that should have been a !log – but it failed anyways [15:15:01] looking into it [15:15:55] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:15:55] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:05] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.7/extensions/EntitySchema/: [[gerrit:512912|Steal maintenance script user]] – forgot `git submodule update` before previous sync (duration: 00m 57s) [15:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:23] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/EntitySchema/maintenance/createPreexistingSchemas.php --wiki=testwikidatawiki [15:17:24] Lucas_WMDE: It's level=INFO, so that should be fine to ignore. [15:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:12] Stuff from hhvm's own syslog is rarely considered for monitoring (and not very detailed/actionable). As far as I know, anything useful there is already proxied to type:mediawiki in a more useful way. The rest we do include in scap canary checker if it's level=WARNING or above. [15:18:24] but tends to be duplicative, but doesn't hurt just in case [15:18:27] ok thanks [15:18:35] meanwhile, that maintenance script succeeded, yay [15:18:43] thx for checking first :) [15:18:45] deploying the wmf.6 backport as well [15:20:34] PROBLEM - puppet last run on dns5001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [15:21:10] (03PS9) 10Mathew.onipe: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) [15:21:12] (03PS1) 10Mathew.onipe: add WDQS reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) [15:24:34] (03PS6) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [15:26:10] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:26:10] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:32] RECOVERY - snapshot of s7 in eqiad on db1115 is OK: snapshot for s7 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-05-28 13:56:00 from db1116.eqiad.wmnet:3317 (810 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [15:28:56] (03PS2) 10Mathew.onipe: add WDQS reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) [15:32:56] RECOVERY - Host ms-be1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.26 ms [15:34:01] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.6/extensions/EntitySchema/: [[gerrit:512911|Steal maintenance script user]] (duration: 00m 58s) [15:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable extension EntitySchema in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512909 (https://phabricator.wikimedia.org/T216955) (owner: 10Michael Große) [15:34:37] alright, now for the config change [15:35:18] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:35:19] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:35] (03Merged) 10jenkins-bot: Enable extension EntitySchema in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512909 (https://phabricator.wikimedia.org/T216955) (owner: 10Michael Große) [15:36:02] it’s on mwdebug1002, checking [15:36:44] (03CR) 10jenkins-bot: Enable extension EntitySchema in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512909 (https://phabricator.wikimedia.org/T216955) (owner: 10Michael Große) [15:37:26] seems to work fine, deploying, and then I’ll run the maintenance script [15:38:06] (03PS2) 10Andrew Bogott: keystone: make the api service active on both controller nodes [puppet] - 10https://gerrit.wikimedia.org/r/512789 (https://phabricator.wikimedia.org/T223905) [15:38:25] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:512909|Enable extension EntitySchema in production]] (duration: 00m 56s) [15:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:35] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/EntitySchema/maintenance/createPreexistingSchemas.php --wiki=wikidatawiki [15:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:07] (03CR) 10Andrew Bogott: [C: 03+2] keystone: make the api service active on both controller nodes [puppet] - 10https://gerrit.wikimedia.org/r/512789 (https://phabricator.wikimedia.org/T223905) (owner: 10Andrew Bogott) [15:39:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but some minor comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/509469 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [15:39:31] maintenance script succeeded, creating one schema manually as a final test [15:40:34] (03PS5) 10Volans: admins: add shell account and admin groups for iflorez [puppet] - 10https://gerrit.wikimedia.org/r/510985 (https://phabricator.wikimedia.org/T223496) (owner: 10Dzahn) [15:41:39] (03CR) 10Volans: [C: 03+2] admins: add shell account and admin groups for iflorez [puppet] - 10https://gerrit.wikimedia.org/r/510985 (https://phabricator.wikimedia.org/T223496) (owner: 10Dzahn) [15:41:55] manual test successful too, yay [15:42:09] !log Extension:EntitySchema deployment finished successfully [15:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:13] (03CR) 10Bstorm: cloudstore: switch scratch mounts from labstore1003 to cloudstore1008 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/509469 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [15:42:40] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:42:40] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:59] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+1] "LGMT https://puppet-compiler.wmflabs.org/compiler1001/16788/" [puppet] - 10https://gerrit.wikimedia.org/r/510697 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [15:43:04] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2035 - https://phabricator.wikimedia.org/T224456 (10Papaul) a:05Papaul→03jcrespo disk replacement complete [15:44:11] (03CR) 10Bstorm: cloudstore: switch scratch mounts from labstore1003 to cloudstore1008 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509469 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [15:45:08] RECOVERY - puppet last run on ms-be2043 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:46:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Thanks for double-checking! This LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/509469 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [15:52:43] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Maintenance_bot) [15:52:58] RECOVERY - puppet last run on dns5001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [15:53:19] !log put back wrongly-replaced sdf on ms-be2043 - T222654 [15:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:24] T222654: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 [15:53:44] 10Operations, 10Parsoid, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move parsoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219927 (10Maintenance_bot) [15:54:51] !log shutting down db2091 for firmware upgrade [15:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:26] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:56:26] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:37] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Maintenance_bot) [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190528T1600). Please do the needful. [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:14] I still have a minor cleanup to do after our deployment, sorry [16:01:23] requires no scap sync, just a bit of housekeeping on deployment [16:01:42] PROBLEM - very high load average likely xfs on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [16:02:48] PROBLEM - dhclient process on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer [16:02:57] known ^ [16:03:26] PROBLEM - DPKG on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer [16:03:42] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:03:58] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:04:00] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [16:04:01] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:38] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:04:40] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:04:54] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:05:06] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:05:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:05:34] RECOVERY - dhclient process on ms-be2043 is OK: PROCS OK: 0 processes with command name dhclient [16:05:43] okay now I’m really done [16:05:44] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [16:05:48] PROBLEM - Host db2091.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:04] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [16:06:12] RECOVERY - DPKG on ms-be2043 is OK: All packages OK [16:07:20] RECOVERY - very high load average likely xfs on ms-be2043 is OK: OK - load average: 32.65, 71.09, 56.54 https://wikitech.wikimedia.org/wiki/Swift [16:07:31] (03CR) 10Muehlenhoff: "Sounds good, reboots of eqiad app servers won't start before tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/510697 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [16:07:32] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [16:07:32] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:07:56] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:08:56] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:09:19] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [16:09:19] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:09:22] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [16:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:23] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:39] 10Operations, 10ops-codfw, 10media-storage, 10observability, 10User-fgiunchedi: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10fgiunchedi) Disk has been replaced, thanks @Papaul ! [16:09:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:09:42] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:11:01] 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP error: "Undefined index: userlangattributes" from QuickTemplate.php - https://phabricator.wikimedia.org/T224491 (10Krinkle) [16:11:18] RECOVERY - Host db2091.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms [16:11:40] (03PS1) 10Mobrovac: RESTRouter: Add initial Helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) [16:12:16] 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) p:05Triage→03Normal [16:12:32] (03CR) 10Mobrovac: "TODO: add to index.yaml and add the tgz, but that can be done later" [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [16:13:02] 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) [16:14:24] RECOVERY - Check systemd state on ms-be2043 is OK: OK - running: The system is fully operational [16:14:32] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [16:14:36] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [16:14:41] 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable) - https://phabricator.wikimedia.org/T221347 (10Krinkle) Could be a coincidence, but I did see similar issues again today. The server... [16:14:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:15:00] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:15:06] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/509445 (owner: 10CRusnov) [16:15:19] !log rearmed keyholder on deploy2001 following reboot [16:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:38] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [16:15:46] (03PS7) 10CRusnov: profile::netbox: stop using icinga as remote cron [puppet] - 10https://gerrit.wikimedia.org/r/509445 [16:15:56] (03PS2) 10Smalyshev: Enable wgSpecialSearchFormOptions on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512802 (https://phabricator.wikimedia.org/T55652) [16:16:49] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10jijiki) [16:17:36] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [16:17:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:50] we’re seeing UnknownContentModelExceptions on mwdebug1271 after deploying a new extension [16:18:56] on *mw1271, sorry [16:19:03] is it okay if I SSH there and do a manual `scap pull`? [16:19:14] or is anyone aware of other problems with that host? [16:19:42] Lucas_WMDE: in theory that should be ok [16:19:49] if it misbehaves, we depool it [16:19:54] nothing in icinga for mw1271 [16:20:31] unless [16:20:38] !log lucaswerkmeister-wmde@mw1271:~$ scap pull [16:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:55] no errors other than a bunch of “cannot delete non-empty directory” for php-1.32.0-wmf.3 [16:21:02] let’s see if it happens again… [16:21:14] (03CR) 10CRusnov: [C: 03+2] profile::netbox: stop using icinga as remote cron [puppet] - 10https://gerrit.wikimedia.org/r/509445 (owner: 10CRusnov) [16:23:52] (03PS1) 10EBernhardson: Add cloudelastic LVS to DNS [dns] - 10https://gerrit.wikimedia.org/r/512924 (https://phabricator.wikimedia.org/T224324) [16:24:14] (03PS1) 10EBernhardson: LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) [16:25:59] (03CR) 10EBernhardson: LVS for cloudelastic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson) [16:27:23] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [16:27:23] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:04] !log Ran scap pull on mw1240 (curl -H 'Host: www.wikidata.org' … mw1240.eqiad.wmnet/wiki/Special:SetEntitySchemaLabelDescriptionAliases/E10/en returned 404) [16:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:21] Do we need to check further? Or shall I just assume this was the only such instance and leave it be? [16:37:25] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Volans) Added @Iflorez to the `wmf` LDAP group as agreed with @MoritzMuehlenhoff I've verified with @Iflorez that basic ac... [16:38:22] (03CR) 10CRusnov: "> Patch Set 14: Code-Review-1" (0314 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [16:39:42] 10Operations, 10netops: librenms logrotate script seems not working - https://phabricator.wikimedia.org/T224502 (10elukey) p:05Triage→03Normal [16:40:14] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Volans) 05Open→03Resolved I've asked @elukey to sync the account to HUE as I don't have access myself. It should be all... [16:40:25] hoo: I’m not seeing any other instances in logstash [16:40:39] I’ll keep checking for a bit longer, but it looks like there’s nothing else to do [16:40:44] Cool :) [16:41:00] (03PS15) 10CRusnov: Add LibreNMS parity check report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [16:41:33] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [16:41:45] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [16:41:45] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:07] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [16:42:34] ^ me fixed [16:43:05] Oh or not fixed, but still probably me [16:44:48] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) [16:45:22] 10Operations, 10Operations-Software-Development, 10netbox, 10netops, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10crusnov) Hello here is the sample output. There are several inconsistencies that I can see the fix for that I'd already attempte... [16:46:56] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) I'm going to assume for now that T224493 is the same issue, because it too only... [16:49:19] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [16:49:20] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:14] (03PS1) 10Jbond: icinga: Add a function to force a recheck of all sevices [software/spicerack] - 10https://gerrit.wikimedia.org/r/512932 [16:56:21] (03PS1) 10CRusnov: hieradata common::netmon: Change timer defs to make puppet happy [puppet] - 10https://gerrit.wikimedia.org/r/512933 [16:57:03] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [16:57:03] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:57] (03CR) 10Volans: [C: 03+1] "LGTM, sorry if my suggestion didn't work" [puppet] - 10https://gerrit.wikimedia.org/r/512933 (owner: 10CRusnov) [16:58:10] (03CR) 10CRusnov: [C: 03+2] hieradata common::netmon: Change timer defs to make puppet happy [puppet] - 10https://gerrit.wikimedia.org/r/512933 (owner: 10CRusnov) [16:58:48] (03CR) 10CRusnov: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/512933 (owner: 10CRusnov) [17:00:04] cscott, arlolra, subbu, and halfak: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190528T1700). [17:03:00] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [17:03:00] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:41] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:04:24] (03CR) 10Volans: [C: 04-1] "Thanks for the addition, can be useful! Just a couple of nit to fix, looks good otherwise." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/512932 (owner: 10Jbond) [17:06:35] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:08:29] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:08:47] (03PS2) 10Jbond: icinga: Add a function to force a recheck of all sevices [software/spicerack] - 10https://gerrit.wikimedia.org/r/512932 [17:11:42] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [17:11:43] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:52] (still no other mw1271 errors for the unknown content model – I’ll stop checking now, seems to be resolved) [17:15:15] hi! We just started seeing some timeouts in CentralNotice banner editing [17:15:17] https://logstash.wikimedia.org/goto/5a68dbc347ff197bc2d65d2b7ca1bee7 [17:15:19] anyone have any ideas? [17:15:58] about something recent that might have changed? Seems possibly related to the revisions table [17:16:12] (03PS3) 10Jbond: icinga: Add a function to force a recheck of all sevices [software/spicerack] - 10https://gerrit.wikimedia.org/r/512932 [17:26:42] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [17:26:43] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:59] 10Operations, 10LDAP-Access-Requests: Remove user Greta WMDE from wmde LDAP group - https://phabricator.wikimedia.org/T224507 (10WMDE-leszek) [17:29:58] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@0735c45]: Update mobileapps to ab67b78 [17:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:14] 08Warning Alert for device asw-c-codfw.mgmt.codfw.wmnet - Port with no description on access switch [17:34:35] (03CR) 10Volans: [C: 03+2] icinga: Add a function to force a recheck of all sevices [software/spicerack] - 10https://gerrit.wikimedia.org/r/512932 (owner: 10Jbond) [17:34:50] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [17:34:50] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:54] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@0735c45]: Update mobileapps to ab67b78 (duration: 05m 56s) [17:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:17] (03Merged) 10jenkins-bot: icinga: Add a function to force a recheck of all sevices [software/spicerack] - 10https://gerrit.wikimedia.org/r/512932 (owner: 10Jbond) [17:40:57] PROBLEM - Check Varnish expiry mailbox lag on cp3035 is CRITICAL: CRITICAL: expiry mailbox lag is 2115212 https://wikitech.wikimedia.org/wiki/Varnish [17:41:28] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [17:41:31] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:15] !log rebooting yubiauth* servers for kernel update [17:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:26] (03PS23) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) [17:44:47] (03CR) 10CRusnov: "thanks!" (0310 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [17:47:32] (03CR) 10jenkins-bot: icinga: Add a function to force a recheck of all sevices [software/spicerack] - 10https://gerrit.wikimedia.org/r/512932 (owner: 10Jbond) [17:48:02] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [17:48:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:07] (03PS5) 10Bstorm: cloudstore: switch scratch mounts from labstore1003 to cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/509469 (https://phabricator.wikimedia.org/T209527) [17:52:44] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [17:52:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:33] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) ##### See also * – someone exper... [17:55:27] 10Operations, 10ops-codfw, 10DBA: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Papaul) 05Open→03Resolved Power drain and firmware upgrade. Before Firmware Version 2.40.40.40 IP Address(es) 10.193.2.127 iDRAC MAC Address 84:7B:EB:F6:70:58 DNS Domain Name Lifecycle Contr... [17:57:26] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [17:57:26] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:35] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) [18:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190528T1800) [18:00:41] (03CR) 10Bstorm: [C: 03+2] cloudstore: switch scratch mounts from labstore1003 to cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/509469 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:03:30] 10Operations, 10netops: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 (10ayounsi) p:05Triage→03High [18:04:15] PROBLEM - very high load average likely xfs on ms-be2043 is CRITICAL: CRITICAL - load average: 104.52, 105.07, 98.93 https://wikitech.wikimedia.org/wiki/Swift [18:04:45] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:05:36] 10Operations, 10netops: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 (10ayounsi) [18:07:16] (03PS1) 10Sbisson: Revert "Hardcode korean help desk config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512942 [18:08:13] 10Operations, 10DBA: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Marostegui) 05Resolved→03Open a:05Papaul→03Marostegui Thanks, I will take it from here I am reopening because we still have to do stuff with it (bring mysql up, check data etc) Thanks @Papaul [18:09:01] (03PS3) 10Gergő Tisza: Revoke editmyuserjsredirect from all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502669 (https://phabricator.wikimedia.org/T207750) [18:11:41] !log Start mysql for s2 and s4 on db2091 T224393 [18:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:46] T224393: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 [18:14:19] 10Operations, 10netops: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 (10BBlack) Plan seems reasonable based on the info in the description! Maybe wait longer than 2h after the linecard is restarted? Or do we suspect that any recurrence is much less likely with no traffic? [18:14:31] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:16:39] (03PS2) 10Sbisson: Revert "Hardcode korean help desk config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512942 [18:16:47] PROBLEM - very high load average likely xfs on ms-be2043 is CRITICAL: CRITICAL - load average: 104.52, 100.65, 98.75 https://wikitech.wikimedia.org/wiki/Swift [18:19:09] 10Operations, 10serviceops, 10User-jijiki: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster - https://phabricator.wikimedia.org/T223647 (10elukey) Interesting data that might support what Joe thinks (namely that HHVM for some reason uses more gets than get): ` tcpdu... [18:20:05] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:23:18] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Papaul) [18:24:20] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2035 - https://phabricator.wikimedia.org/T224456 (10Marostegui) 05Open→03Resolved The rebuilt finished, but it is reporting predictive failure. Let's not change it again until it has fully failed (as this host will be decommissioned soonish). Let's kee... [18:24:39] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [18:29:11] (03CR) 10jerkins-bot: [V: 04-1] Horizon: use keystone_controller rather than nova_controller to determine keystone host [puppet] - 10https://gerrit.wikimedia.org/r/512947 (owner: 10Andrew Bogott) [18:30:23] (03PS2) 10Andrew Bogott: Horizon: use keystone_controller rather than nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/512947 [18:30:55] (03PS4) 10Dzahn: Add jenkins-agent user to releases-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/474824 (owner: 10Thcipriani) [18:31:00] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: use keystone_controller rather than nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/512947 (owner: 10Andrew Bogott) [18:32:13] (03PS5) 10Dzahn: Add jenkins-agent user to releases-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/474824 (owner: 10Thcipriani) [18:32:21] 10Operations, 10netops: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 (10ayounsi) I picked 2h for the sake of picking a number that //sounds// right, but it's not backed by anything. Any value works for me. [18:32:41] RECOVERY - Check systemd state on ms-be2043 is OK: OK - running: The system is fully operational [18:33:00] (03CR) 10Dzahn: [C: 03+2] Add jenkins-agent user to releases-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/474824 (owner: 10Thcipriani) [18:35:28] (03PS2) 10Dzahn: delete the cgred module [puppet] - 10https://gerrit.wikimedia.org/r/511791 (https://phabricator.wikimedia.org/T194724) [18:35:30] 10Operations, 10ops-codfw, 10media-storage, 10observability, 10User-fgiunchedi: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10Papaul) Return information below {F29267063} [18:35:34] (03CR) 10Dzahn: [C: 03+2] delete the cgred module [puppet] - 10https://gerrit.wikimedia.org/r/511791 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:37:10] (03PS2) 10Dzahn: Revert "webserver_misc_apps: add PHP7.2 APT repository on stretch" [puppet] - 10https://gerrit.wikimedia.org/r/512445 [18:37:51] PROBLEM - Check Varnish expiry mailbox lag on cp3039 is CRITICAL: CRITICAL: expiry mailbox lag is 2118138 https://wikitech.wikimedia.org/wiki/Varnish [18:38:11] (03CR) 10Urbanecm: [C: 04-1] "Community consensus?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510784 (https://phabricator.wikimedia.org/T223474) (owner: 10Petar.petkovic) [18:40:14] cloudcontrol2001-dev and 2003-dev have failed puppet and there is not associated SAL entry or ticket link [18:40:34] (03CR) 10Alex Monk: "The current situation strikes me as probably by mistake. Lets get it mentioned on each of the wikis (tech news?) and wait for serious obje" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510784 (https://phabricator.wikimedia.org/T223474) (owner: 10Petar.petkovic) [18:41:11] if they are supposed to be ignored let's remove monitoring instead of disabling it [18:43:11] let's avoid having 20 unhandled CRITs, it leads to monitoring fatigue [18:44:03] major deja vu there [18:44:11] did you say this before like 3 weeks ago? [18:44:36] yea, often [18:44:45] haha ok [18:44:51] PROBLEM - very high load average likely xfs on ms-be2043 is CRITICAL: CRITICAL - load average: 156.70, 114.61, 100.37 https://wikitech.wikimedia.org/wiki/Swift [18:46:52] (03PS1) 10Cwhite: site: remove duplicate node definitions [puppet] - 10https://gerrit.wikimedia.org/r/512952 [18:47:34] (03CR) 10Jforrester: [C: 03+2] Wikibase: Add forwards-compatibility for dataCdnMaxAge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512459 (owner: 10Jforrester) [18:47:42] (03PS2) 10Jforrester: Wikibase: Add forwards-compatibility for dataCdnMaxAge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512459 [18:47:47] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512459 (owner: 10Jforrester) [18:48:55] (03Merged) 10jenkins-bot: Wikibase: Add forwards-compatibility for dataCdnMaxAge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512459 (owner: 10Jforrester) [18:49:09] (03CR) 10Herron: [C: 03+1] "Thanks for this. LGTM as long as noop in PCC" [puppet] - 10https://gerrit.wikimedia.org/r/512952 (owner: 10Cwhite) [18:49:11] (03CR) 10jenkins-bot: Wikibase: Add forwards-compatibility for dataCdnMaxAge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512459 (owner: 10Jforrester) [18:49:28] (03PS2) 10Bstorm: cloudstore: switch maps mounts from labstore1003 to cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/509470 (https://phabricator.wikimedia.org/T209527) [18:49:47] (03PS1) 10Andrew Bogott: Make cloudcontrol1004 the primary keystone host [puppet] - 10https://gerrit.wikimedia.org/r/512954 (https://phabricator.wikimedia.org/T221770) [18:50:10] (03CR) 10Bstorm: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/509470 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:50:58] (03PS3) 10Jforrester: SDC: Stop setting wgMediaInfoEnableFilePageDepicts, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507060 [18:51:03] (03CR) 10Bstorm: "First and second rsync is done, so now this is all scheduling." [puppet] - 10https://gerrit.wikimedia.org/r/509470 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:51:43] !log jforrester@deploy1001 Synchronized wmf-config/Wikibase.php: Add forwards-compatibility for dataCdnMaxAge (duration: 01m 00s) [18:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:24] (03CR) 10Jforrester: [C: 03+2] SDC: Stop setting wgMediaInfoEnableFilePageDepicts, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507060 (owner: 10Jforrester) [18:53:25] (03Merged) 10jenkins-bot: SDC: Stop setting wgMediaInfoEnableFilePageDepicts, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507060 (owner: 10Jforrester) [18:53:40] (03CR) 10jenkins-bot: SDC: Stop setting wgMediaInfoEnableFilePageDepicts, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507060 (owner: 10Jforrester) [18:55:44] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wgMediaInfoEnableFilePageDepicts, no longer read (duration: 00m 57s) [18:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:55] (03PS3) 10Bstorm: cloudstore: switch maps mounts from labstore1003 to cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/509470 (https://phabricator.wikimedia.org/T209527) [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190528T1900) [19:00:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [19:02:02] !log Reboot db2091 for full OS and MySQL upgrade - T224393 [19:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:08] T224393: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 [19:04:45] (03CR) 10Urbanecm: [C: 04-1] "I'll be satisfied if it will be announced¬ opposed, but I don't think that just doing without saying anything is a good idea." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510784 (https://phabricator.wikimedia.org/T223474) (owner: 10Petar.petkovic) [19:05:11] (03PS2) 10Andrew Bogott: Make cloudcontrol1004 the primary keystone host [puppet] - 10https://gerrit.wikimedia.org/r/512954 (https://phabricator.wikimedia.org/T221770) [19:05:13] (03PS1) 10Andrew Bogott: glance: remove --delete from the image sync command [puppet] - 10https://gerrit.wikimedia.org/r/512956 [19:06:13] (03CR) 10Andrew Bogott: [C: 03+2] glance: remove --delete from the image sync command [puppet] - 10https://gerrit.wikimedia.org/r/512956 (owner: 10Andrew Bogott) [19:06:21] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 135.9 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [19:12:01] 10Operations, 10DBA: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Marostegui) I am waiting for replication to catch up to start checking data consistency. [19:12:10] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Remove db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512886 (https://phabricator.wikimedia.org/T223217) (owner: 10Marostegui) [19:13:23] (03Merged) 10jenkins-bot: db-eqiad.php: Remove db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512886 (https://phabricator.wikimedia.org/T223217) (owner: 10Marostegui) [19:14:24] (03CR) 10jenkins-bot: db-eqiad.php: Remove db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512886 (https://phabricator.wikimedia.org/T223217) (owner: 10Marostegui) [19:14:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1064 from config as it will be decommissioned T223217 (duration: 00m 56s) [19:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:47] T223217: Decommission db1064 - https://phabricator.wikimedia.org/T223217 [19:15:45] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db1064 from config as it will be decommissioned T223217 (duration: 00m 55s) [19:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:33] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [19:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:27] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [19:21:32] 10Operations, 10Operations-Software-Development, 10netbox, 10netops, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10ayounsi) Note that this already helped find inconsistencies: [x] Some devices had status active while they shouldn't (cr3-esams... [19:25:14] 10Operations, 10Traffic: rhenium [spare] server still receiving flow data - https://phabricator.wikimedia.org/T224477 (10ayounsi) [19:25:17] 10Operations, 10netops: migrate netinsights from rhenium to sulfur - https://phabricator.wikimedia.org/T212011 (10ayounsi) [19:26:33] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/512952 (owner: 10Cwhite) [19:26:39] (03Abandoned) 10Marostegui: check_mariadb_status.py: Clarify the status of the alert [puppet] - 10https://gerrit.wikimedia.org/r/511454 (https://phabricator.wikimedia.org/T206203) (owner: 10Marostegui) [19:26:41] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:26:54] 10Operations, 10netops: migrate netinsights from rhenium to sulfur - https://phabricator.wikimedia.org/T212011 (10ayounsi) task description says sulfur.wikimedia.org but the link is to sodium ( https://netbox.wikimedia.org/dcim/devices/1171/ ) Which one is correct? [19:27:12] (03PS2) 10Marostegui: eventlogging.my.cnf: Increase buffer pool from 50G to 300G [puppet] - 10https://gerrit.wikimedia.org/r/512365 (https://phabricator.wikimedia.org/T224291) [19:28:18] (03CR) 10Marostegui: [C: 03+2] eventlogging.my.cnf: Increase buffer pool from 50G to 300G [puppet] - 10https://gerrit.wikimedia.org/r/512365 (https://phabricator.wikimedia.org/T224291) (owner: 10Marostegui) [19:32:11] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:36:05] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10Performance: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Ladsgroup) Adding #performance because handshakes on TLS 1.3 are 100ms faster and also it caches handshakes (https://kinsta.com/blog/tls-1-3/). Hope that's fine for you. [19:37:14] 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10Marostegui) For the record, the master failover for this host will be scheduled for the 19th June. [19:39:13] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:39:53] 10Operations, 10Traffic: rhenium [spare] server still receiving flow data - https://phabricator.wikimedia.org/T224477 (10ayounsi) a:03ayounsi Network devices have to have their target changed, see T212011. Note that only cr2-eqiad is actively sending netflow, the other routers are only sending keepalives. [19:42:23] 10Operations, 10netbox, 10observability: netbox / netmon1002: netbox report related service units failed - https://phabricator.wikimedia.org/T224517 (10Dzahn) [19:43:08] ACKNOWLEDGEMENT - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T224517 [19:43:25] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:45:41] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2035 is CRITICAL: cluster=mysql device=cciss,2 instance=db2035:9100 job=node site=codfw daniel_zahn https://phabricator.wikimedia.org/T221533 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2035&var-datasource=codfw+prometheus/ops [19:49:55] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.03e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [19:52:48] ACKNOWLEDGEMENT - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn known [19:54:51] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10kchapman) [19:55:43] ACKNOWLEDGEMENT - Mjolnir bulk update failure check - codfw on icinga1001 is CRITICAL: 121.1 gt 2 daniel_zahn https://phabricator.wikimedia.org/T214494 https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [19:56:12] (03PS1) 10Papaul: DNS: Add mgmt and productin DNS for dbproxy200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/512971 (https://phabricator.wikimedia.org/T223492) [19:56:15] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and productin DNS for dbproxy200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/512971 (https://phabricator.wikimedia.org/T223492) (owner: 10Papaul) [19:57:43] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10Dzahn) [19:57:55] (03PS2) 10Papaul: DNS: Add mgmt and productin DNS for dbproxy200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/512971 [19:58:03] (03PS3) 10Dzahn: Revert "webserver_misc_apps: add PHP7.2 APT repository on stretch" [puppet] - 10https://gerrit.wikimedia.org/r/512445 [19:58:16] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and productin DNS for dbproxy200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/512971 (owner: 10Papaul) [19:59:31] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:01:00] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Papaul) [20:03:45] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:04:34] 10Operations, 10serviceops, 10HHVM, 10Performance-Team (Radar), 10User-Marostegui: Increased instability in MediaWiki backends (according to load balancers) - https://phabricator.wikimedia.org/T223952 (10kchapman) [20:04:36] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Dzahn) > moving public open tasks from Backlog to Acknowledged I think i may have a lack of understanding here, but if a bot or somebody outside the team moves tasks to "acknowledged" d... [20:05:07] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:06:17] (03CR) 10Dzahn: [C: 04-1] "E003|MISSING_OR_WRONG_PTR_FOR_NAME_AND_IP: Missing PTR '228.2.193.10.in-addr.arpa.' for name 'wmf6736.mgmt.codfw.wmnet.' and IP '10.193.2." [dns] - 10https://gerrit.wikimedia.org/r/512971 (owner: 10Papaul) [20:06:33] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:07:20] (03CR) 10Dzahn: [C: 04-1] "E101|MULTIPLE_IPS_FOR_NAME: Found 2 IPs for name 'wmf6736.mgmt.codfw.wmnet." (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/512971 (owner: 10Papaul) [20:08:39] (03PS3) 10Papaul: DNS: Add mgmt and productin DNS for dbproxy200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/512971 [20:09:15] (03CR) 10Dzahn: [C: 03+2] Revert "webserver_misc_apps: add PHP7.2 APT repository on stretch" [puppet] - 10https://gerrit.wikimedia.org/r/512445 (owner: 10Dzahn) [20:09:21] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:09:29] bstorm_: ^ [20:09:33] PROBLEM - very high load average likely xfs on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:09:56] Eeek! [20:10:13] bstorm_: it's been in state "activating" for a while it looks [20:10:22] as opposed to activated or failed [20:10:33] That's not good... [20:10:47] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:12:11] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:13:50] 10Operations, 10netops: migrate netinsights from rhenium to sulfur - https://phabricator.wikimedia.org/T212011 (10MoritzMuehlenhoff) >>! In T212011#5218478, @ayounsi wrote: > task description says sulfur.wikimedia.org but the link is to sodium ( https://netbox.wikimedia.org/dcim/devices/1171/ ) > Which one is... [20:15:13] PROBLEM - very high load average likely xfs on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:16:23] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:20:55] (03PS1) 10Mholloway: WikimediaEditorTasks: Update caption edit target counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512974 (https://phabricator.wikimedia.org/T224299) [20:21:29] (03CR) 10Dzahn: [C: 03+2] DNS: Add mgmt and productin DNS for dbproxy200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/512971 (owner: 10Papaul) [20:22:53] (03CR) 10Mholloway: [C: 03+2] WikimediaEditorTasks: Update caption edit target counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512974 (https://phabricator.wikimedia.org/T224299) (owner: 10Mholloway) [20:23:17] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Dzahn) [20:24:03] (03Merged) 10jenkins-bot: WikimediaEditorTasks: Update caption edit target counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512974 (https://phabricator.wikimedia.org/T224299) (owner: 10Mholloway) [20:24:17] (03CR) 10jenkins-bot: WikimediaEditorTasks: Update caption edit target counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512974 (https://phabricator.wikimedia.org/T224299) (owner: 10Mholloway) [20:28:12] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: WikimediaEditorTasks: Update caption edit target counts (duration: 00m 57s) [20:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:17] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.02e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [20:40:39] 10Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144 (10Dzahn) I have removed a bunch of aliases where people responded they were not aware of having them or that they don't need them anymore. After that initial cleanup i opened a couple OIT tickets t... [20:41:49] RECOVERY - very high load average likely xfs on ms-be2043 is OK: OK - load average: 63.79, 72.85, 78.99 https://wikitech.wikimedia.org/wiki/Swift [20:42:18] (03PS2) 10Dzahn: phabricator: stop paging SRE for process checks, keep for https [puppet] - 10https://gerrit.wikimedia.org/r/512290 (https://phabricator.wikimedia.org/T224205) [20:44:51] (03PS3) 10Dzahn: phabricator: stop paging SRE for process checks, keep for https [puppet] - 10https://gerrit.wikimedia.org/r/512290 (https://phabricator.wikimedia.org/T224205) [20:46:18] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) [20:46:52] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) (Added another exception from another server, also around 2,000 events, for a d... [20:47:53] (03PS1) 10Dzahn: phabricator: remove cron to restart httpd [puppet] - 10https://gerrit.wikimedia.org/r/512977 [20:48:21] (03PS2) 10Dzahn: phabricator: remove cron to restart httpd [puppet] - 10https://gerrit.wikimedia.org/r/512977 (https://phabricator.wikimedia.org/T187790) [20:48:32] (03CR) 10Paladox: [C: 03+1] phabricator: stop paging SRE for process checks, keep for https [puppet] - 10https://gerrit.wikimedia.org/r/512290 (https://phabricator.wikimedia.org/T224205) (owner: 10Dzahn) [20:48:43] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Aklapper) That's about the existing backlog and an action to perform once. See #1 in my previous comment. [20:48:54] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) [20:49:25] (03CR) 10Dzahn: [C: 03+2] phabricator: stop paging SRE for process checks, keep for https [puppet] - 10https://gerrit.wikimedia.org/r/512290 (https://phabricator.wikimedia.org/T224205) (owner: 10Dzahn) [20:50:39] (03PS3) 10Dzahn: phabricator: remove cron to restart httpd [puppet] - 10https://gerrit.wikimedia.org/r/512977 (https://phabricator.wikimedia.org/T187790) [20:51:04] (03CR) 10Paladox: [C: 03+1] "Yay!" [puppet] - 10https://gerrit.wikimedia.org/r/512977 (https://phabricator.wikimedia.org/T187790) (owner: 10Dzahn) [20:52:31] (03CR) 10Dzahn: [C: 03+2] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/512977 (https://phabricator.wikimedia.org/T187790) (owner: 10Dzahn) [20:54:40] (03PS2) 10Dzahn: nagios_common: update phabricator contact group members [puppet] - 10https://gerrit.wikimedia.org/r/512291 (https://phabricator.wikimedia.org/T224205) [20:54:57] (03CR) 10jerkins-bot: [V: 04-1] nagios_common: update phabricator contact group members [puppet] - 10https://gerrit.wikimedia.org/r/512291 (https://phabricator.wikimedia.org/T224205) (owner: 10Dzahn) [20:58:49] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Phabricator: Clean up deadlocked apache processes - https://phabricator.wikimedia.org/T187790 (10Dzahn) removed again since we are not seeing the leaks anymore since our recent upgrade to stretch and phab1003 [20:59:23] PROBLEM - Check Varnish expiry mailbox lag on cp3034 is CRITICAL: CRITICAL: expiry mailbox lag is 2120368 https://wikitech.wikimedia.org/wiki/Varnish [21:00:03] RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [21:03:20] (03PS1) 10Smalyshev: Enable wgSpecialSearchFormOptions on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512979 (https://phabricator.wikimedia.org/T55652) [21:05:35] (03PS1) 10Bstorm: wikireplicas: depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/512981 (https://phabricator.wikimedia.org/T221339) [21:07:34] (03CR) 10Ori.livneh: "Ping?" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [21:08:07] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [21:08:09] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [21:08:38] (03PS2) 10Bstorm: wikireplicas: depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/512981 (https://phabricator.wikimedia.org/T221339) [21:11:14] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=0) [21:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:21] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:12:43] PROBLEM - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 8.54e+05 ge 2.592e+05 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [21:14:24] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) [21:14:58] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) (Added two more different errors that are assumed to be corruptions. This time... [21:16:17] ACKNOWLEDGEMENT - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 8.541e+05 ge 2.592e+05 Gehel OSM import in progress https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [21:19:15] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) [21:20:46] (03PS1) 10Smalyshev: Enable wgSpecialSearchFormOptions on production Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512989 (https://phabricator.wikimedia.org/T55652) [21:23:38] !log restart elasticsearch on cloudelastic1001 to test sanely sized readahead on /dev/dm-0 [21:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:14] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@5a69072]: Deploy GUI & Blazegraph update [21:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:51] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10Eevans) [21:35:43] !log decommissioning restbase1015-a -- T223976 [21:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:48] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [21:38:51] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@5a69072]: Deploy GUI & Blazegraph update (duration: 14m 37s) [21:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:34] (03CR) 10Bstorm: [C: 03+2] wikireplicas: depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/512981 (https://phabricator.wikimedia.org/T221339) (owner: 10Bstorm) [21:46:33] !log depool labsdb1010 for view updates [21:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:54] 10Operations, 10LDAP-Access-Requests: Remove user Greta WMDE from wmde LDAP group - https://phabricator.wikimedia.org/T224507 (10Aklapper) @WMDE-leszek: Does that mean that the accounts https://phabricator.wikimedia.org/p/Greta_Doci_WMDE/ and https://meta.wikimedia.org/wiki/Special:CentralAuth?target=Greta%20D... [21:55:37] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 7681 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [21:58:39] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Dzahn) > transitioning the ~1400 existing tasks currently in "backlog" on the workboard to "acknowledged" without loads of manual work and triggering notifications? I see.. though this... [21:59:23] (03PS1) 10EBernhardson: Update cloudelastic storage device to dm-0, matching reality [puppet] - 10https://gerrit.wikimedia.org/r/512994 [22:01:43] (03PS1) 10Bstorm: Revert "wikireplicas: depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/512997 [22:02:34] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/512997 (owner: 10Bstorm) [22:02:35] 10Operations, 10netbox, 10observability: netbox / netmon1002: netbox report related service units failed - https://phabricator.wikimedia.org/T224517 (10Volans) a:03crusnov [22:02:49] (03PS3) 10Dzahn: nagios_common: update phabricator contact group members [puppet] - 10https://gerrit.wikimedia.org/r/512291 (https://phabricator.wikimedia.org/T224205) [22:09:40] !log restart varnish backend on cp3039 [22:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:55] !log cp3034 - restart varnish backend [22:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:30] reschedules service check on those... [22:11:35] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [22:11:47] RECOVERY - Check Varnish expiry mailbox lag on cp3039 is OK: OK: expiry mailbox lag is 0 https://wikitech.wikimedia.org/wiki/Varnish [22:13:07] https://wikitech.wikimedia.org/wiki/Varnish needs to be updated with those commands [22:13:22] !log repooled labsdb1010 [22:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:27] indeed. had to search for it [22:14:08] let's do cp3035 too [22:14:41] !log cp3035 - varnish-backend-restart [22:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:47] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:18:12] (03PS2) 10EBernhardson: LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) [22:21:03] RECOVERY - Check Varnish expiry mailbox lag on cp3035 is OK: OK: expiry mailbox lag is 0 https://wikitech.wikimedia.org/wiki/Varnish [22:21:11] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnish] [22:22:07] mutante: 3034 mailbox lag didn't drop like the others: https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-1h&to=now&panelId=13&fullscreen&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=backend [22:23:15] (03PS1) 10Bstorm: wikireplicas: depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/513002 (https://phabricator.wikimedia.org/T221339) [22:23:33] (03PS1) 10CRusnov: Add cable names report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 [22:24:10] (03CR) 10jerkins-bot: [V: 04-1] Add cable names report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (owner: 10CRusnov) [22:24:19] XioNoX: no idea, but i will repeat it with sudo -i to be exactly like cp3035 [22:25:14] !log cp3034 - sudo -i varnish-backend-restart [22:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:29] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [22:25:38] (03PS2) 10CRusnov: Add cable names report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) [22:25:41] cp3035 shows an alert for systemd but it's a lie (now) [22:25:55] (03CR) 10Bstorm: [C: 03+2] wikireplicas: depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/513002 (https://phabricator.wikimedia.org/T221339) (owner: 10Bstorm) [22:26:55] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [22:27:30] (03PS1) 10Papaul: DHCP: Add MAC address for dproxy200[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/513004 (https://phabricator.wikimedia.org/T223492) [22:27:40] XioNoX: the graph looks better now [22:29:23] RECOVERY - Check Varnish expiry mailbox lag on cp3034 is OK: OK: expiry mailbox lag is 0 https://wikitech.wikimedia.org/wiki/Varnish [22:36:47] PROBLEM - very high load average likely xfs on ms-be2043 is CRITICAL: CRITICAL - load average: 147.90, 110.07, 88.00 https://wikitech.wikimedia.org/wiki/Swift [22:38:35] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [22:40:33] (03PS3) 10EBernhardson: LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) [22:41:56] (03PS3) 10CRusnov: Add cable names report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) [22:45:23] (03CR) 10Jforrester: [C: 03+2] Fix order of "Edit" tabs when multi-tab mode used on single-tab wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512732 (https://phabricator.wikimedia.org/T223793) (owner: 10Bartosz Dziewoński) [22:45:33] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [22:45:49] I'm taking SWAT (and starting now). [22:46:30] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudstore (backups) - https://phabricator.wikimedia.org/T224528 (10Papaul) [22:47:00] (03Merged) 10jenkins-bot: Fix order of "Edit" tabs when multi-tab mode used on single-tab wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512732 (https://phabricator.wikimedia.org/T223793) (owner: 10Bartosz Dziewoński) [22:47:15] (03CR) 10jenkins-bot: Fix order of "Edit" tabs when multi-tab mode used on single-tab wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512732 (https://phabricator.wikimedia.org/T223793) (owner: 10Bartosz Dziewoński) [22:47:43] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:49:49] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT Fix order of edit tabs for multi-tabs on SET wikis T223793 (duration: 00m 57s) [22:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:54] T223793: On non-SET wikis (two edit tabs), links to new pages (red links) should open the user's preferred editor (last used) - https://phabricator.wikimedia.org/T223793 [22:50:43] (03CR) 10Jforrester: [C: 03+2] Enable wgSpecialSearchFormOptions on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512802 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [22:50:52] (03PS3) 10Jforrester: Enable wgSpecialSearchFormOptions on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512802 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [22:50:57] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512802 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [22:51:15] SMalyshev: Yours will roll out to Beta Cluster soon-ish. [22:51:47] MatmaRex: Deployed and working, BTW. [22:52:09] :O [22:52:14] hi James_F [22:52:16] thanks [22:52:26] Sorry, meant to do that 10 hours ago and forgot. [22:52:31] (03Merged) 10jenkins-bot: Enable wgSpecialSearchFormOptions on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512802 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [22:52:48] (03CR) 10jenkins-bot: Enable wgSpecialSearchFormOptions on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512802 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [22:53:02] 10Operations, 10LDAP-Access-Requests: Remove user Greta WMDE from wmde LDAP group - https://phabricator.wikimedia.org/T224507 (10Dzahn) a:03ayounsi [22:53:33] RECOVERY - very high load average likely xfs on ms-be2043 is OK: OK - load average: 72.95, 75.27, 79.68 https://wikitech.wikimedia.org/wiki/Swift [22:53:56] James_F: thanks [22:54:34] SWAT done 9 minutes before it starts. Is that a record? [22:55:14] seems to be working [22:56:02] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10Dzahn) The user name "darthmon" cannot be found anywhere in the admin module. Please add accounts there when adding them to LDAP groups. [22:56:45] James_F: It's a good one if it is [22:56:51] (03PS2) 10Smalyshev: Enable wgSpecialSearchFormOptions on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512979 (https://phabricator.wikimedia.org/T55652) [22:57:11] SMalyshev: Want ^^ deployed now too? [22:57:30] James_F: yes if you can, since testwikidata is already on .7 [22:57:38] and beta seems to be working [22:58:09] (03CR) 10Jforrester: [C: 03+2] Enable wgSpecialSearchFormOptions on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512979 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [22:59:16] (03Merged) 10jenkins-bot: Enable wgSpecialSearchFormOptions on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512979 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [23:00:05] MaxSem, RoanKattouw, and Niharika: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190528T2300). [23:00:05] MatmaRex and Smalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:16] SMalyshev: Live on mwdebug1002. [23:00:23] checking [23:00:59] James_F: yes works beautifully [23:01:11] (03CR) 10jenkins-bot: Enable wgSpecialSearchFormOptions on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512979 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [23:01:48] Kk. [23:02:39] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT T55652 Enable wgSpecialSearchFormOptions on testwikidata (duration: 00m 56s) [23:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:44] T55652: Special:Search doesn't use labels and descriptions for suggestions but just the item ID - https://phabricator.wikimedia.org/T55652 [23:02:46] Live. [23:03:06] Anyone have any more config they want live, whilst I wait for back-ports to merge? :-) [23:03:39] (03PS2) 10Jforrester: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512479 (https://phabricator.wikimedia.org/T224337) (owner: 10Urbanecm) [23:04:04] (03CR) 10Jforrester: [C: 03+2] Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512479 (https://phabricator.wikimedia.org/T224337) (owner: 10Urbanecm) [23:04:38] Oh, yes, Sam's FR things. [23:04:45] (03PS4) 10Jforrester: FlaggedRevisions: Copy in rest of the config, for static registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512053 (owner: 10Reedy) [23:04:50] (03CR) 10Jforrester: [C: 03+2] FlaggedRevisions: Copy in rest of the config, for static registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512053 (owner: 10Reedy) [23:05:00] (03PS4) 10Jforrester: Stop using array_merge for $wgFlaggedRevsNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512061 (owner: 10Reedy) [23:05:07] (03Merged) 10jenkins-bot: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512479 (https://phabricator.wikimedia.org/T224337) (owner: 10Urbanecm) [23:05:54] (03Merged) 10jenkins-bot: FlaggedRevisions: Copy in rest of the config, for static registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512053 (owner: 10Reedy) [23:06:11] (03CR) 10Jforrester: [C: 03+2] Stop using array_merge for $wgFlaggedRevsNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512061 (owner: 10Reedy) [23:06:23] !log T221339 depooled labsdb1011 and updated views [23:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:29] T221339: Missing index on revision_userindex.rev_actor - https://phabricator.wikimedia.org/T221339 [23:06:46] (03PS5) 10Jforrester: Stop using array_merge for $wgFlaggedRevsNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512061 (owner: 10Reedy) [23:06:50] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512061 (owner: 10Reedy) [23:06:57] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 92.64 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [23:06:57] !log jforrester@deploy1001 Synchronized wmf-config/throttle.php: Remove expired throttle rules I4ba3d489 (duration: 00m 55s) [23:06:57] (03PS1) 10Bstorm: Revert "wikireplicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/513014 [23:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:16] 10Operations, 10ops-codfw, 10decommission, 10User-jijiki: Decommission rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10Papaul) [23:07:21] (03CR) 10jenkins-bot: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512479 (https://phabricator.wikimedia.org/T224337) (owner: 10Urbanecm) [23:07:52] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/513014 (owner: 10Bstorm) [23:08:09] (03Merged) 10jenkins-bot: Stop using array_merge for $wgFlaggedRevsNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512061 (owner: 10Reedy) [23:10:16] !log T221339 repooled labsdb1011 [23:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:10] !log jforrester@deploy1001 Synchronized wmf-config/flaggedrevs.php: FlaggedRevisions: Copy in rest of the config, for static registration I77d70519f Id0cd2e18c (duration: 00m 56s) [23:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:25] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.6/extensions/TimedMediaHandler/includes/ApiTimedText.php: T224522 Fix fatal in ApiTimedText following redirect pages (duration: 00m 58s) [23:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:29] T224522: PHP Fatal Error from ApiTimedText: Argument to WikiPage::factory must Title (WikiPage given) - https://phabricator.wikimedia.org/T224522 [23:15:59] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.6/extensions/TimedMediaHandler/includes/handlers/TextHandler/TextHandler.php: T224367 Fix regression in subtitles for non-English sites on Commons videos (duration: 00m 56s) [23:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:05] T224367: TimedText not working on non-English wikis: Serves the translated namespace instead of canonical one for Commons files - https://phabricator.wikimedia.org/T224367 [23:17:02] !log T221339 completed view updates on labsdb1009 without depooling [23:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:06] T221339: Missing index on revision_userindex.rev_actor - https://phabricator.wikimedia.org/T221339 [23:17:57] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.7/extensions/TimedMediaHandler/includes/handlers/TextHandler/TextHandler.php: T224367 Fix regression in subtitles for non-English sites on Commons videos (duration: 00m 57s) [23:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:10] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.7/extensions/TimedMediaHandler/includes/ApiTimedText.php: T224522 Fix fatal in ApiTimedText following redirect pages (duration: 00m 56s) [23:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:12] OK, all done. Hopefully. [23:22:06] (03CR) 10Ayounsi: "Thanks! All great comments, all addressed!" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [23:29:29] (03PS2) 10Dzahn: DHCP: Add MAC address for dproxy200[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/513004 (https://phabricator.wikimedia.org/T223492) (owner: 10Papaul) [23:30:06] (03CR) 10Dzahn: [C: 03+2] DHCP: Add MAC address for dproxy200[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/513004 (https://phabricator.wikimedia.org/T223492) (owner: 10Papaul) [23:32:47] RECOVERY - Check systemd state on ms-be2043 is OK: OK - running: The system is fully operational [23:34:16] papaul: ^ you can start to install dbproxy [23:34:23] it ran on install2002 [23:40:08] (03PS3) 10Dzahn: phabricator: enable php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/510597 (https://phabricator.wikimedia.org/T190568) [23:45:05] PROBLEM - swift-container-updater on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:45:13] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:45:33] PROBLEM - swift-account-reaper on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:45:51] PROBLEM - SSH on ms-be2043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:46:03] PROBLEM - MD RAID on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:46:11] PROBLEM - Disk space on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [23:46:11] PROBLEM - swift-container-auditor on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:46:33] PROBLEM - dhclient process on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer [23:46:35] PROBLEM - DPKG on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer [23:46:35] PROBLEM - swift-object-updater on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:46:41] PROBLEM - swift-account-server on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:46:43] PROBLEM - puppet last run on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer [23:46:49] PROBLEM - Check size of conntrack table on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:46:57] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer [23:47:07] PROBLEM - swift-account-auditor on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:47:07] PROBLEM - swift-object-auditor on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:47:17] PROBLEM - configured eth on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer [23:47:21] PROBLEM - swift-account-replicator on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:47:29] PROBLEM - swift-object-replicator on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:47:29] PROBLEM - swift-container-server on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:47:29] 10Operations, 10Discovery-Search (Current work): Elasticsearch nodes overloading in eqiad - https://phabricator.wikimedia.org/T220901 (10debt) 05Open→03Resolved a:03debt [23:48:35] PROBLEM - very high load average likely xfs on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:49:01] PROBLEM - swift-container-auditor on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:49:21] PROBLEM - swift-object-server on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:49:23] PROBLEM - DPKG on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer [23:49:23] PROBLEM - swift-object-updater on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:49:29] PROBLEM - swift-account-server on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:50:51] PROBLEM - swift-container-replicator on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:50:55] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:51:19] PROBLEM - swift-object-auditor on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:51:52] papaul: ^ in the past i would have tried https://wikitech-static.wikimedia.org/w/index.php?title=Swift%2FHow_To&type=revision&diff=309358&oldid=297455 but he removed the docs [23:52:15] PROBLEM - swift-object-updater on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:52:21] PROBLEM - swift-account-server on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:52:35] RECOVERY - swift-account-reaper on ms-be2043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift [23:52:37] RECOVERY - swift-object-auditor on ms-be2043 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [23:52:37] RECOVERY - swift-account-auditor on ms-be2043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift [23:52:43] RECOVERY - SSH on ms-be2043 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:52:45] papaul: ^ well.. it was down but coming back. not sure why [23:52:47] RECOVERY - configured eth on ms-be2043 is OK: OK - interfaces up [23:52:59] RECOVERY - MD RAID on ms-be2043 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:53:07] RECOVERY - Disk space on ms-be2043 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [23:53:29] RECOVERY - DPKG on ms-be2043 is OK: All packages OK [23:53:33] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2043 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:53:43] RECOVERY - Check size of conntrack table on ms-be2043 is OK: OK: nf_conntrack is 5 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:53:51] RECOVERY - Check systemd state on ms-be2043 is OK: OK - running: The system is fully operational [23:57:23] RECOVERY - puppet last run on ms-be2043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures