[00:09:33] PROBLEM - SSH on webperf2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:11:13] RECOVERY - SSH on webperf2002 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:26:56] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Prevention): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle) [02:10:32] (03PS4) 10Ryan Kemper: Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) [02:11:22] (03CR) 10jerkins-bot: [V: 04-1] Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper) [02:19:15] (03PS5) 10Ryan Kemper: Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) [02:20:08] (03CR) 10jerkins-bot: [V: 04-1] Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper) [02:23:51] (03CR) 10Ryan Kemper: "Still need to tune the rescore, and also separately circle back to make replica counts the same between codfw/eqiad" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper) [02:28:35] (03PS6) 10Ryan Kemper: Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) [02:29:25] (03CR) 10jerkins-bot: [V: 04-1] Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper) [02:35:49] (03PS7) 10Ryan Kemper: Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) [02:47:39] (03CR) 10Ryan Kemper: "I decided to tack the replica change onto this ticket rather than making it separate. It made the math a bit easier to reason about, and a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper) [03:12:52] (03CR) 10Ryan Kemper: [C: 03+2] "These changes and the previous ones (https://gerrit.wikimedia.org/r/c/operations/puppet/+/608633) look pretty straightforward." [puppet] - 10https://gerrit.wikimedia.org/r/608905 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [03:13:44] (03CR) 10Ryan Kemper: [C: 03+2] Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [03:45:17] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [03:50:49] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 66.1 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [04:24:55] RECOVERY - Long running screen/tmux on kubernetes1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [04:48:43] (03CR) 10ArielGlenn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (https://phabricator.wikimedia.org/T239340) (owner: 10Alexandros Kosiaris) [05:33:16] !log remove chassis redundancy failover from fasw-c-codfw for consistency with all other VCs [05:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609225 (https://phabricator.wikimedia.org/T256770) (owner: 10Ssingh) [05:46:36] !log remove chassis redundancy failover from fasw-c-eqiad for consistency with all other VCs [05:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [05:51:08] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609186 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [06:06:31] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Create component/systemd241 [puppet] - 10https://gerrit.wikimedia.org/r/609181 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [06:09:45] !log rebooting mw1390-mw1419 for kernel security updates [06:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:58] good morning [06:26:02] (03Abandoned) 10Muehlenhoff: Enable CAS staging host for Icinga [puppet] - 10https://gerrit.wikimedia.org/r/596174 (owner: 10Muehlenhoff) [06:27:27] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [06:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:35] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:34] 10Operations, 10observability, 10User-MoritzMuehlenhoff: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016 (10MoritzMuehlenhoff) [06:34:52] 10Operations, 10observability, 10User-MoritzMuehlenhoff: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016 (10MoritzMuehlenhoff) [06:34:54] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10MoritzMuehlenhoff) [06:38:50] hashar: bonjour [06:47:46] !log installing php5 security updates [06:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:52] php5? ;D [06:52:28] 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10elukey) Adding my 2c :) * BGP communities - if pmacct supports adding them to the Kafka JSON message directly, it should be very easy to support from the Analytics poi... [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200703T0700) [07:00:08] unfortunately yes :-) [07:00:29] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:40] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:02] hashar: that, said there are also five images in our Docker registry still using PHP 5: [07:01:05] docker-registry.wikimedia.org/releng/composer-php56:0.2.0-s1 (image) 5.6.40+dfsg-0+deb8u11 upgrade [07:01:06] docker-registry.wikimedia.org/releng/composer-php56:0.2.0-s2 (image) 5.6.40+dfsg-0+deb8u12 upgrade [07:01:08] docker-registry.wikimedia.org/releng/composer-test-php56:0.2.0-s1 (image) 5.6.40+dfsg-0+deb8u11 upgrade [07:01:09] docker-registry.wikimedia.org/releng/composer-test-php56:0.2.0-s2 (image) 5.6.40+dfsg-0+deb8u12 upgrade [07:01:11] docker-registry.wikimedia.org/releng/php56:0.1.2 (image) 5.6.40+dfsg-0+deb8u12 upgrade [07:01:40] oh, there's even another 9 actually [07:02:14] yeah indeed, the last use was to run php linter for the integration.wikimedia.org website [07:02:46] and that requirement disappeared with the upgrade of the machine from Jessie to Buster [07:03:55] what surprise me is that we still have some php5.6 in production while CI no more has it which probably means we miss some test coverage :\ [07:04:23] this is 5.5 and it's not mediawiki running on PHP 5, but unrelated services [07:05:14] https://debmonitor.wikimedia.org/packages/php5-common has the full list of images, can you remove these from the Docker registry? (or open a task to get them removed) [07:05:42] we also have a mystery php5.5 source package: https://debmonitor.wikimedia.org/packages/php5.5-cli [07:05:53] which is used on docker-registry.wikimedia.org/releng/quibble-jessie-php55:0.0.31-1 (image) [07:15:29] moritzm: ah yeah that one is a port of php5.5 to jessie [07:15:55] I guess we should garbage collect the Docker images we no more have any use for [07:17:38] (03CR) 10Ema: [C: 03+2] cumin: update prometheus alias [puppet] - 10https://gerrit.wikimedia.org/r/609178 (https://phabricator.wikimedia.org/T243057) (owner: 10Ema) [07:17:43] 10Operations, 10netops, 10observability: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10ayounsi) In case that's useful for LibreNMS: https://github.com/librenms/librenms/pull/11488/files [07:24:29] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) [07:24:32] (03CR) 10Ema: [C: 03+2] varnish: update 19-unparseable-host-header.vtc [puppet] - 10https://gerrit.wikimedia.org/r/609179 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [07:31:58] (03PS2) 10Ema: Varnish: Include request ID in Set-Cookie warning [puppet] - 10https://gerrit.wikimedia.org/r/608709 (https://phabricator.wikimedia.org/T256395) (owner: 10GergΕ‘ Tisza) [07:35:30] (03PS1) 10Elukey: Remove notebook1004 from production [puppet] - 10https://gerrit.wikimedia.org/r/609387 (https://phabricator.wikimedia.org/T256363) [07:37:19] (03CR) 10Elukey: [C: 03+2] Remove notebook1004 from production [puppet] - 10https://gerrit.wikimedia.org/r/609387 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [07:39:05] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [07:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:47] 10Operations: Handle archival of jessie suite in Debian archive - https://phabricator.wikimedia.org/T257019 (10MoritzMuehlenhoff) [07:39:56] 10Operations: Handle archival of jessie suite in Debian archive - https://phabricator.wikimedia.org/T257019 (10MoritzMuehlenhoff) p:05Triageβ†’03High [07:40:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [07:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:05] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [07:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:43] this is part of https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging [07:45:11] I am renaming notebook1004 to an-scheduler1001 [07:46:44] ah of course there is not only my change in the diffset [07:46:45] sigh [07:51:37] (03CR) 10JMeybohm: [C: 03+2] Add certificate for helm-charts (chartmuseum) [puppet] - 10https://gerrit.wikimedia.org/r/609122 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [07:51:55] (03CR) 10JMeybohm: [C: 03+2] secret: add dummy key for helm-charts (chartmuseum) [labs/private] - 10https://gerrit.wikimedia.org/r/609121 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [07:51:59] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] secret: add dummy key for helm-charts (chartmuseum) [labs/private] - 10https://gerrit.wikimedia.org/r/609121 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [07:52:33] (03CR) 10JMeybohm: [C: 03+2] Introduce chartmuseum[12]001 [dns] - 10https://gerrit.wikimedia.org/r/609164 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [07:55:28] !log installing mutt security updates for jessie (stretch/buster already fixed) [07:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:09] (03PS1) 10Giuseppe Lavagetto: profile::envoy::builder A profile to use in building envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/609388 [08:01:11] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:33] !log elukey@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:34] !log authdns-update for chartmuseum - T256970 [08:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:38] T256970: Site: eqiad/codwf each 1 VM for helm-charts.wikimedia.org (chartmuseum) - https://phabricator.wikimedia.org/T256970 [08:04:57] jayme: let me know if you see anything weird in the diff [08:05:40] elukey: my change only, so nothing more weird than that :) [08:06:07] goood [08:06:08] thanks :) [08:06:55] (nice how the word "weird" is aligned with a fixed size font for both of our sentences in my IRC client :D) [08:07:21] same thing for me ahhaha [08:09:37] (03PS2) 10Giuseppe Lavagetto: profile::envoy::builder A profile to use in building envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/609388 [08:13:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::envoy::builder A profile to use in building envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/609388 (owner: 10Giuseppe Lavagetto) [08:13:37] (03CR) 10Hashar: "Some thoughts here and there ;)" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/609388 (owner: 10Giuseppe Lavagetto) [08:14:20] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:57] (03PS1) 10Elukey: sre.dns.netbox: print some suggestions in case the diff is wrong [cookbooks] - 10https://gerrit.wikimedia.org/r/609390 [08:30:00] 10Operations, 10DBA, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) 05Openβ†’03Resolved [08:33:38] 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10JAllemandou) > Region/site/AS-names - I don't love the Druid lookups idea for two reasons: 1) the data would be augmented only in Druid, not in Hive, so in the future i... [08:35:03] (03CR) 10Volans: "Indeed, it's totally safe to abort, thanks for adding the message!" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/609390 (owner: 10Elukey) [08:36:53] (03CR) 10Elukey: "For context, the diff was the following: https://phabricator.wikimedia.org/P11728" [cookbooks] - 10https://gerrit.wikimedia.org/r/609390 (owner: 10Elukey) [08:39:42] (03CR) 10Elukey: sre.dns.netbox: print some suggestions in case the diff is wrong (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/609390 (owner: 10Elukey) [08:40:03] (03PS1) 10Ema: varnish: VTC for cacheable responses with cookies [puppet] - 10https://gerrit.wikimedia.org/r/609394 (https://phabricator.wikimedia.org/T256395) [08:43:25] !log rebooting netflow* hosts for kernel security update [08:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:33] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:42] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) >>! In T256444#6273882, @ema wrote: > This was the last occurrence of the issue, and no other host has been affected since the librdkafka upgrade yest... [08:44:59] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) 05Openβ†’03Resolved a:03ema [08:47:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:29] (03CR) 10Volans: sre.dns.netbox: print some suggestions in case the diff is wrong (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/609390 (owner: 10Elukey) [08:51:18] 10Operations, 10Wikimedia-Logstash: Buster elasticsearch-curator version not compatible with ELK7 - https://phabricator.wikimedia.org/T257024 (10fgiunchedi) [08:51:44] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:38] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:03] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:40] (03CR) 10Arturo Borrero Gonzalez: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/609181 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [09:00:53] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:11] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [09:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:13] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:20] (03CR) 10Privacybatm: "Done!" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm) [09:08:45] (03PS2) 10Privacybatm: Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) [09:09:30] (03PS1) 10Elukey: Rename notebook1004 to an-scheduler1001 [dns] - 10https://gerrit.wikimedia.org/r/609396 (https://phabricator.wikimedia.org/T256363) [09:11:17] (03CR) 10Elukey: [C: 03+2] Rename notebook1004 to an-scheduler1001 [dns] - 10https://gerrit.wikimedia.org/r/609396 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [09:11:46] (03PS1) 10Filippo Giunchedi: logstash: decom check_procs [puppet] - 10https://gerrit.wikimedia.org/r/609397 (https://phabricator.wikimedia.org/T234854) [09:17:04] (03PS1) 10Elukey: Add basic setup for an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/609398 (https://phabricator.wikimedia.org/T256363) [09:17:34] (03CR) 10Jcrespo: "When I run it, I get:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm) [09:17:52] (03CR) 10Elukey: [C: 03+2] Add basic setup for an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/609398 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [09:21:04] (03PS1) 10Jbond: apereo_cas: login page redirect frames [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609399 (https://phabricator.wikimedia.org/T251513) [09:21:40] (03CR) 10Privacybatm: "> Patch Set 2:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm) [09:24:44] (03PS1) 10Muehlenhoff: Switch Graphite to CAS-only [puppet] - 10https://gerrit.wikimedia.org/r/609400 [09:26:57] (03PS3) 10Privacybatm: Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) [09:27:18] (03CR) 10Privacybatm: "> Patch Set 2:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm) [09:36:33] (03PS1) 10Jbond: cas6.2: merge changes from upstream 6.2 branch [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609402 [09:37:38] (03PS2) 10Jbond: cas6.2: merge changes from upstream 6.2 branch [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609402 [09:38:34] (03PS1) 10Alexandros Kosiaris: WIP: Test something [puppet] - 10https://gerrit.wikimedia.org/r/609403 [09:42:14] (03PS3) 10Jbond: cas6.2: merge changes from upstream 6.2 branch [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609402 [09:45:25] (03PS1) 10Giuseppe Lavagetto: profile::envoy::builder: use an exec for creating the docker volume [puppet] - 10https://gerrit.wikimedia.org/r/609406 [09:46:55] (03PS4) 10Jbond: cas6.2: merge changes from upstream 6.2 branch [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609402 [09:49:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:52:49] (03PS2) 10Muehlenhoff: Switch Graphite to CAS-only [puppet] - 10https://gerrit.wikimedia.org/r/609400 [09:53:31] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:54:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::envoy::builder: use an exec for creating the docker volume [puppet] - 10https://gerrit.wikimedia.org/r/609406 (owner: 10Giuseppe Lavagetto) [09:55:07] (03CR) 10Jcrespo: "I think I was getting None because I was testing on a different transport (puppet vs direct). So the "bug" was real but you could not have" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm) [10:00:36] (03PS4) 10Privacybatm: Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) [10:01:03] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/23657/" [puppet] - 10https://gerrit.wikimedia.org/r/609400 (owner: 10Muehlenhoff) [10:01:14] (03CR) 10Privacybatm: "> Patch Set 3:" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm) [10:02:42] (03PS5) 10Privacybatm: Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) [10:03:25] (03CR) 10Jcrespo: [C: 03+2] "Works great." [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm) [10:03:28] (03PS6) 10Privacybatm: Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) [10:05:11] (03CR) 10Privacybatm: "> Patch Set 5: Code-Review+2" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm) [10:07:13] (03PS2) 10Alexandros Kosiaris: WIP: Test something [puppet] - 10https://gerrit.wikimedia.org/r/609403 [10:07:22] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [10:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:50] (03CR) 10JMeybohm: [C: 04-1] apereo_cas: login page redirect frames (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609399 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [10:09:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:32] !log notebook1004 renamed to an-scheduler1001 [10:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:38] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Radar: Renamed notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T256397 (10elukey) [10:25:16] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:25:32] !log installing nss security updates on jessie [10:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:28] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:27:45] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Standardize/centralize mapping from section to mariadb port/socket and prom-mysql-exporter port - https://phabricator.wikimedia.org/T257033 (10Kormat) [10:31:56] (03PS1) 10Arturo Borrero Gonzalez: toolforge: exec_environ: install libxml-feed-perl [puppet] - 10https://gerrit.wikimedia.org/r/609410 (https://phabricator.wikimedia.org/T256734) [10:33:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: exec_environ: install libxml-feed-perl [puppet] - 10https://gerrit.wikimedia.org/r/609410 (https://phabricator.wikimedia.org/T256734) (owner: 10Arturo Borrero Gonzalez) [10:38:33] (03PS2) 10Jbond: apereo_cas: login page redirect frames [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609399 (https://phabricator.wikimedia.org/T251513) [10:39:19] (03CR) 10Jbond: "updated thanks 😊" (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609399 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [10:51:05] !log installing ruby-json security updates [10:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:14] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:54:00] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:55:27] 10Operations, 10netops: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) 05Openβ†’03Resolved Solved in T244574. [10:59:08] !log installing json-c security updates on jessie [10:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:34] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:01:18] 10Operations, 10netops: Upgrade Fastnetmon to 1.1.6 - https://phabricator.wikimedia.org/T257035 (10ayounsi) p:05Triageβ†’03Low [11:02:57] (03PS1) 10Giuseppe Lavagetto: profile::envoy::builder: fix repo location, add timer [puppet] - 10https://gerrit.wikimedia.org/r/609411 [11:05:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609400 (owner: 10Muehlenhoff) [11:06:10] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:07:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::envoy::builder: fix repo location, add timer [puppet] - 10https://gerrit.wikimedia.org/r/609411 (owner: 10Giuseppe Lavagetto) [11:07:57] (03CR) 10Ema: [C: 03+2] varnish: VTC for cacheable responses with cookies [puppet] - 10https://gerrit.wikimedia.org/r/609394 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [11:10:36] (03CR) 10Muehlenhoff: [C: 03+1] "Nothing I'm really familiar with, but https://css-tricks.com/snippets/javascript/break-out-of-iframe/ agrees, so let's give this a shot." [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609399 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [11:13:34] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:18:12] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:19:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609400 (owner: 10Muehlenhoff) [11:19:16] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:20:02] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:29:15] !log rebooting urldownloader standby hosts for kernel updates (1002/2002) [11:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:31] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:37] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I reran PCC against the latest PS and it's also sane: https://puppet-compiler.wmflabs.org/compiler1003/23521/" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [11:48:59] (03PS1) 10Muehlenhoff: Failover url downloaders for reboots [dns] - 10https://gerrit.wikimedia.org/r/609412 [12:15:26] 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [12:16:00] 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) Ported wikidata-database-cpu-saturation, just needed to change the data source for each graph. [12:30:27] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10conny-kawohl_WMDE) [12:41:24] !log Restarting Zuul / CI [12:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:10] stupid systemd ... [12:45:18] (03PS1) 10Muehlenhoff: systemd/slice: Install systemd 241 from component/systemd241 [puppet] - 10https://gerrit.wikimedia.org/r/609419 (https://phabricator.wikimedia.org/T256877) [12:47:57] (03Abandoned) 10Muehlenhoff: Unconditionally install systemd packages [puppet] - 10https://gerrit.wikimedia.org/r/609104 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [12:52:09] 10Operations, 10netops: Upgrade Fastnetmon to 1.1.6 - https://phabricator.wikimedia.org/T257035 (10ayounsi) [12:58:41] (03PS1) 10Muehlenhoff: Remove apt::pin for python3-prometheus-client-package [puppet] - 10https://gerrit.wikimedia.org/r/609420 (https://phabricator.wikimedia.org/T256877) [12:59:52] 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10Aklapper) Removing #WMF-Legal (not sure why this tag was added) [13:06:15] (03PS1) 10Elukey: piwik: add binlog and server-id [puppet] - 10https://gerrit.wikimedia.org/r/609421 (https://phabricator.wikimedia.org/T234826) [13:09:31] 10Operations, 10netops: Upgrade Fastnetmon to 1.1.6 - https://phabricator.wikimedia.org/T257035 (10MoritzMuehlenhoff) Steps to update the existing package on deneb: ` apt-get source fastnetmon cd fastnetmon- 1.1.4-1~deb10u1 uupdate ../1.1.4.orig.tar.xz (some patches might be merged, drop those from debian/pa... [13:16:19] (03PS2) 10Elukey: piwik: add binlog to database config. [puppet] - 10https://gerrit.wikimedia.org/r/609421 (https://phabricator.wikimedia.org/T234826) [13:19:33] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/23662/matomo1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/609421 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [13:22:20] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:23:38] (03PS4) 10Hashar: scap configuration for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) [13:24:10] (03CR) 10Hashar: [C: 03+1] "Trivial rebase ;)" [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) (owner: 10Hashar) [13:33:20] (03CR) 10Filippo Giunchedi: "LGTM overall, still not seeing the redirectmatch for /problems in latest PCC, though might be just a rebase missing?" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [13:34:04] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:34:18] (03CR) 10Hashar: [C: 03+1] Switch CI to profile::java (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [13:45:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:46:16] (03CR) 10Elukey: [C: 04-1] Switch CI to profile::java (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [13:48:54] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:51:46] (03PS1) 10Ayounsi: Netflow: send as little options templates as possible [homer/public] - 10https://gerrit.wikimedia.org/r/609426 (https://phabricator.wikimedia.org/T240658) [13:52:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:55:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] Failover url downloaders for reboots [dns] - 10https://gerrit.wikimedia.org/r/609412 (owner: 10Muehlenhoff) [13:56:10] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:59:01] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster [13:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:14] test cluster --^ [13:59:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:01:23] looks like the jobqueue is unhappy? top exception is JobQueueEventBus.php: Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable [14:01:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:05:59] <_joe_> godog: so eventgate-main [14:06:28] <_joe_> akosiaris: can you take a look? [14:07:19] <_joe_> https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=70&fullscreen&orgId=1&refresh=1m seems better [14:07:27] (03PS3) 10Alexandros Kosiaris: WIP: Test something [puppet] - 10https://gerrit.wikimedia.org/r/609403 [14:08:28] * akosiaris looking [14:09:10] PROBLEM - PHP7 rendering on wtp1033 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1311 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:09:18] <_joe_> uh [14:09:21] ? [14:09:46] PROBLEM - Apache HTTP on wtp1033 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1312 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:10:22] <_joe_> oh this is not new [14:10:24] eventgate-main reports way fewer messages right now https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=54&fullscreen&orgId=1&refresh=1m [14:10:52] stratch that [14:11:00] RECOVERY - PHP7 rendering on wtp1033 is OK: HTTP OK: HTTP/1.1 200 OK - 84124 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:11:01] <_joe_> !log restarted php-fpm on wtp1033, stuck in sigill [14:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:31] we had a pretty big spike of purges it seems from 13:00 to 13:50 [14:11:34] RECOVERY - Apache HTTP on wtp1033 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:11:41] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.stop-cluster (exit_code=99) [14:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:30] (03PS4) 10Alexandros Kosiaris: WIP: Test something [puppet] - 10https://gerrit.wikimedia.org/r/609403 [14:37:10] (03PS1) 10Elukey: sre.hadoop.stop-cluster.py: fix minor errors/details [cookbooks] - 10https://gerrit.wikimedia.org/r/609436 (https://phabricator.wikimedia.org/T244499) [14:37:32] 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [14:38:32] (03CR) 10Elukey: [C: 03+2] sre.hadoop.stop-cluster.py: fix minor errors/details [cookbooks] - 10https://gerrit.wikimedia.org/r/609436 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [14:42:56] 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) mysql-aggretated ported. This was more involved. The steps were: 1. convert from `$dc` source var to `$site` query parameter 1. change the metric used for label_values to one that is prese... [14:55:53] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [14:57:41] (03PS1) 10Elukey: profile::prometheus::analytics: rename mysql jobs [puppet] - 10https://gerrit.wikimedia.org/r/609440 [14:59:17] (03CR) 10Kormat: [C: 03+1] "That's a +1 and a <3 from the data-persistence team 😊" [puppet] - 10https://gerrit.wikimedia.org/r/609440 (owner: 10Elukey) [14:59:56] awww [15:02:42] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster [15:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:53] (03CR) 10Hashar: [C: 04-1] "This change confuses me, mostly because of the gid 903 in data.yaml. Reading the documentation for sysusers.d in seems type 'u' cause the " [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [15:09:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) [15:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:19] * elukey dances [15:21:45] (03PS1) 10Elukey: sre.hadoop.change-distro.py: fix misc details [cookbooks] - 10https://gerrit.wikimedia.org/r/609442 (https://phabricator.wikimedia.org/T244499) [15:21:58] 10Operations, 10DBA, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Kormat) [15:22:09] 10Operations, 10DBA, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Kormat) p:05Triageβ†’03Medium [15:23:36] (03PS3) 10Filippo Giunchedi: Move to Debian packaging [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/552486 (https://phabricator.wikimedia.org/T217340) [15:26:07] (03CR) 10Elukey: [C: 03+2] sre.hadoop.change-distro.py: fix misc details [cookbooks] - 10https://gerrit.wikimedia.org/r/609442 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [15:38:17] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Please note that this will rename the metrics and thus lose historical data, should be fine though" [puppet] - 10https://gerrit.wikimedia.org/r/609440 (owner: 10Elukey) [15:39:34] !log jayme@cumin1001 START - Cookbook sre.ganeti.makevm [15:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:49] (03CR) 10Elukey: [C: 03+2] profile::prometheus::analytics: rename mysql jobs [puppet] - 10https://gerrit.wikimedia.org/r/609440 (owner: 10Elukey) [15:41:53] !log jayme@cumin1001 START - Cookbook sre.ganeti.makevm [15:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:38] !log jynus@cumin1001 dbctl commit (dc=all): 'Reduce db1118 weight to spread load mode evenly', diff saved to https://phabricator.wikimedia.org/P11730 and previous config saved to /var/cache/conftool/dbconfig/20200703-154337-jynus.json [15:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:59] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:15] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:52] (03PS1) 10Reedy: Use $wgShellwgRestrictionMethod not $wgRestrictionMethod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609446 [15:58:31] (03PS2) 10Reedy: Use $wgShellwgRestrictionMethod not $wgRestrictionMethod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609446 [15:59:08] (03PS3) 10Reedy: Use $wgShellRestrictionMethod not $wgRestrictionMethod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609446 [15:59:29] (03CR) 10CDanis: [C: 03+1] Use $wgShellRestrictionMethod not $wgRestrictionMethod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609446 (owner: 10Reedy) [15:59:42] (03CR) 10Reedy: [C: 03+2] Use $wgShellRestrictionMethod not $wgRestrictionMethod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609446 (owner: 10Reedy) [16:00:34] (03Merged) 10jenkins-bot: Use $wgShellRestrictionMethod not $wgRestrictionMethod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609446 (owner: 10Reedy) [16:02:09] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Rename wgRestrictionMethod to wgShellRestrictionMethod (duration: 00m 58s) [16:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:27] (03PS1) 10JMeybohm: add chartmuseum[12]001 to dhcp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/609449 (https://phabricator.wikimedia.org/T256970) [16:07:29] (03PS1) 10JMeybohm: Add cumin alias for chartmuseum hosts [puppet] - 10https://gerrit.wikimedia.org/r/609450 (https://phabricator.wikimedia.org/T256970) [16:07:43] (03CR) 10Privacybatm: [C: 03+1] Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm) [16:09:38] (03CR) 10Jcrespo: [C: 03+2] Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm) [16:09:51] (03PS7) 10Jcrespo: Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm) [16:19:53] (03PS1) 10Elukey: Set BigTop for Hadoop master/standby/worker nodes. [puppet] - 10https://gerrit.wikimedia.org/r/609452 (https://phabricator.wikimedia.org/T244499) [16:20:40] (03CR) 10Elukey: [C: 03+2] Set BigTop for Hadoop master/standby/worker nodes. [puppet] - 10https://gerrit.wikimedia.org/r/609452 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [16:30:02] (03PS14) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [16:31:55] (03PS2) 10Privacybatm: Firewall.py: Solve auto port detection concurrency issue [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) [16:51:02] 10Operations, 10MediaWiki-extensions-Score: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10CDanis) [16:51:34] 10Operations, 10MediaWiki-extensions-Score, 10Wikimedia-General-or-Unknown: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Reedy) [16:51:48] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: Ifa929b2ad4 (duration: 00m 57s) [16:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:24] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, 10Security: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Reedy) [16:55:12] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, 10Security: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ebe123) p:05Triageβ†’03High To be more precise, the error is: > Could not execute LilyPond: /dev/nu... [16:56:38] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, 10Security: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Reedy) >>! In T257066#6277557, @Ebe123 wrote: > To be more precise, the error is: >> Could not execut... [16:56:53] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, 10Security: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10CDanis) >>! In T257066#6277557, @Ebe123 wrote: > To be more precise, the error is: >> Could not execu... [17:08:26] Reedy, cdanis: pm? [17:08:34] Sure [17:09:40] * RhinosF1 sent pm [17:26:03] (03CR) 10Alexandros Kosiaris: "looks pretty ok, but do add an entry in site.pp for those hosts with role(insetup)" [puppet] - 10https://gerrit.wikimedia.org/r/609449 (https://phabricator.wikimedia.org/T256970) (owner: 10JMeybohm) [17:26:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] add chartmuseum[12]001 to dhcp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/609449 (https://phabricator.wikimedia.org/T256970) (owner: 10JMeybohm) [17:36:08] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:37:32] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:37:32] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [17:37:32] received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:37:32] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:37:39] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on icinga1001 is CRITICAL: 0.04156 lt 0.3 https://docs.google.com/document/d/1SeXdegjsfL94R6XYB1I4Uv8yjCPH1tVXeL0taJF0NNs/preview%23heading=h.qe04i0ld9cvl https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [17:37:44] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out bef [17:37:44] s received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:37:50] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:37:54] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:37:58] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:38:08] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:38:14] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:38:15] oh come on [17:38:26] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:38:35] πŸ‘‹ [17:38:38] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:39:00] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:39:00] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was rece [17:39:00] tech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:39:01] <_joe_> it's a spike of requests [17:39:14] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:39:17] <_joe_> 15k reqps [17:39:22] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:39:28] <_joe_> gimme 5 mins and I'll be at my computer [17:39:43] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&from=now-15m&to=now [17:39:55] <_joe_> someone look at the logs and find out what's going on [17:39:56] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:40:00] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:40:48] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:40:50] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:40:52] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:42:34] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:43:02] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [17:43:40] logstash indicates lots of cirrussearch-too-busy-error [17:44:14] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:44:14] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:44:54] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:45:18] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:45:26] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:45:27] (03PS1) 10Krinkle: Temporarily turn off LilyPond execution [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609470 (https://phabricator.wikimedia.org/T257062) [17:45:28] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:45:39] (03CR) 10Krinkle: [C: 03+2] Temporarily turn off LilyPond execution [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609470 (https://phabricator.wikimedia.org/T257062) (owner: 10Krinkle) [17:46:00] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:46:00] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:46:02] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:46:20] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:46:22] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:46:22] (03Merged) 10jenkins-bot: Temporarily turn off LilyPond execution [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609470 (https://phabricator.wikimedia.org/T257062) (owner: 10Krinkle) [17:46:23] <_joe_> shdubsh: yes that's expected [17:46:34] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:46:44] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:46:56] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:47:02] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:47:04] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:47:04] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:47:06] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:47:14] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:47:26] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:47:38] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:47:46] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:47:50] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:48:36] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [17:48:36] (03PS1) 10Krinkle: noc: improve tab anchor links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609471 [17:48:45] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on icinga1001 is OK: (C)0.3 lt (W)0.5 lt 0.7015 https://docs.google.com/document/d/1SeXdegjsfL94R6XYB1I4Uv8yjCPH1tVXeL0taJF0NNs/preview%23heading=h.qe04i0ld9cvl https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [17:48:45] (03CR) 10Krinkle: [C: 03+2] noc: improve tab anchor links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609471 (owner: 10Krinkle) [17:49:08] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:51:04] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-data [17:51:04] etheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [17:52:58] (03Merged) 10jenkins-bot: noc: improve tab anchor links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609471 (owner: 10Krinkle) [17:55:34] (03PS1) 10Majavah: Remove "Create a book" link from sidebar on Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609472 (https://phabricator.wikimedia.org/T257073) [17:58:30] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [18:02:02] (03CR) 10RhinosF1: [C: 03+1] Remove "Create a book" link from sidebar on Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609472 (https://phabricator.wikimedia.org/T257073) (owner: 10Majavah) [18:08:48] (03PS1) 10CDanis: vcl: ratelimit search API calls [puppet] - 10https://gerrit.wikimedia.org/r/609475 [18:09:30] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 19.55 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:10:46] (03PS2) 10CDanis: vcl: ratelimit search API calls [puppet] - 10https://gerrit.wikimedia.org/r/609475 [18:12:22] (03CR) 10Jbond: [C: 03+1] "LGTM could also add "std.ip(req.http.X-Client-IP, "192.0.2.1") ~ public_cloud_nets " to the claus to tie it to clouds" [puppet] - 10https://gerrit.wikimedia.org/r/609475 (owner: 10CDanis) [18:12:31] (03CR) 10RLazarus: [C: 03+1] "LGTM as an emergency patch to keep handy; someone else should review more thoughtfully for longer-term use" [puppet] - 10https://gerrit.wikimedia.org/r/609475 (owner: 10CDanis) [18:16:52] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:20:18] (03PS1) 10Jbond: varnish: Rate limit cloud providers for all requiests [puppet] - 10https://gerrit.wikimedia.org/r/609477 [18:24:50] (03PS1) 10CDanis: vcl: public_clouds_shutdown: ratelimit API reqs as well [puppet] - 10https://gerrit.wikimedia.org/r/609480 [18:25:27] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Bugreporter) p:05Highβ†’03Unbreak! [18:40:44] PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [18:41:59] For the records, I have intermittent DB problems connecting on Phabricator, e.g. the browser showing "Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL)." and/or "Unable to establish a connection to any database host phuser@m3-master.eqiad.wmnet." [18:42:04] Phab down? [18:42:57] Hmh, appears to be working again [18:45:21] Hi ops people - would any of you help a poor analytics-engineer restart a service? [18:45:36] see the above error message about hive-server [18:47:31] !log βœ”οΈ cdanis@an-coord1001.eqiad.wmnet ~ πŸ•’β˜• sudo systemctl restart hive-server2.service [18:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:06] RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [18:48:13] \o/ thanks a lot cdanis :) [18:49:27] Majavah: Phab still has DB issues. Just that they are intermittent. [18:50:20] * joal goes restarting failed jobs [18:53:46] I'm getting Phab errors too. [19:02:24] PROBLEM - SSH on an-coord1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:04:12] I just killed a process on an-coord1001 that is probably the source of the issues above --^ [19:04:53] right - memory full [19:04:57] sorry for that [19:06:41] Better now. Thanks! [19:11:26] RECOVERY - SSH on an-coord1001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:02:03] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Dsharpe) An issue is being diagnosed involving this extension, and it will likely remain down until a... [20:07:21] (03PS5) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 [20:13:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:18:13] (03PS1) 10Peter.ovchyn: Rename WPBSkinBlacklist to WPBSkinDisabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609490 (https://phabricator.wikimedia.org/T254675) [20:19:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:23:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:26:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:33:21] (03PS6) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 [21:03:15] (03PS1) 10Krinkle: Remove bogus $wgWMEPhp7SamplingRate setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609494 (https://phabricator.wikimedia.org/T219127) [21:04:44] (03PS7) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 [21:05:10] (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler1002/23669/" [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [21:21:50] (03PS1) 10Reedy: Add a maintenance script to get all LY files [extensions/Score] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/609303 [21:21:54] (03CR) 10Reedy: [C: 03+2] Add a maintenance script to get all LY files [extensions/Score] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/609303 (owner: 10Reedy) [21:23:37] 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security, 10Sustainability (Incident Prevention): Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584 (10Krinkle) [21:31:38] (03CR) 10Andrew Bogott: [C: 03+1] puppetmaster::frontend: add hiera calls and type validation [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [21:35:32] (03CR) 10Andrew Bogott: [C: 03+1] "Thanks for splitting this out" [puppet] - 10https://gerrit.wikimedia.org/r/609186 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [21:37:12] (03Merged) 10jenkins-bot: Add a maintenance script to get all LY files [extensions/Score] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/609303 (owner: 10Reedy) [21:41:50] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 63 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:42:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:42:52] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10DannyS712) p:05Unbreak!β†’03High Move back to high - this is just the public tracking task, the act... [21:45:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:49:10] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.39/extensions/Score/: Sync maintenance script (duration: 00m 58s) [21:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:53:16] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:53:30] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas