[00:09:33] <icinga-wm>	 PROBLEM - SSH on webperf2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:11:13] <icinga-wm>	 RECOVERY - SSH on webperf2002 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:26:56] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Prevention): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle)
[02:10:32] <wikibugs>	 (03PS4) 10Ryan Kemper: Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928)
[02:11:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper)
[02:19:15] <wikibugs>	 (03PS5) 10Ryan Kemper: Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928)
[02:20:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper)
[02:23:51] <wikibugs>	 (03CR) 10Ryan Kemper: "Still need to tune the rescore, and also separately circle back to make replica counts the same between codfw/eqiad" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper)
[02:28:35] <wikibugs>	 (03PS6) 10Ryan Kemper: Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928)
[02:29:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper)
[02:35:49] <wikibugs>	 (03PS7) 10Ryan Kemper: Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928)
[02:47:39] <wikibugs>	 (03CR) 10Ryan Kemper: "I decided to tack the replica change onto this ticket rather than making it separate. It made the math a bit easier to reason about, and a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper)
[03:12:52] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] "These changes and the previous ones (https://gerrit.wikimedia.org/r/c/operations/puppet/+/608633) look pretty straightforward." [puppet] - 10https://gerrit.wikimedia.org/r/608905 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski)
[03:13:44] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski)
[03:45:17] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[03:50:49] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 66.1 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[04:24:55] <icinga-wm>	 RECOVERY - Long running screen/tmux on kubernetes1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[04:48:43] <wikibugs>	 (03CR) 10ArielGlenn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (https://phabricator.wikimedia.org/T239340) (owner: 10Alexandros Kosiaris)
[05:33:16] <XioNoX>	 !log remove chassis redundancy failover from fasw-c-codfw for consistency with all other VCs
[05:33:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:44:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609225 (https://phabricator.wikimedia.org/T256770) (owner: 10Ssingh)
[05:46:36] <XioNoX>	 !log remove chassis redundancy failover from fasw-c-eqiad for consistency with all other VCs
[05:46:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond)
[05:51:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609186 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond)
[06:06:31] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Create component/systemd241 [puppet] - 10https://gerrit.wikimedia.org/r/609181 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff)
[06:09:45] <moritzm>	 !log rebooting mw1390-mw1419 for kernel security updates
[06:09:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:58] <hashar>	 good morning
[06:26:02] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Enable CAS staging host for Icinga [puppet] - 10https://gerrit.wikimedia.org/r/596174 (owner: 10Muehlenhoff)
[06:27:27] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[06:27:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:35] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[06:27:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:34:34] <wikibugs>	 10Operations, 10observability, 10User-MoritzMuehlenhoff: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016 (10MoritzMuehlenhoff)
[06:34:52] <wikibugs>	 10Operations, 10observability, 10User-MoritzMuehlenhoff: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016 (10MoritzMuehlenhoff)
[06:34:54] <wikibugs>	 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10MoritzMuehlenhoff)
[06:38:50] <elukey>	 hashar: bonjour
[06:47:46] <moritzm>	 !log installing php5 security updates
[06:47:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:51:52] <hashar>	 php5? ;D
[06:52:28] <wikibugs>	 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10elukey) Adding my 2c :)  * BGP communities - if pmacct supports adding them to the Kafka JSON message directly, it should be very easy to support from the Analytics poi...
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200703T0700)
[07:00:08] <moritzm>	 unfortunately yes :-)
[07:00:29] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[07:00:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:40] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[07:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:02] <moritzm>	 hashar: that, said there are also five images in our Docker registry still using PHP 5: 
[07:01:05] <moritzm>	 docker-registry.wikimedia.org/releng/composer-php56:0.2.0-s1 (image) 	5.6.40+dfsg-0+deb8u11 upgrade
[07:01:06] <moritzm>	 docker-registry.wikimedia.org/releng/composer-php56:0.2.0-s2 (image) 	5.6.40+dfsg-0+deb8u12 upgrade
[07:01:08] <moritzm>	 docker-registry.wikimedia.org/releng/composer-test-php56:0.2.0-s1 (image) 	5.6.40+dfsg-0+deb8u11 upgrade
[07:01:09] <moritzm>	 docker-registry.wikimedia.org/releng/composer-test-php56:0.2.0-s2 (image) 	5.6.40+dfsg-0+deb8u12 upgrade
[07:01:11] <moritzm>	 docker-registry.wikimedia.org/releng/php56:0.1.2 (image) 	5.6.40+dfsg-0+deb8u12 upgrade
[07:01:40] <moritzm>	 oh, there's even another 9 actually
[07:02:14] <hashar>	 yeah indeed, the last use was to run php linter for the integration.wikimedia.org website
[07:02:46] <hashar>	 and that requirement disappeared with the upgrade of the machine from Jessie to Buster 
[07:03:55] <hashar>	 what surprise me is that we still have some php5.6 in production while CI no more has it which probably means we miss some test coverage :\
[07:04:23] <moritzm>	 this is 5.5 and it's not mediawiki running on PHP 5, but unrelated services
[07:05:14] <moritzm>	 https://debmonitor.wikimedia.org/packages/php5-common has the full list of images, can you remove these from the Docker registry? (or open a task to get them removed)
[07:05:42] <moritzm>	 we also have a mystery php5.5 source package: https://debmonitor.wikimedia.org/packages/php5.5-cli
[07:05:53] <moritzm>	 which is used on docker-registry.wikimedia.org/releng/quibble-jessie-php55:0.0.31-1 (image)
[07:15:29] <hashar>	 moritzm: ah yeah that one is a port of php5.5 to jessie
[07:15:55] <hashar>	 I guess we should garbage collect the Docker images we no more have any use for
[07:17:38] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cumin: update prometheus alias [puppet] - 10https://gerrit.wikimedia.org/r/609178 (https://phabricator.wikimedia.org/T243057) (owner: 10Ema)
[07:17:43] <wikibugs>	 10Operations, 10netops, 10observability: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10ayounsi) In case that's useful for LibreNMS: https://github.com/librenms/librenms/pull/11488/files
[07:24:29] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi)
[07:24:32] <wikibugs>	 (03CR) 10Ema: [C: 03+2] varnish: update 19-unparseable-host-header.vtc [puppet] - 10https://gerrit.wikimedia.org/r/609179 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema)
[07:31:58] <wikibugs>	 (03PS2) 10Ema: Varnish: Include request ID in Set-Cookie warning [puppet] - 10https://gerrit.wikimedia.org/r/608709 (https://phabricator.wikimedia.org/T256395) (owner: 10Gergő Tisza)
[07:35:30] <wikibugs>	 (03PS1) 10Elukey: Remove notebook1004 from production [puppet] - 10https://gerrit.wikimedia.org/r/609387 (https://phabricator.wikimedia.org/T256363)
[07:37:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove notebook1004 from production [puppet] - 10https://gerrit.wikimedia.org/r/609387 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[07:39:05] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[07:39:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:47] <wikibugs>	 10Operations: Handle archival of jessie suite in Debian archive - https://phabricator.wikimedia.org/T257019 (10MoritzMuehlenhoff)
[07:39:56] <wikibugs>	 10Operations: Handle archival of jessie suite in Debian archive - https://phabricator.wikimedia.org/T257019 (10MoritzMuehlenhoff) p:05Triage→03High
[07:40:10] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[07:40:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:05] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.dns.netbox
[07:44:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:43] <elukey>	 this is part of https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging
[07:45:11] <elukey>	 I am renaming notebook1004 to an-scheduler1001
[07:46:44] <elukey>	 ah of course there is not only my change in the diffset
[07:46:45] <elukey>	 sigh
[07:51:37] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add certificate for helm-charts (chartmuseum) [puppet] - 10https://gerrit.wikimedia.org/r/609122 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[07:51:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] secret: add dummy key for helm-charts (chartmuseum) [labs/private] - 10https://gerrit.wikimedia.org/r/609121 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[07:51:59] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] secret: add dummy key for helm-charts (chartmuseum) [labs/private] - 10https://gerrit.wikimedia.org/r/609121 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[07:52:33] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Introduce chartmuseum[12]001 [dns] - 10https://gerrit.wikimedia.org/r/609164 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[07:55:28] <moritzm>	 !log installing mutt security updates for jessie (stretch/buster already fixed)
[07:55:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::envoy::builder A profile to use in building envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/609388
[08:01:11] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:03:33] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[08:03:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:34] <jayme>	 !log authdns-update for chartmuseum - T256970
[08:04:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:38] <stashbot>	 T256970: Site: eqiad/codwf each 1 VM for helm-charts.wikimedia.org (chartmuseum) - https://phabricator.wikimedia.org/T256970
[08:04:57] <elukey>	 jayme: let me know if you see anything weird in the diff
[08:05:40] <jayme>	 elukey: my change only, so nothing more weird than that :)
[08:06:07] <elukey>	 goood
[08:06:08] <elukey>	 thanks :)
[08:06:55] <jayme>	 (nice how the word "weird" is aligned with a fixed size font for both of our sentences in my IRC client :D)
[08:07:21] <elukey>	 same thing for me ahhaha
[08:09:37] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile::envoy::builder A profile to use in building envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/609388
[08:13:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::envoy::builder A profile to use in building envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/609388 (owner: 10Giuseppe Lavagetto)
[08:13:37] <wikibugs>	 (03CR) 10Hashar: "Some thoughts here and there ;)" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/609388 (owner: 10Giuseppe Lavagetto)
[08:14:20] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[08:14:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:22] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[08:16:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:57] <wikibugs>	 (03PS1) 10Elukey: sre.dns.netbox: print some suggestions in case the diff is wrong [cookbooks] - 10https://gerrit.wikimedia.org/r/609390
[08:30:00] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) 05Open→03Resolved
[08:33:38] <wikibugs>	 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10JAllemandou) > Region/site/AS-names - I don't love the Druid lookups idea for two reasons: 1) the data would be augmented only in Druid, not in Hive, so in the future i...
[08:35:03] <wikibugs>	 (03CR) 10Volans: "Indeed, it's totally safe to abort, thanks for adding the message!" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/609390 (owner: 10Elukey)
[08:36:53] <wikibugs>	 (03CR) 10Elukey: "For context, the diff was the following: https://phabricator.wikimedia.org/P11728" [cookbooks] - 10https://gerrit.wikimedia.org/r/609390 (owner: 10Elukey)
[08:39:42] <wikibugs>	 (03CR) 10Elukey: sre.dns.netbox: print some suggestions in case the diff is wrong (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/609390 (owner: 10Elukey)
[08:40:03] <wikibugs>	 (03PS1) 10Ema: varnish: VTC for cacheable responses with cookies [puppet] - 10https://gerrit.wikimedia.org/r/609394 (https://phabricator.wikimedia.org/T256395)
[08:43:25] <moritzm>	 !log rebooting netflow* hosts for kernel security update
[08:43:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:33] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[08:43:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:42] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) >>! In T256444#6273882, @ema wrote: > This was the last occurrence of the issue, and no other host has been affected since the librdkafka upgrade yest...
[08:44:59] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) 05Open→03Resolved a:03ema
[08:47:37] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[08:47:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:29] <wikibugs>	 (03CR) 10Volans: sre.dns.netbox: print some suggestions in case the diff is wrong (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/609390 (owner: 10Elukey)
[08:51:18] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Buster elasticsearch-curator version not compatible with ELK7 - https://phabricator.wikimedia.org/T257024 (10fgiunchedi)
[08:51:44] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[08:51:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:38] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[08:55:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:03] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[08:56:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:10] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[08:58:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:30] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[08:58:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/609181 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff)
[09:00:53] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[09:00:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:11] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[09:04:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:13] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[09:06:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:20] <wikibugs>	 (03CR) 10Privacybatm: "Done!" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm)
[09:08:45] <wikibugs>	 (03PS2) 10Privacybatm: Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951)
[09:09:30] <wikibugs>	 (03PS1) 10Elukey: Rename notebook1004 to an-scheduler1001 [dns] - 10https://gerrit.wikimedia.org/r/609396 (https://phabricator.wikimedia.org/T256363)
[09:11:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Rename notebook1004 to an-scheduler1001 [dns] - 10https://gerrit.wikimedia.org/r/609396 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[09:11:46] <wikibugs>	 (03PS1) 10Filippo Giunchedi: logstash: decom check_procs [puppet] - 10https://gerrit.wikimedia.org/r/609397 (https://phabricator.wikimedia.org/T234854)
[09:17:04] <wikibugs>	 (03PS1) 10Elukey: Add basic setup for an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/609398 (https://phabricator.wikimedia.org/T256363)
[09:17:34] <wikibugs>	 (03CR) 10Jcrespo: "When I run it, I get:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm)
[09:17:52] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add basic setup for an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/609398 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[09:21:04] <wikibugs>	 (03PS1) 10Jbond: apereo_cas: login page redirect frames [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609399 (https://phabricator.wikimedia.org/T251513)
[09:21:40] <wikibugs>	 (03CR) 10Privacybatm: "> Patch Set 2:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm)
[09:24:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch Graphite to CAS-only [puppet] - 10https://gerrit.wikimedia.org/r/609400
[09:26:57] <wikibugs>	 (03PS3) 10Privacybatm: Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951)
[09:27:18] <wikibugs>	 (03CR) 10Privacybatm: "> Patch Set 2:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm)
[09:36:33] <wikibugs>	 (03PS1) 10Jbond: cas6.2: merge changes from upstream 6.2 branch [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609402
[09:37:38] <wikibugs>	 (03PS2) 10Jbond: cas6.2: merge changes from upstream 6.2 branch [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609402
[09:38:34] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: WIP: Test something [puppet] - 10https://gerrit.wikimedia.org/r/609403
[09:42:14] <wikibugs>	 (03PS3) 10Jbond: cas6.2: merge changes from upstream 6.2 branch [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609402
[09:45:25] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::envoy::builder: use an exec for creating the docker volume [puppet] - 10https://gerrit.wikimedia.org/r/609406
[09:46:55] <wikibugs>	 (03PS4) 10Jbond: cas6.2: merge changes from upstream 6.2 branch [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609402
[09:49:49] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:52:49] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch Graphite to CAS-only [puppet] - 10https://gerrit.wikimedia.org/r/609400
[09:53:31] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:54:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::envoy::builder: use an exec for creating the docker volume [puppet] - 10https://gerrit.wikimedia.org/r/609406 (owner: 10Giuseppe Lavagetto)
[09:55:07] <wikibugs>	 (03CR) 10Jcrespo: "I think I was getting None because I was testing on a different transport (puppet vs direct). So the "bug" was real but you could not have" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm)
[10:00:36] <wikibugs>	 (03PS4) 10Privacybatm: Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951)
[10:01:03] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/23657/" [puppet] - 10https://gerrit.wikimedia.org/r/609400 (owner: 10Muehlenhoff)
[10:01:14] <wikibugs>	 (03CR) 10Privacybatm: "> Patch Set 3:" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm)
[10:02:42] <wikibugs>	 (03PS5) 10Privacybatm: Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951)
[10:03:25] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] "Works great." [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm)
[10:03:28] <wikibugs>	 (03PS6) 10Privacybatm: Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951)
[10:05:11] <wikibugs>	 (03CR) 10Privacybatm: "> Patch Set 5: Code-Review+2" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm)
[10:07:13] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: WIP: Test something [puppet] - 10https://gerrit.wikimedia.org/r/609403
[10:07:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime
[10:07:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:50] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] apereo_cas: login page redirect frames (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609399 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond)
[10:09:58] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:27] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[10:15:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:32] <elukey>	 !log notebook1004 renamed to an-scheduler1001
[10:15:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:20] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[10:18:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:38] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Radar: Renamed notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T256397 (10elukey)
[10:25:16] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:25:32] <moritzm>	 !log installing nss security updates on jessie
[10:25:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:28] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:27:45] <wikibugs>	 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Standardize/centralize mapping from section to mariadb port/socket and prom-mysql-exporter port - https://phabricator.wikimedia.org/T257033 (10Kormat)
[10:31:56] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: exec_environ: install libxml-feed-perl [puppet] - 10https://gerrit.wikimedia.org/r/609410 (https://phabricator.wikimedia.org/T256734)
[10:33:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: exec_environ: install libxml-feed-perl [puppet] - 10https://gerrit.wikimedia.org/r/609410 (https://phabricator.wikimedia.org/T256734) (owner: 10Arturo Borrero Gonzalez)
[10:38:33] <wikibugs>	 (03PS2) 10Jbond: apereo_cas: login page redirect frames [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609399 (https://phabricator.wikimedia.org/T251513)
[10:39:19] <wikibugs>	 (03CR) 10Jbond: "updated thanks 😊" (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609399 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond)
[10:51:05] <moritzm>	 !log installing ruby-json security updates
[10:51:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:14] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:54:00] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:55:27] <wikibugs>	 10Operations, 10netops: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) 05Open→03Resolved Solved in T244574.
[10:59:08] <moritzm>	 !log installing json-c security updates on jessie
[10:59:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:34] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:01:18] <wikibugs>	 10Operations, 10netops: Upgrade Fastnetmon to 1.1.6 - https://phabricator.wikimedia.org/T257035 (10ayounsi) p:05Triage→03Low
[11:02:57] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::envoy::builder: fix repo location, add timer [puppet] - 10https://gerrit.wikimedia.org/r/609411
[11:05:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609400 (owner: 10Muehlenhoff)
[11:06:10] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:07:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::envoy::builder: fix repo location, add timer [puppet] - 10https://gerrit.wikimedia.org/r/609411 (owner: 10Giuseppe Lavagetto)
[11:07:57] <wikibugs>	 (03CR) 10Ema: [C: 03+2] varnish: VTC for cacheable responses with cookies [puppet] - 10https://gerrit.wikimedia.org/r/609394 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[11:10:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Nothing I'm really familiar with, but https://css-tricks.com/snippets/javascript/break-out-of-iframe/ agrees, so let's give this a shot." [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609399 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond)
[11:13:34] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:18:12] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:19:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609400 (owner: 10Muehlenhoff)
[11:19:16] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:20:02] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:29:15] <moritzm>	 !log rebooting urldownloader standby hosts for kernel updates (1002/2002)
[11:29:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:31] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[11:29:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:15] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[11:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:37] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[11:36:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:20] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[11:39:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I reran PCC against the latest PS and it's also sane: https://puppet-compiler.wmflabs.org/compiler1003/23521/" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond)
[11:48:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover url downloaders for reboots [dns] - 10https://gerrit.wikimedia.org/r/609412
[12:15:26] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat)
[12:16:00] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) Ported wikidata-database-cpu-saturation, just needed to change the data source for each graph.
[12:30:27] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10conny-kawohl_WMDE)
[12:41:24] <hashar>	 !log Restarting Zuul / CI
[12:41:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:10] <hashar>	 stupid systemd ...
[12:45:18] <wikibugs>	 (03PS1) 10Muehlenhoff: systemd/slice: Install systemd 241 from component/systemd241 [puppet] - 10https://gerrit.wikimedia.org/r/609419 (https://phabricator.wikimedia.org/T256877)
[12:47:57] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Unconditionally install systemd packages [puppet] - 10https://gerrit.wikimedia.org/r/609104 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff)
[12:52:09] <wikibugs>	 10Operations, 10netops: Upgrade Fastnetmon to 1.1.6 - https://phabricator.wikimedia.org/T257035 (10ayounsi)
[12:58:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove apt::pin for python3-prometheus-client-package [puppet] - 10https://gerrit.wikimedia.org/r/609420 (https://phabricator.wikimedia.org/T256877)
[12:59:52] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10Aklapper) Removing #WMF-Legal (not sure why this tag was added)
[13:06:15] <wikibugs>	 (03PS1) 10Elukey: piwik: add binlog and server-id [puppet] - 10https://gerrit.wikimedia.org/r/609421 (https://phabricator.wikimedia.org/T234826)
[13:09:31] <wikibugs>	 10Operations, 10netops: Upgrade Fastnetmon to 1.1.6 - https://phabricator.wikimedia.org/T257035 (10MoritzMuehlenhoff) Steps to update the existing package on deneb:   ` apt-get source fastnetmon cd fastnetmon- 1.1.4-1~deb10u1 uupdate ../1.1.4.orig.tar.xz (some patches might be merged, drop those from debian/pa...
[13:16:19] <wikibugs>	 (03PS2) 10Elukey: piwik: add binlog to database config. [puppet] - 10https://gerrit.wikimedia.org/r/609421 (https://phabricator.wikimedia.org/T234826)
[13:19:33] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/23662/matomo1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/609421 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey)
[13:22:20] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:23:38] <wikibugs>	 (03PS4) 10Hashar: scap configuration for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005)
[13:24:10] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Trivial rebase ;)" [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) (owner: 10Hashar)
[13:33:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, still not seeing the redirectmatch for /problems in latest PCC, though might be just a rebase missing?" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond)
[13:34:04] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:34:18] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] Switch CI to profile::java (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff)
[13:45:14] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:46:16] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] Switch CI to profile::java (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff)
[13:48:54] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:51:46] <wikibugs>	 (03PS1) 10Ayounsi: Netflow: send as little options templates as possible [homer/public] - 10https://gerrit.wikimedia.org/r/609426 (https://phabricator.wikimedia.org/T240658)
[13:52:34] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:55:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Failover url downloaders for reboots [dns] - 10https://gerrit.wikimedia.org/r/609412 (owner: 10Muehlenhoff)
[13:56:10] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:59:01] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster
[13:59:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:14] <elukey>	 test cluster --^
[13:59:48] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:01:23] <godog>	 looks like the jobqueue is unhappy? top exception is JobQueueEventBus.php: Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable
[14:01:38] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:05:59] <_joe_>	 godog: so eventgate-main
[14:06:28] <_joe_>	 akosiaris: can you take a look? 
[14:07:19] <_joe_>	 https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=70&fullscreen&orgId=1&refresh=1m seems better
[14:07:27] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: WIP: Test something [puppet] - 10https://gerrit.wikimedia.org/r/609403
[14:08:28] * akosiaris looking
[14:09:10] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1033 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1311 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:09:18] <_joe_>	  uh
[14:09:21] <akosiaris>	 ?
[14:09:46] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1033 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1312 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:10:22] <_joe_>	 oh this is not new
[14:10:24] <akosiaris>	 eventgate-main reports way fewer messages right now https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=54&fullscreen&orgId=1&refresh=1m
[14:10:52] <akosiaris>	 stratch that
[14:11:00] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1033 is OK: HTTP OK: HTTP/1.1 200 OK - 84124 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:11:01] <_joe_>	 !log restarted php-fpm on wtp1033, stuck in sigill
[14:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:31] <akosiaris>	 we had a pretty big spike of purges it seems from 13:00 to 13:50
[14:11:34] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1033 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:11:41] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.stop-cluster (exit_code=99)
[14:11:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:30] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: WIP: Test something [puppet] - 10https://gerrit.wikimedia.org/r/609403
[14:37:10] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.stop-cluster.py: fix minor errors/details [cookbooks] - 10https://gerrit.wikimedia.org/r/609436 (https://phabricator.wikimedia.org/T244499)
[14:37:32] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat)
[14:38:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.stop-cluster.py: fix minor errors/details [cookbooks] - 10https://gerrit.wikimedia.org/r/609436 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey)
[14:42:56] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) mysql-aggretated ported. This was more involved. The steps were: 1. convert from `$dc` source var to `$site` query parameter 1. change the metric used for label_values to one that is prese...
[14:55:53] <wikibugs>	 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat)
[14:57:41] <wikibugs>	 (03PS1) 10Elukey: profile::prometheus::analytics: rename mysql jobs [puppet] - 10https://gerrit.wikimedia.org/r/609440
[14:59:17] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] "That's a +1 and a <3 from the data-persistence team 😊" [puppet] - 10https://gerrit.wikimedia.org/r/609440 (owner: 10Elukey)
[14:59:56] <elukey>	 awww
[15:02:42] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster
[15:02:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:53] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "This change confuses me, mostly because of the gid 903 in data.yaml. Reading the documentation for sysusers.d in seems type 'u' cause the " [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[15:09:35] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
[15:09:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:19] * elukey dances
[15:21:45] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.change-distro.py: fix misc details [cookbooks] - 10https://gerrit.wikimedia.org/r/609442 (https://phabricator.wikimedia.org/T244499)
[15:21:58] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Kormat)
[15:22:09] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Kormat) p:05Triage→03Medium
[15:23:36] <wikibugs>	 (03PS3) 10Filippo Giunchedi: Move to Debian packaging [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/552486 (https://phabricator.wikimedia.org/T217340)
[15:26:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.change-distro.py: fix misc details [cookbooks] - 10https://gerrit.wikimedia.org/r/609442 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey)
[15:38:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Please note that this will rename the metrics and thus lose historical data, should be fine though" [puppet] - 10https://gerrit.wikimedia.org/r/609440 (owner: 10Elukey)
[15:39:34] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.makevm
[15:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::prometheus::analytics: rename mysql jobs [puppet] - 10https://gerrit.wikimedia.org/r/609440 (owner: 10Elukey)
[15:41:53] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.makevm
[15:41:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:38] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Reduce db1118 weight to spread load mode evenly', diff saved to https://phabricator.wikimedia.org/P11730 and previous config saved to /var/cache/conftool/dbconfig/20200703-154337-jynus.json
[15:43:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:59] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[15:44:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:15] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[15:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:52] <wikibugs>	 (03PS1) 10Reedy: Use $wgShellwgRestrictionMethod not $wgRestrictionMethod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609446
[15:58:31] <wikibugs>	 (03PS2) 10Reedy: Use $wgShellwgRestrictionMethod not $wgRestrictionMethod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609446
[15:59:08] <wikibugs>	 (03PS3) 10Reedy: Use $wgShellRestrictionMethod not $wgRestrictionMethod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609446
[15:59:29] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] Use $wgShellRestrictionMethod not $wgRestrictionMethod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609446 (owner: 10Reedy)
[15:59:42] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Use $wgShellRestrictionMethod not $wgRestrictionMethod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609446 (owner: 10Reedy)
[16:00:34] <wikibugs>	 (03Merged) 10jenkins-bot: Use $wgShellRestrictionMethod not $wgRestrictionMethod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609446 (owner: 10Reedy)
[16:02:09] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Rename wgRestrictionMethod to wgShellRestrictionMethod (duration: 00m 58s)
[16:02:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:27] <wikibugs>	 (03PS1) 10JMeybohm: add chartmuseum[12]001 to dhcp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/609449 (https://phabricator.wikimedia.org/T256970)
[16:07:29] <wikibugs>	 (03PS1) 10JMeybohm: Add cumin alias for chartmuseum hosts [puppet] - 10https://gerrit.wikimedia.org/r/609450 (https://phabricator.wikimedia.org/T256970)
[16:07:43] <wikibugs>	 (03CR) 10Privacybatm: [C: 03+1] Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm)
[16:09:38] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm)
[16:09:51] <wikibugs>	 (03PS7) 10Jcrespo: Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm)
[16:19:53] <wikibugs>	 (03PS1) 10Elukey: Set BigTop for Hadoop master/standby/worker nodes. [puppet] - 10https://gerrit.wikimedia.org/r/609452 (https://phabricator.wikimedia.org/T244499)
[16:20:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set BigTop for Hadoop master/standby/worker nodes. [puppet] - 10https://gerrit.wikimedia.org/r/609452 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey)
[16:30:02] <wikibugs>	 (03PS14) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979)
[16:31:55] <wikibugs>	 (03PS2) 10Privacybatm: Firewall.py: Solve auto port detection concurrency issue [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450)
[16:51:02] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10CDanis)
[16:51:34] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Wikimedia-General-or-Unknown: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Reedy)
[16:51:48] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: Ifa929b2ad4 (duration: 00m 57s)
[16:51:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:24] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, 10Security: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Reedy)
[16:55:12] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, 10Security: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ebe123) p:05Triage→03High To be more precise, the error is: > Could not execute LilyPond: /dev/nu...
[16:56:38] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, 10Security: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Reedy) >>! In T257066#6277557, @Ebe123 wrote: > To be more precise, the error is: >> Could not execut...
[16:56:53] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, 10Security: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10CDanis) >>! In T257066#6277557, @Ebe123 wrote: > To be more precise, the error is: >> Could not execu...
[17:08:26] <RhinosF1>	 Reedy, cdanis: pm?
[17:08:34] <Reedy>	 Sure
[17:09:40] * RhinosF1 sent pm
[17:26:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "looks pretty ok, but do add an entry in site.pp for those hosts with role(insetup)" [puppet] - 10https://gerrit.wikimedia.org/r/609449 (https://phabricator.wikimedia.org/T256970) (owner: 10JMeybohm)
[17:26:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] add chartmuseum[12]001 to dhcp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/609449 (https://phabricator.wikimedia.org/T256970) (owner: 10JMeybohm)
[17:36:08] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[17:37:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:37:32] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[17:37:32] <icinga-wm>	 received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:37:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:37:39] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on icinga1001 is CRITICAL: 0.04156 lt 0.3 https://docs.google.com/document/d/1SeXdegjsfL94R6XYB1I4Uv8yjCPH1tVXeL0taJF0NNs/preview%23heading=h.qe04i0ld9cvl https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[17:37:44] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out bef
[17:37:44] <icinga-wm>	 s received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:37:50] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:37:54] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:37:58] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:38:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:38:14] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:38:15] <apergos>	 oh come on
[17:38:26] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:38:35] <rzl>	 👋
[17:38:38] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:39:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:39:00] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was rece
[17:39:00] <icinga-wm>	 tech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:39:01] <_joe_>	 it's a spike of requests
[17:39:14] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:39:17] <_joe_>	 15k reqps
[17:39:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:39:28] <_joe_>	 gimme 5 mins and I'll be at my computer
[17:39:43] <_joe_>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&from=now-15m&to=now
[17:39:55] <_joe_>	 someone look at the logs and find out what's going on
[17:39:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:40:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:40:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:40:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:40:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:42:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:43:02] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-
[17:43:40] <shdubsh>	 logstash indicates lots of cirrussearch-too-busy-error
[17:44:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:44:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:44:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:45:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:45:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:45:27] <wikibugs>	 (03PS1) 10Krinkle: Temporarily turn off LilyPond execution [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609470 (https://phabricator.wikimedia.org/T257062)
[17:45:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:45:39] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Temporarily turn off LilyPond execution [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609470 (https://phabricator.wikimedia.org/T257062) (owner: 10Krinkle)
[17:46:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:46:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:46:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:46:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:46:22] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:46:22] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily turn off LilyPond execution [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609470 (https://phabricator.wikimedia.org/T257062) (owner: 10Krinkle)
[17:46:23] <_joe_>	 shdubsh: yes that's expected
[17:46:34] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:46:44] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:46:56] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:47:02] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:47:04] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:47:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:47:06] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:47:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:47:26] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:47:38] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:47:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:47:50] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:48:36] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[17:48:36] <wikibugs>	 (03PS1) 10Krinkle: noc: improve tab anchor links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609471
[17:48:45] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on icinga1001 is OK: (C)0.3 lt (W)0.5 lt 0.7015 https://docs.google.com/document/d/1SeXdegjsfL94R6XYB1I4Uv8yjCPH1tVXeL0taJF0NNs/preview%23heading=h.qe04i0ld9cvl https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[17:48:45] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] noc: improve tab anchor links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609471 (owner: 10Krinkle)
[17:49:08] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[17:51:04] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-data
[17:51:04] <icinga-wm>	 etheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[17:52:58] <wikibugs>	 (03Merged) 10jenkins-bot: noc: improve tab anchor links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609471 (owner: 10Krinkle)
[17:55:34] <wikibugs>	 (03PS1) 10Majavah: Remove "Create a book" link from sidebar on Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609472 (https://phabricator.wikimedia.org/T257073)
[17:58:30] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[18:02:02] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] Remove "Create a book" link from sidebar on Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609472 (https://phabricator.wikimedia.org/T257073) (owner: 10Majavah)
[18:08:48] <wikibugs>	 (03PS1) 10CDanis: vcl: ratelimit search API calls [puppet] - 10https://gerrit.wikimedia.org/r/609475
[18:09:30] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 19.55 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[18:10:46] <wikibugs>	 (03PS2) 10CDanis: vcl: ratelimit search API calls [puppet] - 10https://gerrit.wikimedia.org/r/609475
[18:12:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM could also add "std.ip(req.http.X-Client-IP, "192.0.2.1") ~ public_cloud_nets " to the claus to tie it to clouds" [puppet] - 10https://gerrit.wikimedia.org/r/609475 (owner: 10CDanis)
[18:12:31] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "LGTM as an emergency patch to keep handy; someone else should review more thoughtfully for longer-term use" [puppet] - 10https://gerrit.wikimedia.org/r/609475 (owner: 10CDanis)
[18:16:52] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[18:20:18] <wikibugs>	 (03PS1) 10Jbond: varnish:  Rate limit cloud providers for all requiests [puppet] - 10https://gerrit.wikimedia.org/r/609477
[18:24:50] <wikibugs>	 (03PS1) 10CDanis: vcl: public_clouds_shutdown: ratelimit API reqs as well [puppet] - 10https://gerrit.wikimedia.org/r/609480
[18:25:27] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Bugreporter) p:05High→03Unbreak!
[18:40:44] <icinga-wm>	 PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[18:41:59] <andre__>	 For the records, I have intermittent DB problems connecting on Phabricator, e.g. the browser showing "Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL)." and/or "Unable to establish a connection to any database host phuser@m3-master.eqiad.wmnet."
[18:42:04] <Majavah>	 Phab down?
[18:42:57] <Majavah>	 Hmh, appears to be working again
[18:45:21] <joal>	 Hi ops people - would any of you help a poor analytics-engineer restart a service?
[18:45:36] <joal>	 see the above error message about hive-server
[18:47:31] <cdanis>	 !log ✔️ cdanis@an-coord1001.eqiad.wmnet ~ 🕒☕ sudo systemctl restart hive-server2.service 
[18:47:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:06] <icinga-wm>	 RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[18:48:13] <joal>	 \o/ thanks a lot cdanis :)
[18:49:27] <andre__>	 Majavah: Phab still has DB issues. Just that they are intermittent.
[18:50:20] * joal goes restarting failed jobs
[18:53:46] <dancy>	 I'm getting Phab errors too.
[19:02:24] <icinga-wm>	 PROBLEM - SSH on an-coord1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:04:12] <joal>	 I just killed a process on an-coord1001 that is probably the source of the issues above --^
[19:04:53] <joal>	 right - memory full
[19:04:57] <joal>	 sorry for that
[19:06:41] <dancy>	 Better now. Thanks!
[19:11:26] <icinga-wm>	 RECOVERY - SSH on an-coord1001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:02:03] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Dsharpe) An issue is being diagnosed involving this extension, and it will likely remain down until a...
[20:07:21] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403
[20:13:58] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:18:13] <wikibugs>	 (03PS1) 10Peter.ovchyn: Rename WPBSkinBlacklist to WPBSkinDisabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609490 (https://phabricator.wikimedia.org/T254675)
[20:19:30] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:23:12] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:26:52] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:33:21] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403
[21:03:15] <wikibugs>	 (03PS1) 10Krinkle: Remove bogus $wgWMEPhp7SamplingRate setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609494 (https://phabricator.wikimedia.org/T219127)
[21:04:44] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403
[21:05:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler1002/23669/" [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris)
[21:21:50] <wikibugs>	 (03PS1) 10Reedy: Add a maintenance script to get all LY files [extensions/Score] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/609303
[21:21:54] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Add a maintenance script to get all LY files [extensions/Score] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/609303 (owner: 10Reedy)
[21:23:37] <wikibugs>	 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security, 10Sustainability (Incident Prevention): Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584 (10Krinkle)
[21:31:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] puppetmaster::frontend: add hiera calls and type validation [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond)
[21:35:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "Thanks for splitting this out" [puppet] - 10https://gerrit.wikimedia.org/r/609186 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond)
[21:37:12] <wikibugs>	 (03Merged) 10jenkins-bot: Add a maintenance script to get all LY files [extensions/Score] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/609303 (owner: 10Reedy)
[21:41:50] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 63 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:42:12] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:42:52] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10DannyS712) p:05Unbreak!→03High Move back to high - this is just the public tracking task, the act...
[21:45:52] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:49:10] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.35.0-wmf.39/extensions/Score/: Sync maintenance script (duration: 00m 58s)
[21:49:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:26] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:53:16] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:53:30] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas