[00:00:43] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:01:39] <icinga-wm>	 PROBLEM - snapshot of s4 in codfw on db1115 is CRITICAL: snapshot for s4 at codfw taken more than 4 days ago: Most recent backup 2019-09-08 23:30:18 https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:09:20] <XioNoX>	 will deal with netflow2001 later, on my phone right now
[00:24:35] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:27:59] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:37:09] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:38:33] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:07:11] <XioNoX>	 !log enable netflow sampling on cr2-codfw
[01:07:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:09:06] <XioNoX>	 !log add IPv6 sampling to cr1-eqiad
[01:09:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:19:33] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ayounsi known issue, will investigate https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:07] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:37:09] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:47:17] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 135165784 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:48:51] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 45928 and 59 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:05:30] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn)
[03:30:30] <wikibugs>	 (03PS10) 10DannyS712: Fix typos in code [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491)
[03:44:28] <wikibugs>	 (03PS3) 10Effie Mouzeli: mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392)
[03:44:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli)
[04:12:40] <wikibugs>	 (03PS4) 10Effie Mouzeli: mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392)
[04:18:07] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1001/18274/mwmaint1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli)
[04:26:19] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[04:27:55] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[04:42:30] <wikibugs>	 10Operations, 10serviceops: Confd died on bast3002 - https://phabricator.wikimedia.org/T227592 (10jijiki) 05Open→03Resolved a:03jijiki It has not happened again, Resolving for now.
[04:48:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli)
[04:48:53] <wikibugs>	 (03PS1) 10Tim Starling: Add coredump action to fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536398 (https://phabricator.wikimedia.org/T232613)
[04:49:04] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Set a explicit SSL Session cache timeout [puppet] - 10https://gerrit.wikimedia.org/r/536399 (https://phabricator.wikimedia.org/T231849)
[04:52:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/18275/" [puppet] - 10https://gerrit.wikimedia.org/r/536399 (https://phabricator.wikimedia.org/T231849) (owner: 10Vgutierrez)
[04:53:50] <vgutierrez>	 !log restarting ats-tls on cp4021 and cp2002 to pick up the new SSL session cache timeout - T231849
[04:53:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:53:53] <stashbot>	 T231849: Tune ATS SSL session cache - https://phabricator.wikimedia.org/T231849
[04:57:12] <wikibugs>	 (03PS1) 10Tim Starling: In PHP FPM, enable process.dumpable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/536400
[04:58:37] <wikibugs>	 (03PS2) 10Tim Starling: In PHP FPM, enable process.dumpable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/536400 (https://phabricator.wikimedia.org/T232613)
[05:05:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] In PHP FPM, enable process.dumpable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/536400 (https://phabricator.wikimedia.org/T232613) (owner: 10Tim Starling)
[05:05:25] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: In PHP FPM, enable process.dumpable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/536400 (https://phabricator.wikimedia.org/T232613) (owner: 10Tim Starling)
[05:09:37] <wikibugs>	 (03PS1) 10Dzahn: puppetize setting advmss (MTU) size for GRU mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T23262)
[05:11:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetize setting advmss (MTU) size for GRU mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T23262) (owner: 10Dzahn)
[05:17:33] <effie>	 !log Rolling restart php-fpm across the fleet for 536400
[05:17:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:22:11] <wikibugs>	 (03PS2) 10Dzahn: puppetize setting advmss (MTU) size for GRU mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T23262)
[05:22:39] <icinga-wm>	 PROBLEM - PHP opcache health on mw1333 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:23:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetize setting advmss (MTU) size for GRU mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T23262) (owner: 10Dzahn)
[05:24:21] <icinga-wm>	 PROBLEM - PHP opcache health on mw1328 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:25:10] <effie>	 ^ will take care of it 
[05:25:45] <icinga-wm>	 PROBLEM - PHP opcache health on mw1241 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:25:49] <icinga-wm>	 RECOVERY - PHP opcache health on mw1333 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:25:57] <icinga-wm>	 RECOVERY - PHP opcache health on mw1328 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:26:47] <icinga-wm>	 PROBLEM - PHP opcache health on mw1324 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:27:17] <icinga-wm>	 PROBLEM - PHP opcache health on mw1254 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:27:19] <icinga-wm>	 RECOVERY - PHP opcache health on mw1241 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:28:23] <icinga-wm>	 RECOVERY - PHP opcache health on mw1324 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:28:51] <icinga-wm>	 RECOVERY - PHP opcache health on mw1254 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:29:05] <icinga-wm>	 PROBLEM - PHP opcache health on mw1249 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:29:33] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:29:49] <icinga-wm>	 PROBLEM - PHP opcache health on mw1325 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:29:50] <wikibugs>	 (03PS3) 10Dzahn: puppetize setting advmss (MTU) size for GRU mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T23262)
[05:30:41] <icinga-wm>	 RECOVERY - PHP opcache health on mw1249 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:30:59] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[05:31:25] <icinga-wm>	 RECOVERY - PHP opcache health on mw1325 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:33:52] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli)
[05:34:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add coredump action to fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536398 (https://phabricator.wikimedia.org/T232613) (owner: 10Tim Starling)
[05:35:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli)
[05:35:25] <icinga-wm>	 PROBLEM - PHP opcache health on mw1227 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:36:49] <wikibugs>	 (03PS5) 10Effie Mouzeli: mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392)
[05:36:53] <icinga-wm>	 PROBLEM - PHP opcache health on mw1347 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:36:59] <icinga-wm>	 RECOVERY - PHP opcache health on mw1227 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:38:13] <icinga-wm>	 PROBLEM - PHP opcache health on mw1247 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:38:29] <icinga-wm>	 RECOVERY - PHP opcache health on mw1347 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:39:29] <wikibugs>	 (03PS6) 10Effie Mouzeli: mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392)
[05:39:49] <icinga-wm>	 RECOVERY - PHP opcache health on mw1247 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:40:03] <wikibugs>	 (03PS4) 10Dzahn: puppetize setting advmss (MTU) size for GRE mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602)
[05:40:22] <_joe_>	 !log live-hacking mw1348, setting rlimit_core = unlimited to allow core dumps to be taken
[05:40:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:40:53] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[05:41:46] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli)
[05:42:29] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[05:43:45] <wikibugs>	 (03PS5) 10Dzahn: puppetize setting advmss (MTU) size for GRE tunnel mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602)
[05:43:48] <wikibugs>	 10Operations, 10MediaWiki-extensions-Mailgun, 10cloud-services-team, 10serviceops, and 5 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki)
[05:44:01] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:44:17] <wikibugs>	 10Operations, 10serviceops: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki)
[05:44:19] <wikibugs>	 10Operations, 10MediaWiki-extensions-Mailgun, 10cloud-services-team, 10serviceops, and 5 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) 05Open→03Resolved
[05:44:30] <wikibugs>	 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10jijiki)
[05:45:56] <wikibugs>	 (03CR) 10Dzahn: "set: https://puppet-compiler.wmflabs.org/compiler1002/18281/cobalt.wikimedia.org/change.cobalt.wikimedia.org.pson" [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602) (owner: 10Dzahn)
[05:48:53] <wikibugs>	 (03PS6) 10Dzahn: puppetize setting advmss (MTU) size for GRE tunnel mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602)
[05:52:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add coredump action to fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536398 (https://phabricator.wikimedia.org/T232613) (owner: 10Tim Starling)
[05:52:55] <wikibugs>	 (03Merged) 10jenkins-bot: Add coredump action to fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536398 (https://phabricator.wikimedia.org/T232613) (owner: 10Tim Starling)
[05:53:11] <wikibugs>	 (03CR) 10jenkins-bot: Add coredump action to fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536398 (https://phabricator.wikimedia.org/T232613) (owner: 10Tim Starling)
[05:58:01] <logmsgbot>	 !log oblivian@deploy1001 Synchronized w/fatal-error.php: Adding core dump function to fatal-error (duration: 01m 04s)
[05:58:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:20] <wikibugs>	 (03PS3) 10Elukey: Add config for wmf_netflow to Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) (owner: 10Nuria)
[06:04:58] <wikibugs>	 (03CR) 10Elukey: "Tried to add the config on the fly on an-tool1007 and I get the following error in the logs:" [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) (owner: 10Nuria)
[06:15:19] <wikibugs>	 (03CR) 10Elukey: Add config for wmf_netflow to Turnilo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) (owner: 10Nuria)
[06:32:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "-2 because while the patch is technically correct, I think this is noise." [puppet] - 10https://gerrit.wikimedia.org/r/535697 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron)
[06:35:33] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[06:37:51] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:44:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Going in the right direction, code could still be simpler if you just declare metrics_manager as a class property of EndpointRequest." (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 (owner: 10Cwhite)
[06:48:15] <wikibugs>	 (03PS4) 10Elukey: Add config for wmf_netflow to Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) (owner: 10Nuria)
[06:49:05] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[06:49:59] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[06:50:39] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[06:51:31] <_joe_>	 uhm
[06:51:35] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[06:52:29] <_joe_>	 that's mostly maps AIUI
[06:53:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "I noticed in the Druid UI that some of the more recent hours have multiple segments, and some of them with 1 dimensions. This might be due" [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) (owner: 10Nuria)
[06:57:43] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[07:01:43] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[07:03:19] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[07:07:15] <effie>	 should we start looking into this alert that keeps popping up ?
[07:13:23] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[07:14:59] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[07:15:08] <icinga-wm>	 PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian
[07:16:42] <icinga-wm>	 RECOVERY - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian
[07:22:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10jcrespo) The raid is back \o/  ` root@backup1001:~$ sudo megacli -PDList -aALL | grep 'Firmware state' Firmware state: Unconfigured(good), Spun Up Firmware state: Unconfigured(...
[07:23:57] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[07:24:29] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[07:24:51] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:24:59] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_upload site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:27:05] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[07:27:39] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[07:28:01] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:28:09] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:38:09] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[07:39:43] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[07:42:22] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts: ` ['elastic1046.eqiad.wmnet'] ` The log can be found in `/var/log/wmf...
[07:45:58] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove obsolete restbase hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/536204
[08:25:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] profile: use prometheus for logstash alerting (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[08:25:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "> Patch Set 2: Code-Review-1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[08:33:13] <_joe_>	 !log restarting kartotherian on maps1003, all workers seem stuck
[08:33:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:55] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1002 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[08:34:06] <onimisionipe>	 there we go
[08:34:19] <_joe_>	 onimisionipe: I didn't expect anything different
[08:34:45] <onimisionipe>	 let's do a rolling restart on all maps eqiad for kartotherian
[08:34:55] <onimisionipe>	 you have the cumin power :)
[08:35:14] <onimisionipe>	 I see the same pattern in maps1004 and maps1001
[08:35:33] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1001 is CRITICAL: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[08:35:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1002.eqiad.wmnet, maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:36:02] <_joe_>	 onimisionipe: yeah I'm going to 
[08:36:13] <_joe_>	 but not via cumin
[08:37:09] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:37:12] <_joe_>	 !log rolling restart of karotherian
[08:37:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:37] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[08:38:37] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[08:40:29] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1004 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) is CRITICAL: Could not fetch url http://10.64.48.154:6533/v4/marker/pin-m-fuel+ffffff.png: Generic connection error: HTTPConnectionPool(host=u10.64.48.154, port=6533): Max retries exceeded with url: /v4/marker/pin-m-fuel+ffffff.png (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f4da1b31a5
[08:40:29] <icinga-wm>	 blish a new connection: [Errno 111] Connection refused,)): /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) is CRITICAL: Could not fetch url http://10.64.48.154:6533/v4/marker/pin-m-fuel+ffffff@2x.png: Generic connection error: HTTPConnectionPool(host=u10.64.48.154, port=6533): Max retries exceeded with url: /v4/marker/pin-m-fuel+ffffff@2x.png (Caused by NewConnectionError(urllib3.connection.HTTPConnection
[08:40:29] <icinga-wm>	 a1b31ad0: Failed to establish a new connection: [Errno 111] Connection refused,)) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[08:40:43] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:41:14] <icinga-wm>	 PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received: /_info (Untitled test) timed out before a response was re
[08:41:14] <icinga-wm>	 r/pin-m+ffffff@2x.png (Untitled test) timed out before a response was received: /private-info/info.json (private tile service info for osm-intl) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian
[08:41:23] <_joe_>	 uh
[08:41:25] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:41:44] <_joe_>	 that's becuase of the restarts I hope?
[08:41:57] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:42:08] <_joe_>	 no, karthoterian is just not responding
[08:42:19] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:43:01] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:43:33] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:43:35] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[08:43:38] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[08:43:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:44:04] <icinga-wm>	 PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian
[08:44:23] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[08:45:06] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[08:45:29] <gehel>	 !log stop tilerator on maps to help reduce load
[08:45:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:35] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:45:59] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[08:46:01] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:47:09] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:47:31] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1046.eqiad.wmnet'] `  Of which those **FAILED**: ` ['elastic1046.eqiad.wmnet'] `
[08:47:38] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.ipmi-password-reset
[08:47:38] <logmsgbot>	 !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99)
[08:47:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:58] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.ipmi-password-reset
[08:47:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:20] <logmsgbot>	 !log jmm@cumin2001 Updating IPMI password on 1 hosts - jmm@cumin2001
[08:48:20] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0)
[08:48:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:23] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:49:04] <icinga-wm>	 RECOVERY - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian
[08:50:01] <icinga-wm>	 PROBLEM - Check systemd state on maps2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:01] <icinga-wm>	 PROBLEM - Check systemd state on maps2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:11] <wikibugs>	 (03PS4) 10Vgutierrez: Release 8.0.5-1wm7 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/536302 (https://phabricator.wikimedia.org/T232298)
[08:50:15] <icinga-wm>	 PROBLEM - tilerator on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[08:50:23] <onimisionipe>	 there we go
[08:50:33] <icinga-wm>	 PROBLEM - Check systemd state on maps1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:41] <icinga-wm>	 PROBLEM - Check systemd state on maps1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:45] <icinga-wm>	 PROBLEM - tilerator on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[08:50:55] <icinga-wm>	 PROBLEM - tilerator on maps1004 is CRITICAL: connect to address 10.64.48.154 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[08:51:05] <icinga-wm>	 PROBLEM - Check systemd state on maps2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:51:05] <icinga-wm>	 PROBLEM - tilerator on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[08:51:13] <icinga-wm>	 PROBLEM - tilerator on maps2001 is CRITICAL: connect to address 10.192.0.144 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[08:51:13] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:51:25] <icinga-wm>	 PROBLEM - Maps HTTPS on maps1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[08:51:35] <icinga-wm>	 PROBLEM - tilerator on maps1001 is CRITICAL: connect to address 10.64.0.79 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[08:51:41] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1001 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[08:51:42] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[08:51:56] <_joe_>	 godog: let's downtime it?
[08:52:01] <godog>	 yeah, doing
[08:52:26] <_joe_>	 thanks!
[08:52:49] <icinga-wm>	 RECOVERY - tilerator on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[08:52:53] <icinga-wm>	 RECOVERY - Maps HTTPS on maps1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.668 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[08:52:56] <godog>	 !log downtime kartotherian pages for 1h
[08:52:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:12] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 2.459 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[08:53:12] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[08:53:15] <icinga-wm>	 RECOVERY - Check systemd state on maps2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:53:18] <godog>	 that's only eqiad downtimed btw
[08:54:19] <icinga-wm>	 RECOVERY - Check systemd state on maps2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:56:25] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:57:15] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad
[08:57:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:27] <_joe_>	 gehel:^^ depooled eqiad right now
[08:57:39] <_joe_>	 not sure it will work for the caches as well
[08:57:44] <gehel>	 thanks
[08:57:50] <_joe_>	 vgutierrez: all the upload caches backends are now ATS right?
[08:57:56] <vgutierrez>	 yes
[08:57:57] <vgutierrez>	 that's right
[08:58:13] <_joe_>	 so they use the discovery url I guess to reach maps
[08:58:31] <icinga-wm>	 PROBLEM - tilerator on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[08:58:40] <vgutierrez>	 _joe_: that's right
[08:58:46] <_joe_>	 gehel: I will set elastic1017 and 1046 to pooled=inactive, they're polluting pybal's logs
[08:59:00] <_joe_>	 Sep 13 08:58:49 lvs1015 pybal[658]: [kartotherian-ssl_443] ERROR: Could not depool server maps1004.eqiad.wmnet because of too many down!
[08:59:01] <icinga-wm>	 PROBLEM - Check systemd state on maps1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:59:03] <icinga-wm>	 PROBLEM - tilerator on maps1003 is CRITICAL: connect to address 10.64.32.117 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[08:59:04] <_joe_>	 so still no dice
[08:59:05] <icinga-wm>	 PROBLEM - Check systemd state on maps2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:59:11] <icinga-wm>	 PROBLEM - tilerator on maps2001 is CRITICAL: connect to address 10.192.0.144 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[08:59:12] <gehel>	 _joe_: ok, thanks
[08:59:25] <vgutierrez>	 _joe_: specifically map http://maps.wikimedia.org https://kartotherian.discovery.wmnet
[08:59:35] <icinga-wm>	 PROBLEM - Check systemd state on maps1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:59:35] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[08:59:39] <icinga-wm>	 PROBLEM - Check systemd state on maps2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:00:11] <_joe_>	 can someone check that all the alerts on maps on codfw are from tilerator please
[09:00:24] <vgutierrez>	 and of course ATS is screaming (20190913.09h00m04s CONNECT: could not connect to 10.2.1.13 for 'https://kartotherian.discovery.wmnet/osm-intl/7/41/-15.png' (setting last failure time) connect_result=5)
[09:00:32] <vgutierrez>	 (that's cp5001 BTW)
[09:00:40] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=elastic1017.eqiad.wmnet
[09:00:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:12] <_joe_>	 so maps in codfw are now down as well?
[09:01:25] <vgutierrez>	 it looks like it
[09:01:38] <_joe_>	 can someone check pybal in codfw please?
[09:01:59] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=elastic1046.eqiad.wmnet
[09:02:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:02] <vgutierrez>	 checking
[09:02:57] <vgutierrez>	 _joe_: proxyfetch is not happy: Sep 13 07:02:20 lvs2003 pybal[45433]: [kartotherian-ssl_443] ERROR: Monitoring instance ProxyFetch reports server maps2001.codfw.wmnet (enabled/up/pooled) down: Getting https://localhost/ took longer than 5 seconds.
[09:03:02] <_joe_>	 because the cpu has still not spiked there
[09:03:06] <_joe_>	 sigh
[09:03:11] <vgutierrez>	 of fuck.. that's 2 hours ago
[09:03:13] <vgutierrez>	 forget about it
[09:03:18] <_joe_>	 ok 
[09:03:20] <_joe_>	 :D
[09:03:35] <_joe_>	 a tail -f should be enough to understand the current state
[09:03:40] <vgutierrez>	 last message: Sep 13 07:02:30 lvs2003 pybal[45433]: [kartotherian-ssl_443] INFO: Server maps2001.codfw.wmnet (enabled/partially up/not pooled) is up
[09:03:56] <_joe_>	 the cpu is spiking though
[09:04:04] <vgutierrez>	 and maps2002 just failed
[09:04:05] <vgutierrez>	 Sep 13 09:03:38 lvs2003 pybal[45433]: [kartotherian-ssl_443 ProxyFetch] WARN: maps2002.codfw.wmnet (enabled/up/pooled): Fetch failed (https://localhost/), 5.004 s
[09:04:11] <_joe_>	 yeah
[09:04:31] <gehel>	 crap
[09:04:43] <_joe_>	 I expected as much
[09:05:02] <_joe_>	 vgutierrez: just on the ssl endpoint or also on the non-ssl one?
[09:05:15] <vgutierrez>	 SSL only
[09:05:28] <vgutierrez>	 now is time for 2003..
[09:05:50] <vgutierrez>	 and pybal is already below the depool threshold.. so keeping pooled failed servers
[09:06:40] <vgutierrez>	 _joe_: what's handling TLS for kartotherian? envoy?
[09:06:50] <godog>	 I'm silencing pages in karto codfw too, it is going to page too soon anyways
[09:07:21] <gehel>	 any idea to track those json errors to a specific URL? No context in the kartotherian logs
[09:08:17] <godog>	 !log downtime kartotherian pages for 1h in codfw
[09:08:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:30] <vgutierrez>	 oh.. nginx there
[09:09:59] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:10:11] <_joe_>	 sigh
[09:10:15] <_joe_>	 let's repool eqiad
[09:10:21] <_joe_>	 this is doing nothing good anyways
[09:10:39] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:11:06] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad
[09:11:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:15] <_joe_>	 we proved the problem is tied to the requests
[09:11:26] <onimisionipe>	 yep
[09:12:18] <gehel>	 I'm suspecting the geoline endpoint (https://maps.wikimedia.org/geoline), but no proof
[09:12:41] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps2001 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[09:12:47] <gehel>	 whree would be a good place to drop this traffic and see if that mitigates the issue
[09:13:09] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:14:09] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[09:14:22] <godog>	 I'd say either varnish on frontend or nginx on maps hosts
[09:14:43] <vgutierrez>	 varnish-fe would be the most efficient place to do it yeah
[09:15:01] <gehel>	 vgutierrez: is that easy to do?
[09:15:08] * gehel has no idea about varnish
[09:15:27] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:16:17] <_joe_>	 vgutierrez: why not in the nginx for maps? but yeah whatever you choose
[09:16:28] <vgutierrez>	 _joe_: edge VS backend
[09:16:30] <vgutierrez>	 but yeah
[09:16:37] <vgutierrez>	 it would work as well
[09:16:51] <_joe_>	 let's do it in vcl, if you feel confident about that
[09:17:02] <jbond42>	 fyi the error i saw on maps2004 was not the same geoip error 
[09:17:03] <jbond42>	 Error: GroupId not available\n    at dm.loadGroups.then.dataGroups (/srv/deployment/kartotherian/deploy-cache/revs/c4c9e8b9dcc0747a8329b061408ddda32378ca5e/node_modules/@kartotherian/snapshot/lib/mapdataLoader.js:53:19
[09:17:23] <_joe_>	 that is apparently common and well known
[09:17:28] <jbond42>	 ack 
[09:17:44] <vgutierrez>	 _joe_: something like https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/upload-frontend.inc.vcl.erb#L129-L135
[09:18:09] <gehel>	 jbond42: saddly: T158657
[09:18:10] <stashbot>	 T158657: Kartotherian error: GroupId not available - https://phabricator.wikimedia.org/T158657
[09:18:28] <vgutierrez>	 gehel: which URLs need to be blocked?
[09:18:48] <gehel>	 https://maps.wikimedia.org/geoline
[09:18:55] <gehel>	 well, anythign starting with that
[09:19:17] <vgutierrez>	 ack
[09:19:54] <gehel>	 this is me guessing that the json errors are the core issues and that they are related to this endpoing. But I don't have anything better to propose
[09:19:54] <_joe_>	 can I suggest you add such a block manually in one nginx backend
[09:20:02] <_joe_>	 and see i fit solves the issue on that node?
[09:20:27] <wikibugs>	 10Operations, 10Discovery, 10Maps: Wikimedia maps unstability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817 (10jcrespo)
[09:20:34] <jynus>	 ^ gehel
[09:21:07] <jynus>	 feel free to correct any missunderstanding I had written
[09:21:40] <gehel>	 _joe_: will do
[09:21:47] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[09:22:43] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Rate limit heavily maps.wm.o/geoline [puppet] - 10https://gerrit.wikimedia.org/r/536549
[09:23:50] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Rate limit heavily maps.wm.o/geoline [puppet] - 10https://gerrit.wikimedia.org/r/536549
[09:23:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/536549 (owner: 10Vgutierrez)
[09:24:01] <vgutierrez>	 jbond42: two seconds? :)
[09:24:16] <gehel>	 !log deny access to /geoline on maps1004 - T232817
[09:24:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:19] <stashbot>	 T232817: Wikimedia maps unstability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817
[09:24:20] <vgutierrez>	 between PS2 and you review, that's impressive
[09:24:21] <jbond42>	 its a simple one :)
[09:24:34] <vgutierrez>	 jbond42: syntax error on PS1 though :(
[09:25:01] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[09:25:06] <jbond42>	 oh yes :#
[09:25:23] <vgutierrez>	 PS2 looks ok though
[09:26:03] <jbond42>	 yes looks good 
[09:26:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] varnish: Rate limit heavily maps.wm.o/geoline [puppet] - 10https://gerrit.wikimedia.org/r/536549 (owner: 10Vgutierrez)
[09:26:27] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:26:41] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:26:57] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_upload site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:27:20] <vgutierrez>	 gehel: maps1004 looks better without /geoline traffic?
[09:27:40] <gehel>	 !log restart kartotherian on maps1004 - T232817
[09:27:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:57] <gehel>	 vgutierrez: just restarted kartotherian, does not really look much better
[09:28:07] <wikibugs>	 10Operations, 10Discovery, 10Maps: Wikimedia maps instability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817 (10jcrespo)
[09:28:25] <_joe_>	 gehel: the cpu usage is more humane for now though
[09:28:53] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:29:09] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:29:23] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:29:31] <gehel>	 slightly better, maybe, let's wait 5 minutes see what happens
[09:29:52] <vgutierrez>	 ack
[09:30:48] <vgutierrez>	 gehel: do we have metrics about maps endpoints being hit?
[09:30:53] <vgutierrez>	 or just webrequest?
[09:31:36] <gehel>	 no specific metrics on the maps side
[09:33:07] <gehel>	 load seems to be decreasing on other maps servers as well, no idea why
[09:33:56] <vgutierrez>	 I don't see any crazy spike on /geoline
[09:34:04] <jbond42>	 could it be the geoshape endpoint?
[09:34:07] <icinga-wm>	 PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:34:43] <vgutierrez>	 https://w.wiki/8J5
[09:34:58] <vgutierrez>	 jbond42: geoshape gets more traffic but nothing crazy checking the last 7 days
[09:35:13] <vgutierrez>	 but my knowledge of maps.wm.o is /dev/null
[09:35:29] <icinga-wm>	 RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:36:16] <jbond42>	 i turned on access loggin on maps 1003 to get a sample mand i see this in the log 
[09:36:20] <jbond42>	 10.64.16.25 - - [13/Sep/2019:09:26:38 +0000] "GET 
[09:36:22] <jbond42>	 /geoshape?getgeojson=1&query=++SELECT+%3Fid+++%28if%28%3Fid+%3D+wd%3AQ5970%2C+%27%23C12838%27%2C+%27%2307c63e%27%29+as+%3Ffill%29+++%28if%28BOUND%28%3Flink%29%2C+++++++concat%28%27%5B%5B%27%2C+substr%28str%28%3Flink%29%2C31%2C500%29%2C++%27%7C%27%2C+%3FidLabel%2C+%27%5D%5D%27%29%2C+++++++%3FidLabel%29++++as+%3Ftitle%29+WHERE+%7B+++%7B+%3Fid+p%3AP31%2Fps%3AP31%2Fwdt%3AP279%2A+wd%3AQ22865+.+++h
[09:36:28] <jbond42>	 int%3APrior+hint%3Agearing+%27forward%27++%7D+++UNION+++%7B+%3Fid+p%3AP31%2Fps%3AP31%2Fwdt%3AP279%2A+wd%3AQ106658+.++++hint%3APrior+hint%3Agearing+%27forward%27++%7D+++++%3Fid+p%3AP131+%3Flocst+.+++%3Flocst+ps%3AP131%2Fwdt%3AP131%2A+wd%3AQ1197.+++hint%3APrior+hint%3Agearing+%27forward%27++++MINUS+%7B+%3Flocst+pq%3AP582+%5B%5D+%7D+++MINUS+%7B+%3Fid+wdt%3AP576+%5B%5D+%7D+++SERVICE+wikibase%3Ala
[09:36:34] <jbond42>	 bel+%7B+++++bd%3AserviceParam+wikibase%3Alanguage+%27en%27+.+++++%3Fid+rdfs%3Alabel+%3FidLabel+.+++%7D+++OPTIONAL+%7B%3Flink+schema%3Aabout+%3Fid+.+++%3Flink+schema%3AisPartOf+%3Chttps%3A%2F%2Fen.wikipedia.org%2F%3E+.+%7D+%7D++GROUP+BY+%3Fid+%3Flink+%3FidLabel++ HTTP/1.1" 200 699067 "-" "kartotherian-getJSON (yurik @ wikimedia)"
[09:36:39] <jbond42>	 maps1003 ~ %                                                  
[09:36:40] <jbond42>	 is that normal the SELECT jumped out at me
[09:36:52] <vgutierrez>	 I think so
[09:36:53] <gehel>	 saddly, this is "normal"
[09:36:58] <vgutierrez>	 it's a feature not a bug
[09:37:02] * vgutierrez hides
[09:37:08] <jbond42>	 ahh ok :S
[09:37:10] <gehel>	 this is a SPARQL query to WDQS
[09:37:15] <jbond42>	 ahh
[09:37:48] <gehel>	 I'll drop geoshap as well on maps1004, see if that changes anything
[09:37:55] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:38:08] <jynus>	 question was caching in general affected? Because I can see tiles not loading, but the js didn't use to load for me either
[09:38:32] <gehel>	 !log drop /geoshape and restart kartotherian on maps1004 - T232817
[09:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:35] <stashbot>	 T232817: Wikimedia maps instability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817
[09:38:56] <vgutierrez>	 jynus: I don't think so, maps.wikimedia.org opens just fine here via eqsin
[09:39:09] <jynus>	 vgutierrez: it didn't for me (it started recently)
[09:39:38] <jynus>	 but maybe at app side the ui doesn't load if tiles didn't, so don't take my word for iut
[09:39:43] <vgutierrez>	 esams, right?
[09:39:45] <jynus>	 yep
[09:40:06] <onimisionipe>	 maps.m.o works fine for me
[09:40:07] <jynus>	 just I was expecting grey squared and the minimal ui
[09:40:17] <jynus>	 yes, it works for me too now
[09:40:27] <jynus>	 I mean while peak issues
[09:40:42] <godog>	 !log install linux-perf-4.9 on maps1002 and attempt to capture a stack sample
[09:40:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:43] <gehel>	 majority of the tiles should be in cache
[09:40:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/536163 (owner: 10Alexandros Kosiaris)
[09:41:00] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: blubberoid: Remove monitoring support [deployment-charts] - 10https://gerrit.wikimedia.org/r/536163
[09:41:08] <jynus>	 none loaded to me and I was at a low zoom
[09:41:27] <jynus>	 just reporting, don't need any actionable as things are now working
[09:41:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] blubberoid: Remove monitoring support [deployment-charts] - 10https://gerrit.wikimedia.org/r/536163 (owner: 10Alexandros Kosiaris)
[09:42:08] <gehel>	 looks like dropping /geoline has some positive impact, but waiting a few more minutes to see if I'm just dreaming or not
[09:42:27] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:42:56] <logmsgbot>	 !log @ helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' .
[09:42:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:00] <vgutierrez>	 gehel: you mean geoshape?
[09:43:07] <jbond42>	 is this menty to be the case on the maps servers 
[09:43:08] <jbond42>	 ● tilerator.service masked failed failed tilerator.service 
[09:43:18] <gehel>	 right, geoshape
[09:43:21] <vgutierrez>	 :)
[09:43:53] <gehel>	 jbond42: yes, I killed tilerator to reduce load, did not seem to help much
[09:43:59] <jbond42>	 ack
[09:44:09] <logmsgbot>	 !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' .
[09:44:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:03] <gehel>	 ok, I think that maps1004 is behaving better since dropping geoshape, I'll re-enable geoline and see how that looks, and then we can apply this more widely
[09:45:12] <vgutierrez>	 gehel: ack
[09:45:19] <vgutierrez>	 I'll update the CR for varnish
[09:45:22] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' .
[09:45:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:02] <gehel>	 !log re-enabling /geoline on maps1004 - T232817
[09:46:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:04] <stashbot>	 T232817: Wikimedia maps instability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817
[09:46:30] <wikibugs>	 (03PS3) 10Vgutierrez: varnish: Rate limit heavily maps.wm.o/geoshape [puppet] - 10https://gerrit.wikimedia.org/r/536549
[09:47:00] <apergos>	 geoline? or geoshape? I am now quite confused
[09:47:44] <vgutierrez>	 apergos: geoshape now
[09:47:52] <gehel>	 geoshape is still banned, geoline is enabled
[09:47:55] <vgutierrez>	 feel free to check the SAL though
[09:47:58] <apergos>	 ah gotcha
[09:48:24] <gehel>	 don't trust me too much, geoline and geoshape are somewhat mixed up in my brain atm
[09:48:52] <gehel>	 vgutierrez: if I read your VCL patch correctly, that introduces a global rate limit on geoshape, right?
[09:49:17] <gehel>	 Or is it magically bucketed by client in some way?
[09:49:31] <vgutierrez>	 yeah, that's X-Client-IP right there :)
[09:50:05] <vgutierrez>	 X-Client-IP is set by nginx and sent to varnish-fe with the real client IP
[09:50:24] <gehel>	 Ok, so that's part of the bucket
[09:50:50] <gehel>	 the suspicion is that the issue is not so much because of the amount of traffic, but because of the kind of requests
[09:51:13] <godog>	 fwiw my attempt at a stack trace/sample with perf didn't yield anything useful unfortunately
[09:51:17] <gehel>	 let's see if 1/second/ip is sufficient
[09:51:31] <vgutierrez>	 gehel: we can tune that of course :)
[09:51:41] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/536549 (owner: 10Vgutierrez)
[09:51:48] <vgutierrez>	 gehel: now?
[09:51:51] <gehel>	 let's try as-is first and see if that's sufficient
[09:52:13] <vgutierrez>	 do you want it merged now?
[09:52:22] <gehel>	 vgutierrez: yeah, maps1004 is definitely less saturating - https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=maps&var-instance=All&from=now-1h&to=now&panelId=2160&fullscreen&refresh=1m
[09:52:26] <vgutierrez>	 ack
[09:52:37] <godog>	 I'm seeing cpu sort of recovered across other maps hosts too
[09:52:49] <godog>	 still very high but not pegged at 100%
[09:52:51] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Rate limit heavily maps.wm.o/geoshape [puppet] - 10https://gerrit.wikimedia.org/r/536549 (owner: 10Vgutierrez)
[09:53:03] <wikibugs>	 (03PS4) 10Vgutierrez: varnish: Rate limit heavily maps.wm.o/geoshape [puppet] - 10https://gerrit.wikimedia.org/r/536549
[09:53:20] <vgutierrez>	 and this is how I get a t-shirt
[09:53:22] <vgutierrez>	 /o\
[09:53:58] <godog>	 hahaha vgutierrez 
[09:54:14] <gehel>	 vgutierrez: I don't have any t-shirt for you, but I can send chocolate :)
[09:54:30] <vgutierrez>	 gehel: that's tempting.. here I don't get proper one
[09:55:52] <vgutierrez>	 (gotta love A:cp-upload cumin alias)
[09:58:39] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:59:14] <gehel>	 CPU seems to be climbing back up, maybe this has nothing to do with geoline / geoshape, but just about the traffic we're getting
[10:00:13] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[10:00:25] <vgutierrez>	 gehel: traffic doesn't seem to have increased over the last week
[10:00:52] <wikibugs>	 (03PS1) 10Elukey: eventlogging:dependencies: allow to specify python-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/536557 (https://phabricator.wikimedia.org/T222941)
[10:00:59] <vgutierrez>	 at least from turnilo PoV
[10:01:02] <gehel>	 vgutierrez: so it must be about the type of traffic and the the quantity
[10:01:14] <vgutierrez>	 so more expensive queries
[10:01:57] <vgutierrez>	 puppet is running on the cp upload cluster
[10:02:51] <gehel>	 looks like most of our traffic is refered from https://twpkinfo.com/
[10:03:47] <gehel>	 vgutierrez: change already applied? CPU seems to be going down in the last 3 minutes
[10:03:59] <icinga-wm>	 RECOVERY - snapshot of s4 in codfw on db1115 is OK: snapshot for s4 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-09-13 08:08:54 from db2099.codfw.wmnet:3314 (1076 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[10:04:03] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[10:04:11] <vgutierrez>	 gehel: half ot if
[10:04:17] <vgutierrez>	 *half of it :)
[10:04:18] <vgutierrez>	 damn
[10:04:29] <wikibugs>	 (03PS1) 10Effie Mouzeli: WIP: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558
[10:05:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558 (owner: 10Effie Mouzeli)
[10:07:05] <gehel>	 things seem to be mostly stable for maps, but I'm not entirely sure why
[10:07:11] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[10:07:25] <gehel>	 I'll wait more before re-enabling tilerator
[10:08:22] <godog>	 yeah cpu looking good now on maps afaics
[10:09:28] <logmsgbot>	 !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
[10:09:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:30] <wikibugs>	 (03PS2) 10Elukey: eventlogging:dependencies: allow to specify python-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/536557 (https://phabricator.wikimedia.org/T222941)
[10:13:44] <vgutierrez>	 !log disable ATS-TLS debug options on cp5001 - T232298
[10:13:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:47] <stashbot>	 T232298: Investigate segfaults on ats-tls running on cp5001  - https://phabricator.wikimedia.org/T232298
[10:14:19] <wikibugs>	 (03CR) 10Jbond: WIP: systemd: Add support for coredump.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536558 (owner: 10Effie Mouzeli)
[10:17:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Andrew: I know that this is a horrible hack but it is just get puppet unblocked. I noticed that we are about to absent all the profile::ev" [puppet] - 10https://gerrit.wikimedia.org/r/536557 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey)
[10:18:47] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add more entries to .gitignore [deployment-charts] - 10https://gerrit.wikimedia.org/r/536560
[10:19:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add more entries to .gitignore [deployment-charts] - 10https://gerrit.wikimedia.org/r/536560 (owner: 10Alexandros Kosiaris)
[10:19:43] <wikibugs>	 (03PS1) 10Elukey: eventlogging: move comment about python-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/536561 (https://phabricator.wikimedia.org/T222941)
[10:20:02] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] eventlogging: move comment about python-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/536561 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey)
[10:21:30] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Eevans) restbase2009 is fully decommissioned and ready to be reimaged.
[10:21:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for atgomez [puppet] - 10https://gerrit.wikimedia.org/r/536562
[10:22:37] <wikibugs>	 10Operations, 10Analytics, 10Traffic: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10BBlack) The URL mentioned at the top isn't a media URL, it actually is HTML content and is a pageview.  Try it in your browser: https://commons.wikimedia.org//wiki/File:Arm_muscles_b...
[10:23:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/536562 (owner: 10Muehlenhoff)
[10:23:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove access for atgomez [puppet] - 10https://gerrit.wikimedia.org/r/536562 (owner: 10Muehlenhoff)
[10:24:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Of course the comment was placed in the wrong place, fixed it with https://gerrit.wikimedia.org/r/536561" [puppet] - 10https://gerrit.wikimedia.org/r/536557 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey)
[10:24:43] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove access for atgomez [puppet] - 10https://gerrit.wikimedia.org/r/536562
[10:27:37] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove access for atgomez [puppet] - 10https://gerrit.wikimedia.org/r/536562
[10:30:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for atgomez [puppet] - 10https://gerrit.wikimedia.org/r/536562 (owner: 10Muehlenhoff)
[10:31:11] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[10:32:45] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[10:36:05] <logmsgbot>	 !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
[10:36:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:46] <logmsgbot>	 !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
[10:36:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:21] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:38:25] <moritzm>	 !log repool restbase1018 after reimage to stretch and completed Cassandra bootstrap
[10:38:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:33] <wikibugs>	 10Operations, 10Traffic: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) Reported and proposed a solution to upstream in https://github.com/apache/trafficserver/pull/5935
[10:41:51] <moritzm>	 !log reimage restbase2009 to stretch T224553
[10:41:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:54] <stashbot>	 T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553
[10:42:15] <logmsgbot>	 !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
[10:42:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:00] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Reset Mailing list admin password for oversight-wp-ja - https://phabricator.wikimedia.org/T232822 (10Rxy) p:05Triage→03Normal
[10:47:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/536367 (https://phabricator.wikimedia.org/T224553) (owner: 10Eevans)
[10:47:03] <wikibugs>	 (03PS2) 10Muehlenhoff: hieradata: specify restbase2009 jbod devices [puppet] - 10https://gerrit.wikimedia.org/r/536367 (https://phabricator.wikimedia.org/T224553) (owner: 10Eevans)
[10:47:51] <vgutierrez>	 !log rebooting acmechief-test servers to catch up latest kernel upgrades
[10:47:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:17] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Reset Mailing list admin password for oversight-wp-ja - https://phabricator.wikimedia.org/T232822 (10Rxy) p:05Normal→03Triage
[10:55:22] <icinga-wm>	 PROBLEM - Maps HTTPS on maps1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[10:55:53] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1001 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[10:56:57] <icinga-wm>	 RECOVERY - Maps HTTPS on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 9.932 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[10:58:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] hieradata: specify restbase2009 jbod devices [puppet] - 10https://gerrit.wikimedia.org/r/536367 (https://phabricator.wikimedia.org/T224553) (owner: 10Eevans)
[11:00:09] <gehel>	 theory on maps: we have some kind of non terminating request on geoshape and it slowly saturates. The current throttling limits the problem but still let it grow slowly
[11:00:39] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1001 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[11:00:54] <gehel>	 vgutierrez: could we be way more aggressive on that throttling? see if it helps
[11:01:13] <gehel>	 I'm trying again to reach mateusbs17, see if he has any better idea
[11:03:47] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[11:05:30] <godog>	 I'm going to silence kartoterian for a couple of hours
[11:05:54] <godog>	 !log silence kartotherian pages for 2h, known issue
[11:05:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:12] <wikibugs>	 (03PS1) 10Muehlenhoff: restbase2010: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536565
[11:07:11] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Reset Mailing list admin password for oversight-wp-ja - https://phabricator.wikimedia.org/T232822 (10Rxy)
[11:07:37] <wikibugs>	 (03PS1) 10Muehlenhoff: restbase2011: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536566
[11:07:47] <vgutierrez>	 I'm out for dinner.. maybe bblack can handle that gehel 
[11:08:07] <gehel>	 vgutierrez: sure, enjoy dinner!
[11:08:08] <jbond42>	 gehel: the current limit is 1 request per second avrage with the allowance for a 10 burst, how much more aggressive would you like it
[11:08:30] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[11:08:30] <logmsgbot>	 !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[11:08:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:39] <wikibugs>	 (03PS1) 10Muehlenhoff: restbase2012: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536567
[11:08:47] <gehel>	 I'm thinking blocking it completely, but I'm also thinking it's probably unrelated
[11:09:11] <gehel>	 the other quite extreme action would be to block that pokemongo website based on referrer
[11:09:53] <jbond42>	 ok kone sec i can look at both of those options
[11:10:02] <elukey>	 !log reboot an-tool1007 (runs turnilo) for kernel upgrades
[11:10:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:05] <elukey>	 !log reboot an-conf100* (Analytics Zookeeper nodes - not yet in production) for kernel upgrades
[11:11:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:24] <_joe_>	 gehel: did anyone analyze the traffic we're getting?
[11:13:31] <_joe_>	 there are many other sites
[11:14:10] <gehel>	 _joe_: only at a high level (so mostly: no)
[11:14:17] <_joe_>	 ok
[11:14:23] <_joe_>	 that's what I would do at this point
[11:14:31] <_joe_>	 there must be one specific caller causing this
[11:14:43] <icinga-wm>	 PROBLEM - Host an-conf1001 is DOWN: PING CRITICAL - Packet loss = 100%
[11:15:57] <icinga-wm>	 RECOVERY - Host an-conf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[11:17:27] <gehel>	 ok I'll try to dig into the logs
[11:17:33] <gehel>	 but not even sure what I'm looking for
[11:18:39] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[11:18:51] <icinga-wm>	 PROBLEM - Maps HTTPS on maps1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[11:19:43] <icinga-wm>	 RECOVERY - Maps HTTPS on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[11:20:57] <wikibugs>	 (03PS1) 10Jbond: maps: block 9db.jp from maps [puppet] - 10https://gerrit.wikimedia.org/r/536568
[11:22:45] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[11:23:30] <wikibugs>	 (03PS1) 10Jbond: maps: block goeshape [puppet] - 10https://gerrit.wikimedia.org/r/536570
[11:23:47] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[11:24:05] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[11:24:46] <jbond42>	 gehel: changes above ^^ one to disable geoshapes and one to disable 9db.jp as refer.  bblack is also added as reviewer
[11:26:31] <wikibugs>	 (03CR) 10Gehel: "LGTM as a mitigation strategy." [puppet] - 10https://gerrit.wikimedia.org/r/536570 (owner: 10Jbond)
[11:28:37] <wikibugs>	 (03PS2) 10Jbond: maps: block 9db.jp from maps [puppet] - 10https://gerrit.wikimedia.org/r/536568
[11:30:47] <wikibugs>	 (03PS3) 10Jbond: maps: block 9db.jp from maps [puppet] - 10https://gerrit.wikimedia.org/r/536568
[11:33:52] <wikibugs>	 (03CR) 10Urbanecm: "> Patch Set 5:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg)
[11:36:37] <icinga-wm>	 PROBLEM - Host an-conf1002 is DOWN: PING CRITICAL - Packet loss = 100%
[11:36:45] <elukey>	 this is me --^
[11:37:03] <icinga-wm>	 RECOVERY - Host an-conf1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[11:38:56] <_joe_>	 !log manually raising the worker heap limit to 600 MB on kartotherian on maps1003
[11:38:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:21] <jbond42>	 !log enable access logs on maps1003
[11:39:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:46] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10MoritzMuehlenhoff)
[11:40:19] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10MoritzMuehlenhoff) restbase2009 has been reimaged and is ready to be bootstrapped in Cassandra.
[11:41:25] <wikibugs>	 (03PS2) 10Effie Mouzeli: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558
[11:42:28] <wikibugs>	 (03PS3) 10Effie Mouzeli: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558
[11:43:07] <icinga-wm>	 PROBLEM - Host an-conf1003 is DOWN: PING CRITICAL - Packet loss = 100%
[11:43:36] <elukey>	 this is me --^
[11:44:01] <icinga-wm>	 RECOVERY - Host an-conf1003 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[11:48:17] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:05] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662
[11:53:07] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:55:05] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[11:57:07] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662
[11:58:43] <gehel>	 there was a new kartotherian version deployed yesterday, checking what was the content
[11:59:09] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:59:21] <moritzm>	 ^restbase is reimage alert spam, silencing
[11:59:59] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff Pending bootstrap after reimage https://phabricator.wikimedia.org/T120662
[11:59:59] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive Muehlenhoff Pending bootstrap after reimage https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:59:59] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused Muehlenhoff Pending bootstrap after reimage https://phabricator.wikimedia.org/T93886
[11:59:59] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff Pending bootstrap after reimage https://phabricator.wikimedia.org/T120662
[11:59:59] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive Muehlenhoff Pending bootstrap after reimage https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:02:23] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1002 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:03:35] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:03:54] <wikibugs>	 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10faidon) a:05faidon→03RobH Approved.  It sounds like our spare pools are being drained, so if that's the case feel free to open a task to replenis...
[12:04:29] <wikibugs>	 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: codfw: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227425 (10faidon) a:05faidon→03RobH Approved.
[12:05:57] <wikibugs>	 (03PS4) 10Effie Mouzeli: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558
[12:07:46] <wikibugs>	 (03PS5) 10Effie Mouzeli: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558
[12:14:47] <jbond42>	 !log add timing information to maps1003 access logs
[12:14:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:32] <wikibugs>	 (03PS1) 10BBlack: LVS for MW: Remove RunCommand checks [puppet] - 10https://gerrit.wikimedia.org/r/536581 (https://phabricator.wikimedia.org/T111899)
[12:31:23] <wikibugs>	 (03PS1) 10Effie Mouzeli: profile::mediawiki::api: Setup systemd-coredump on api servers [puppet] - 10https://gerrit.wikimedia.org/r/536582 (https://phabricator.wikimedia.org/T232613)
[12:33:01] <wikibugs>	 (03PS1) 10BBlack: maps tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536583
[12:36:47] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] maps tiles URI sanity filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack)
[12:37:02] <_joe_>	 !log temp ban of class of urls on maps1003 nginx
[12:37:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:19] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:40:26] <wikibugs>	 (03CR) 10MSantos: [C: 04-1] maps tiles URI sanity filter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack)
[12:43:10] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10MoritzMuehlenhoff) @Bstorm @Andrew These were once installed with Stretch, but in the mean time Buster was released, let's reimage those be...
[12:56:25] <wikibugs>	 (03PS2) 10BBlack: maps tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536583
[12:56:30] <wikibugs>	 (03CR) 10BBlack: maps tiles URI sanity filter (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack)
[12:56:55] <_joe_>	 !log banning more urls on maps1003
[12:56:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:49] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove obsolete restbase hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/536204
[12:57:51] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM, but mateus has more knowledge of what is supported here." [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack)
[12:59:10] <wikibugs>	 (03CR) 10MSantos: maps tiles URI sanity filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack)
[12:59:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete restbase hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/536204 (owner: 10Muehlenhoff)
[13:01:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: WIP swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123)
[13:05:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10Jclark-ctr) @jcrespo   Support ticket did not include disk.   it was only a cable issue.  No other tickets open.
[13:09:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10Cmjohnson) Actually we need to close this task and open a separate task about the disk.  Different issue should get a different task.
[13:10:11] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] maps tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack)
[13:10:22] <wikibugs>	 (03PS2) 10Filippo Giunchedi: WIP swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123)
[13:10:23] <wikibugs>	 (03PS2) 10Effie Mouzeli: profile::mediawiki::api: Setup systemd-coredump on api servers [puppet] - 10https://gerrit.wikimedia.org/r/536582 (https://phabricator.wikimedia.org/T232613)
[13:11:33] <wikibugs>	 (03PS3) 10BBlack: maps tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536583
[13:12:21] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] maps tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack)
[13:12:27] <wikibugs>	 (03CR) 10Muehlenhoff: "What's the rough ETA on obsoleting labtestpuppetmaster as well? It's currently the last puppet master on jessie in prod, so a bit of an ou" [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) (owner: 10Andrew Bogott)
[13:14:51] <wikibugs>	 (03CR) 10Muehlenhoff: "Is this really a compatible drop-in replacement? It's probably better to simply continue to use the jar provided by Gerrit upstream on bus" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[13:15:40] <wikibugs>	 (03PS1) 10Jbond: maps: tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536588
[13:19:36] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 7 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10fgiunchedi)
[13:19:39] <logmsgbot>	 !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
[13:19:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:41] <wikibugs>	 (03PS2) 10Jbond: maps: tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536588
[13:20:08] <logmsgbot>	 !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
[13:20:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:14] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/536588 (owner: 10Jbond)
[13:21:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] maps: tiles URI sanity filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536588 (owner: 10Jbond)
[13:21:38] <logmsgbot>	 !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
[13:21:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:16] <wikibugs>	 (03PS3) 10Jbond: maps: tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536588
[13:24:49] <icinga-wm>	 RECOVERY - tilerator on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[13:25:13] <wikibugs>	 (03CR) 10Jbond: maps: tiles URI sanity filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536588 (owner: 10Jbond)
[13:25:27] <icinga-wm>	 RECOVERY - Check systemd state on maps1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:27:17] <logmsgbot>	 !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
[13:27:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:43] <logmsgbot>	 !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
[13:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:23] <wikibugs>	 (03PS1) 10Awight: Enable source wiki editing for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536590 (https://phabricator.wikimedia.org/T228851)
[13:30:38] <wikibugs>	 (03CR) 10BBlack: [C: 04-1] maps: tiles URI sanity filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536588 (owner: 10Jbond)
[13:30:45] <bblack>	 jbond42: ^
[13:30:51] <jbond42>	 looking
[13:32:28] <jbond42>	 ahh thx and you can ignore my comment on your early change i see why now :)
[13:34:16] <wikibugs>	 (03PS4) 10Jbond: maps: tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536588
[13:35:21] <wikibugs>	 (03CR) 10Jbond: maps: tiles URI sanity filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536588 (owner: 10Jbond)
[13:35:33] <jbond42>	 bblack: can you recheck ^^
[13:36:07] <wikibugs>	 10Operations, 10Discovery, 10Maps, 10Product-Infrastructure-Team-Backlog: Wikimedia maps instability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817 (10Mholloway)
[13:36:42] <bblack>	 jbond42: yeah that part LGTM, we're assuming the main regex is ok from mateus earlier +1 right?
[13:37:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: alert on overall puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/536591 (https://phabricator.wikimedia.org/T232303)
[13:37:25] <jbond42>	 bbl yes
[13:37:30] <jbond42>	 bblack: yes
[13:38:02] <jbond42>	 also from https://github.com/wikimedia/mediawiki-services-kartotherian/tree/master/packages/kartotherian
[13:38:21] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] maps: tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536588 (owner: 10Jbond)
[13:43:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/536233 (owner: 10Muehlenhoff)
[13:47:03] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable coredns based cluster DNS in codfw [puppet] - 10https://gerrit.wikimedia.org/r/536593
[13:49:18] <wikibugs>	 (03PS1) 10Jbond: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595
[13:50:25] <wikibugs>	 (03PS2) 10Jbond: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595
[13:51:41] <wikibugs>	 (03PS3) 10Subramanya Sastry: Take #2: Redirect Parsoid/PHP rt-testing log events to "parsoid-tests" [puppet] - 10https://gerrit.wikimedia.org/r/536208 (https://phabricator.wikimedia.org/T232042)
[13:51:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Let's do this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536202 (owner: 10Subramanya Sastry)
[13:54:37] <wikibugs>	 (03PS3) 10Jbond: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595
[13:54:57] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Direct Parsoid/PHP rt-testing log events to a different target" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536202 (owner: 10Subramanya Sastry)
[13:55:10] <_joe_>	 subbu: deploying
[13:55:41] <subbu>	 k
[13:56:01] <wikibugs>	 (03PS5) 10Elukey: profile::kerberos::kdc: add debconf settings [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089)
[13:56:28] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Direct Parsoid/PHP rt-testing log events to a different target" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536202 (owner: 10Subramanya Sastry)
[13:57:24] <wikibugs>	 (03CR) 10MSantos: [C: 04-1] maps: add filter to block any unknown URI's (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond)
[13:57:29] <logmsgbot>	 !log oblivian@deploy1001 Synchronized wmf-config/logging.php: unbreak mediawiki logging on scandium (duration: 01m 04s)
[13:57:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:51] <wikibugs>	 (03PS4) 10Jbond: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595
[13:59:07] <wikibugs>	 (03PS5) 10Jbond: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595
[13:59:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Take #2: Redirect Parsoid/PHP rt-testing log events to "parsoid-tests" [puppet] - 10https://gerrit.wikimedia.org/r/536208 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry)
[14:01:27] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:01:30] <subbu>	 _joe_, confirmed that the revert works on scandium .. i now see logs in logstash when i parse a page that errors.
[14:01:37] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:02:27] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: codfw: coredns set to 4 replicas with pod antifinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/536597
[14:04:00] <wikibugs>	 (03CR) 10Gehel: maps: add filter to block any unknown URI's (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond)
[14:05:11] <logmsgbot>	 !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'coredns' .
[14:05:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:36] <wikibugs>	 (03CR) 10MSantos: maps: add filter to block any unknown URI's (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond)
[14:06:59] <_joe_>	 subbu: in a few minutes the logstash filter should be deployed
[14:07:18] <subbu>	 \o/ ty .. 
[14:07:27] <subbu>	 once you confirm, i'll test and verify.
[14:07:56] <wikibugs>	 (03PS6) 10Jbond: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595
[14:10:01] <wikibugs>	 (03CR) 10Jbond: maps: add filter to block any unknown URI's (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond)
[14:12:34] <wikibugs>	 (03PS1) 10Subramanya Sastry: beta cluster: Make deployment-parsoid09 a Mediawiki appserver as well [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538)
[14:13:37] <wikibugs>	 (03PS7) 10BBlack: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond)
[14:14:00] <logmsgbot>	 !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'coredns' .
[14:14:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:24] <wikibugs>	 (03CR) 10Subramanya Sastry: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/536598 is the puppet patch for making the parsoid host on beta cluster a mediawiki ap" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534215 (https://phabricator.wikimedia.org/T231569) (owner: 10Subramanya Sastry)
[14:15:05] <_joe_>	 subbu: we're done here
[14:15:15] <subbu>	 ok!
[14:15:47] <wikibugs>	 (03PS8) 10BBlack: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond)
[14:17:03] <subbu>	 _joe_, works! :)
[14:17:21] <_joe_>	 subbu: good!
[14:17:22] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond)
[14:17:26] <_joe_>	 and sorry for the delay
[14:17:50] <subbu>	 all good! :)
[14:18:40] <moritzm>	 !log installing reportbug update from Buster 10.1 point release
[14:18:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:53] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: codfw: coredns set to 4 replicas with pod antifinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/536597
[14:20:14] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond)
[14:20:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] codfw: coredns set to 4 replicas with pod antifinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/536597 (owner: 10Alexandros Kosiaris)
[14:22:50] <moritzm>	 !log installing bzip2 update from Buster 10.1 point release
[14:22:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:17] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:25:30] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: eqiad: coredns set to 4 replicas with pod antifinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/536603
[14:26:49] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:27:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] eqiad: coredns set to 4 replicas with pod antifinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/536603 (owner: 10Alexandros Kosiaris)
[14:28:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Overall lgtm, a couple small things." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536558 (owner: 10Effie Mouzeli)
[14:30:30] <moritzm>	 !log installing cups security update on buster (only client-side libs installed)
[14:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:47] <logmsgbot>	 !log akosiaris@ helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
[14:30:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:16] <wikibugs>	 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff)
[14:34:43] <wikibugs>	 (03CR) 10Jbond: "looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536558 (owner: 10Effie Mouzeli)
[14:36:19] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:36:31] <_joe_>	 uh effie
[14:36:39] <_joe_>	 why this? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536558/1/modules/systemd/manifests/init.pp
[14:36:42] <_joe_>	 I missed it
[14:36:51] <_joe_>	 oh nevermind
[14:36:54] <_joe_>	 old patchset
[14:36:56] <effie>	 that is 1 :p
[14:36:57] <_joe_>	 lol
[14:36:59] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[14:36:59] <_joe_>	 yeah
[14:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:00] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:37:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:03] <_joe_>	 dunno why
[14:37:13] <effie>	 _joe_: it was a really brilliant idea
[14:37:23] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:39:36] <_joe_>	 no I mean dunno why I ended up there
[14:39:41] <_joe_>	 prolly following a comment
[14:40:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "* Do this on the profile::mediawiki::php class" [puppet] - 10https://gerrit.wikimedia.org/r/536582 (https://phabricator.wikimedia.org/T232613) (owner: 10Effie Mouzeli)
[14:41:53] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[14:42:05] <wikibugs>	 (03PS1) 10BBlack: maps URIs: allow root, with or without query [puppet] - 10https://gerrit.wikimedia.org/r/536606
[14:43:03] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] maps URIs: allow root, with or without query [puppet] - 10https://gerrit.wikimedia.org/r/536606 (owner: 10BBlack)
[14:44:04] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Edit Project Config [deployment-charts] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/536607
[14:44:21] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: Edit Project Config [deployment-charts] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/536607 (owner: 10Alexandros Kosiaris)
[14:50:41] <wikibugs>	 (03CR) 10Hashar: "That will add the instance to the list of target scap deploys/sync mediawiki to." [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) (owner: 10Subramanya Sastry)
[14:51:02] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[14:51:03] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:51:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-dan: New upstream release [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/535847 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry)
[14:52:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-swe: New upstream release [debs/contenttranslation/apertium-swe] - 10https://gerrit.wikimedia.org/r/535853 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry)
[14:53:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-nno: New upstream release [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/535863 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry)
[14:53:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-nob: New upstream release [debs/contenttranslation/apertium-nob] - 10https://gerrit.wikimedia.org/r/536165 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry)
[14:54:32] <hashar>	 going to test https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/536580/  on mwdebug1001 (poke Daimona & anomie )
[14:54:38] <wikibugs>	 (03PS6) 10Effie Mouzeli: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558
[14:55:09] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:58] <wikibugs>	 (03PS3) 10Effie Mouzeli: profile::mediawiki::api: Setup systemd-coredump on api servers [puppet] - 10https://gerrit.wikimedia.org/r/536582 (https://phabricator.wikimedia.org/T232613)
[15:00:05] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:00:25] <hashar>	 anomie: so my patch might well end up being too spammy (pointed by Daimona)
[15:00:29] <hashar>	 since it would always log something
[15:00:39] <hashar>	 1/11 of traffic to wikidata
[15:00:50] <hashar>	 but that might not be worse than api-feature-usages or the csp reports
[15:01:36] <hashar>	 giving a try and will rollback if that is way too much
[15:01:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey)
[15:02:19] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[15:02:25] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:02:26] <logmsgbot>	 !log hashar@deploy1001 Synchronized php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php: Add more log and context for T232613 logging - T232613 (duration: 01m 04s)
[15:02:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:56] <stashbot>	 T232613: LBFactoryMulti.php: PHP Notice: Undefined index:  - https://phabricator.wikimedia.org/T232613
[15:03:46] <wikibugs>	 (03PS1) 10Jbond: ipmi: relax password minimum length [software/spicerack] - 10https://gerrit.wikimedia.org/r/536616 (https://phabricator.wikimedia.org/T147074)
[15:04:10] <Daimona>	 Looks like definitely too spammy...
[15:04:23] <wikibugs>	 (03PS2) 10Jbond: ipmi: relax password minimum length [software/spicerack] - 10https://gerrit.wikimedia.org/r/536616 (https://phabricator.wikimedia.org/T147074)
[15:06:11] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10ayounsi) Note that with new eqiad routing engines we can set the MSS at the router level (untested). Advantages are: easier to deploy (one configuration change) and can be applied to ext...
[15:06:13] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:06:32] <hashar>	 Daimona: yeah though that is not that much. The AdHocDebug channel does not show up in the top channels (based on the home dashboard https://logstash.wikimedia.org/app/kibana )
[15:06:34] <hashar>	 but yeah
[15:06:36] <hashar>	 I will let it run
[15:06:48] <hashar>	 and once a core dump get captured, I guess I will disable the feature entirely
[15:06:53] <hashar>	 we probably have enough core dumps
[15:07:01] <Daimona>	 I was looking at https://logstash.wikimedia.org/goto/58e68b0a1ecf0e61b1c88bbcf8c22386
[15:07:14] <Daimona>	 932 hits, of which only one generated a coredump
[15:07:31] <Daimona>	 Yeah probably
[15:07:50] <hashar>	 AbuseFilter generates an order of magnitude  more logs :]
[15:08:37] <hashar>	 soo hmm 200 events per minutes, that is manageable
[15:08:42] <hashar>	 I am letting it run
[15:08:51] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[15:08:52] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:08:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:00] <hashar>	 Daimona: and thank you for the reviews / tips etc :]
[15:10:10] <hashar>	 that did not last long :]
[15:10:12] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Enable coredns based cluster DNS in codfw [puppet] - 10https://gerrit.wikimedia.org/r/536593
[15:10:15] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable coredns based cluster DNS in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/536617
[15:10:26] <wikibugs>	 (03CR) 10Jbond: "small nit, also do we want to add something to `/etc/network/interfaces`?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602) (owner: 10Dzahn)
[15:10:39] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:10:57] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:11:03] <Daimona>	 Yeah, we're currently deprecating lots of stuff, so there are a lot of entries. wmf.22 should reduce the amount, although it's blocked on this empty string issue :)
[15:11:10] <Daimona>	 yw :)
[15:11:33] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[15:11:34] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:11:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:06] <akosiaris>	 !log upload apertium-dan_0.6.0-1+wmf3 apertium-nno_1.0.0-1+wmf1 apertium-nob_1.0.0-2+wmf1 apertium-swe_0.8.0-1+wmf1 to apt.wikimedia.org/jessie-wikimedia T218184
[15:14:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:09] <stashbot>	 T218184: Update apertium-nno-nob, apertium-swe-dan, apertium-swe-nor and apertium-dan-nor packages - https://phabricator.wikimedia.org/T218184
[15:14:45] <wikibugs>	 (03PS3) 10Cwhite: profile: use prometheus for logstash alerting [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870)
[15:15:01] <wikibugs>	 (03CR) 10Cwhite: profile: use prometheus for logstash alerting (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[15:22:52] <wikibugs>	 (03PS3) 10Paladox: gerrit: override gerrit::server::slave_hosts under gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/536359
[15:24:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/536558 (owner: 10Effie Mouzeli)
[15:25:24] <wikibugs>	 (03PS1) 10MSantos: maps: allow float value for image scale [puppet] - 10https://gerrit.wikimedia.org/r/536621
[15:28:01] <wikibugs>	 (03PS1) 10Hashar: Disable adhoc core dump logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536622 (https://phabricator.wikimedia.org/T232613)
[15:28:10] <wikibugs>	 (03PS2) 10MSantos: maps: allow float value for image scale [puppet] - 10https://gerrit.wikimedia.org/r/536621
[15:29:28] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Disable adhoc core dump logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536622 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar)
[15:31:02] <wikibugs>	 (03Merged) 10jenkins-bot: Disable adhoc core dump logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536622 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar)
[15:31:18] <wikibugs>	 (03CR) 10jenkins-bot: Disable adhoc core dump logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536622 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar)
[15:32:47] <wikibugs>	 10Operations, 10Analytics, 10Traffic: Add google weblight to the list of trusted proxies - https://phabricator.wikimedia.org/T232849 (10Nuria)
[15:33:28] <wikibugs>	 (03PS1) 10Andrew Bogott: Puppet CAs: rename a local '$puppetmaster' variable [puppet] - 10https://gerrit.wikimedia.org/r/536625 (https://phabricator.wikimedia.org/T232428)
[15:34:40] <logmsgbot>	 !log hashar@deploy1001 Synchronized wmf-config/CommonSettings.php: Disable adhoc core dump logging - T232613 (duration: 01m 04s)
[15:35:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:02] <stashbot>	 T232613: LBFactoryMulti.php: PHP Notice: Undefined index:  - https://phabricator.wikimedia.org/T232613
[15:35:30] <hashar>	 so that is all I have for this evening 
[15:35:33] <hashar>	 will be back later tonigh
[15:36:05] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:36:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Puppet CAs: rename a local '$puppetmaster' variable [puppet] - 10https://gerrit.wikimedia.org/r/536625 (https://phabricator.wikimedia.org/T232428) (owner: 10Andrew Bogott)
[15:39:47] <wikibugs>	 (03CR) 10Ayounsi: "2 inline comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602) (owner: 10Dzahn)
[15:42:43] <wikibugs>	 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Resolve local commits on cloud-puppetmaster-01.cloudinfra.eqiad.wmflabs and cloud-puppetmaster-02.cloudinfra.eqiad.wmflabs - https://phabricator.wikimedia.org/T232428 (10Andrew) 05Open→03Resolved The attache...
[15:42:46] <wikibugs>	 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew)
[15:43:40] <effie>	 !log reverting live hacks on mw1348
[15:43:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:28] <urandom>	 !log bootstrapping Cassandra, restbase2009-a -- T224553
[15:47:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:31] <stashbot>	 T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553
[15:53:11] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10BBlack) Right, that would cover cases like install1002 and archiva (and probably many other minor cases we've missed which haven't set off big alarm bells), but we'll still need direct m...
[15:56:05] <wikibugs>	 (03CR) 10BBlack: [C: 04-1] "In general, I'm hoping we don't need to go down this road and we'll find better ways to deal with this." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602) (owner: 10Dzahn)
[16:00:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: "non-blocking nit inline, LGTM but please attach a PCC run as well" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536365 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[16:02:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See line, other than that LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[16:04:02] <wikibugs>	 (03PS6) 10Elukey: profile::kerberos::kdc: add debconf settings [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089)
[16:05:08] <wikibugs>	 (03PS4) 10Cwhite: profile: use prometheus for logstash alerting [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870)
[16:05:23] <wikibugs>	 (03CR) 10Cwhite: profile: use prometheus for logstash alerting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[16:05:29] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18288/kerberos1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey)
[16:07:03] <icinga-wm>	 RECOVERY - Check the Netbox report librenms for fail status. on netbox1001 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[16:09:11] <icinga-wm>	 PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:10:37] <XioNoX>	 !log fix bgp group netflow on cr2-codfw
[16:10:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:50] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10ayounsi) As discussed on IRC, this *should* work for inbound (clamping the SYNACK too), but to be tested.
[16:36:33] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 44.73 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:41:17] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 54.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:43:57] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:46:01] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 79.29 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:53:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm but best have brandon double check i dont want to push to varnish without a +1 from traffic just yet" [puppet] - 10https://gerrit.wikimedia.org/r/536621 (owner: 10MSantos)
[16:56:54] <wikibugs>	 10Operations, 10Analytics, 10Traffic: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data  - https://phabricator.wikimedia.org/T232795 (10Nuria)
[16:58:41] <wikibugs>	 10Operations, 10Analytics, 10Traffic: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10Nuria) I have started another ticket that as you mentioned, better explains the rationale behing having "trusted proxies", we really do not need them if we can capture the original i...
[16:59:02] <wikibugs>	 10Operations, 10Analytics, 10Traffic: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10Nuria) ping @Ottomata and @JAllemandou for thou...
[17:00:21] <wikibugs>	 (03CR) 10Brennen Bearnes: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T222494) (owner: 10Brennen Bearnes)
[17:00:30] <wikibugs>	 (03PS1) 10Herron: prometheus: add per-site systemd failed unit checks [puppet] - 10https://gerrit.wikimedia.org/r/536642 (https://phabricator.wikimedia.org/T230570)
[17:07:42] <wikibugs>	 (03PS1) 10Ayounsi: Kafkatee, mask default (package provided) systemd service [puppet] - 10https://gerrit.wikimedia.org/r/536645
[17:12:05] <wikibugs>	 (03CR) 10Ayounsi: "I'm not sure of all the implication, but my understanding is that the default service is not used?" [puppet] - 10https://gerrit.wikimedia.org/r/536645 (owner: 10Ayounsi)
[17:16:57] <wikibugs>	 (03CR) 10Jbennett: "We view this risk as having a decent impact but a relatively low likelihood that's why it is ranked as low. As for signing off, we don't d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg)
[17:24:45] <urandom>	 !log bootstrapping Cassandra, restbase2009-b -- T224553
[17:24:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:55] <stashbot>	 T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553
[17:25:06] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2009 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:26:11] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-b valid until 2020-06-24 13:01:53 +0000 (expires in 284 days) https://phabricator.wikimedia.org/T120662
[17:37:31] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:43:31] <wikibugs>	 (03PS7) 10Cwhite: add generic interface to metrics gathering [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807
[17:50:44] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] maps: allow float value for image scale [puppet] - 10https://gerrit.wikimedia.org/r/536621 (owner: 10MSantos)
[17:50:53] <wikibugs>	 (03PS3) 10BBlack: maps: allow float value for image scale [puppet] - 10https://gerrit.wikimedia.org/r/536621 (owner: 10MSantos)
[17:53:29] <wikibugs>	 (03PS2) 10Cwhite: initial commit [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376
[17:59:31] <wikibugs>	 10Operations, 10Analytics, 10Traffic: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10BBlack) The problem stems from the "Trust" in "...
[18:01:55] <wikibugs>	 (03PS1) 10Herron: kafka-main: replace kafka1002 hardware with kafka-main1002 [puppet] - 10https://gerrit.wikimedia.org/r/536655 (https://phabricator.wikimedia.org/T225005)
[18:05:55] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status ulsfo on icinga1001 is CRITICAL: instance={cp4027:9536,cp4028:9536,cp4029:9536,cp4030:9536,cp4031:9536,cp4032:9536} site=ulsfo tunnel={cp1075_v4,cp1075_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[18:06:08] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Implement sensitive logstash access control - https://phabricator.wikimedia.org/T213902 (10herron) Open Distro for Elasticsearch looks quite promising https://opendistro.github.io/for-elasticsearch/
[18:07:38] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10herron) https://opendistro.github.io/for-elasticsearch/ appears to be a valid option, although this was resolved I'll update the description to include it
[18:11:14] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10herron)
[18:12:57] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Gzip SVGs served by MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/535860 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles)
[18:15:52] <wikibugs>	 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10jcrespo)
[18:15:57] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10jcrespo) 05Open→03Resolved I am ok with that (I actually said to open a new one if this was to be closed). I just didn't know if it had to be open to track whatever was you...
[18:17:37] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Looks like beta puppetmaster is borked." [puppet] - 10https://gerrit.wikimedia.org/r/535860 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles)
[18:24:31] <wikibugs>	 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10wiki_willy) a:03Jclark-ctr
[18:40:16] <wikibugs>	 (03CR) 10Jeena Huneidi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) (owner: 10Subramanya Sastry)
[18:40:45] <wikibugs>	 10Operations, 10Analytics, 10Traffic: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10Nuria) Right, I see the UA issue but in the abs...
[18:52:09] <wikibugs>	 (03PS1) 10Alex Monk: Add cloudinfra hiera data [puppet] - 10https://gerrit.wikimedia.org/r/536663 (https://phabricator.wikimedia.org/T232509)
[18:54:46] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) 05Open→03Stalled
[18:54:48] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn)
[18:54:57] <wikibugs>	 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10Dzahn)
[18:56:12] <wikibugs>	 10Operations, 10Phabricator, 10hardware-requests, 10serviceops, 10Release-Engineering-Team (Development services): The server, WMF7426, was given to us temporarily, we would like to make it permanent - https://phabricator.wikimedia.org/T232887 (10mmodell)
[18:56:34] <wikibugs>	 10Operations, 10Phabricator, 10hardware-requests, 10serviceops, 10Release-Engineering-Team (Development services): The phabricator server, WMF7426, was given to us temporarily, we would like to make it permanent - https://phabricator.wikimedia.org/T232887 (10mmodell)
[19:01:06] <wikibugs>	 (03PS1) 10Jhedden: openstack: configure apache wsgi for keystone api [puppet] - 10https://gerrit.wikimedia.org/r/536664 (https://phabricator.wikimedia.org/T223907)
[19:01:44] <wikibugs>	 (03PS2) 10Andrew Bogott: Add cloudinfra hiera data [puppet] - 10https://gerrit.wikimedia.org/r/536663 (https://phabricator.wikimedia.org/T232509) (owner: 10Alex Monk)
[19:01:56] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn)
[19:02:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add cloudinfra hiera data [puppet] - 10https://gerrit.wikimedia.org/r/536663 (https://phabricator.wikimedia.org/T232509) (owner: 10Alex Monk)
[19:03:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: configure apache wsgi for keystone api [puppet] - 10https://gerrit.wikimedia.org/r/536664 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden)
[19:07:01] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:07:30] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 04-1] "we should make sure the image is published before merging and also follow this to update the CPU/memory limits: https://wikitech.wikimedia" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T222494) (owner: 10Brennen Bearnes)
[19:09:22] <wikibugs>	 (03CR) 10Brennen Bearnes: "> Patch Set 1: Code-Review-1" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T222494) (owner: 10Brennen Bearnes)
[19:11:19] <wikibugs>	 10Operations, 10Patch-For-Review: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937 (10Dzahn)
[19:14:46] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Dzahn)
[19:14:51] <wikibugs>	 10Operations, 10Phabricator, 10Traffic: Make phame cacheable - https://phabricator.wikimedia.org/T219978 (10Dzahn)
[19:15:10] <wikibugs>	 10Operations, 10Phabricator, 10Traffic: Make phame cacheable - https://phabricator.wikimedia.org/T219978 (10Dzahn) merging into T226044
[19:18:30] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) == From the merged task:  Blog posts on phame cannot currently be cached by our...
[19:21:16] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) a:05mmodell→03None This is unblocked on my end, @ema feel free to proceed wh...
[19:22:33] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn)
[19:22:37] <wikibugs>	 10Operations, 10Phabricator: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129 (10Dzahn)
[19:22:39] <wikibugs>	 10Operations, 10Patch-For-Review: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937 (10Dzahn)
[19:22:44] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) Also important, @epriestley's comment at T219978#5346100
[19:24:53] <wikibugs>	 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10srishakatux) @Nuria I want to read MySQL credentials from `/etc/mysql/conf.d/research-client.cnf` via a script that...
[19:30:35] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:30:46] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Fibercut, telia working on it. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:30:46] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Fibercut, telia working on it. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:35:21] <icinga-wm>	 ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status ulsfo on icinga1001 is CRITICAL: instance={cp4027:9536,cp4028:9536,cp4029:9536,cp4030:9536,cp4031:9536,cp4032:9536} site=ulsfo tunnel={cp1075_v4,cp1075_v6} Ayounsi If an icinga alert brought you here, please disregard for the time being. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[19:42:08] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Krinkle)
[19:48:45] <wikibugs>	 (03PS2) 10Krinkle: beta cluster: Make deployment-parsoid09 a Mediawiki appserver as well [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) (owner: 10Subramanya Sastry)
[19:50:31] <wikibugs>	 (03PS1) 1020after4: Phabricator: Make a separate hiera option to ensure phd stopped/running [puppet] - 10https://gerrit.wikimedia.org/r/536669 (https://phabricator.wikimedia.org/T232883)
[19:54:02] <urandom>	 !log bootstrapping Cassandra, restbase2009-c -- T224553
[19:54:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:06] <stashbot>	 T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553
[19:54:59] <icinga-wm>	 RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:56:03] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-c valid until 2020-06-24 13:01:54 +0000 (expires in 284 days) https://phabricator.wikimedia.org/T120662
[20:00:38] <twentyafterfour>	 !log hotfixing T232600 due to severity of the bug and relative safety of the fix (if this breaks, yell at James_F who twisted my arm and made me do it) 
[20:00:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:41] <stashbot>	 T232600: Some Phabricator boards do not load cards anymore in Chrome 77 - https://phabricator.wikimedia.org/T232600
[20:01:09] * James_F grins.
[20:01:27] <wikibugs>	 (03PS1) 10Herron: prometheus-ipsec-exporter: subscribe service to /etc/ipsec.conf [puppet] - 10https://gerrit.wikimedia.org/r/536671 (https://phabricator.wikimedia.org/T230236)
[20:01:47] <twentyafterfour>	 James_F: fixed?
[20:02:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus-ipsec-exporter: subscribe service to /etc/ipsec.conf [puppet] - 10https://gerrit.wikimedia.org/r/536671 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron)
[20:02:13] <James_F>	 twentyafterfour: Looks like it. Thanks!
[20:02:35] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[20:03:08] <wikibugs>	 (03PS2) 10Herron: prometheus-ipsec-exporter: subscribe service to /etc/ipsec.conf [puppet] - 10https://gerrit.wikimedia.org/r/536671 (https://phabricator.wikimedia.org/T230236)
[20:08:35] <wikibugs>	 (03PS1) 10Andrew Bogott: codfw1dev: First pass at buildingout cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/536672 (https://phabricator.wikimedia.org/T229441)
[20:09:20] <wikibugs>	 (03PS2) 10Andrew Bogott: codfw1dev: First pass at building out cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/536672 (https://phabricator.wikimedia.org/T229441)
[20:10:12] <wikibugs>	 (03PS3) 10Andrew Bogott: codfw1dev: First pass at building out cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/536672 (https://phabricator.wikimedia.org/T229441)
[20:12:40] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10ayounsi) Trying to figure out why this is failing: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ps1-b6-eqiad error is: > External command error: Error in p...
[20:13:37] <wikibugs>	 (03CR) 10Urbanecm: "> Patch Set 5:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg)
[20:16:35] <wikibugs>	 (03PS4) 10Andrew Bogott: codfw1dev: First pass at building out cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/536672 (https://phabricator.wikimedia.org/T229441)
[20:18:15] <icinga-wm>	 RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:23:46] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 04-1] "I am going to spin up a new instance deployment-mediawiki-parsoid10 which will be the parsoid/php server. Will work with Petr to figure ou" [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) (owner: 10Subramanya Sastry)
[20:23:50] <chaomodus>	 !log restarting netbox1001.wikimedia.org
[20:23:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:30] <wikibugs>	 (03PS5) 10Andrew Bogott: codfw1dev: First pass at building out cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/536672 (https://phabricator.wikimedia.org/T229441)
[20:30:02] <wikibugs>	 (03PS1) 10Zoranzoki21: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536679
[20:47:02] <wikibugs>	 (03CR) 10Jbennett: "It means it is low risk from a security standpoint if the patch uploader wishes to deploy it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg)
[20:59:11] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:59:11] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:03:01] <wikibugs>	 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10Jclark-ctr) contacted dell regarding failed drive will update with response
[21:03:46] <wikibugs>	 (03PS1) 10Krinkle: Gerrit: Add colorblind-friendly diff styles to 'eclipse' syntax theme [puppet] - 10https://gerrit.wikimedia.org/r/536687 (https://phabricator.wikimedia.org/T232893)
[21:04:42] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10wiki_willy) Hi @Dzahn - just following up on this one, to see when the server can be taken down.  Thanks, Willy
[21:05:33] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: asw2-c-eqiad:xe-2/0/45 inbound interface errors - https://phabricator.wikimedia.org/T229612 (10wiki_willy) @Cmjohnson - can you provide an update on this one next week?  Thanks, Willy
[21:16:19] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.56 port 9042 https://phabricator.wikimedia.org/T93886
[21:16:49] <icinga-wm>	 PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:18:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10wiki_willy) @Cmjohnson or @Jclark-ctr - can one of you guys check this out early next week?   Thanks, Willy
[21:20:30] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:21:26] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Reset Mailing list admin password for oversight-wp-ja - https://phabricator.wikimedia.org/T232822 (10Quiddity) 05Open→03Resolved a:03Quiddity Done x3.
[21:22:31] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:22:32] <wikibugs>	 (03CR) 10Paladox: "> > Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:26:04] <wikibugs>	 (03CR) 10Dzahn: gerrit: add gerrit1001 as a replica host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536352 (owner: 10Dzahn)
[21:26:17] <wikibugs>	 (03Abandoned) 10Dzahn: gerrit: add gerrit1001 as a replica host [puppet] - 10https://gerrit.wikimedia.org/r/536352 (owner: 10Dzahn)
[21:31:51] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] prometheus-ipsec-exporter: subscribe service to /etc/ipsec.conf [puppet] - 10https://gerrit.wikimedia.org/r/536671 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron)
[21:33:45] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:35:39] <wikibugs>	 (03PS3) 10Dzahn: gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355
[21:36:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:36:49] <wikibugs>	 (03CR) 10Paladox: gerrit: do not link to mysql-connector-java.jar if on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:38:00] <wikibugs>	 (03PS4) 10Dzahn: gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355
[21:38:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:39:55] <wikibugs>	 (03CR) 10Paladox: "Note that the plan is to stop using mysql when upgraded to gerrit 2.16." [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:40:37] <wikibugs>	 (03CR) 10Paladox: "Also the lib is downloaded by gerrit from maven (mysql-lib)." [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:40:52] <wikibugs>	 (03CR) 10Dzahn: "so apparently "provided by Gerrit" means "Gerrit tells you to download it from Maven.org. is that ok, Moritz?" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:41:43] <wikibugs>	 (03CR) 10Paladox: "We can install the jar manually without gerrit doing it." [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:42:01] <icinga-wm>	 RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:42:17] <wikibugs>	 (03PS5) 10Dzahn: gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355
[21:42:43] <wikibugs>	 (03CR) 10Dzahn: "just pointing out that we went from "use distro package" to "just use the one provided by gerrit" to "manually install it"" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:43:58] <wikibugs>	 (03CR) 10Dzahn: "fine with it. hopefully won't be for long since as you say 2.16 and we can get rid of it" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:44:29] <icinga-wm>	 PROBLEM - HHVM rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:46:22] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "PCC checks out https://puppet-compiler.wmflabs.org/compiler1002/18273/logstash1009.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/536365 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[21:46:58] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:47:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw1343 is OK: HTTP OK: HTTP/1.1 200 OK - 79572 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:48:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "noop on prod and doing nothing (in puppet) on buster" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:50:19] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[21:54:37] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "just like we talked about it in the meeting. agree and looks mostly good. just found an issue when using the compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/536669 (https://phabricator.wikimedia.org/T232883) (owner: 1020after4)
[21:55:42] <wikibugs>	 (03CR) 10Cwhite: "Do the thresholds need to be tweaked as well?  Looking back at the last 90 days of data, the check might not have gone off once?" [puppet] - 10https://gerrit.wikimedia.org/r/536591 (https://phabricator.wikimedia.org/T232303) (owner: 10Filippo Giunchedi)
[21:57:16] <wikibugs>	 (03PS2) 1020after4: Phabricator: Make a separate hiera option to ensure phd stopped/running [puppet] - 10https://gerrit.wikimedia.org/r/536669 (https://phabricator.wikimedia.org/T232883)
[22:02:05] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops
[22:02:14] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/18298/  running on 1003 , stopped on 2001" [puppet] - 10https://gerrit.wikimedia.org/r/536669 (https://phabricator.wikimedia.org/T232883) (owner: 1020after4)
[22:02:19] <wikibugs>	 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10wiki_willy) a:03Jclark-ctr
[22:04:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn)
[22:04:54] <wikibugs>	 (03PS6) 10Dzahn: gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355
[22:13:30] <wikibugs>	 (03PS3) 10Dzahn: Phabricator: Make a separate hiera option to ensure phd stopped/running [puppet] - 10https://gerrit.wikimedia.org/r/536669 (https://phabricator.wikimedia.org/T232883) (owner: 1020after4)
[22:16:54] <wikibugs>	 (03CR) 10Alex Monk: "(I replaced a cherry-pick of this from PS9 or earlier with one of PS11 because everything using this was broken with an error about a miss" [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup)
[22:19:24] <wikibugs>	 (03PS1) 10Bstorm: sssd: Add a whole duplicate hierarchy of sssd images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058)
[22:19:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Phabricator: Make a separate hiera option to ensure phd stopped/running [puppet] - 10https://gerrit.wikimedia.org/r/536669 (https://phabricator.wikimedia.org/T232883) (owner: 1020after4)
[22:20:49] <wikibugs>	 (03PS3) 10Subramanya Sastry: beta cluster: Make deployment-mediawiki-parsoid10 a MW scap target [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538)
[22:22:05] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1010 is CRITICAL: 1.695e+04 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[22:23:14] <onimisionipe>	 ^thats me
[22:23:26] <onimisionipe>	 Extending downtime
[22:26:23] <mutante>	 thanks
[22:40:25] <icinga-wm>	 PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:40:43] <mutante>	 chaomodus: ^
[22:42:04] <wikibugs>	 (03PS1) 10Dzahn: DHCP: switch phab1001 from jessie to buster [puppet] - 10https://gerrit.wikimedia.org/r/536698 (https://phabricator.wikimedia.org/T190568)
[22:45:03] <wikibugs>	 (03PS1) 10Alex Monk: cloudinfra hiera: Add missing statsd key [puppet] - 10https://gerrit.wikimedia.org/r/536699 (https://phabricator.wikimedia.org/T232509)
[22:47:36] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn)
[22:47:46] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) 05Stalled→03Open
[22:47:52] <wikibugs>	 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10Dzahn)
[22:49:19] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:49:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sssd: Add a whole duplicate hierarchy of sssd images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm)
[22:50:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "phab1001 has been idling for long enough, we definitely don't need the jessie system anymore. reinstalling and then either we keep both 10" [puppet] - 10https://gerrit.wikimedia.org/r/536698 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn)
[22:51:19] <wikibugs>	 (03PS2) 10Dzahn: DHCP: switch phab1001 from jessie to buster [puppet] - 10https://gerrit.wikimedia.org/r/536698 (https://phabricator.wikimedia.org/T190568)
[22:59:54] <wikibugs>	 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941 (10Krenair) I believe we've achieved this with T220268#5275994, the only catch is that you need to make a hiera change like https://wikitech.wikimedia.org/w/index.php?title=Hiera:Dep...
[23:00:14] <wikibugs>	 (03PS1) 10Dzahn: site: apply spare::system role to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/536701 (https://phabricator.wikimedia.org/T190568)
[23:01:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: apply spare::system role to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/536701 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn)
[23:02:17] <wikibugs>	 (03PS2) 10Dzahn: site: apply spare::system role to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/536701 (https://phabricator.wikimedia.org/T190568)
[23:05:20] <wikibugs>	 (03PS2) 10Reedy: Drop PasswordCannotBePopular compatibility hack, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534706 (owner: 10Jforrester)
[23:05:22] <wikibugs>	 (03PS2) 10Reedy: Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534707 (owner: 10Jforrester)
[23:05:47] <wikibugs>	 (03CR) 10Jforrester: "Oh, yeah, forgot these. Let's do them on Monday?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534706 (owner: 10Jforrester)
[23:06:21] <gehel>	 !log re-enable puppet on maps - T232817
[23:06:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:06:25] <stashbot>	 T232817: Wikimedia maps instability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817
[23:06:27] <wikibugs>	 (03CR) 10Reedy: "WFM :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534706 (owner: 10Jforrester)
[23:07:35] <icinga-wm>	 RECOVERY - tilerator on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[23:07:43] <icinga-wm>	 RECOVERY - tilerator on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[23:07:43] <icinga-wm>	 RECOVERY - tilerator on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[23:07:45] <icinga-wm>	 RECOVERY - Check systemd state on maps2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:05] <icinga-wm>	 RECOVERY - tilerator on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[23:08:11] <icinga-wm>	 RECOVERY - tilerator on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[23:08:17] <icinga-wm>	 RECOVERY - Check systemd state on maps1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:31] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:35] <icinga-wm>	 RECOVERY - Check systemd state on maps2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:43] <icinga-wm>	 RECOVERY - tilerator on maps1004 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[23:08:45] <icinga-wm>	 RECOVERY - tilerator on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[23:08:57] <icinga-wm>	 RECOVERY - Check systemd state on maps1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:57] <icinga-wm>	 RECOVERY - Check systemd state on maps2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:09:09] <icinga-wm>	 RECOVERY - Check systemd state on maps1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:11:36] <wikibugs>	 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941 (10Krenair) 05Open→03Resolved a:03Krenair Actually I've just tested a couple of new instance creations with the above, it comes up without needing to do anything special anymor...
[23:12:57] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin10...
[23:19:27] <wikibugs>	 (03PS1) 10Dzahn: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704
[23:21:44] <wikibugs>	 (03PS2) 10Andrew Bogott: cloudinfra hiera: Add missing statsd key [puppet] - 10https://gerrit.wikimedia.org/r/536699 (https://phabricator.wikimedia.org/T232509) (owner: 10Alex Monk)
[23:22:18] <wikibugs>	 (03CR) 10Paladox: gerrit: make scap user configurable in Hiera (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn)
[23:22:37] <wikibugs>	 (03PS2) 10Bstorm: sssd: Add a whole duplicate hierarchy of sssd images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058)
[23:26:29] <wikibugs>	 (03PS2) 10Dzahn: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704
[23:29:31] <wikibugs>	 (03CR) 10Dzahn: gerrit: make scap user configurable in Hiera (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn)
[23:37:50] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:39:45] <wikibugs>	 (03CR) 10Paladox: gerrit: make scap user configurable in Hiera (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn)
[23:42:03] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[23:42:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:42:32] <wikibugs>	 (03PS3) 10Paladox: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn)
[23:43:15] <wikibugs>	 (03PS4) 10Paladox: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn)
[23:43:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudinfra hiera: Add missing statsd key [puppet] - 10https://gerrit.wikimedia.org/r/536699 (https://phabricator.wikimedia.org/T232509) (owner: 10Alex Monk)
[23:44:02] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[23:44:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:45:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn)
[23:47:59] <wikibugs>	 (03PS5) 10Paladox: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn)
[23:51:45] <wikibugs>	 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): cloud-puppetmasters:  move some hiera settings from Horizon to git/gerrit - https://phabricator.wikimedia.org/T232509 (10Krenair) 05Open→03Resolved
[23:51:49] <wikibugs>	 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair)
[23:52:33] <wikibugs>	 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) 05Open→03Resolved
[23:52:36] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair)
[23:56:02] <wikibugs>	 (03CR) 10Dzahn: gerrit: make scap user configurable in Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn)
[23:57:30] <wikibugs>	 (03CR) 10Paladox: gerrit: make scap user configurable in Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn)
[23:59:47] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['phab1001.eqiad.wmne...