[00:00:43] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:39] PROBLEM - snapshot of s4 in codfw on db1115 is CRITICAL: snapshot for s4 at codfw taken more than 4 days ago: Most recent backup 2019-09-08 23:30:18 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:09:20] will deal with netflow2001 later, on my phone right now [00:24:35] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:59] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:37:09] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:33] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:07:11] !log enable netflow sampling on cr2-codfw [01:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:06] !log add IPv6 sampling to cr1-eqiad [01:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:33] ACKNOWLEDGEMENT - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ayounsi known issue, will investigate https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:07] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:09] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:47:17] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 135165784 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:48:51] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 45928 and 59 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:05:30] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [03:30:30] (03PS10) 10DannyS712: Fix typos in code [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) [03:44:28] (03PS3) 10Effie Mouzeli: mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) [03:44:52] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [04:12:40] (03PS4) 10Effie Mouzeli: mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) [04:18:07] (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1001/18274/mwmaint1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [04:26:19] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [04:27:55] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [04:42:30] 10Operations, 10serviceops: Confd died on bast3002 - https://phabricator.wikimedia.org/T227592 (10jijiki) 05Open→03Resolved a:03jijiki It has not happened again, Resolving for now. [04:48:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [04:48:53] (03PS1) 10Tim Starling: Add coredump action to fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536398 (https://phabricator.wikimedia.org/T232613) [04:49:04] (03PS1) 10Vgutierrez: ATS: Set a explicit SSL Session cache timeout [puppet] - 10https://gerrit.wikimedia.org/r/536399 (https://phabricator.wikimedia.org/T231849) [04:52:05] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/18275/" [puppet] - 10https://gerrit.wikimedia.org/r/536399 (https://phabricator.wikimedia.org/T231849) (owner: 10Vgutierrez) [04:53:50] !log restarting ats-tls on cp4021 and cp2002 to pick up the new SSL session cache timeout - T231849 [04:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:53] T231849: Tune ATS SSL session cache - https://phabricator.wikimedia.org/T231849 [04:57:12] (03PS1) 10Tim Starling: In PHP FPM, enable process.dumpable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/536400 [04:58:37] (03PS2) 10Tim Starling: In PHP FPM, enable process.dumpable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/536400 (https://phabricator.wikimedia.org/T232613) [05:05:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] In PHP FPM, enable process.dumpable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/536400 (https://phabricator.wikimedia.org/T232613) (owner: 10Tim Starling) [05:05:25] (03PS3) 10Giuseppe Lavagetto: In PHP FPM, enable process.dumpable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/536400 (https://phabricator.wikimedia.org/T232613) (owner: 10Tim Starling) [05:09:37] (03PS1) 10Dzahn: puppetize setting advmss (MTU) size for GRU mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T23262) [05:11:51] (03CR) 10jerkins-bot: [V: 04-1] puppetize setting advmss (MTU) size for GRU mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T23262) (owner: 10Dzahn) [05:17:33] !log Rolling restart php-fpm across the fleet for 536400 [05:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:11] (03PS2) 10Dzahn: puppetize setting advmss (MTU) size for GRU mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T23262) [05:22:39] PROBLEM - PHP opcache health on mw1333 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:23:29] (03CR) 10jerkins-bot: [V: 04-1] puppetize setting advmss (MTU) size for GRU mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T23262) (owner: 10Dzahn) [05:24:21] PROBLEM - PHP opcache health on mw1328 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:25:10] ^ will take care of it [05:25:45] PROBLEM - PHP opcache health on mw1241 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:25:49] RECOVERY - PHP opcache health on mw1333 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:25:57] RECOVERY - PHP opcache health on mw1328 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:26:47] PROBLEM - PHP opcache health on mw1324 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:27:17] PROBLEM - PHP opcache health on mw1254 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:27:19] RECOVERY - PHP opcache health on mw1241 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:28:23] RECOVERY - PHP opcache health on mw1324 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:28:51] RECOVERY - PHP opcache health on mw1254 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:29:05] PROBLEM - PHP opcache health on mw1249 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:29:33] PROBLEM - PHP7 jobrunner on mw1303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:29:49] PROBLEM - PHP opcache health on mw1325 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:29:50] (03PS3) 10Dzahn: puppetize setting advmss (MTU) size for GRU mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T23262) [05:30:41] RECOVERY - PHP opcache health on mw1249 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:30:59] RECOVERY - PHP7 jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:31:25] RECOVERY - PHP opcache health on mw1325 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:33:52] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [05:34:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add coredump action to fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536398 (https://phabricator.wikimedia.org/T232613) (owner: 10Tim Starling) [05:35:05] (03CR) 10Dzahn: [C: 03+1] mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [05:35:25] PROBLEM - PHP opcache health on mw1227 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:36:49] (03PS5) 10Effie Mouzeli: mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) [05:36:53] PROBLEM - PHP opcache health on mw1347 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:36:59] RECOVERY - PHP opcache health on mw1227 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:38:13] PROBLEM - PHP opcache health on mw1247 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:38:29] RECOVERY - PHP opcache health on mw1347 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:39:29] (03PS6) 10Effie Mouzeli: mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) [05:39:49] RECOVERY - PHP opcache health on mw1247 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:40:03] (03PS4) 10Dzahn: puppetize setting advmss (MTU) size for GRE mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602) [05:40:22] <_joe_> !log live-hacking mw1348, setting rlimit_core = unlimited to allow core dumps to be taken [05:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:53] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [05:41:46] (03CR) 10Effie Mouzeli: [C: 03+2] mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [05:42:29] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [05:43:45] (03PS5) 10Dzahn: puppetize setting advmss (MTU) size for GRE tunnel mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602) [05:43:48] 10Operations, 10MediaWiki-extensions-Mailgun, 10cloud-services-team, 10serviceops, and 5 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) [05:44:01] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:44:17] 10Operations, 10serviceops: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki) [05:44:19] 10Operations, 10MediaWiki-extensions-Mailgun, 10cloud-services-team, 10serviceops, and 5 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) 05Open→03Resolved [05:44:30] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10jijiki) [05:45:56] (03CR) 10Dzahn: "set: https://puppet-compiler.wmflabs.org/compiler1002/18281/cobalt.wikimedia.org/change.cobalt.wikimedia.org.pson" [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602) (owner: 10Dzahn) [05:48:53] (03PS6) 10Dzahn: puppetize setting advmss (MTU) size for GRE tunnel mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602) [05:52:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add coredump action to fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536398 (https://phabricator.wikimedia.org/T232613) (owner: 10Tim Starling) [05:52:55] (03Merged) 10jenkins-bot: Add coredump action to fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536398 (https://phabricator.wikimedia.org/T232613) (owner: 10Tim Starling) [05:53:11] (03CR) 10jenkins-bot: Add coredump action to fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536398 (https://phabricator.wikimedia.org/T232613) (owner: 10Tim Starling) [05:58:01] !log oblivian@deploy1001 Synchronized w/fatal-error.php: Adding core dump function to fatal-error (duration: 01m 04s) [05:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:20] (03PS3) 10Elukey: Add config for wmf_netflow to Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) (owner: 10Nuria) [06:04:58] (03CR) 10Elukey: "Tried to add the config on the fly on an-tool1007 and I get the following error in the logs:" [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) (owner: 10Nuria) [06:15:19] (03CR) 10Elukey: Add config for wmf_netflow to Turnilo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) (owner: 10Nuria) [06:32:50] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "-2 because while the patch is technically correct, I think this is noise." [puppet] - 10https://gerrit.wikimedia.org/r/535697 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron) [06:35:33] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [06:37:51] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:05] (03CR) 10Giuseppe Lavagetto: "Going in the right direction, code could still be simpler if you just declare metrics_manager as a class property of EndpointRequest." (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 (owner: 10Cwhite) [06:48:15] (03PS4) 10Elukey: Add config for wmf_netflow to Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) (owner: 10Nuria) [06:49:05] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [06:49:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:50:39] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [06:51:31] <_joe_> uhm [06:51:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:52:29] <_joe_> that's mostly maps AIUI [06:53:19] (03CR) 10Elukey: [C: 03+2] "I noticed in the Druid UI that some of the more recent hours have multiple segments, and some of them with 1 dimensions. This might be due" [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) (owner: 10Nuria) [06:57:43] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:01:43] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:03:19] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:07:15] should we start looking into this alert that keeps popping up ? [07:13:23] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:14:59] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:15:08] PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [07:16:42] RECOVERY - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [07:22:23] 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10jcrespo) The raid is back \o/ ` root@backup1001:~$ sudo megacli -PDList -aALL | grep 'Firmware state' Firmware state: Unconfigured(good), Spun Up Firmware state: Unconfigured(... [07:23:57] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:24:29] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:24:51] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:24:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_upload site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:27:05] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:27:39] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:28:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:28:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:38:09] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:39:43] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:42:22] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts: ` ['elastic1046.eqiad.wmnet'] ` The log can be found in `/var/log/wmf... [07:45:58] (03PS2) 10Muehlenhoff: Remove obsolete restbase hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/536204 [08:25:32] (03CR) 10Filippo Giunchedi: [C: 04-1] profile: use prometheus for logstash alerting (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [08:25:49] (03CR) 10Filippo Giunchedi: [C: 04-1] "> Patch Set 2: Code-Review-1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [08:33:13] <_joe_> !log restarting kartotherian on maps1003, all workers seem stuck [08:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:55] PROBLEM - kartotherian endpoints health on maps1002 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [08:34:06] there we go [08:34:19] <_joe_> onimisionipe: I didn't expect anything different [08:34:45] let's do a rolling restart on all maps eqiad for kartotherian [08:34:55] you have the cumin power :) [08:35:14] I see the same pattern in maps1004 and maps1001 [08:35:33] PROBLEM - kartotherian endpoints health on maps1001 is CRITICAL: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [08:35:57] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1002.eqiad.wmnet, maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:36:02] <_joe_> onimisionipe: yeah I'm going to [08:36:13] <_joe_> but not via cumin [08:37:09] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:37:12] <_joe_> !log rolling restart of karotherian [08:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:37] RECOVERY - kartotherian endpoints health on maps1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [08:38:37] RECOVERY - kartotherian endpoints health on maps1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [08:40:29] PROBLEM - kartotherian endpoints health on maps1004 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) is CRITICAL: Could not fetch url http://10.64.48.154:6533/v4/marker/pin-m-fuel+ffffff.png: Generic connection error: HTTPConnectionPool(host=u10.64.48.154, port=6533): Max retries exceeded with url: /v4/marker/pin-m-fuel+ffffff.png (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f4da1b31a5 [08:40:29] blish a new connection: [Errno 111] Connection refused,)): /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) is CRITICAL: Could not fetch url http://10.64.48.154:6533/v4/marker/pin-m-fuel+ffffff@2x.png: Generic connection error: HTTPConnectionPool(host=u10.64.48.154, port=6533): Max retries exceeded with url: /v4/marker/pin-m-fuel+ffffff@2x.png (Caused by NewConnectionError(urllib3.connection.HTTPConnection [08:40:29] a1b31ad0: Failed to establish a new connection: [Errno 111] Connection refused,)) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [08:40:43] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:41:14] PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received: /_info (Untitled test) timed out before a response was re [08:41:14] r/pin-m+ffffff@2x.png (Untitled test) timed out before a response was received: /private-info/info.json (private tile service info for osm-intl) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [08:41:23] <_joe_> uh [08:41:25] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:41:44] <_joe_> that's becuase of the restarts I hope? [08:41:57] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:42:08] <_joe_> no, karthoterian is just not responding [08:42:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:43:01] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:43:33] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:43:35] RECOVERY - kartotherian endpoints health on maps1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [08:43:38] PROBLEM - LVS HTTP IPv4 #page on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:43:57] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:44:04] PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [08:44:23] PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [08:45:06] RECOVERY - LVS HTTP IPv4 #page on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:45:29] !log stop tilerator on maps to help reduce load [08:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:45:59] RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [08:46:01] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:47:31] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1046.eqiad.wmnet'] ` Of which those **FAILED**: ` ['elastic1046.eqiad.wmnet'] ` [08:47:38] !log jmm@cumin2001 START - Cookbook sre.hosts.ipmi-password-reset [08:47:38] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [08:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:58] !log jmm@cumin2001 START - Cookbook sre.hosts.ipmi-password-reset [08:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:20] !log jmm@cumin2001 Updating IPMI password on 1 hosts - jmm@cumin2001 [08:48:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [08:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:23] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:49:04] RECOVERY - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [08:50:01] PROBLEM - Check systemd state on maps2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:01] PROBLEM - Check systemd state on maps2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:11] (03PS4) 10Vgutierrez: Release 8.0.5-1wm7 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/536302 (https://phabricator.wikimedia.org/T232298) [08:50:15] PROBLEM - tilerator on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [08:50:23] there we go [08:50:33] PROBLEM - Check systemd state on maps1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:41] PROBLEM - Check systemd state on maps1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:45] PROBLEM - tilerator on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [08:50:55] PROBLEM - tilerator on maps1004 is CRITICAL: connect to address 10.64.48.154 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [08:51:05] PROBLEM - Check systemd state on maps2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:05] PROBLEM - tilerator on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [08:51:13] PROBLEM - tilerator on maps2001 is CRITICAL: connect to address 10.192.0.144 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [08:51:13] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:25] PROBLEM - Maps HTTPS on maps1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [08:51:35] PROBLEM - tilerator on maps1001 is CRITICAL: connect to address 10.64.0.79 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [08:51:41] PROBLEM - kartotherian endpoints health on maps1001 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [08:51:42] PROBLEM - LVS HTTP IPv4 #page on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:51:56] <_joe_> godog: let's downtime it? [08:52:01] yeah, doing [08:52:26] <_joe_> thanks! [08:52:49] RECOVERY - tilerator on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [08:52:53] RECOVERY - Maps HTTPS on maps1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.668 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [08:52:56] !log downtime kartotherian pages for 1h [08:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:12] RECOVERY - LVS HTTP IPv4 #page on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 2.459 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:53:12] RECOVERY - kartotherian endpoints health on maps1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [08:53:15] RECOVERY - Check systemd state on maps2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:18] that's only eqiad downtimed btw [08:54:19] RECOVERY - Check systemd state on maps2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:56:25] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:57:15] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [08:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:27] <_joe_> gehel:^^ depooled eqiad right now [08:57:39] <_joe_> not sure it will work for the caches as well [08:57:44] thanks [08:57:50] <_joe_> vgutierrez: all the upload caches backends are now ATS right? [08:57:56] yes [08:57:57] that's right [08:58:13] <_joe_> so they use the discovery url I guess to reach maps [08:58:31] PROBLEM - tilerator on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [08:58:40] _joe_: that's right [08:58:46] <_joe_> gehel: I will set elastic1017 and 1046 to pooled=inactive, they're polluting pybal's logs [08:59:00] <_joe_> Sep 13 08:58:49 lvs1015 pybal[658]: [kartotherian-ssl_443] ERROR: Could not depool server maps1004.eqiad.wmnet because of too many down! [08:59:01] PROBLEM - Check systemd state on maps1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:03] PROBLEM - tilerator on maps1003 is CRITICAL: connect to address 10.64.32.117 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [08:59:04] <_joe_> so still no dice [08:59:05] PROBLEM - Check systemd state on maps2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:11] PROBLEM - tilerator on maps2001 is CRITICAL: connect to address 10.192.0.144 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [08:59:12] _joe_: ok, thanks [08:59:25] _joe_: specifically map http://maps.wikimedia.org https://kartotherian.discovery.wmnet [08:59:35] PROBLEM - Check systemd state on maps1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:35] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:59:39] PROBLEM - Check systemd state on maps2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:11] <_joe_> can someone check that all the alerts on maps on codfw are from tilerator please [09:00:24] and of course ATS is screaming (20190913.09h00m04s CONNECT: could not connect to 10.2.1.13 for 'https://kartotherian.discovery.wmnet/osm-intl/7/41/-15.png' (setting last failure time) connect_result=5) [09:00:32] (that's cp5001 BTW) [09:00:40] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=elastic1017.eqiad.wmnet [09:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:12] <_joe_> so maps in codfw are now down as well? [09:01:25] it looks like it [09:01:38] <_joe_> can someone check pybal in codfw please? [09:01:59] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=elastic1046.eqiad.wmnet [09:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:02] checking [09:02:57] _joe_: proxyfetch is not happy: Sep 13 07:02:20 lvs2003 pybal[45433]: [kartotherian-ssl_443] ERROR: Monitoring instance ProxyFetch reports server maps2001.codfw.wmnet (enabled/up/pooled) down: Getting https://localhost/ took longer than 5 seconds. [09:03:02] <_joe_> because the cpu has still not spiked there [09:03:06] <_joe_> sigh [09:03:11] of fuck.. that's 2 hours ago [09:03:13] forget about it [09:03:18] <_joe_> ok [09:03:20] <_joe_> :D [09:03:35] <_joe_> a tail -f should be enough to understand the current state [09:03:40] last message: Sep 13 07:02:30 lvs2003 pybal[45433]: [kartotherian-ssl_443] INFO: Server maps2001.codfw.wmnet (enabled/partially up/not pooled) is up [09:03:56] <_joe_> the cpu is spiking though [09:04:04] and maps2002 just failed [09:04:05] Sep 13 09:03:38 lvs2003 pybal[45433]: [kartotherian-ssl_443 ProxyFetch] WARN: maps2002.codfw.wmnet (enabled/up/pooled): Fetch failed (https://localhost/), 5.004 s [09:04:11] <_joe_> yeah [09:04:31] crap [09:04:43] <_joe_> I expected as much [09:05:02] <_joe_> vgutierrez: just on the ssl endpoint or also on the non-ssl one? [09:05:15] SSL only [09:05:28] now is time for 2003.. [09:05:50] and pybal is already below the depool threshold.. so keeping pooled failed servers [09:06:40] _joe_: what's handling TLS for kartotherian? envoy? [09:06:50] I'm silencing pages in karto codfw too, it is going to page too soon anyways [09:07:21] any idea to track those json errors to a specific URL? No context in the kartotherian logs [09:08:17] !log downtime kartotherian pages for 1h in codfw [09:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:30] oh.. nginx there [09:09:59] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:10:11] <_joe_> sigh [09:10:15] <_joe_> let's repool eqiad [09:10:21] <_joe_> this is doing nothing good anyways [09:10:39] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:11:06] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [09:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:15] <_joe_> we proved the problem is tied to the requests [09:11:26] yep [09:12:18] I'm suspecting the geoline endpoint (https://maps.wikimedia.org/geoline), but no proof [09:12:41] PROBLEM - kartotherian endpoints health on maps2001 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [09:12:47] whree would be a good place to drop this traffic and see if that mitigates the issue [09:13:09] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:14:09] RECOVERY - kartotherian endpoints health on maps2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [09:14:22] I'd say either varnish on frontend or nginx on maps hosts [09:14:43] varnish-fe would be the most efficient place to do it yeah [09:15:01] vgutierrez: is that easy to do? [09:15:08] * gehel has no idea about varnish [09:15:27] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:16:17] <_joe_> vgutierrez: why not in the nginx for maps? but yeah whatever you choose [09:16:28] _joe_: edge VS backend [09:16:30] but yeah [09:16:37] it would work as well [09:16:51] <_joe_> let's do it in vcl, if you feel confident about that [09:17:02] fyi the error i saw on maps2004 was not the same geoip error [09:17:03] Error: GroupId not available\n at dm.loadGroups.then.dataGroups (/srv/deployment/kartotherian/deploy-cache/revs/c4c9e8b9dcc0747a8329b061408ddda32378ca5e/node_modules/@kartotherian/snapshot/lib/mapdataLoader.js:53:19 [09:17:23] <_joe_> that is apparently common and well known [09:17:28] ack [09:17:44] _joe_: something like https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/upload-frontend.inc.vcl.erb#L129-L135 [09:18:09] jbond42: saddly: T158657 [09:18:10] T158657: Kartotherian error: GroupId not available - https://phabricator.wikimedia.org/T158657 [09:18:28] gehel: which URLs need to be blocked? [09:18:48] https://maps.wikimedia.org/geoline [09:18:55] well, anythign starting with that [09:19:17] ack [09:19:54] this is me guessing that the json errors are the core issues and that they are related to this endpoing. But I don't have anything better to propose [09:19:54] <_joe_> can I suggest you add such a block manually in one nginx backend [09:20:02] <_joe_> and see i fit solves the issue on that node? [09:20:27] 10Operations, 10Discovery, 10Maps: Wikimedia maps unstability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817 (10jcrespo) [09:20:34] ^ gehel [09:21:07] feel free to correct any missunderstanding I had written [09:21:40] _joe_: will do [09:21:47] PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [09:22:43] (03PS1) 10Vgutierrez: varnish: Rate limit heavily maps.wm.o/geoline [puppet] - 10https://gerrit.wikimedia.org/r/536549 [09:23:50] (03PS2) 10Vgutierrez: varnish: Rate limit heavily maps.wm.o/geoline [puppet] - 10https://gerrit.wikimedia.org/r/536549 [09:23:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/536549 (owner: 10Vgutierrez) [09:24:01] jbond42: two seconds? :) [09:24:16] !log deny access to /geoline on maps1004 - T232817 [09:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:19] T232817: Wikimedia maps unstability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817 [09:24:20] between PS2 and you review, that's impressive [09:24:21] its a simple one :) [09:24:34] jbond42: syntax error on PS1 though :( [09:25:01] RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [09:25:06] oh yes :# [09:25:23] PS2 looks ok though [09:26:03] yes looks good [09:26:08] (03CR) 10Jbond: [C: 03+1] varnish: Rate limit heavily maps.wm.o/geoline [puppet] - 10https://gerrit.wikimedia.org/r/536549 (owner: 10Vgutierrez) [09:26:27] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:26:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:26:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_upload site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:27:20] gehel: maps1004 looks better without /geoline traffic? [09:27:40] !log restart kartotherian on maps1004 - T232817 [09:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:57] vgutierrez: just restarted kartotherian, does not really look much better [09:28:07] 10Operations, 10Discovery, 10Maps: Wikimedia maps instability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817 (10jcrespo) [09:28:25] <_joe_> gehel: the cpu usage is more humane for now though [09:28:53] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:29:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:29:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:29:31] slightly better, maybe, let's wait 5 minutes see what happens [09:29:52] ack [09:30:48] gehel: do we have metrics about maps endpoints being hit? [09:30:53] or just webrequest? [09:31:36] no specific metrics on the maps side [09:33:07] load seems to be decreasing on other maps servers as well, no idea why [09:33:56] I don't see any crazy spike on /geoline [09:34:04] could it be the geoshape endpoint? [09:34:07] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:43] https://w.wiki/8J5 [09:34:58] jbond42: geoshape gets more traffic but nothing crazy checking the last 7 days [09:35:13] but my knowledge of maps.wm.o is /dev/null [09:35:29] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:16] i turned on access loggin on maps 1003 to get a sample mand i see this in the log [09:36:20] 10.64.16.25 - - [13/Sep/2019:09:26:38 +0000] "GET [09:36:22] /geoshape?getgeojson=1&query=++SELECT+%3Fid+++%28if%28%3Fid+%3D+wd%3AQ5970%2C+%27%23C12838%27%2C+%27%2307c63e%27%29+as+%3Ffill%29+++%28if%28BOUND%28%3Flink%29%2C+++++++concat%28%27%5B%5B%27%2C+substr%28str%28%3Flink%29%2C31%2C500%29%2C++%27%7C%27%2C+%3FidLabel%2C+%27%5D%5D%27%29%2C+++++++%3FidLabel%29++++as+%3Ftitle%29+WHERE+%7B+++%7B+%3Fid+p%3AP31%2Fps%3AP31%2Fwdt%3AP279%2A+wd%3AQ22865+.+++h [09:36:28] int%3APrior+hint%3Agearing+%27forward%27++%7D+++UNION+++%7B+%3Fid+p%3AP31%2Fps%3AP31%2Fwdt%3AP279%2A+wd%3AQ106658+.++++hint%3APrior+hint%3Agearing+%27forward%27++%7D+++++%3Fid+p%3AP131+%3Flocst+.+++%3Flocst+ps%3AP131%2Fwdt%3AP131%2A+wd%3AQ1197.+++hint%3APrior+hint%3Agearing+%27forward%27++++MINUS+%7B+%3Flocst+pq%3AP582+%5B%5D+%7D+++MINUS+%7B+%3Fid+wdt%3AP576+%5B%5D+%7D+++SERVICE+wikibase%3Ala [09:36:34] bel+%7B+++++bd%3AserviceParam+wikibase%3Alanguage+%27en%27+.+++++%3Fid+rdfs%3Alabel+%3FidLabel+.+++%7D+++OPTIONAL+%7B%3Flink+schema%3Aabout+%3Fid+.+++%3Flink+schema%3AisPartOf+%3Chttps%3A%2F%2Fen.wikipedia.org%2F%3E+.+%7D+%7D++GROUP+BY+%3Fid+%3Flink+%3FidLabel++ HTTP/1.1" 200 699067 "-" "kartotherian-getJSON (yurik @ wikimedia)" [09:36:39] maps1003 ~ % [09:36:40] is that normal the SELECT jumped out at me [09:36:52] I think so [09:36:53] saddly, this is "normal" [09:36:58] it's a feature not a bug [09:37:02] * vgutierrez hides [09:37:08] ahh ok :S [09:37:10] this is a SPARQL query to WDQS [09:37:15] ahh [09:37:48] I'll drop geoshap as well on maps1004, see if that changes anything [09:37:55] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:08] question was caching in general affected? Because I can see tiles not loading, but the js didn't use to load for me either [09:38:32] !log drop /geoshape and restart kartotherian on maps1004 - T232817 [09:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:35] T232817: Wikimedia maps instability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817 [09:38:56] jynus: I don't think so, maps.wikimedia.org opens just fine here via eqsin [09:39:09] vgutierrez: it didn't for me (it started recently) [09:39:38] but maybe at app side the ui doesn't load if tiles didn't, so don't take my word for iut [09:39:43] esams, right? [09:39:45] yep [09:40:06] maps.m.o works fine for me [09:40:07] just I was expecting grey squared and the minimal ui [09:40:17] yes, it works for me too now [09:40:27] I mean while peak issues [09:40:42] !log install linux-perf-4.9 on maps1002 and attempt to capture a stack sample [09:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:43] majority of the tiles should be in cache [09:40:57] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/536163 (owner: 10Alexandros Kosiaris) [09:41:00] (03PS3) 10Alexandros Kosiaris: blubberoid: Remove monitoring support [deployment-charts] - 10https://gerrit.wikimedia.org/r/536163 [09:41:08] none loaded to me and I was at a low zoom [09:41:27] just reporting, don't need any actionable as things are now working [09:41:28] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] blubberoid: Remove monitoring support [deployment-charts] - 10https://gerrit.wikimedia.org/r/536163 (owner: 10Alexandros Kosiaris) [09:42:08] looks like dropping /geoline has some positive impact, but waiting a few more minutes to see if I'm just dreaming or not [09:42:27] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:56] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [09:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:00] gehel: you mean geoshape? [09:43:07] is this menty to be the case on the maps servers [09:43:08] ● tilerator.service masked failed failed tilerator.service [09:43:18] right, geoshape [09:43:21] :) [09:43:53] jbond42: yes, I killed tilerator to reduce load, did not seem to help much [09:43:59] ack [09:44:09] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [09:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:03] ok, I think that maps1004 is behaving better since dropping geoshape, I'll re-enable geoline and see how that looks, and then we can apply this more widely [09:45:12] gehel: ack [09:45:19] I'll update the CR for varnish [09:45:22] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [09:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:02] !log re-enabling /geoline on maps1004 - T232817 [09:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:04] T232817: Wikimedia maps instability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817 [09:46:30] (03PS3) 10Vgutierrez: varnish: Rate limit heavily maps.wm.o/geoshape [puppet] - 10https://gerrit.wikimedia.org/r/536549 [09:47:00] geoline? or geoshape? I am now quite confused [09:47:44] apergos: geoshape now [09:47:52] geoshape is still banned, geoline is enabled [09:47:55] feel free to check the SAL though [09:47:58] ah gotcha [09:48:24] don't trust me too much, geoline and geoshape are somewhat mixed up in my brain atm [09:48:52] vgutierrez: if I read your VCL patch correctly, that introduces a global rate limit on geoshape, right? [09:49:17] Or is it magically bucketed by client in some way? [09:49:31] yeah, that's X-Client-IP right there :) [09:50:05] X-Client-IP is set by nginx and sent to varnish-fe with the real client IP [09:50:24] Ok, so that's part of the bucket [09:50:50] the suspicion is that the issue is not so much because of the amount of traffic, but because of the kind of requests [09:51:13] fwiw my attempt at a stack trace/sample with perf didn't yield anything useful unfortunately [09:51:17] let's see if 1/second/ip is sufficient [09:51:31] gehel: we can tune that of course :) [09:51:41] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/536549 (owner: 10Vgutierrez) [09:51:48] gehel: now? [09:51:51] let's try as-is first and see if that's sufficient [09:52:13] do you want it merged now? [09:52:22] vgutierrez: yeah, maps1004 is definitely less saturating - https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=maps&var-instance=All&from=now-1h&to=now&panelId=2160&fullscreen&refresh=1m [09:52:26] ack [09:52:37] I'm seeing cpu sort of recovered across other maps hosts too [09:52:49] still very high but not pegged at 100% [09:52:51] (03CR) 10Vgutierrez: [C: 03+2] varnish: Rate limit heavily maps.wm.o/geoshape [puppet] - 10https://gerrit.wikimedia.org/r/536549 (owner: 10Vgutierrez) [09:53:03] (03PS4) 10Vgutierrez: varnish: Rate limit heavily maps.wm.o/geoshape [puppet] - 10https://gerrit.wikimedia.org/r/536549 [09:53:20] and this is how I get a t-shirt [09:53:22] /o\ [09:53:58] hahaha vgutierrez [09:54:14] vgutierrez: I don't have any t-shirt for you, but I can send chocolate :) [09:54:30] gehel: that's tempting.. here I don't get proper one [09:55:52] (gotta love A:cp-upload cumin alias) [09:58:39] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:59:14] CPU seems to be climbing back up, maybe this has nothing to do with geoline / geoshape, but just about the traffic we're getting [10:00:13] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [10:00:25] gehel: traffic doesn't seem to have increased over the last week [10:00:52] (03PS1) 10Elukey: eventlogging:dependencies: allow to specify python-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/536557 (https://phabricator.wikimedia.org/T222941) [10:00:59] at least from turnilo PoV [10:01:02] vgutierrez: so it must be about the type of traffic and the the quantity [10:01:14] so more expensive queries [10:01:57] puppet is running on the cp upload cluster [10:02:51] looks like most of our traffic is refered from https://twpkinfo.com/ [10:03:47] vgutierrez: change already applied? CPU seems to be going down in the last 3 minutes [10:03:59] RECOVERY - snapshot of s4 in codfw on db1115 is OK: snapshot for s4 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-09-13 08:08:54 from db2099.codfw.wmnet:3314 (1076 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [10:04:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:04:11] gehel: half ot if [10:04:17] *half of it :) [10:04:18] damn [10:04:29] (03PS1) 10Effie Mouzeli: WIP: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558 [10:05:57] (03CR) 10jerkins-bot: [V: 04-1] WIP: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558 (owner: 10Effie Mouzeli) [10:07:05] things seem to be mostly stable for maps, but I'm not entirely sure why [10:07:11] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:07:25] I'll wait more before re-enabling tilerator [10:08:22] yeah cpu looking good now on maps afaics [10:09:28] !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [10:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:30] (03PS2) 10Elukey: eventlogging:dependencies: allow to specify python-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/536557 (https://phabricator.wikimedia.org/T222941) [10:13:44] !log disable ATS-TLS debug options on cp5001 - T232298 [10:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:47] T232298: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 [10:14:19] (03CR) 10Jbond: WIP: systemd: Add support for coredump.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536558 (owner: 10Effie Mouzeli) [10:17:07] (03CR) 10Elukey: [C: 03+2] "Andrew: I know that this is a horrible hack but it is just get puppet unblocked. I noticed that we are about to absent all the profile::ev" [puppet] - 10https://gerrit.wikimedia.org/r/536557 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey) [10:18:47] (03PS1) 10Alexandros Kosiaris: Add more entries to .gitignore [deployment-charts] - 10https://gerrit.wikimedia.org/r/536560 [10:19:19] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add more entries to .gitignore [deployment-charts] - 10https://gerrit.wikimedia.org/r/536560 (owner: 10Alexandros Kosiaris) [10:19:43] (03PS1) 10Elukey: eventlogging: move comment about python-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/536561 (https://phabricator.wikimedia.org/T222941) [10:20:02] (03CR) 10Elukey: [C: 03+2] eventlogging: move comment about python-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/536561 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey) [10:21:30] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Eevans) restbase2009 is fully decommissioned and ready to be reimaged. [10:21:58] (03PS1) 10Muehlenhoff: Remove access for atgomez [puppet] - 10https://gerrit.wikimedia.org/r/536562 [10:22:37] 10Operations, 10Analytics, 10Traffic: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10BBlack) The URL mentioned at the top isn't a media URL, it actually is HTML content and is a pageview. Try it in your browser: https://commons.wikimedia.org//wiki/File:Arm_muscles_b... [10:23:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/536562 (owner: 10Muehlenhoff) [10:23:42] (03CR) 10jerkins-bot: [V: 04-1] Remove access for atgomez [puppet] - 10https://gerrit.wikimedia.org/r/536562 (owner: 10Muehlenhoff) [10:24:00] (03CR) 10Elukey: [C: 03+2] "Of course the comment was placed in the wrong place, fixed it with https://gerrit.wikimedia.org/r/536561" [puppet] - 10https://gerrit.wikimedia.org/r/536557 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey) [10:24:43] (03PS2) 10Muehlenhoff: Remove access for atgomez [puppet] - 10https://gerrit.wikimedia.org/r/536562 [10:27:37] (03PS3) 10Muehlenhoff: Remove access for atgomez [puppet] - 10https://gerrit.wikimedia.org/r/536562 [10:30:47] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for atgomez [puppet] - 10https://gerrit.wikimedia.org/r/536562 (owner: 10Muehlenhoff) [10:31:11] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [10:32:45] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [10:36:05] !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [10:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:46] !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [10:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:21] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:38:25] !log repool restbase1018 after reimage to stretch and completed Cassandra bootstrap [10:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:33] 10Operations, 10Traffic: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) Reported and proposed a solution to upstream in https://github.com/apache/trafficserver/pull/5935 [10:41:51] !log reimage restbase2009 to stretch T224553 [10:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:54] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [10:42:15] !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [10:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:00] 10Operations, 10Wikimedia-Mailing-lists: Reset Mailing list admin password for oversight-wp-ja - https://phabricator.wikimedia.org/T232822 (10Rxy) p:05Triage→03Normal [10:47:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/536367 (https://phabricator.wikimedia.org/T224553) (owner: 10Eevans) [10:47:03] (03PS2) 10Muehlenhoff: hieradata: specify restbase2009 jbod devices [puppet] - 10https://gerrit.wikimedia.org/r/536367 (https://phabricator.wikimedia.org/T224553) (owner: 10Eevans) [10:47:51] !log rebooting acmechief-test servers to catch up latest kernel upgrades [10:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:17] 10Operations, 10Wikimedia-Mailing-lists: Reset Mailing list admin password for oversight-wp-ja - https://phabricator.wikimedia.org/T232822 (10Rxy) p:05Normal→03Triage [10:55:22] PROBLEM - Maps HTTPS on maps1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:55:53] PROBLEM - kartotherian endpoints health on maps1001 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [10:56:57] RECOVERY - Maps HTTPS on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 9.932 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:58:13] (03CR) 10Muehlenhoff: [C: 03+2] hieradata: specify restbase2009 jbod devices [puppet] - 10https://gerrit.wikimedia.org/r/536367 (https://phabricator.wikimedia.org/T224553) (owner: 10Eevans) [11:00:09] theory on maps: we have some kind of non terminating request on geoshape and it slowly saturates. The current throttling limits the problem but still let it grow slowly [11:00:39] PROBLEM - kartotherian endpoints health on maps1001 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [11:00:54] vgutierrez: could we be way more aggressive on that throttling? see if it helps [11:01:13] I'm trying again to reach mateusbs17, see if he has any better idea [11:03:47] RECOVERY - kartotherian endpoints health on maps1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [11:05:30] I'm going to silence kartoterian for a couple of hours [11:05:54] !log silence kartotherian pages for 2h, known issue [11:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:12] (03PS1) 10Muehlenhoff: restbase2010: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536565 [11:07:11] 10Operations, 10Wikimedia-Mailing-lists: Reset Mailing list admin password for oversight-wp-ja - https://phabricator.wikimedia.org/T232822 (10Rxy) [11:07:37] (03PS1) 10Muehlenhoff: restbase2011: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536566 [11:07:47] I'm out for dinner.. maybe bblack can handle that gehel [11:08:07] vgutierrez: sure, enjoy dinner! [11:08:08] gehel: the current limit is 1 request per second avrage with the allowance for a 10 burst, how much more aggressive would you like it [11:08:30] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:08:30] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [11:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:39] (03PS1) 10Muehlenhoff: restbase2012: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536567 [11:08:47] I'm thinking blocking it completely, but I'm also thinking it's probably unrelated [11:09:11] the other quite extreme action would be to block that pokemongo website based on referrer [11:09:53] ok kone sec i can look at both of those options [11:10:02] !log reboot an-tool1007 (runs turnilo) for kernel upgrades [11:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:05] !log reboot an-conf100* (Analytics Zookeeper nodes - not yet in production) for kernel upgrades [11:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:24] <_joe_> gehel: did anyone analyze the traffic we're getting? [11:13:31] <_joe_> there are many other sites [11:14:10] _joe_: only at a high level (so mostly: no) [11:14:17] <_joe_> ok [11:14:23] <_joe_> that's what I would do at this point [11:14:31] <_joe_> there must be one specific caller causing this [11:14:43] PROBLEM - Host an-conf1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:15:57] RECOVERY - Host an-conf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [11:17:27] ok I'll try to dig into the logs [11:17:33] but not even sure what I'm looking for [11:18:39] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [11:18:51] PROBLEM - Maps HTTPS on maps1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:19:43] RECOVERY - Maps HTTPS on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:20:57] (03PS1) 10Jbond: maps: block 9db.jp from maps [puppet] - 10https://gerrit.wikimedia.org/r/536568 [11:22:45] PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [11:23:30] (03PS1) 10Jbond: maps: block goeshape [puppet] - 10https://gerrit.wikimedia.org/r/536570 [11:23:47] RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [11:24:05] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [11:24:46] gehel: changes above ^^ one to disable geoshapes and one to disable 9db.jp as refer. bblack is also added as reviewer [11:26:31] (03CR) 10Gehel: "LGTM as a mitigation strategy." [puppet] - 10https://gerrit.wikimedia.org/r/536570 (owner: 10Jbond) [11:28:37] (03PS2) 10Jbond: maps: block 9db.jp from maps [puppet] - 10https://gerrit.wikimedia.org/r/536568 [11:30:47] (03PS3) 10Jbond: maps: block 9db.jp from maps [puppet] - 10https://gerrit.wikimedia.org/r/536568 [11:33:52] (03CR) 10Urbanecm: "> Patch Set 5:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [11:36:37] PROBLEM - Host an-conf1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:45] this is me --^ [11:37:03] RECOVERY - Host an-conf1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [11:38:56] <_joe_> !log manually raising the worker heap limit to 600 MB on kartotherian on maps1003 [11:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:21] !log enable access logs on maps1003 [11:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:46] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10MoritzMuehlenhoff) [11:40:19] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10MoritzMuehlenhoff) restbase2009 has been reimaged and is ready to be bootstrapped in Cassandra. [11:41:25] (03PS2) 10Effie Mouzeli: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558 [11:42:28] (03PS3) 10Effie Mouzeli: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558 [11:43:07] PROBLEM - Host an-conf1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:36] this is me --^ [11:44:01] RECOVERY - Host an-conf1003 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [11:48:17] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:05] PROBLEM - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [11:53:07] PROBLEM - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:55:05] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [11:57:07] PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [11:58:43] there was a new kartotherian version deployed yesterday, checking what was the content [11:59:09] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:59:21] ^restbase is reimage alert spam, silencing [11:59:59] ACKNOWLEDGEMENT - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff Pending bootstrap after reimage https://phabricator.wikimedia.org/T120662 [11:59:59] ACKNOWLEDGEMENT - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive Muehlenhoff Pending bootstrap after reimage https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:59:59] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused Muehlenhoff Pending bootstrap after reimage https://phabricator.wikimedia.org/T93886 [11:59:59] ACKNOWLEDGEMENT - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff Pending bootstrap after reimage https://phabricator.wikimedia.org/T120662 [11:59:59] ACKNOWLEDGEMENT - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive Muehlenhoff Pending bootstrap after reimage https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:02:23] PROBLEM - kartotherian endpoints health on maps1002 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:03:35] RECOVERY - kartotherian endpoints health on maps1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:03:54] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10faidon) a:05faidon→03RobH Approved. It sounds like our spare pools are being drained, so if that's the case feel free to open a task to replenis... [12:04:29] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: codfw: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227425 (10faidon) a:05faidon→03RobH Approved. [12:05:57] (03PS4) 10Effie Mouzeli: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558 [12:07:46] (03PS5) 10Effie Mouzeli: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558 [12:14:47] !log add timing information to maps1003 access logs [12:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:32] (03PS1) 10BBlack: LVS for MW: Remove RunCommand checks [puppet] - 10https://gerrit.wikimedia.org/r/536581 (https://phabricator.wikimedia.org/T111899) [12:31:23] (03PS1) 10Effie Mouzeli: profile::mediawiki::api: Setup systemd-coredump on api servers [puppet] - 10https://gerrit.wikimedia.org/r/536582 (https://phabricator.wikimedia.org/T232613) [12:33:01] (03PS1) 10BBlack: maps tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536583 [12:36:47] (03CR) 10Gehel: [C: 04-1] maps tiles URI sanity filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack) [12:37:02] <_joe_> !log temp ban of class of urls on maps1003 nginx [12:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:19] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:26] (03CR) 10MSantos: [C: 04-1] maps tiles URI sanity filter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack) [12:43:10] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10MoritzMuehlenhoff) @Bstorm @Andrew These were once installed with Stretch, but in the mean time Buster was released, let's reimage those be... [12:56:25] (03PS2) 10BBlack: maps tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536583 [12:56:30] (03CR) 10BBlack: maps tiles URI sanity filter (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack) [12:56:55] <_joe_> !log banning more urls on maps1003 [12:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:49] (03PS3) 10Muehlenhoff: Remove obsolete restbase hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/536204 [12:57:51] (03CR) 10Gehel: [C: 03+1] "LGTM, but mateus has more knowledge of what is supported here." [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack) [12:59:10] (03CR) 10MSantos: maps tiles URI sanity filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack) [12:59:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete restbase hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/536204 (owner: 10Muehlenhoff) [13:01:17] (03PS1) 10Filippo Giunchedi: WIP swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) [13:05:45] 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10Jclark-ctr) @jcrespo Support ticket did not include disk. it was only a cable issue. No other tickets open. [13:09:14] 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10Cmjohnson) Actually we need to close this task and open a separate task about the disk. Different issue should get a different task. [13:10:11] (03CR) 10MSantos: [C: 03+1] maps tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack) [13:10:22] (03PS2) 10Filippo Giunchedi: WIP swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) [13:10:23] (03PS2) 10Effie Mouzeli: profile::mediawiki::api: Setup systemd-coredump on api servers [puppet] - 10https://gerrit.wikimedia.org/r/536582 (https://phabricator.wikimedia.org/T232613) [13:11:33] (03PS3) 10BBlack: maps tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536583 [13:12:21] (03CR) 10BBlack: [C: 03+2] maps tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536583 (owner: 10BBlack) [13:12:27] (03CR) 10Muehlenhoff: "What's the rough ETA on obsoleting labtestpuppetmaster as well? It's currently the last puppet master on jessie in prod, so a bit of an ou" [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) (owner: 10Andrew Bogott) [13:14:51] (03CR) 10Muehlenhoff: "Is this really a compatible drop-in replacement? It's probably better to simply continue to use the jar provided by Gerrit upstream on bus" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [13:15:40] (03PS1) 10Jbond: maps: tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536588 [13:19:36] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 7 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10fgiunchedi) [13:19:39] !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [13:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:41] (03PS2) 10Jbond: maps: tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536588 [13:20:08] !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [13:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:14] (03CR) 10MSantos: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/536588 (owner: 10Jbond) [13:21:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] maps: tiles URI sanity filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536588 (owner: 10Jbond) [13:21:38] !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [13:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:16] (03PS3) 10Jbond: maps: tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536588 [13:24:49] RECOVERY - tilerator on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [13:25:13] (03CR) 10Jbond: maps: tiles URI sanity filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536588 (owner: 10Jbond) [13:25:27] RECOVERY - Check systemd state on maps1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:17] !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [13:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:43] !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [13:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:23] (03PS1) 10Awight: Enable source wiki editing for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536590 (https://phabricator.wikimedia.org/T228851) [13:30:38] (03CR) 10BBlack: [C: 04-1] maps: tiles URI sanity filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536588 (owner: 10Jbond) [13:30:45] jbond42: ^ [13:30:51] looking [13:32:28] ahh thx and you can ignore my comment on your early change i see why now :) [13:34:16] (03PS4) 10Jbond: maps: tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536588 [13:35:21] (03CR) 10Jbond: maps: tiles URI sanity filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536588 (owner: 10Jbond) [13:35:33] bblack: can you recheck ^^ [13:36:07] 10Operations, 10Discovery, 10Maps, 10Product-Infrastructure-Team-Backlog: Wikimedia maps instability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817 (10Mholloway) [13:36:42] jbond42: yeah that part LGTM, we're assuming the main regex is ok from mateus earlier +1 right? [13:37:10] (03PS1) 10Filippo Giunchedi: prometheus: alert on overall puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/536591 (https://phabricator.wikimedia.org/T232303) [13:37:25] bbl yes [13:37:30] bblack: yes [13:38:02] also from https://github.com/wikimedia/mediawiki-services-kartotherian/tree/master/packages/kartotherian [13:38:21] (03CR) 10BBlack: [C: 03+2] maps: tiles URI sanity filter [puppet] - 10https://gerrit.wikimedia.org/r/536588 (owner: 10Jbond) [13:43:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/536233 (owner: 10Muehlenhoff) [13:47:03] (03PS1) 10Alexandros Kosiaris: Enable coredns based cluster DNS in codfw [puppet] - 10https://gerrit.wikimedia.org/r/536593 [13:49:18] (03PS1) 10Jbond: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 [13:50:25] (03PS2) 10Jbond: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 [13:51:41] (03PS3) 10Subramanya Sastry: Take #2: Redirect Parsoid/PHP rt-testing log events to "parsoid-tests" [puppet] - 10https://gerrit.wikimedia.org/r/536208 (https://phabricator.wikimedia.org/T232042) [13:51:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Let's do this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536202 (owner: 10Subramanya Sastry) [13:54:37] (03PS3) 10Jbond: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 [13:54:57] (03Merged) 10jenkins-bot: Revert "Direct Parsoid/PHP rt-testing log events to a different target" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536202 (owner: 10Subramanya Sastry) [13:55:10] <_joe_> subbu: deploying [13:55:41] k [13:56:01] (03PS5) 10Elukey: profile::kerberos::kdc: add debconf settings [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) [13:56:28] (03CR) 10jenkins-bot: Revert "Direct Parsoid/PHP rt-testing log events to a different target" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536202 (owner: 10Subramanya Sastry) [13:57:24] (03CR) 10MSantos: [C: 04-1] maps: add filter to block any unknown URI's (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond) [13:57:29] !log oblivian@deploy1001 Synchronized wmf-config/logging.php: unbreak mediawiki logging on scandium (duration: 01m 04s) [13:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:51] (03PS4) 10Jbond: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 [13:59:07] (03PS5) 10Jbond: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 [13:59:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Take #2: Redirect Parsoid/PHP rt-testing log events to "parsoid-tests" [puppet] - 10https://gerrit.wikimedia.org/r/536208 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [14:01:27] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:01:30] _joe_, confirmed that the revert works on scandium .. i now see logs in logstash when i parse a page that errors. [14:01:37] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:02:27] (03PS1) 10Alexandros Kosiaris: codfw: coredns set to 4 replicas with pod antifinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/536597 [14:04:00] (03CR) 10Gehel: maps: add filter to block any unknown URI's (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond) [14:05:11] !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [14:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:36] (03CR) 10MSantos: maps: add filter to block any unknown URI's (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond) [14:06:59] <_joe_> subbu: in a few minutes the logstash filter should be deployed [14:07:18] \o/ ty .. [14:07:27] once you confirm, i'll test and verify. [14:07:56] (03PS6) 10Jbond: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 [14:10:01] (03CR) 10Jbond: maps: add filter to block any unknown URI's (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond) [14:12:34] (03PS1) 10Subramanya Sastry: beta cluster: Make deployment-parsoid09 a Mediawiki appserver as well [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) [14:13:37] (03PS7) 10BBlack: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond) [14:14:00] !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [14:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:24] (03CR) 10Subramanya Sastry: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/536598 is the puppet patch for making the parsoid host on beta cluster a mediawiki ap" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534215 (https://phabricator.wikimedia.org/T231569) (owner: 10Subramanya Sastry) [14:15:05] <_joe_> subbu: we're done here [14:15:15] ok! [14:15:47] (03PS8) 10BBlack: maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond) [14:17:03] _joe_, works! :) [14:17:21] <_joe_> subbu: good! [14:17:22] (03CR) 10MSantos: [C: 03+1] maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond) [14:17:26] <_joe_> and sorry for the delay [14:17:50] all good! :) [14:18:40] !log installing reportbug update from Buster 10.1 point release [14:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:53] (03PS2) 10Alexandros Kosiaris: codfw: coredns set to 4 replicas with pod antifinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/536597 [14:20:14] (03CR) 10BBlack: [C: 03+2] maps: add filter to block any unknown URI's [puppet] - 10https://gerrit.wikimedia.org/r/536595 (owner: 10Jbond) [14:20:36] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] codfw: coredns set to 4 replicas with pod antifinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/536597 (owner: 10Alexandros Kosiaris) [14:22:50] !log installing bzip2 update from Buster 10.1 point release [14:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:17] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:30] (03PS1) 10Alexandros Kosiaris: eqiad: coredns set to 4 replicas with pod antifinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/536603 [14:26:49] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:27:52] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] eqiad: coredns set to 4 replicas with pod antifinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/536603 (owner: 10Alexandros Kosiaris) [14:28:43] (03CR) 10Giuseppe Lavagetto: "Overall lgtm, a couple small things." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536558 (owner: 10Effie Mouzeli) [14:30:30] !log installing cups security update on buster (only client-side libs installed) [14:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:47] !log akosiaris@ helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [14:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:16] 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff) [14:34:43] (03CR) 10Jbond: "looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536558 (owner: 10Effie Mouzeli) [14:36:19] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:31] <_joe_> uh effie [14:36:39] <_joe_> why this? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536558/1/modules/systemd/manifests/init.pp [14:36:42] <_joe_> I missed it [14:36:51] <_joe_> oh nevermind [14:36:54] <_joe_> old patchset [14:36:56] that is 1 :p [14:36:57] <_joe_> lol [14:36:59] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:36:59] <_joe_> yeah [14:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:03] <_joe_> dunno why [14:37:13] _joe_: it was a really brilliant idea [14:37:23] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:39:36] <_joe_> no I mean dunno why I ended up there [14:39:41] <_joe_> prolly following a comment [14:40:50] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "* Do this on the profile::mediawiki::php class" [puppet] - 10https://gerrit.wikimedia.org/r/536582 (https://phabricator.wikimedia.org/T232613) (owner: 10Effie Mouzeli) [14:41:53] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [14:42:05] (03PS1) 10BBlack: maps URIs: allow root, with or without query [puppet] - 10https://gerrit.wikimedia.org/r/536606 [14:43:03] (03CR) 10BBlack: [C: 03+2] maps URIs: allow root, with or without query [puppet] - 10https://gerrit.wikimedia.org/r/536606 (owner: 10BBlack) [14:44:04] (03PS1) 10Alexandros Kosiaris: Edit Project Config [deployment-charts] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/536607 [14:44:21] (03Abandoned) 10Alexandros Kosiaris: Edit Project Config [deployment-charts] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/536607 (owner: 10Alexandros Kosiaris) [14:50:41] (03CR) 10Hashar: "That will add the instance to the list of target scap deploys/sync mediawiki to." [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) (owner: 10Subramanya Sastry) [14:51:02] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:51:03] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-dan: New upstream release [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/535847 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry) [14:52:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-swe: New upstream release [debs/contenttranslation/apertium-swe] - 10https://gerrit.wikimedia.org/r/535853 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry) [14:53:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-nno: New upstream release [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/535863 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry) [14:53:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-nob: New upstream release [debs/contenttranslation/apertium-nob] - 10https://gerrit.wikimedia.org/r/536165 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry) [14:54:32] going to test https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/536580/ on mwdebug1001 (poke Daimona & anomie ) [14:54:38] (03PS6) 10Effie Mouzeli: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558 [14:55:09] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:58] (03PS3) 10Effie Mouzeli: profile::mediawiki::api: Setup systemd-coredump on api servers [puppet] - 10https://gerrit.wikimedia.org/r/536582 (https://phabricator.wikimedia.org/T232613) [15:00:05] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:00:25] anomie: so my patch might well end up being too spammy (pointed by Daimona) [15:00:29] since it would always log something [15:00:39] 1/11 of traffic to wikidata [15:00:50] but that might not be worse than api-feature-usages or the csp reports [15:01:36] giving a try and will rollback if that is way too much [15:01:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [15:02:19] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:02:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:02:26] !log hashar@deploy1001 Synchronized php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php: Add more log and context for T232613 logging - T232613 (duration: 01m 04s) [15:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:56] T232613: LBFactoryMulti.php: PHP Notice: Undefined index: - https://phabricator.wikimedia.org/T232613 [15:03:46] (03PS1) 10Jbond: ipmi: relax password minimum length [software/spicerack] - 10https://gerrit.wikimedia.org/r/536616 (https://phabricator.wikimedia.org/T147074) [15:04:10] Looks like definitely too spammy... [15:04:23] (03PS2) 10Jbond: ipmi: relax password minimum length [software/spicerack] - 10https://gerrit.wikimedia.org/r/536616 (https://phabricator.wikimedia.org/T147074) [15:06:11] 10Operations, 10Traffic, 10Patch-For-Review: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10ayounsi) Note that with new eqiad routing engines we can set the MSS at the router level (untested). Advantages are: easier to deploy (one configuration change) and can be applied to ext... [15:06:13] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:32] Daimona: yeah though that is not that much. The AdHocDebug channel does not show up in the top channels (based on the home dashboard https://logstash.wikimedia.org/app/kibana ) [15:06:34] but yeah [15:06:36] I will let it run [15:06:48] and once a core dump get captured, I guess I will disable the feature entirely [15:06:53] we probably have enough core dumps [15:07:01] I was looking at https://logstash.wikimedia.org/goto/58e68b0a1ecf0e61b1c88bbcf8c22386 [15:07:14] 932 hits, of which only one generated a coredump [15:07:31] Yeah probably [15:07:50] AbuseFilter generates an order of magnitude more logs :] [15:08:37] soo hmm 200 events per minutes, that is manageable [15:08:42] I am letting it run [15:08:51] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:08:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:00] Daimona: and thank you for the reviews / tips etc :] [15:10:10] that did not last long :] [15:10:12] (03PS2) 10Alexandros Kosiaris: Enable coredns based cluster DNS in codfw [puppet] - 10https://gerrit.wikimedia.org/r/536593 [15:10:15] (03PS1) 10Alexandros Kosiaris: Enable coredns based cluster DNS in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/536617 [15:10:26] (03CR) 10Jbond: "small nit, also do we want to add something to `/etc/network/interfaces`?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602) (owner: 10Dzahn) [15:10:39] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:10:57] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:03] Yeah, we're currently deprecating lots of stuff, so there are a lot of entries. wmf.22 should reduce the amount, although it's blocked on this empty string issue :) [15:11:10] yw :) [15:11:33] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:11:34] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:06] !log upload apertium-dan_0.6.0-1+wmf3 apertium-nno_1.0.0-1+wmf1 apertium-nob_1.0.0-2+wmf1 apertium-swe_0.8.0-1+wmf1 to apt.wikimedia.org/jessie-wikimedia T218184 [15:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:09] T218184: Update apertium-nno-nob, apertium-swe-dan, apertium-swe-nor and apertium-dan-nor packages - https://phabricator.wikimedia.org/T218184 [15:14:45] (03PS3) 10Cwhite: profile: use prometheus for logstash alerting [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) [15:15:01] (03CR) 10Cwhite: profile: use prometheus for logstash alerting (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [15:22:52] (03PS3) 10Paladox: gerrit: override gerrit::server::slave_hosts under gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/536359 [15:24:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/536558 (owner: 10Effie Mouzeli) [15:25:24] (03PS1) 10MSantos: maps: allow float value for image scale [puppet] - 10https://gerrit.wikimedia.org/r/536621 [15:28:01] (03PS1) 10Hashar: Disable adhoc core dump logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536622 (https://phabricator.wikimedia.org/T232613) [15:28:10] (03PS2) 10MSantos: maps: allow float value for image scale [puppet] - 10https://gerrit.wikimedia.org/r/536621 [15:29:28] (03CR) 10Hashar: [C: 03+2] Disable adhoc core dump logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536622 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar) [15:31:02] (03Merged) 10jenkins-bot: Disable adhoc core dump logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536622 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar) [15:31:18] (03CR) 10jenkins-bot: Disable adhoc core dump logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536622 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar) [15:32:47] 10Operations, 10Analytics, 10Traffic: Add google weblight to the list of trusted proxies - https://phabricator.wikimedia.org/T232849 (10Nuria) [15:33:28] (03PS1) 10Andrew Bogott: Puppet CAs: rename a local '$puppetmaster' variable [puppet] - 10https://gerrit.wikimedia.org/r/536625 (https://phabricator.wikimedia.org/T232428) [15:34:40] !log hashar@deploy1001 Synchronized wmf-config/CommonSettings.php: Disable adhoc core dump logging - T232613 (duration: 01m 04s) [15:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:02] T232613: LBFactoryMulti.php: PHP Notice: Undefined index: - https://phabricator.wikimedia.org/T232613 [15:35:30] so that is all I have for this evening [15:35:33] will be back later tonigh [15:36:05] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:38] (03CR) 10Andrew Bogott: [C: 03+2] Puppet CAs: rename a local '$puppetmaster' variable [puppet] - 10https://gerrit.wikimedia.org/r/536625 (https://phabricator.wikimedia.org/T232428) (owner: 10Andrew Bogott) [15:39:47] (03CR) 10Ayounsi: "2 inline comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602) (owner: 10Dzahn) [15:42:43] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Resolve local commits on cloud-puppetmaster-01.cloudinfra.eqiad.wmflabs and cloud-puppetmaster-02.cloudinfra.eqiad.wmflabs - https://phabricator.wikimedia.org/T232428 (10Andrew) 05Open→03Resolved The attache... [15:42:46] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew) [15:43:40] !log reverting live hacks on mw1348 [15:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:28] !log bootstrapping Cassandra, restbase2009-a -- T224553 [15:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:31] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [15:53:11] 10Operations, 10Traffic, 10Patch-For-Review: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10BBlack) Right, that would cover cases like install1002 and archiva (and probably many other minor cases we've missed which haven't set off big alarm bells), but we'll still need direct m... [15:56:05] (03CR) 10BBlack: [C: 04-1] "In general, I'm hoping we don't need to go down this road and we'll find better ways to deal with this." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602) (owner: 10Dzahn) [16:00:17] (03CR) 10Filippo Giunchedi: "non-blocking nit inline, LGTM but please attach a PCC run as well" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536365 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:02:16] (03CR) 10Filippo Giunchedi: "See line, other than that LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:04:02] (03PS6) 10Elukey: profile::kerberos::kdc: add debconf settings [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) [16:05:08] (03PS4) 10Cwhite: profile: use prometheus for logstash alerting [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) [16:05:23] (03CR) 10Cwhite: profile: use prometheus for logstash alerting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:05:29] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18288/kerberos1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [16:07:03] RECOVERY - Check the Netbox report librenms for fail status. on netbox1001 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:09:11] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:37] !log fix bgp group netflow on cr2-codfw [16:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:50] 10Operations, 10Traffic, 10Patch-For-Review: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10ayounsi) As discussed on IRC, this *should* work for inbound (clamping the SYNACK too), but to be tested. [16:36:33] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 44.73 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:41:17] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 54.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:43:57] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:01] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 79.29 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:53:40] (03CR) 10Jbond: [C: 03+1] "lgtm but best have brandon double check i dont want to push to varnish without a +1 from traffic just yet" [puppet] - 10https://gerrit.wikimedia.org/r/536621 (owner: 10MSantos) [16:56:54] 10Operations, 10Analytics, 10Traffic: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10Nuria) [16:58:41] 10Operations, 10Analytics, 10Traffic: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10Nuria) I have started another ticket that as you mentioned, better explains the rationale behing having "trusted proxies", we really do not need them if we can capture the original i... [16:59:02] 10Operations, 10Analytics, 10Traffic: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10Nuria) ping @Ottomata and @JAllemandou for thou... [17:00:21] (03CR) 10Brennen Bearnes: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T222494) (owner: 10Brennen Bearnes) [17:00:30] (03PS1) 10Herron: prometheus: add per-site systemd failed unit checks [puppet] - 10https://gerrit.wikimedia.org/r/536642 (https://phabricator.wikimedia.org/T230570) [17:07:42] (03PS1) 10Ayounsi: Kafkatee, mask default (package provided) systemd service [puppet] - 10https://gerrit.wikimedia.org/r/536645 [17:12:05] (03CR) 10Ayounsi: "I'm not sure of all the implication, but my understanding is that the default service is not used?" [puppet] - 10https://gerrit.wikimedia.org/r/536645 (owner: 10Ayounsi) [17:16:57] (03CR) 10Jbennett: "We view this risk as having a decent impact but a relatively low likelihood that's why it is ranked as low. As for signing off, we don't d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [17:24:45] !log bootstrapping Cassandra, restbase2009-b -- T224553 [17:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:55] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [17:25:06] RECOVERY - cassandra-b service on restbase2009 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:26:11] RECOVERY - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-b valid until 2020-06-24 13:01:53 +0000 (expires in 284 days) https://phabricator.wikimedia.org/T120662 [17:37:31] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:31] (03PS7) 10Cwhite: add generic interface to metrics gathering [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 [17:50:44] (03CR) 10BBlack: [C: 03+2] maps: allow float value for image scale [puppet] - 10https://gerrit.wikimedia.org/r/536621 (owner: 10MSantos) [17:50:53] (03PS3) 10BBlack: maps: allow float value for image scale [puppet] - 10https://gerrit.wikimedia.org/r/536621 (owner: 10MSantos) [17:53:29] (03PS2) 10Cwhite: initial commit [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 [17:59:31] 10Operations, 10Analytics, 10Traffic: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10BBlack) The problem stems from the "Trust" in "... [18:01:55] (03PS1) 10Herron: kafka-main: replace kafka1002 hardware with kafka-main1002 [puppet] - 10https://gerrit.wikimedia.org/r/536655 (https://phabricator.wikimedia.org/T225005) [18:05:55] PROBLEM - Aggregate IPsec Tunnel Status ulsfo on icinga1001 is CRITICAL: instance={cp4027:9536,cp4028:9536,cp4029:9536,cp4030:9536,cp4031:9536,cp4032:9536} site=ulsfo tunnel={cp1075_v4,cp1075_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [18:06:08] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Implement sensitive logstash access control - https://phabricator.wikimedia.org/T213902 (10herron) Open Distro for Elasticsearch looks quite promising https://opendistro.github.io/for-elasticsearch/ [18:07:38] 10Operations, 10Wikimedia-Logstash, 10observability: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10herron) https://opendistro.github.io/for-elasticsearch/ appears to be a valid option, although this was resolved I'll update the description to include it [18:11:14] 10Operations, 10Wikimedia-Logstash, 10observability: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10herron) [18:12:57] (03CR) 10Krinkle: [C: 03+1] Gzip SVGs served by MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/535860 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [18:15:52] 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10jcrespo) [18:15:57] 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10jcrespo) 05Open→03Resolved I am ok with that (I actually said to open a new one if this was to be closed). I just didn't know if it had to be open to track whatever was you... [18:17:37] (03CR) 10Krinkle: [C: 03+1] "Looks like beta puppetmaster is borked." [puppet] - 10https://gerrit.wikimedia.org/r/535860 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [18:24:31] 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10wiki_willy) a:03Jclark-ctr [18:40:16] (03CR) 10Jeena Huneidi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) (owner: 10Subramanya Sastry) [18:40:45] 10Operations, 10Analytics, 10Traffic: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10Nuria) Right, I see the UA issue but in the abs... [18:52:09] (03PS1) 10Alex Monk: Add cloudinfra hiera data [puppet] - 10https://gerrit.wikimedia.org/r/536663 (https://phabricator.wikimedia.org/T232509) [18:54:46] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) 05Open→03Stalled [18:54:48] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [18:54:57] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10Dzahn) [18:56:12] 10Operations, 10Phabricator, 10hardware-requests, 10serviceops, 10Release-Engineering-Team (Development services): The server, WMF7426, was given to us temporarily, we would like to make it permanent - https://phabricator.wikimedia.org/T232887 (10mmodell) [18:56:34] 10Operations, 10Phabricator, 10hardware-requests, 10serviceops, 10Release-Engineering-Team (Development services): The phabricator server, WMF7426, was given to us temporarily, we would like to make it permanent - https://phabricator.wikimedia.org/T232887 (10mmodell) [19:01:06] (03PS1) 10Jhedden: openstack: configure apache wsgi for keystone api [puppet] - 10https://gerrit.wikimedia.org/r/536664 (https://phabricator.wikimedia.org/T223907) [19:01:44] (03PS2) 10Andrew Bogott: Add cloudinfra hiera data [puppet] - 10https://gerrit.wikimedia.org/r/536663 (https://phabricator.wikimedia.org/T232509) (owner: 10Alex Monk) [19:01:56] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) [19:02:50] (03CR) 10Andrew Bogott: [C: 03+2] Add cloudinfra hiera data [puppet] - 10https://gerrit.wikimedia.org/r/536663 (https://phabricator.wikimedia.org/T232509) (owner: 10Alex Monk) [19:03:05] (03CR) 10jerkins-bot: [V: 04-1] openstack: configure apache wsgi for keystone api [puppet] - 10https://gerrit.wikimedia.org/r/536664 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [19:07:01] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:07:30] (03CR) 10Jeena Huneidi: [C: 04-1] "we should make sure the image is published before merging and also follow this to update the CPU/memory limits: https://wikitech.wikimedia" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T222494) (owner: 10Brennen Bearnes) [19:09:22] (03CR) 10Brennen Bearnes: "> Patch Set 1: Code-Review-1" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T222494) (owner: 10Brennen Bearnes) [19:11:19] 10Operations, 10Patch-For-Review: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937 (10Dzahn) [19:14:46] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Dzahn) [19:14:51] 10Operations, 10Phabricator, 10Traffic: Make phame cacheable - https://phabricator.wikimedia.org/T219978 (10Dzahn) [19:15:10] 10Operations, 10Phabricator, 10Traffic: Make phame cacheable - https://phabricator.wikimedia.org/T219978 (10Dzahn) merging into T226044 [19:18:30] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) == From the merged task: Blog posts on phame cannot currently be cached by our... [19:21:16] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) a:05mmodell→03None This is unblocked on my end, @ema feel free to proceed wh... [19:22:33] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) [19:22:37] 10Operations, 10Phabricator: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129 (10Dzahn) [19:22:39] 10Operations, 10Patch-For-Review: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937 (10Dzahn) [19:22:44] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) Also important, @epriestley's comment at T219978#5346100 [19:24:53] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10srishakatux) @Nuria I want to read MySQL credentials from `/etc/mysql/conf.d/research-client.cnf` via a script that... [19:30:35] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:46] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Fibercut, telia working on it. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:30:46] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Fibercut, telia working on it. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:35:21] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status ulsfo on icinga1001 is CRITICAL: instance={cp4027:9536,cp4028:9536,cp4029:9536,cp4030:9536,cp4031:9536,cp4032:9536} site=ulsfo tunnel={cp1075_v4,cp1075_v6} Ayounsi If an icinga alert brought you here, please disregard for the time being. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [19:42:08] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Krinkle) [19:48:45] (03PS2) 10Krinkle: beta cluster: Make deployment-parsoid09 a Mediawiki appserver as well [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) (owner: 10Subramanya Sastry) [19:50:31] (03PS1) 1020after4: Phabricator: Make a separate hiera option to ensure phd stopped/running [puppet] - 10https://gerrit.wikimedia.org/r/536669 (https://phabricator.wikimedia.org/T232883) [19:54:02] !log bootstrapping Cassandra, restbase2009-c -- T224553 [19:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:06] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [19:54:59] RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:56:03] RECOVERY - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-c valid until 2020-06-24 13:01:54 +0000 (expires in 284 days) https://phabricator.wikimedia.org/T120662 [20:00:38] !log hotfixing T232600 due to severity of the bug and relative safety of the fix (if this breaks, yell at James_F who twisted my arm and made me do it) [20:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:41] T232600: Some Phabricator boards do not load cards anymore in Chrome 77 - https://phabricator.wikimedia.org/T232600 [20:01:09] * James_F grins. [20:01:27] (03PS1) 10Herron: prometheus-ipsec-exporter: subscribe service to /etc/ipsec.conf [puppet] - 10https://gerrit.wikimedia.org/r/536671 (https://phabricator.wikimedia.org/T230236) [20:01:47] James_F: fixed? [20:02:03] (03CR) 10jerkins-bot: [V: 04-1] prometheus-ipsec-exporter: subscribe service to /etc/ipsec.conf [puppet] - 10https://gerrit.wikimedia.org/r/536671 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [20:02:13] twentyafterfour: Looks like it. Thanks! [20:02:35] RECOVERY - Aggregate IPsec Tunnel Status ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [20:03:08] (03PS2) 10Herron: prometheus-ipsec-exporter: subscribe service to /etc/ipsec.conf [puppet] - 10https://gerrit.wikimedia.org/r/536671 (https://phabricator.wikimedia.org/T230236) [20:08:35] (03PS1) 10Andrew Bogott: codfw1dev: First pass at buildingout cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/536672 (https://phabricator.wikimedia.org/T229441) [20:09:20] (03PS2) 10Andrew Bogott: codfw1dev: First pass at building out cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/536672 (https://phabricator.wikimedia.org/T229441) [20:10:12] (03PS3) 10Andrew Bogott: codfw1dev: First pass at building out cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/536672 (https://phabricator.wikimedia.org/T229441) [20:12:40] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10ayounsi) Trying to figure out why this is failing: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ps1-b6-eqiad error is: > External command error: Error in p... [20:13:37] (03CR) 10Urbanecm: "> Patch Set 5:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [20:16:35] (03PS4) 10Andrew Bogott: codfw1dev: First pass at building out cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/536672 (https://phabricator.wikimedia.org/T229441) [20:18:15] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:46] (03CR) 10Subramanya Sastry: [C: 04-1] "I am going to spin up a new instance deployment-mediawiki-parsoid10 which will be the parsoid/php server. Will work with Petr to figure ou" [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) (owner: 10Subramanya Sastry) [20:23:50] !log restarting netbox1001.wikimedia.org [20:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:30] (03PS5) 10Andrew Bogott: codfw1dev: First pass at building out cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/536672 (https://phabricator.wikimedia.org/T229441) [20:30:02] (03PS1) 10Zoranzoki21: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536679 [20:47:02] (03CR) 10Jbennett: "It means it is low risk from a security standpoint if the patch uploader wishes to deploy it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [20:59:11] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:59:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:03:01] 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10Jclark-ctr) contacted dell regarding failed drive will update with response [21:03:46] (03PS1) 10Krinkle: Gerrit: Add colorblind-friendly diff styles to 'eclipse' syntax theme [puppet] - 10https://gerrit.wikimedia.org/r/536687 (https://phabricator.wikimedia.org/T232893) [21:04:42] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10wiki_willy) Hi @Dzahn - just following up on this one, to see when the server can be taken down. Thanks, Willy [21:05:33] 10Operations, 10ops-eqiad, 10netops: asw2-c-eqiad:xe-2/0/45 inbound interface errors - https://phabricator.wikimedia.org/T229612 (10wiki_willy) @Cmjohnson - can you provide an update on this one next week? Thanks, Willy [21:16:19] RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.56 port 9042 https://phabricator.wikimedia.org/T93886 [21:16:49] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:26] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10wiki_willy) @Cmjohnson or @Jclark-ctr - can one of you guys check this out early next week? Thanks, Willy [21:20:30] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:21:26] 10Operations, 10Wikimedia-Mailing-lists: Reset Mailing list admin password for oversight-wp-ja - https://phabricator.wikimedia.org/T232822 (10Quiddity) 05Open→03Resolved a:03Quiddity Done x3. [21:22:31] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:32] (03CR) 10Paladox: "> > Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:26:04] (03CR) 10Dzahn: gerrit: add gerrit1001 as a replica host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536352 (owner: 10Dzahn) [21:26:17] (03Abandoned) 10Dzahn: gerrit: add gerrit1001 as a replica host [puppet] - 10https://gerrit.wikimedia.org/r/536352 (owner: 10Dzahn) [21:31:51] (03CR) 10Cwhite: [C: 03+1] prometheus-ipsec-exporter: subscribe service to /etc/ipsec.conf [puppet] - 10https://gerrit.wikimedia.org/r/536671 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [21:33:45] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:39] (03PS3) 10Dzahn: gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 [21:36:15] (03CR) 10jerkins-bot: [V: 04-1] gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:36:49] (03CR) 10Paladox: gerrit: do not link to mysql-connector-java.jar if on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:38:00] (03PS4) 10Dzahn: gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 [21:38:35] (03CR) 10jerkins-bot: [V: 04-1] gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:39:55] (03CR) 10Paladox: "Note that the plan is to stop using mysql when upgraded to gerrit 2.16." [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:40:37] (03CR) 10Paladox: "Also the lib is downloaded by gerrit from maven (mysql-lib)." [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:40:52] (03CR) 10Dzahn: "so apparently "provided by Gerrit" means "Gerrit tells you to download it from Maven.org. is that ok, Moritz?" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:41:43] (03CR) 10Paladox: "We can install the jar manually without gerrit doing it." [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:42:01] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:42:17] (03PS5) 10Dzahn: gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 [21:42:43] (03CR) 10Dzahn: "just pointing out that we went from "use distro package" to "just use the one provided by gerrit" to "manually install it"" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:43:58] (03CR) 10Dzahn: "fine with it. hopefully won't be for long since as you say 2.16 and we can get rid of it" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:44:29] PROBLEM - HHVM rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:46:22] (03CR) 10Cwhite: [C: 03+1] "PCC checks out https://puppet-compiler.wmflabs.org/compiler1002/18273/logstash1009.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/536365 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [21:46:58] (03CR) 10Dzahn: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:47:29] RECOVERY - HHVM rendering on mw1343 is OK: HTTP OK: HTTP/1.1 200 OK - 79572 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:48:53] (03CR) 10Dzahn: [C: 03+1] "noop on prod and doing nothing (in puppet) on buster" [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:50:19] (03CR) 10Paladox: [C: 03+1] gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [21:54:37] (03CR) 10Dzahn: [C: 04-1] "just like we talked about it in the meeting. agree and looks mostly good. just found an issue when using the compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/536669 (https://phabricator.wikimedia.org/T232883) (owner: 1020after4) [21:55:42] (03CR) 10Cwhite: "Do the thresholds need to be tweaked as well? Looking back at the last 90 days of data, the check might not have gone off once?" [puppet] - 10https://gerrit.wikimedia.org/r/536591 (https://phabricator.wikimedia.org/T232303) (owner: 10Filippo Giunchedi) [21:57:16] (03PS2) 1020after4: Phabricator: Make a separate hiera option to ensure phd stopped/running [puppet] - 10https://gerrit.wikimedia.org/r/536669 (https://phabricator.wikimedia.org/T232883) [22:02:05] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [22:02:14] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/18298/ running on 1003 , stopped on 2001" [puppet] - 10https://gerrit.wikimedia.org/r/536669 (https://phabricator.wikimedia.org/T232883) (owner: 1020after4) [22:02:19] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10wiki_willy) a:03Jclark-ctr [22:04:46] (03CR) 10Dzahn: [C: 03+2] gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 (owner: 10Dzahn) [22:04:54] (03PS6) 10Dzahn: gerrit: do not link to mysql-connector-java.jar if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 [22:13:30] (03PS3) 10Dzahn: Phabricator: Make a separate hiera option to ensure phd stopped/running [puppet] - 10https://gerrit.wikimedia.org/r/536669 (https://phabricator.wikimedia.org/T232883) (owner: 1020after4) [22:16:54] (03CR) 10Alex Monk: "(I replaced a cherry-pick of this from PS9 or earlier with one of PS11 because everything using this was broken with an error about a miss" [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [22:19:24] (03PS1) 10Bstorm: sssd: Add a whole duplicate hierarchy of sssd images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) [22:19:55] (03CR) 10Dzahn: [C: 03+2] Phabricator: Make a separate hiera option to ensure phd stopped/running [puppet] - 10https://gerrit.wikimedia.org/r/536669 (https://phabricator.wikimedia.org/T232883) (owner: 1020after4) [22:20:49] (03PS3) 10Subramanya Sastry: beta cluster: Make deployment-mediawiki-parsoid10 a MW scap target [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) [22:22:05] PROBLEM - WDQS high update lag on wdqs1010 is CRITICAL: 1.695e+04 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:23:14] ^thats me [22:23:26] Extending downtime [22:26:23] thanks [22:40:25] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:43] chaomodus: ^ [22:42:04] (03PS1) 10Dzahn: DHCP: switch phab1001 from jessie to buster [puppet] - 10https://gerrit.wikimedia.org/r/536698 (https://phabricator.wikimedia.org/T190568) [22:45:03] (03PS1) 10Alex Monk: cloudinfra hiera: Add missing statsd key [puppet] - 10https://gerrit.wikimedia.org/r/536699 (https://phabricator.wikimedia.org/T232509) [22:47:36] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [22:47:46] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) 05Stalled→03Open [22:47:52] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10Dzahn) [22:49:19] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:41] (03CR) 10jerkins-bot: [V: 04-1] sssd: Add a whole duplicate hierarchy of sssd images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [22:50:13] (03CR) 10Dzahn: [C: 03+2] "phab1001 has been idling for long enough, we definitely don't need the jessie system anymore. reinstalling and then either we keep both 10" [puppet] - 10https://gerrit.wikimedia.org/r/536698 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [22:51:19] (03PS2) 10Dzahn: DHCP: switch phab1001 from jessie to buster [puppet] - 10https://gerrit.wikimedia.org/r/536698 (https://phabricator.wikimedia.org/T190568) [22:59:54] 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941 (10Krenair) I believe we've achieved this with T220268#5275994, the only catch is that you need to make a hiera change like https://wikitech.wikimedia.org/w/index.php?title=Hiera:Dep... [23:00:14] (03PS1) 10Dzahn: site: apply spare::system role to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/536701 (https://phabricator.wikimedia.org/T190568) [23:01:07] (03CR) 10Dzahn: [C: 03+2] site: apply spare::system role to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/536701 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [23:02:17] (03PS2) 10Dzahn: site: apply spare::system role to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/536701 (https://phabricator.wikimedia.org/T190568) [23:05:20] (03PS2) 10Reedy: Drop PasswordCannotBePopular compatibility hack, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534706 (owner: 10Jforrester) [23:05:22] (03PS2) 10Reedy: Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534707 (owner: 10Jforrester) [23:05:47] (03CR) 10Jforrester: "Oh, yeah, forgot these. Let's do them on Monday?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534706 (owner: 10Jforrester) [23:06:21] !log re-enable puppet on maps - T232817 [23:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:25] T232817: Wikimedia maps instability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817 [23:06:27] (03CR) 10Reedy: "WFM :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534706 (owner: 10Jforrester) [23:07:35] RECOVERY - tilerator on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [23:07:43] RECOVERY - tilerator on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [23:07:43] RECOVERY - tilerator on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [23:07:45] RECOVERY - Check systemd state on maps2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:05] RECOVERY - tilerator on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [23:08:11] RECOVERY - tilerator on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [23:08:17] RECOVERY - Check systemd state on maps1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:31] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:35] RECOVERY - Check systemd state on maps2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:43] RECOVERY - tilerator on maps1004 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [23:08:45] RECOVERY - tilerator on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [23:08:57] RECOVERY - Check systemd state on maps1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:57] RECOVERY - Check systemd state on maps2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:09] RECOVERY - Check systemd state on maps1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:36] 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941 (10Krenair) 05Open→03Resolved a:03Krenair Actually I've just tested a couple of new instance creations with the above, it comes up without needing to do anything special anymor... [23:12:57] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin10... [23:19:27] (03PS1) 10Dzahn: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 [23:21:44] (03PS2) 10Andrew Bogott: cloudinfra hiera: Add missing statsd key [puppet] - 10https://gerrit.wikimedia.org/r/536699 (https://phabricator.wikimedia.org/T232509) (owner: 10Alex Monk) [23:22:18] (03CR) 10Paladox: gerrit: make scap user configurable in Hiera (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [23:22:37] (03PS2) 10Bstorm: sssd: Add a whole duplicate hierarchy of sssd images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) [23:26:29] (03PS2) 10Dzahn: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 [23:29:31] (03CR) 10Dzahn: gerrit: make scap user configurable in Hiera (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [23:37:50] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:45] (03CR) 10Paladox: gerrit: make scap user configurable in Hiera (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [23:42:03] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:32] (03PS3) 10Paladox: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [23:43:15] (03PS4) 10Paladox: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [23:43:58] (03CR) 10Andrew Bogott: [C: 03+2] cloudinfra hiera: Add missing statsd key [puppet] - 10https://gerrit.wikimedia.org/r/536699 (https://phabricator.wikimedia.org/T232509) (owner: 10Alex Monk) [23:44:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:43] (03CR) 10jerkins-bot: [V: 04-1] gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [23:47:59] (03PS5) 10Paladox: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [23:51:45] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): cloud-puppetmasters: move some hiera settings from Horizon to git/gerrit - https://phabricator.wikimedia.org/T232509 (10Krenair) 05Open→03Resolved [23:51:49] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) [23:52:33] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) 05Open→03Resolved [23:52:36] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) [23:56:02] (03CR) 10Dzahn: gerrit: make scap user configurable in Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [23:57:30] (03CR) 10Paladox: gerrit: make scap user configurable in Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [23:59:47] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['phab1001.eqiad.wmne...